Column: Preparing your content for machine learning
Subscribe to NewscastStudio's newsletter for the latest in broadcast design and engineering delivered to your inbox.
It was almost ten years ago when I first investigated integration with speech-to-text solutions as a way to enrich archival content. At that time, many of our customers were struggling to manage archives that included vast amounts of media that had very little metadata. These organizations were unable to find the right asset quickly, when they needed it, assuming that they even knew that the content was in their archive. Unfortunately, at that time the speech-to-text solutions available were not “ready for prime time.” There were too many errors, particularly when dealing with audio that was less than pristine or voices that had less common accents.
Even though that initial investigation was unsuccessful, we knew that it was only a matter of time before speech-to-text would achieve the quality our customers demanded and that other types of machine learning would follow. The applications for these technologies and the resources invested in their development ensured that they would eventually be successful. Speech-to-text became viable a few years ago, and we are now seeing improved results in video processing, such as celebrity recognition, object recognition, facial recognition, and sentiment analysis.
Media organizations are intrigued by machine learning technologies. However, implementing metadata enrichment involves more than just dumping an entire media library into a cloud bucket. There are a few factors to consider before the process starts, to ensure that you get the best results without breaking the bank.
The first decision to make is to decide what types of machine learning you will be using, and which content will be processed. Will the service process all of your content, or will you limit it to only content that fits certain criteria, such as:
- Age – only new assets, only historical assets, or a combination of both
- Source – only assets that come from certain sources
- Categorization – only assets that have specific existing metadata values
- Relevance – only assets that have been recently used and/or content related to recently used assets
The machine learning system(s) that you choose to use may impact your media supply chain and storage policies, both in terms of where the assets are stored and which formats you choose to create and store. Speech-to-text services generally work with common audio and video formats, such as MP3, WAV, and MP4 files, and can achieve good results with standard consumer bitrates. On the other hand, video processing systems may require larger-resolution or higher-bitrate content, so typical “proxy” instances of assets may not be acceptable. For this reason, it may be necessary to create higher-quality MP4 instances to process. Some (but not all) machine learning systems support professional formats, allowing you to use the mezzanine or production instances of your assets. Even if they do support professional formats, you may want to consider transcoding to MP4 to reduce file sizes. Not only will smaller files reduce the connectivity requirements, but the machine learning system may have limits on the maximum file size that it will process.
Many machine learning services, including those from some of the major cloud providers, require assets to be hosted in their cloud storage. Selecting archive management or asset management system that dynamically moves content between storage systems (whether in the cloud or on-premise) gives you the flexibility to choose machine learning services that meet your requirements.
Once you begin running your assets through a machine learning service, what will you do with the output? For speech-to-text services, the output may be a transcript or closed caption file. For video processing services, the output may be an XML or JSON file. You will want to have a system in place that can associate the information from the machine learning service with your video asset so that it becomes searchable metadata, whether it’s time-based markers and/or segments, or external files that are attached to the asset as unstructured metadata. Ideally, this system should also allow you to filter out information that is unnecessary.
Advancements in machine learning have tremendous potential to unlock the value of archived media assets. Implementing a metadata enrichment process requires some planning and forethought but choosing the right partners and systems will allow you to take advantage of these advancements efficiently and cost-effectively.