These days, you can get a 65” 4K Ultra High Definition, High Dynamic Range smart TV at your local superstore for under $1000 and stream vast libraries of premium movies and TV shows to it for under $20/month. Video compression technologies like H.264 and H.265 have been key to realizing this extraordinary result. Since the dawn of video compression, the primary codec design goals have been maximizing viewer perceived quality while minimizing bandwidth, storage and coding costs, and delays. Starting with the CCITT H.120 standard in 1984 up to and including the still developing ITU-T/MPEG Versatile Video Coding standard, perceptual quality and compression rates have steadily improved while staying within the practical and economic limits of network and computing capabilities.
However, a projected 95% of video or image content will never be seen by human eyes. A substantial amount of video produced by surveillance and traffic cameras, robots, drones, autonomous vehicles and other sources is often discarded or archived without a single person watching. Instead, these videos are inputs to computer vision and video analytics applications. Some existing codecs components (e.g., motion estimation) have value for video analytics but computing and network resources devoted to maintaining vibrant colors and clear pictures on large screens are wasted when only a robot is watching.
At Intel Labs, we started asking two questions:
- Can you use any of the capabilities intrinsic in current video codecs to improve video analytics results? -- i.e., “Compression Aware Analytics”
- What if you created a video codec that was designed from first principles for video analytics? – i.e., “Analytics Aware Compression”
From these two questions arose the “Co-Adaptive Networking and Visual Analysis Systems” or CANVAS project led by Intel Labs’ Omesh Tickoo and Srinivasa Somayazulu. We showed some early CANVAS results at the 2019 Computer Vision and Pattern Recognition Conference (CVPR) in June.
The Right Tool for a Different Job
In my previous blog, “Sharing the Video Edge”, I described our work on smart city video analytics at the Intel Science and Technology Center for Visual Cloud Systems (ISTC-VCS) at Carnegie Mellon University. There, we’ve built out an urban testbed to demonstrate camera to edge to cloud distributed video analytics use cases. In those use cases, the processing workflow looks something like the figure below.
In these applications, computer vision cameras typically stream an H.264 encoded video to an edge node and cloud based application that performs a pipeline of operations to produce some set of analytics results. In the above, the application is responsible for:
- Detecting and recognizing license plate numbers in the field of view
- Producing a track of the license plate through the field of view
- Re-encoding a snippet video of the license plate moving through the video
To accomplish this, the edge node first decodes the video into a sequence of individual frame images. Those images are fed into a neural network that detects objects and sends re-encoded videos segments containing the objects to the cloud. The cloud decodes the arriving video and runs a plate recognizer and tracker to produce its results.
This approach is common and expedient because it leverages the many years spent creating high quality, efficient and standardized video codecs. However, as the number of cameras and volume of data from each camera increases, the network and computing infrastructure can be overwhelmed. The CANVAS team believes that, when the task is constrained to analytics, there are better ways to do it.
Let’s go a little deeper on the two approaches: compression aware analytics and analytics aware compression.
Empowering Tomorrow’s Droids with Today’s Codecs
In CANVAS, the Compression Aware Analytics project exploits the information already contained in the encoded camera stream to remove unnecessary processing in the subsequent stages. For example, the decode stage can be eliminated by training a plate recognizer to use the encoded bitstream directly. The object tracker can use the recognizer outputs and the motion vectors in the encoded bitstream to produce the plate tracks. Motion vectors needn’t be recomputed. The video snippet can be extracted from the encoded video using the track timestamps and, if necessary, further compressed for transmission to the data center.
Designing the Perfect Droid Codec
The other CANVAS project, Analytics Aware Compression, believes it is possible to improve compression rates and analytics performance by designing a codec that optimizes for elements of the camera stream that are important for analytics. In general, analytics applications don’t require high perceptual quality. They need high resolution images of objects of interest and good object tracking through the field of view. Analytics aware compression adapts the encoder to emphasize high quality encoding of important frame regions (e.g., license plates) while de-emphasizing the quality or even dropping the background. For example, at right, the detected pedestrian regions can be compressed at a high resolution while, say, the grass can be transmitted at much lower resolution. Similarly, Analytics Aware Compression can reduce framerates or frame quality when there is little movement between frames. This same technique is used in current video codecs but an analytics aware codec can take this to a new extreme.
Toward a CANVAS Codec
To validate our ideas, we ran an initial CANVAS experiment combining compression aware object tracking with analytics aware region of interest (ROI) compression in a pedestrian detection application. Our goal was to see if CANVAS techniques could appreciably reduce transmission and computation resource requirements while retaining object classification accuracy. Our approach is shown in the figure below. In a typical edge-to-cloud environment, we created a new “edge analytics encoder” that identified objects in a camera feed, encoded those ROIs as high fidelity i-frames and combined them with the original motion vectors to create an analytics-optimized H.264 stream. Background information and p-frames were not transmitted. At the cloud, our decoder extracted and reconstructed a sequence of ROI frames. These were run through a FastRCNN object classifier to find the pedestrians.
The basic flow of a video through the systems is:
- The edge decodes an incoming camera stream into an i-Frame sequence
- These i-Frames run through a simple object detector to identify ROIs
- A bounding box tracker computes the ROI paths through the video
- The CANVAS encoder compresses the motion vectors and ROIs and streams to the Cloud
- At the cloud, CANVAS decodes the stream into i-Frame images
- The decoded images run through an object classifier and an object track is produced from bitstream motion vectors
- The cloud feeds the classification results back to the edge to inform the object tracker
- The cloud classifier outputs the object class and track
We ran this codec against a set of pedestrian videos on an Intel® CoreTM i7-6770HQ processor using Intel® Quick Sync Video for video decode and the Intel® Movidius™ Myriad™ X VPU and Intel® Distribution of OpenVINO™ Toolkit for object detection and classification. Compared with a baseline RCNN object detector on a high fidelity video sequence, we were able to see several orders of magnitude improvement in bitrate and computational complexity with only minor impact to detection accuracy. Further details will be published soon. In some cases, like videos with fast moving objects, we actually saw accuracy increases.
These are early results but we’re very encouraged. We believe that an analytics optimized codec can lead to improved application performance at the expense of human viewability. We continue research in this area while we explore whether there is an industry need for such a technology. We’re interested in connecting with industry and academic technical leaders who have ideas in compression aware analytics and analytics aware compression. Please reach out to Dr. Omesh Tickoo if you’d like to collaborate.
Check out other blogs from my visual cloud series:
- Overcoming Visual Analysis Paralysis -- Scanner, Spark, VDMS, Pandas and Apache Arrow (Oct 2019)
- Sharing the Video Edge -- Mainstream and FilterForward (Apr 2019)
- Feeling a Little Edgy --OpenRTIST (Mar 2019)
- The Loneliness of the Expert -- Eureka (Mar 2019)
- Visual Data: Pack Rat to Explorer -- VDMS (Feb 2019)
- Scaling the Big Video Data Mountain -- Scanner (Jan 2019)