Computer Vision on ARM i.MX6 using OpenCV and OpenGL ES Shaders

I have been exploring the computer vision capabilities of the NXP i.MX6 ARM SoC with the goal of incorporating camera video and motion sensing features into the Lightwing Multimedia UI Engine. Lightwing could benefit tremendously from the addition of these features. But, the documentation that is available for building computer vision applications on this platform is really limited and out of date, so perhaps some of my lessons learned will be useful for other projects as well.

The NXP i.MX6 System on Chip has impressive computer vision capabilities, including a high-speed MIPI camera interface and GPU and IPU accelerators that can enhance solutions built with or without OpenCV. MIPI is a camera interface that is optimized for high speed, low power and low cost using a short flex cable that connects directly to the i.MX6. It has advantages due to avoiding the complexity of JPEG coding, USB cables and UVC.

Computer Vision Features of the NXP i.MX6 SoC

OpenCV is the most widely used open source library for computer vision applications like motion sensing and object recognition and it is supported on a wide range of platforms. Many implementations of its VideoCapture class are provided for different platforms and camera types. But, several additional components are required for it to work on Linux platforms with a MIPI camera, mainly due to differences in the pixel format from the camera. OpenCV requires video frames to be in BGR24 format, but cameras typically produce video frames in a YUV format. So, some additional components are required to do this color space conversion of the pixel data.

The following diagram illustrates the major components of a computer vision system on Linux on the i.MX6 platform and how video flows between them. Two architectures are shown, one using GStreamer and the other using OpenGL ES.

Components of a Computer Vision System on the NXP i.MX6

The typical solution is to use GStreamer, which has plugins for the NXP IPU, to do the required color space conversion. The IPU is NXP’s Image Processing Unit which also has video scaling features. However, the IPU is limited to a maximum video size of 1024 x 768 pixels. The GStreamer approach is complicated because of all the components required in the video path, but OpenCV’s VideoCapture class already has support for GStreamer. In testing this, I found that the GStreamer support is actually quite buggy and difficult to get working on this platform. So, I have explored another approach which has some major advantages.

The other approach is to eliminate the GStreamer component, along with its plugins, and instead use the GPU to convert the video. This requires OpenGL ES 2.0, which is the API that Lightwing is built on. The GPU has the advantage that many custom video processing filters and effects can be implemented in its GLSL shader language, as well as color space conversion and even motion sensing computer vision algorithms. So, this approach could potentially even avoid the need for OpenCV.

I have implemented the edge-detection algorithms on camera video two different ways. The first approach, was to use OpenCV’s Canny function in C, along with its VideoCapture class. In this case, the IPU is doing the color space conversion to BGR24 format, but that is not shown here because it is performed in the layers below OpenCV.

Example of Using an OpenCV Edge-Detection Filter with Camera Video

#include <opencv2/imgcodecs.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/core/utility.hpp>

#include <iostream>
#include <stdio.h>

using namespace cv;
using namespace std;

int iEdgeThreshSobel = 1;

Mat mImageFrame, mImageGray, mImageGrayBlur, mImageEdgeMask, mImageEdgeResult;
const char* WindowName = "Canny edge map with Sobel gradient";

static void onTrackbar(int, void*)
{
    cvtColor(mImageFrame, mImageGray, COLOR_BGR2GRAY);             // Create blurred gray scale image for edge detection.
    blur(mImageGray, mImageGrayBlur, Size(3, 3));

    // Canny edge detector with a sobel filter.
    Canny(mImageGrayBlur, mImageEdgeMask, iEdgeThreshSobel, iEdgeThreshSobel * 3, 3);
    mImageEdgeResult = Scalar::all(0);                                                            // Clear to black.
    mImageFrame.copyTo(mImageEdgeResult, mImageEdgeMask);
    imshow(WindowName, mImageEdgeResult);                                          // Display image frame in window.
}

int main(int argc, char** argv)
{
    VideoCapture capture;
    capture.open(0);                                                                                // Open camera device through V4L2.
    namedWindow(WindowName, WINDOW_KEEPRATIO);     // Create window and tool bar slide control.
    createTrackbar("Canny threshold Sobel", WindowName, &iEdgeThreshSobel, 100, onTrackbar);
    char key = 0;

    while (key != ‘q’)                           // Continuously capture frames from the camera and display them.
    {
        capture >> mImageFrame;   // Capture another image frame from camera.

        if (mImageFrame.empty())
            break;

        onTrackbar(0, 0);                   // Show the image.
        key = (char)waitKey(30);      // Wait 30 milliseconds for a key press.
    }
    return 0;
}

Here is a captured screen shot from this program showing the result of OpenCV’s Canny edge-detection on camera video at 1024 x 768 resolution. The slider control at the top adjusts the threshold for the Canny Sobel filter to reject noise.

Frame Captured from Camera Showing OpenCV’s Canny Edge-Detection

My second approach, was to implement the Sobel-Feldman algorithm directly in GLSL through OpenGL ES 2.0. This moves most of the heavy compute calculations from the ARM cores to the GPU’s shader engine. In this case, the GPU is also doing the color space conversion to RGB32 format, but that is not shown here because it is abstracted by the texture2D sampler. This is a built-in extension of the Vivante OpenGL ES 2.0 driver on the i.MX6 platform.

Example of Using a Sobel Edge-Detection Filter on the GPU with Camera Video

// Sobel Fragment Shader - Displays video image with a Sobel-Feldman edge-detection filter applied.
precision highp float;
uniform sampler2D  gsuTexture;            // Handle of texture for video frames.
uniform vec2 	      gsuDimensions;    // Horizontal and vertical size of the video frames in pixels.
varying vec2	              gsvTexCoord;        // Interpolated texture coordinates from the vertex shader.
    float ComputeAverage(vec3 vInput)
    {
        float  fAverage = (vInput.x + vInput.y + vInput.z) / 3.0;
        return  fAverage;
    }

    float ComputeConvolution(mat3 mInput)
    {
       	vec2 vOffsets = vec2(1.0 / gsuDimensions.x,  1.0 / gsuDimensions.y);
        float  fPixel00 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x - vOffsets.x, gsvTexCoord.y - vOffsets.y)).xyz);
        float  fPixel01 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x, gsvTexCoord.y - vOffsets.y)).xyz);
        float  fPixel02 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x + vOffsets.x, gsvTexCoord.y - vOffsets.y)).xyz);
        vec3  vRow0 = vec3(fPixel00, fPixel01, fPixel02);
        float  fPixel10 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x - vOffsets.x, gsvTexCoord.y)).xyz);
        float  fPixel11 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x, gsvTexCoord.y)).xyz);
        float  fPixel12 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x + vOffsets.x, gsvTexCoord.y)).xyz);
        vec3  vRow1 = vec3(fPixel10, fPixel11, fPixel12);
        float  fPixel20 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x - vOffsets.x, gsvTexCoord.y + vOffsets.y)).xyz);
        float  fPixel21 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x, gsvTexCoord.y + vOffsets.y)).xyz);
        float  fPixel22 = ComputeAverage(texture2D(gsuTexture, vec2(gsvTexCoord.x + vOffsets.x, gsvTexCoord.y + vOffsets.y)).xyz);
        vec3  vRow2 = vec3(fPixel20, fPixel21, fPixel22);
        vec3  vProducts0 = (mInput[0] * vRow0);
        vec3  vProducts1 = (mInput[1] * vRow1);
        vec3  vProducts2 = (mInput[2] * vRow2);
        vec3  vSums = vProducts0 + vProducts1 + vProducts2;
        return  vSums.x + vSums.y + vSums.z;
    }

    void main()
    {
        mat3  mHorizontal = mat3(1.0, 0.0, -1.0,    2.0, 0.0, -2.0,    1.0, 0.0, -1.0);
        mat3  mVertical = mat3(-1.0, -2.0, -1.0,    0.0, 0.0, 0.0,    1.0, 2.0, 1.0);
        float  fHorizontalSum = ComputeConvolution(mHorizontal);
        float  fVerticalSum = ComputeConvolution(mVertical);

        if  ((fVerticalSum > 0.2) || (fHorizontalSum > 0.2) || (fVerticalSum < -0.2) || (fHorizontalSum < -0.2))
             gl_FragColor = vec4(1.0);      // Output white pixel
        else
             gl_FragColor = vec4(0.0);     // Output black pixel}
     }

Comparing the performance between these two approaches is interesting. The OpenCV approach does not use the GPU at all, but only the IPU. The measured ARM CPU use is about 25% running with 1024 x 768 camera video. However, the ARM CPU use is less than 1% in the second approach which uses the GPU to process the same video! This is not exactly a fair comparison since the Canny filter is more general purpose and probably does a better job of rejecting noise, but still the performance difference is striking.

Frame Captured from Sobel Edge-Detection Implemented on GPU

My next steps will be to build on these two architectures to implement motion sensing for Lightwing.