Teaching Computers to “See”

An image contains a wealth of information. It can convey the context of an event, showing what happened, who was involved and what it resulted in. To this end, many organizations have utilized imagery as a key data format in recent years to support processes in improving quality control in semiconductor manufacturing, analyzing tissue samples in pharmaceutical R&D and other sectors. One of the main challenges of images, however, is efficiently extracting the information they contain. Traditionally, researchers and engineers have manually reviewed and annotated images, but this time- and labor-intensive approach is becoming unsustainable, as experiments and operations increase in scale and complexity.

Automating image analysis computationally offers significant promise to tackle this challenge, but it has previously required deep domain expertise to implement effectively. Why is this? The key lies in teaching an image analytics application to understand “meaning.” Consider a program that identifies dogs in pictures: what makes a dog a “dog?” What makes it different from a cat? What if the dog is partially obscured by a bush or is wearing a hat? What if the image is upside down? All of these factors impact the complexity of the application. Until recent advances in machine learning, the design of such image analytics tools was often too labor intensive for widespread enterprise use.

The “How:” Convolutional Neural Networks

Modern image analytics leverage the ability of computers to recognize and manipulate patterns. As data, images are fundamentally collections of pixels arrayed in a square or rectangle. These pixels, taken together, can be combined to form edges, shapes, textures and objects. Subconsciously, this is how we see, building up these patterns into entities to which we ascribe meaning: a bottle, a laptop, a bird. Image analytics applications seek to replicate this process.

Machine learning, specifically deep learning, has made image analytics much more accessible. Previous approaches required manual feature definition, i.e., developers told the program what shapes and patterns to look for and what they meant. Convolutional Neural Networks (CNNs), a form of deep learning, generalized this process, allowing the computer to teach itself what features were meaningful. How is this possible? The multilayered structure of a CNN allows it to break the picture down into more manageable pieces (Figure 1). Convolution layers apply filters to regions of the image to detect features. This results in a stack of new images equal to the number of filters applied in the given layer. Pooling layers take these images and “downsample” them, reducing the dimensionality (i.e., size of the image) while preserving the spatial distribution of the data.

BIOVIA Technical Marketing