Internship Report

Bachelor’s thesis: Converting sign language into text using AI

We had the pleasure of co-promoting a Bachelor’s thesis for a student at the University College of Ghent recently. In the research, artificial intelligence and how it can be used to promote communication between Flemish Sign Language and spoken language is studied.

Introduction

This research & thesis investigates whether there is potential in the use of artificial intelligence to improve communication between a deaf and a hearing person. Flemish sign language, like any other sign language, is an officially recognized language with its own vocabulary and grammar. Moreover, it is the mother tongue of approximately 6000 Flemish people. Yet sign language is generally understood by only a small fraction of the world’s population and difficulties quickly arise when a deaf and a hearing person try to have a simple conversation. So, the research is addressing a very contemporary problem for which technology such as artificial intelligence may be able to offer (part of) a solution.This specific bachelor’s thesis focuses on how this technology can be used to convert Flemish Sign Language into written text. It is examined whether a smart application can be made that “looks” at the gestures of the deaf person by means of a camera in the device and then converts them into written or spoken text. In the opposite direction, the hearing person can then say or type a few sentences and these are translated into the indicated sign language on the basis of the sign language dictionary.

Sign language

Someone who is born deaf does not know what sound is, so it is obvious that a visual language, namely a sign language, is the first language they learn. Sign language is a gestural-visual language. The language is made visible by hands, posture and facial expression. Every sign language is a full-fledged language, a language with its own lexicon and grammar. The gestures used when speaking a sign language are different from the spontaneous gestures that hearing people use when speaking.Sign language is a natural language. This means that sign language was not ‘invented’ by humans, but rather originated in a natural way. This is to say that sign languages are different around the world and as a result you speak the sign language of the country where you are born in, just like spoken language. There are currently 7,117 recognized languages worldwide, of which 144 are sign languages.Flemish Sign Language is the sign language used by the Flemish Deaf Community. The Flemish Sign Language is generally known as VGT (Vlaamse Gebarentaal). It is an officially recognized language by the Ethonologue: Languages of the world and the mother tongue of about 6,000 Flemish people and, moreover, mastered by about 7,000 Flemish people as a second language. The language originated from the Belgian Sign Language.

Image Recognition

A cornerstone in the possible AI solution is image recognition. In image recognition, a Convolutional Neural Network (CNN) is used. These types of networks are one of the most popular applications of Deep Learning and are well suited for image processing and computer vision applications. CNN’s are mainly used to classify images, cluster them for similarities and identify objects. These algorithms are able to identify a variety of things: faces (face recognition), individuals (person detection), road signs (self-driving car technology), tumors (healthcare applications), and many other aspects in visual data.

A CNN algorithm sees an image in a different way than our human brain. For an algorithm, each image looks like an array of thousands of digits, with each digit representing the color code of one pixel. The algorithm will have to recognize a pattern between all these figures.

A CNN algorithm does this by breaking the image up into smaller groups of pixels called a filter. Each filter is a matrix of pixels, and the network does a series of calculations on these pixels, comparing them to pixels in a specific pattern that the network is looking for. In the first layer of the CNN, the algorithm is able to detect some simple patterns, such as rough edges and curves. As the network performs more convolutions, it starts to identify specific objects such as faces and animals.

How does CNN know what pattern to look for and whether its prediction is accurate? This is done by means of a large amount of labeled training data. When the CNN algorithm starts, all weights and filter values ​​are chosen randomly, with the result that the initial predictions contain little logic and are rarely correct. Every time the CNN makes a prediction with labeled data, it uses an error function to see how close its prediction was to the actual label of the image. Based on this error or loss function, the CNN updates its weights and filter values ​​and restarts the process. Ideally, each convolution will perform with a little more accuracy.

Transfer sign language to text

In this thesis, the focus is mainly on how artificial intelligence can be used to “look” at the gestures displayed by the gesture-maker on the basis of the camera in a device and immediately convert them into text. However, to complicate things, saying one sentence in sign language is not just showing a sequence of a few still images in succession. Each gesture is one fluid movement and in a sentence the gestures are depicted one after the other. In order to convert sign language into text, this will result in the need to film the statements that the signer says.

Our research started with looking into existing, useful technologies that can recognize and analyze the content of a video by means of Deep Learning. It soon became apparent that AI has not yet evolved enough to recognize this specific content of a video. After further research, a solution to this problem presented itself. Several companies offer image recognition technology for images. These AI services are APIs used to recognize and analyze the content of images. Of course it is possible to split the video of the gesture-maker into frames and eventually use the frames as images for image recognition.

An average smartphone now films at about 30 frames per second. For performance reasons, it is not a good idea to send 30 frames every second to an API to find a match with a gesture. To keep the performance of the application in our research high, only five frames will be extracted from each video of one gesture and only these frames will be sent to the API to look for the best match. The word with the highest matching percentage will be returned in response. The reason why multiple images per gesture are sent to the API is because a gesture is one smooth movement and thus looks different every time. The five frames therefore provide a greater chance of success for the algorithm to find the correct word.

Since each signer performs their sign language in a slightly different way, image recognition technology comes in handy. The algorithms can be trained in such a way that things can be recognized in every possible situation, for example with a lot of or little light, different camera angles and variations in execution, and so on.

Proof of Concept

As previously mentioned, image recognition technologies will be used to analyze the images. There are currently several APIs that offer services for this. In the thesis, the 3 most well-known services were discussed: Azure Custom Vision, Google AutoML Vision and Amazon Rekognition.

Azure Custom Vision

Azure Custom Vision is one of the many available AI services from the Azure Cognitive Services. Azure Cognitive Services is developed and offered by Microsoft and is a comprehensive family of AI services and cognitive APIs. It helps developers build intelligent apps, without requiring specific knowledge of machine learning.

Custom Vision is the API that makes image recognition possible. It ensures that you can build your own image classification system, put it into use and continuously improve it. The Custom Vision service uses a machine learning algorithm that classifies images within the labels you provide. This service can be divided into two functions, namely classification and object detection. Image classification gives one or more labels to an image. Object detection is similar, but also returns the coordinates in the image where the applied label can be found. In addition, when creating a new model, you can choose from different variants of the Custom Vision algorithm. The variants are optimized for images around a specific subject, for example commodities or landmarks, such as the Eiffel Tower.

Google Cloud AutoML Vision

Google also offers a number of Machine Learning products, but these are currently much less comprehensive than the Azure Cognitive Services. Google has two types of products that use image recognition. The former is called Vision API and offers powerful pre-trained machine learning models for classifying images. That is, it can quickly classify images into millions of predefined categories. In addition, this service can also read printed and handwritten text. Unfortunately, the Vision API cannot be used to classify gestures in a sign language as it does not contain a pre-trained model for this. This is where Google Cloud’s AutoML Vision comes to the rescue. AutoML Vision makes it possible to train a machine learning model or algorithm to classify images according to self-defined labels.

Amazon Rekognition Custom Labels

Finally, Amazon also has several ML services and here too there are two different products that can recognize the content of an image. There’s Amazon Rekognition and Amazon Rekognition Custom Labels. As the names themselves say, one product consists of pre-trained models and the other can be specifically trained for your own needs with your own data.

Training the application

During the thesis, we used the Azure Custom Vision service for the development of a demo application.

The very first step was to collect data for a training and test data set. The Custom Vision AI service is optimized to quickly spot differences between images, so that it is possible to start prototyping a model with a relatively small amount of data. About 50 images per gesture should already be a good start for a training dataset.

In addition, a labeled test data set is also required. This serves to estimate how well the model will score on new data and to prevent overfitting. The test data set will consist of a collection of labeled images that have never been used before during training.

Three factors for a well-trained model

The quality of an image classification system depends on the amount, quality and variety in the provided labeled data.

Quantity in data

The number of training images is the most important factor. It is recommended to use a minimum of 50 images per label as a starting point. With fewer images, there is a higher risk of overfitting. The performance figures will suggest good quality, but the model will struggle with real world data.

From this it is concluded that a large amount of data will be required to train an algorithm to recognize gestures from images. Each gesture in itself will require a fair number of images to be properly trained. Yet it is difficult to put a concrete number on this. The amount of data that is required depends strongly on its quality. The fact is that a smaller amount of training images taken in a correct environment will produce a better result than a larger number of less representative images. It is said that 50 samples per label are a good place to start, but in reality it is very easy to use several hundred samples per label.

Balanced data

In addition, it is important to take into account the relative amounts of the training data. For example, using 500 images for one label and only 50 images for another label creates an unbalanced dataset. This will result in the model being more accurate in predicting one label versus another. Maintaining a ratio of at least 1 to 2 between the label with the fewest images and the label with the most images, respectively, will likely provide satisfactory results. For example, if the label with the highest number of images has 500 images, then the label with the least number of images must have at least 250 images for training.

When training a model to recognize gestures, it will initially be started with an equal amount of data per gesture to be recognized. After a training round, it may be that during testing a gesture scores less accurately and thus is repeatedly not recognized against other gestures. To correct for this lower accuracy, more examples of that specific gesture must be added to the dataset and the algorithm retrained. Keeping in mind that at all times in the training process, the dataset must be balanced.

Variety in data

The variety in the training data is a final important factor that will help prevent overfitting and thus improve the quality of the algorithm. It is necessary to provide images that are representative of what is submitted to the classification system during normal use. If this is not done, the classification system, due to too little variation, may incorrectly learn to make predictions based on arbitrary features common to the images. The figure below shows what a lack of variation looks like. It is a classification between apples and citrus fruits. The classification may end up giving unnecessary importance to the hands or the white plates, rather than focusing on the aspects of apples or citrus fruits.

To avoid this problem it is necessary to provide a wide variety of images and thus ensure that the classification system can generalize properly. Some ways to make the dataset more diverse are:
• Place the object in different backgrounds
• Images provide variations of lighting
• The item may vary in size and quantity
• Provide images with different camera angles

After sufficient and representative data has been collected, we can start training the algorithm. The first step is to create a new project in Azure Custom Vision. We choose for classification, a single tag per image and general domain specification. Following, all images of the training data are added to the project and labeled. The figure belof for example shows that 32 images of the gesture “Potato” are added.

Then, the algorithm trains itself based on the given data and calculates its accuracy by testing itself with the training data. The images below show that the algorithm works well on the training data. This example indicated a precision of 100% after the first training round (overfitting). This percentage decreases when the algorithm is tested on images it has never seen.

Once the algorithm has finished training, it can be manually tested using the test dataset. It is then checked whether the algorithm works properly and therefore recognizes the gestures correctly. The chance is very small that an algorithm will work well after the first training round. Usually, a number of training rounds will follow before an algorithm is ready to be used. In the following training rounds, the training dataset should be expanded with more images and images with more elements from the environment in which the algorithm will be used. In our case for example, we added more images of indoor and outdoor gestures, pictures from further away and closer by, from different angles and other types of lighting. Once you find that the algorithm is working properly, the algorithm is ready to go live and classify the gestures for which it has been trained.

Feedback and retraining

Even after the application has been put into use, it is recommended to continuously improve it. As a result, one can add a function to the developed application that allows to control the model or algorithm. For example, after translating a gesture, it is possible to show the user a button where one can select whether the given word was correct or incorrect. If the user indicates an error, the application stores the corresponding gesture image in a separate collection. The application developer can then view these images with false predictions and try to figure out why those gestures went wrong. Some reasons why things can go wrong are: color combinations, some kind of lighting, etc. which are almost not in the training dataset. Finally, these erroneously predicted images, along with some similar examples, are added to the training data set and the algorithm retrained. This is a function that takes little time from the user and can be very valuable for the further development of the application.

Conclusion

This thesis investigated how artificial intelligence can be used to promote communication between Flemish Sign Language and spoken language. The research results confirm that the application possibilities of artificial intelligence have advanced so far today that it is possible to convert sign language into written text using image recognition technology.

Based on the proof of concept that we realized, it can be concluded that an application can be brought to the market that converts Flemish Sign Language into spoken or written text and vice versa. The application then looks through the camera of a smartphone at the gestures that the gesture-maker performs and converts these into words using image recognition technology. In the opposite direction, the hearing person can speak or type a few sentences. These sentences are converted into sign language on the basis of the Flemish Sign Language dictionary. The deaf person can then view the sign language.

Student: Yasmine De Winne
Promoter: Stefaan De Cock
Co-promotor: Wouter Baetens
Institution: University College Ghent
Academic year: 2019-2020

Newsletter

Stay tuned.