Data Gathering For Image Recognition

The other time I was reading an interesting blog from Henk Boelman (a Microsoft Advocate). He was describing how you can build an image classifier with Azure Machine Learning Studio. He used a dataset that he had downloaded from Kaggle. But what if you cant  find on Kaggle what you need. This blog post will discuss how you can make use of the Bing Image Search to generate your own dataset.

So first of all some explanation. “Bing Image Search API” is part of the Cognitive Services on Azure and it gives you the possibility to search for images just like how you would do on bing.com/images. The API has a search function that allows you to search for a specific keyword but also filter the results based on size (height/width), file size, license, color, …

More info and free easy try out can be found on the product page of Microsoft.
https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/

Since we are talking about Machine Learning all the code that will be discussed will be in Python. That way you can just copy it in your Jupiter/Azure Notebook.

Step 1: Install/Import necessary packages

The Bing Image Search has its own Python SDK, what is very handy. But if you want you can also make use of the classic HTTP request methods.!pip install azure-cognitiveservices-search-imagesearch

from azure.cognitiveservices.search.imagesearch import ImageSearchClient
from msrest.authentication import CognitiveServicesCredentials
import pandas as pd

Step 2: Configure subscription key + endpoint

Subscription Key + Endpoint can be found in your Azure Portal after you create a Cognitive Service resource. You don’t need to look for a specific Bing Search Cognitive Services. Since a while Bing Search and many more cognitive services have been implemented in 1 resource. Which means 1 endpoint and 1 key to use them all.

I also added the search term heresubscription_key = “[YourKeyHere]”

subscription_endpoint = "[YourEndpointHere]"
search_term = "apples"

Step 3: Configure Client

So in this step we create 2 new objects. One is the ‘credentials’ which contains the subscription key that we just configured. And the other is the client that we will use to make the call. Last mentioned also needs the endpoint.credentials = CognitiveServicesCredentials(subscription_key)

client = ImageSearchClient(endpoint=subscription_endpoint, credentials=credentials)

Step 4: Search and gather the results

Searching for items is very easy. Just by making use of the client object and the search function that has a bunch of variables that you can configure (like color, size, …) In this case we only configure what we are search for and how many results we want to get back.#Search for images

image_results = client.images.search(query=search_term,count=150)

Step 5: Convert to dataframe

The result we receive from the API is a list of objects. This might we handy if we are in an c# application or so. But in this case we want to work with clear data. So with below functionality we convert the data to a dataframe. You will notice we get quite some information back. Some information is still stored in an object like ”_type”, but in this case we don’t need this data. So we will leave it like this.#convert to dataframe

df = pd.DataFrame([x.as_dict() for x in image_results.value])

df.head()

_typeaccent_colorcontent_sizecontent_urldate_publishedencoding_formatheighthost_page_display_urlhost_page_urlimage_idimage_insights_tokeninsights_metadatanamethumbnailthumbnail_urlweb_search_urlwidth
0ImageObject9D2E39224643 Bhttps://upload.wikimedia.org/wikipedia/commons…2019-09-14T23:40:00.0000000Zjpeg1200https://en.wikipedia.org/wiki/Applehttps://en.wikipedia.org/wiki/Apple6AEAF1C3894ED7D563916634549ED00175C970D3ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E…{}Apple – Wikipedia{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6…https://www.bing.com/images/search?view=detail…1200
1ImageObject4F300B161831 Bhttps://upload.wikimedia.org/wikipedia/commons…2019-12-21T20:48:00.0000000Zjpeg1083https://en.wikipedia.org/wiki/Honeycrisphttps://en.wikipedia.org/wiki/Honeycrisp4531DA369849BECBA2980492B1F021A9EF15BB99ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0…{}Honeycrisp – Wikipedia{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP…https://www.bing.com/images/search?view=detail…1200
2ImageObjectC47807172706 Bhttps://www.tasteofhome.com/wp-content/uploads…2018-08-15T00:12:00.0000000Zjpeg1200https://www.tasteofhome.com/collection/new-typ…https://www.tasteofhome.com/collection/new-typ…EC1EBFE716FD1033C0E0983931FEC881E8A1177Dccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE…{}15 New Types of Apples You Should Be Buying | …{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp…https://www.bing.com/images/search?view=detail…1200
3ImageObject7E0F244462721 Bhttp://www.michiganapples.com/portals/0/MAC%20…2019-11-15T14:57:00.0000000Zjpeg3738www.michiganapples.com/About/Varietieshttp://www.michiganapples.com/About/Varieties753A53E3C6A184B4C55A074E275F120D20D4914Cccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F…{}Michigan Apple Varieties | Michigan Apple Comm…{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ…https://www.bing.com/images/search?view=detail…3738
4ImageObject4F1E0F419131 Bhttp://www.flinchbaughsorchard.com/wp-content/…2019-11-01T04:51:00.0000000Zjpeg1691www.flinchbaughsorchard.com/apple-varietieshttp://www.flinchbaughsorchard.com/apple-varie…7480E3C7DDE4776D891D003001872D954786935Cccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187…{}Apple Varieties | Flinchbaugh’s Orchard & Farm…{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo…https://www.bing.com/images/search?view=detail…1800

Step 6: Generate new filenames

Since we get data back from many different websites, there is a chance that some filenames might be totally the same. Plus some files might not have an extension within the URI, because of routing or image generation. To fix this we add a function that guesses the MimeType and based on that looks up the extension that belongs to it. This in combination with a GUID that is generated we are sure we have a unique filename.import mimetypes

import uuid

def getFileName(contentUrl):
    mt = mimetypes.guess_type(contentUrl)
    if mt[0] != None :
        ext = mimetypes.guess_extension(mt[0])
        return str(uuid.uuid1()) + ext
    else:
        return ""

By applying above funtion to each row we can add an extra column to the dataframe that contains the newly generated fileName. NOTE: you might have noticed that the function can return an emtpy string. This is only the case when it can’t figure out the MimeType. Those results we filter out.df[‘fileName’] = df.apply(lambda x: getFileName(x.content_url), axis=1)

df = df[df['fileName'] != ""]

df.head()

_typeaccent_colorcontent_sizecontent_urldate_publishedencoding_formatheighthost_page_display_urlhost_page_urlimage_idimage_insights_tokeninsights_metadatanamethumbnailthumbnail_urlweb_search_urlwidthfileName
0ImageObject9D2E39224643 Bhttps://upload.wikimedia.org/wikipedia/commons…2019-09-14T23:40:00.0000000Zjpeg1200https://en.wikipedia.org/wiki/Applehttps://en.wikipedia.org/wiki/Apple6AEAF1C3894ED7D563916634549ED00175C970D3ccid_pfp3ysAm*mid_6AEAF1C3894ED7D563916634549E…{}Apple – Wikipedia{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse3.mm.bing.net/th?id=OIP.pfp3ysAmXA6…https://www.bing.com/images/search?view=detail…120011c6ddea-3197-11ea-87d5-000d3aaa7d6e.jpe
1ImageObject4F300B161831 Bhttps://upload.wikimedia.org/wikipedia/commons…2019-12-21T20:48:00.0000000Zjpeg1083https://en.wikipedia.org/wiki/Honeycrisphttps://en.wikipedia.org/wiki/Honeycrisp4531DA369849BECBA2980492B1F021A9EF15BB99ccid_wPWDNJmR*mid_4531DA369849BECBA2980492B1F0…{}Honeycrisp – Wikipedia{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse1.mm.bing.net/th?id=OIP.wPWDNJmRmNP…https://www.bing.com/images/search?view=detail…120011c6e2ea-3197-11ea-87d5-000d3aaa7d6e.jpe
2ImageObjectC47807172706 Bhttps://www.tasteofhome.com/wp-content/uploads…2018-08-15T00:12:00.0000000Zjpeg1200https://www.tasteofhome.com/collection/new-typ…https://www.tasteofhome.com/collection/new-typ…EC1EBFE716FD1033C0E0983931FEC881E8A1177Dccid_7jSg14s4*mid_EC1EBFE716FD1033C0E0983931FE…{}15 New Types of Apples You Should Be Buying | …{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse2.mm.bing.net/th?id=OIP.7jSg14s46gp…https://www.bing.com/images/search?view=detail…120011c6e60a-3197-11ea-87d5-000d3aaa7d6e.jpe
3ImageObject7E0F244462721 Bhttp://www.michiganapples.com/portals/0/MAC%20…2019-11-15T14:57:00.0000000Zjpeg3738www.michiganapples.com/About/Varietieshttp://www.michiganapples.com/About/Varieties753A53E3C6A184B4C55A074E275F120D20D4914Cccid_3X+TRdGR*mid_753A53E3C6A184B4C55A074E275F…{}Michigan Apple Varieties | Michigan Apple Comm…{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse2.mm.bing.net/th?id=OIP.3X-TRdGReOQ…https://www.bing.com/images/search?view=detail…373811c6e8e4-3197-11ea-87d5-000d3aaa7d6e.jpe
4ImageObject4F1E0F419131 Bhttp://www.flinchbaughsorchard.com/wp-content/…2019-11-01T04:51:00.0000000Zjpeg1691www.flinchbaughsorchard.com/apple-varietieshttp://www.flinchbaughsorchard.com/apple-varie…7480E3C7DDE4776D891D003001872D954786935Cccid_GBWUKwPQ*mid_7480E3C7DDE4776D891D00300187…{}Apple Varieties | Flinchbaugh’s Orchard & Farm…{‘_type’: ‘ImageObject’, ‘width’: 474, ‘height…https://tse1.mm.bing.net/th?id=OIP.GBWUKwPQyNo…https://www.bing.com/images/search?view=detail…180011c6ebaa-3197-11ea-87d5-000d3aaa7d6e.jpe

Step 7: Create folder to save the images

This is a folder on your local or cloud environment.dirName = “image_gathering”

os.mkdir(dirName)
Requirement already satisfied: wget in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (3.2)
Directory  image_gathering/apples  already exists. Folder will be cleared

Step 8: Download images

First we create a function that tries to download the file to the new destination with the new filename. Bing search might be outdated, so there are possibilities that certain downloads fails (Ex. NotFound, NotAuthenticated, …) Therefore the function returns True or False depending if download was succesful.import wget

def downloadFile(dirName, contentUrl, fileName):
    try:
        wget.download(contentUrl, dirName + "/" + fileName)
        return True
    except:
        return False

By applying this function do every row, every result will be downloaded and we add an extra column that keeps track if it was successfull or not.df[‘fileDownloaded’] = df.apply(lambda x: downloadFile(dirName, x.content_url, x.fileName), axis=1)

THE END

Its not that hard and quite fast to gather your own data based on results from Bing Search. Depending on what you want to realize you will need to run this script multiple times if you want images for different search terms. Or you can convert this script into a function that accepts an array of search strings.

Extra – Move data to datastore

In case you are using Azure Machine Learning Studio, its a good idea to move your images to a datastore. That way other data scientists can also make use of it. Below code uploads the folder of images to the same folder on your datastore.import azureml.core

from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

datastore_name = 'workspaceblobstore'
datastore = Datastore.get(ws, datastore_name)
datastore.upload(dirName, dirName)

The full Jupyter Notebook can be found on my GitHub

https://github.com/sammydeprez/PythonExamples/blob/master/BingImageSearch/BingImageSearch.ipynb

Enjoy gathering new data!

[Initially posted here]

Blog

Stay tuned.