Azure Cognitive SearchNLPOCRCustom Entity Detection

TAFUTA - A cognitive search platform

Multiple (VITO, FIT, VLAIO, VIB)

CHAPTERS

Reading time: 7 min.

CONTEXT

Make internal content discoverable with Tafuta

Organisations like Flanders Investment & Trade, VITO, VLAIO, and VIB have deployed Tafuta to make their internal content discoverable. Tafuta, Swahili for 'search', is a cognitive search platform that combines full-text search with AI-driven enrichment to turn raw documents into searchable, structured information. It won the Smart Investigation challenge in the 2020 Smart Policing Hackathon.

Research paper discovery

VITO uses Tafuta to offer technological researchers a tool to more accurately retrieve relevant research papers and documents. The data source spans PDFs, Word documents, PowerPoints, and Excel files from on-premise storage, plus external sources like Google Scholar. A custom entity detection skill enables searches based on economical, chemical, and process parameters.

Rich result presentation

Search results appear in a table showing the most relevant documents first, plus a graph representation showing how documents are connected and related to specific entities (people, organisations) or topics. The document database is automatically populated with new content based on research trends.

Custom per deployment

Each deployment includes custom models and classifiers tuned to the client's specific business goals and industry requirements.

HOW IT WORKS

What's a cognitive search solution?

Tafuta is built on the cognitive search pattern: a knowledge retrieval service with built-in AI capabilities. It provides a full-text search engine, persistent storage of search indexes, and integrated AI used during indexing to extract more text and structure from documents.

How it works

The solution uses NLP and AI services across vision, language, and speech, including OCR, translation, key phrase extraction, and entity detection. Together, these transform raw, unstructured information into searchable content. It handles Microsoft Word, PowerPoint, Excel, PDF, PNG, RTF, JSON, HTML, and XML formats.

Indexing and querying

The two primary workloads are indexing and querying. Indexing brings text into the solution and makes it searchable by processing inbound text into tokens stored in inverted indexes. Once populated, you can send query requests with relevance tuning, autocomplete, synonym matching, fuzzy matching, pattern matching, filtering, and sorting.

AI ENRICHMENT

AI-enriched search engine

AI extracts text from images, blobs, and unstructured data sources, making content more searchable. Enrichment and extraction are implemented using cognitive skills attached to the indexer-driven pipeline — both built-in and custom skills we create for your domain.

Natural language processing skills

Entity recognition, language detection, key phrase extraction, text manipulation, sentiment detection, and PII detection. With these skills, unstructured text is mapped as searchable and filterable fields in the index.

Image processing skills

Optical Character Recognition (OCR), facial detection, image interpretation, image recognition for famous people and landmarks, and attribute detection like image orientation. These skills create text representations of image content, making visual information searchable.

Practical scenarios

Scanned documents (JPEG) made full-text searchable via OCR. PDFs with combined image and text where NLP processing produces better results than standard indexing. Multi-lingual content with automatic language detection and translation. Multimedia analysis on audio, video, and images.

WHEN TO USE IT

When should you use a cognitive search solution?

A cognitive search solution fits organisations that need an in-company search experience similar to commercial web search engines, or need to consolidate heterogeneous content types into a private, user-defined search index.

Unstructured content at scale

When your raw content is largely undifferentiated text, images, or application files, the search solution identifies and extracts text during indexing, creates structure, and generates new information such as translated text or detected entities.

Custom text analytics

When your content needs linguistic or custom text analysis, analysers can be configured for specialised processing: filtering out diacritics, recognising patterns in strings, or building domain-specific entity classifiers for finance, science, or medicine.