Tafuta: AI-powered Cognitive Search

Is your organization struggling to easily identify and explore relevant content at scale? Are you being confronted with different document types over various sources that you want to make easily searchable for your employees? Your company hold a lot of content that needs linguistic or custom text analytics? Then a cognitive search solution might be what you need.

What’s a Cognitive Search Solution?

A Cognitive Search Solution is a knowledge retrieval service with built-in AI capabilities. It provides a full-text search engine, persistent storage of search indexes and integrated AI used during indexing to extract more text and structure in documents. The solution uses Natural Language Processing (NLP) and AI services across vision, language, and speech (OCR, translation, key phrase extraction, location- , people- , and organization detection) to transform raw, unstructured information into searchable content. With a Cognitive Search solution, you get a search engine that performs indexing and query execution, persistent storage of indexes that you create and manage, and a natural query language for composing simple to complex queries. A Cognitive Search solution is integrated with services that automate data ingestion/retrieval from data sources, so your solution runs on an ever growing content database. A Cognitive Search Solution can deal with several popular file formats, such as Microsoft Word, PowerPoint, and Excel, Adobe PDF, and PNG, RTF, JSON, HTML, and XML.

The core of a Cognitive Search solution sits in between external data stores that contain your un-indexed data, and a client app that sends query requests to a search index and handles the response. An index schema determines the structure of searchable content. This is a high-level overview of the way a Cognitive Search solution typically works:

The two primary workloads of a Cognitive Search solution are indexing and querying.

  • Indexing brings text into your search solution and makes it searchable. Internally, inbound text is processed into tokens and stored in inverted indexes for fast scans. The subsequent analysis and transformations can result in new information and structures that did not previously exist, providing high utility for many search and knowledge mining scenarios.
  • Once an index is populated with searchable data, your organization can start sending query requests to the solution. The search includes relevance tuning, autocomplete, synonym matching, fuzzy matching, pattern matching, filter, and sort.

A Cognitive Search solution should match your specific business goals and industry requirements. That’s why we use custom models and classifiers for every client, such as for example legal clause classifiers or manufacturing material identifiers. After we’ve deployed the solution, we work on fine-tuning the search results using rich, custom-tuned ranking models to tie your results to your business goals.

When should I use a Cognitive Search Solution?

A Cognitive Search solution is well suited for the following scenarios:

  • Your organization needs an in-company search experience similar to commercial web search engines.
  • Consolidation of heterogeneous content types into a private, user-defined search index. You can populate the search index with documents from any source. Control over the index schema and refresh schedule is one of the key reasons for using our Cognitive Search solution.
  • Raw content is largely undifferentiated text or image files or application files. During indexing, the Search solution identifies and extracts text, creates structure and new information such as translated text or entities.
  • Your content needs linguistic or custom text analysis. With a Cognitive Search solution, you can configure analyzers to achieve specialized processing of raw content, such as filtering out diacritics, or recognizing and preserving patterns in strings.

AI enriched search engine

In a Cognitive Search solution, AI can be used to extract text from images, blobs, and other unstructured data sources. Enrichment and extraction make your content more searchable. Extraction and enrichment are implemented using cognitive skills attached to the indexer-driven pipeline. You can use ready-to-use skills (from Microsoft for example) or embed external processing into a custom skill that we create for you. Examples of a custom skills are a custom entity module or document classifier, targeting a specific domain such as finance, scientific publications, or medicine.

The built-in skills of a Cognitive Search solution fall into these categories:

  • Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, sentiment detection, and PII detection (extracts personal information from an input text and gives you the option of masking it). With these skills, unstructured text is mapped as searchable and filterable fields in an index.
  • Image processing skills include Optical Character Recognition (OCR) and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like image orientation. These skills create text representations of image content, making it searchable using the query capabilities of the Search solution.

Natural language and image processing is applied during the data ingestion phase, with results becoming part of a document’s composition in a searchable index in the Cognitive Search solution. Data is sourced as a data set and then pushed through an indexing pipeline using whichever built-in skills you need.

Using the built-in AI skills allow for a.o. the following scenarios:

  • Scanned documents (JPEG) that you want to make full-text searchable. With an optical character recognition (OCR) skill we can identify, extract, and ingest text from JPEG files.
  • PDFs with combined image and text. Text in PDFs can be extracted during indexing without the use of enrichment steps, but the addition of image and natural language processing can often produce a better outcome than a standard indexing provides.
  • Multi-lingual content against which you want to apply language detection and possibly text translation.
  • Restructure content through text split, merge, and shape operations.
  • Unstructured or semi-structured documents containing content that has inherent meaning or context that is hidden in the larger document.
  • Analysis on multimedia files (like audio, video, and images)

Blobs in particular often contain a large body of content that is packed into a single “field”. By attaching image and natural language processing skills to an indexer, you can create new information that is extant in the raw content, but not otherwise surfaced as distinct fields. Some of the built-in cognitive skills that we use here are: key phrase extraction, sentiment analysis, and entity recognition (people, organizations, and locations).

 

Tafuta

Tafuta is a Swahili verb that stands for search or seek. It’s our solution for your organization’s knowledge retrieval needs. Organizations like Flanders Investment & Trade, VITO, VLAIO & VIB have implemented the solution already and have experienced the benefits of search 2.0. Tafuta also won the Smart Investigation challenge in the 2020 Smart Policing Hackathon.

VITO for example uses Tafuta to offer its technological researchers a tool to more accurately retrieve relevant research papers and documents. The data source includes a large number of .pdf, .docx, .pptx and .xlsx files that are stored on an on-premise environment, and online resources like Google Scholar. Based on trends in research, the document database is automatically additionally populated with documents from external data sources. We’ve built a custom entity detection skill that allows for searches based on economical, chemical and process parameters. Search results are shown in a table, showing the most relevant results first, and in a graph representation, showing how different documents are connected and related to specific entities (people, organizations, …) or topics.

Get in touch if you’re interested to know what a Cognitive Search can mean for your organization!