Extracting Text from Machine Drawing and Nested Tables – A teX.ai Success Story

Humanity is moving toward the digital world, and we must agree to the hard fact that our world is moving towards a data-driven society. Everything around us has turned into potential data that could earn billions of dollars in either positive or negative ways.

Businesses are spending a hard time in data analytics and extraction from various sources to streamline their business process in the aspects of demand & supply, marketing, expansion, technology evolution, and much more. Now, let’s focus on the realm of text analytics solutions and grab the basics!

Text analytics is a combination of statistical and machine learning techniques that helps in structuring the information into pictorial or understandable data sets. Text analytics can be leveraged for business intelligence, research, and other purposes.

According to GM Insights, the text analytics market size had already surpassed USD 4 billion by 2018 and expected to grow at a rate of 18% CAGR from 2019 to 2026.

It seems infinite! Extracting or mining significant data for business analytics from this huge information with man force becomes expensive, inaccurate, and impossible. Here’s where text mining dives in!

Text mining or text extraction is the technique by which the significant information required for analytics is derived from different text formats such as websites, emails, PDFs, and images. Text mining is processed through natural language processing (NLP) applications to extract essential information.

NLP-Blog

In this blog post, let’s explore how our product teX.ai helped an oil & gas enterprise to boom their business with AI-powered text analytics techniques.

A glimpse of the oil & gas sector!

Oil & gas sector is the largest industry with a billion-dollar value and high manpower with thousands of employees in every industrial plant. The oil & gas industry is the major contributor to national GDP, and it has experienced a significant boom in the 20th century.

demo-cta-blog

The modern world with manufacturing units, chemical plants, pharmaceuticals, fertilizers, pesticides, dyes, plastics, and much more demanded a huge supply of petroleum products.

Even though the future automotive sector explores the feasibilities of alternate fuels like water or electricity, the manufacturing sectors must stick to oil & gas fuels.

All about our client and their purpose to join hands with us!

Our client is a pioneer in the oil & gas sector with a worldwide franchise. Their spirit is to help their consumers in various sectors such as agriculture, transportation, pharmaceuticals, chemicals, space tech, and much more with quality fuels in their processes. They deal with engineering disciplines, metallurgy, geophysics, IT support, sales, and other stakeholders to successfully run their business.

Every phase of their business contains numeric documents with complex machine parts diagram, process flow charts, graphs, nested tables, chemical composition, and much more data formats. They found it hard to extract data from the huge number of PDF’s with 5 different formats and multiple pages ranging from 2-100.

Our client partnered with us to enhance the text extraction process with tex.ai, and we have fulfilled their dreams with an accuracy of over 80% and 4x faster than the current process.

Now, let’s get into the types of documents our client had and the solutions offered by teX.ai

You might be interested in reading : NLP Techniques for Information Extraction

How teX.ai offered solutions to the oil & gas sector’s various document formats?

Our client enterprise deals with three types of documents in their processes in different formats. They include:

  • Quality Validation
  • Survey
  • Well Schematics

Let’s get into the details of formats and teX.ai solutions

Quality Validation:

The oil and gas enterprise deals with chemical composition PDFs in 3 different formats that cover around 10 pages each. In general, they utilize manpower to analyze and extract these data.

Solution: teX.ai identified the chemical composition details in the quality analysis tables with the Optical Character Recognition (OCR) technique and converted them to simple data of Key-Value pair.

We considered the chemical symbol as key and the composition as value. The output was 85% accurate, and the time taken for text mining was only a few seconds with hundreds of pages in the document.

Survey:

Our client deals with a bunch of public survey tables that need to be extracted from PDF documents for data analytics.

Solution: Our team leveraged teX.ai and segregated the survey tables with the keyword search using the OCR technique. We then mine the survey data sets with Tabula or Camelot in presentable table formats.

Well Schematics:

The oil & gas business also deals with the combination of nested tables containing drilling equipment’s diagrammatic representation in their documents. They were in need to analyze these data for business intelligence.

Solution: We mined nested tables in the client’s documents at two stages. Our team utilized the FCN model at phase 1, Open CV at phase 2 to detect rows at tables and generated output data in CSV format.

How we deployed our text extraction solutions in client applications?

On successful completion of automation with AI models for text extraction at required accuracy levels and performance, we stepped ahead to deploy the application.  Our team developed an admin module using Flask and containerization using Dockers to deploy tex.ai in the client applications.

This is how deployed Text extraction solutions in the client’s application

Tech stack in teX.ai

Purpose Technology
Optical Character Recognition(OCR) Tesseract, Tesserocr, OCRmyPDF, PyTesseract
Preprocessing & Postprocessing tools xPDF, Poppler, OpenCV, Pandas,Json
Table detection & extraction Camelot, OpenCV, LSD (line segment detection), csv, TensorFlow, FCN (Fully Convolutional Networks), CNN (Convolutional Neural Networks)
Deployment Flask & Dockers

teX.ai’s business impact on our Client business model

Our client was delighted with teX.ai’s optimal performance in the aspects of text extraction with a large bunch of documents with higher accuracy levels and quick processing.

They were pleased with the automated text extraction model that had reduced human intervention by 75%. Let’s wind up with highlights of the tex.ai extraction process for our client.

  • 4x Quick text extraction process from different formats
  • Reduction of human intervention by 75%
  • High level of accuracy to around 85%

Looking to extract actionable insights from text data? teX.ai is your all-in-one solution for all your text analytics needs. Talk to us now.

Author: Pradeep Parthiban
Pradeep is a Content Writer and Digital Marketing Specialist at Indium Software with a demonstrated history of working in the information technology and services industry.

Leave a Reply