TableNet — Table detection and Tabular data extraction from Images

6 min readJan 27, 2021

Deep learning models are getting popular day by day due to its widespread uses of understanding and extracting abstract information from biological data in high level. It can also improve the performances over the traditional model.

In this blog we will discuss about the TableNet an end to end deep learning model to detect and extract the tabular data from the images.

1. Description:

The uses of mobile phone and others digital devices are increasing rapidly. With the advent of new technologies we are moving towards the digital
era and instead of handing over the physical documents we always prefer the scanned copy of our personal documents.

For e.g. to take an admission in any institution instead of entering all the required details manually what if we simply provide the scanned copy of the mark sheets and certificates or while applying for a credit card/loan if the scanned copy of payslip is provided for a salaried person, in either of the cases system should able to extract the required information from the images and further process them as the details are mainly provided in a tabular format. Not only that , even if we want to store the information to maintain a dataset for the future purposes,it will also be helpful as data will be captured directly from the images so percentage of outliers will be very less. Hence if we can develop this self sufficient system which is useful to extract the tabular information from an image, the amount of manual intervention can significantly be reduced.

2. Dataset Source:

The model has been trained on publicly available marmot dataset at
https://drive.google.com/drive/folders/1QZiv5RKe3xlOBdTzuTVuYRxixemVIODp

3. Problem Statement:

i) Detect if there is any table present in the given image. It needn’t be with borderline , it could be anything where rows columns like structure present.
ii) If table is present, extract the tabular information.

4. Mapping to ML/DL Problem:

We can consider it as classification problem , whether the model is capable of predicting the table locations accurately or not.

4.1. Performance Metrics:

We will use accuracy along with Precision Recall as performance metrics.

4.2. Train-Test Construction:

We will split the data in 80–20 ratio to evaluate the model performance before deploying it in the production.

4.3. Approaches:

To solve the problem we will proceed step by step,
i) Dataset preparation, from the image annotations (provided in .xml files) extract the table and column pixel locations and convert it in terms of 1024*1024 format.
ii) Prepare the table and column masks from the extracted data. So by end of this step we will have original image and corresponding table and column masks.
iii) Model development,model will predict the table and column masks from the input image.
iv) Data extraction, once we have the predicted table and column masks, we can easily crop the mask part from the original image and then extract the information using Tesseract-OCR.
v) and finally Deployment so that it can be used as web service.

Next, we will go through in more details.

5. Dataset Preparation:

We will train the model on publicly available marmot dataset,we have the original images in .bmp format and table annotations in .xml format.

Here is the sample one,

The table annotation is mentioned in the .xml file as below,

table annotation

Clearly we need to extract the useful information like xmin,ymin, xmax , ymax which are the pixel locations for the columns and save them in a dataframe, we will import the xml.etree.ElementTree library for this purpose.

After extracting the required information we will create the table and column masks from xmin,ymin,xmax and ymax and save them as .jpeg file

Here is a sample one

and

From the original image , table and column masks we can construct the train and test dataset.

6. Model Development:

The model architecture is as follows,

The model takes a single input image and produces two different semantically labelled output images for tables and columns.

The base network of the model is initialized with the pre-trained VGG-19 features. This is followed by two decoder branches — one is for Segmentation of the table region and another one is Segmentation of the columns within a table.

The model shares the encoding layer of VGG- 19 for both the table and column detectors, while the decoders for the two tasks are separate. The shared common layers are repeatedly trained from the gradients received from both the table and column detectors while the decoders are trained independently.

complete tablenet architeture

The predicted output will be table and column masks. After few epochs here is a sample one ,

If we check for the accuracy, precision and recall,

from our observations Accuracy, Precision and Recall of Table are slightly higher than Column.

7. Data Extraction:

Once we have the mask images then our next target would be to crop the image from the original one which is only covered by the mask area.

The sample o/p image is

Table part has been cropped from the original Image using the predicted Table-mask

In the final step we will use tesseract-OCR to extract the texts from the image

And this is the reference output

8. Deployment:

The app is deployed in AWS using Flask web framework. It is accessible from here, please use the browser which supports HTML5 for better performances.

For deployment purpose we have created app.py where all our python code is present and for backend processing, upload.html to take the input from the user . As an intermediate step the input image has to be converted in high tensor , to process this atleast 4GB of RAM is required. Hence it is not possible to deploy in a relatively low configuration system.

This is the glimpse for our deployed web app

user is restricted to choose only .png/.jpg/.jpeg format

9. Future works:

Even though the single model is sufficient to detect as well as extract the tabular information , still there are few areas of improvement,

i) For extracting the information , it’s directly dependent on the Tesseract OCR. even though the model can detect the table in low resolution image , However Tesseract will not work for low resolution image . So result will be affected if the original image is not in high resolution.

ii) In case Tesseract doesn’t return the text in proper format, it would be difficult to store them in a .csv file so further modification is needed as storing the tabular data in a .csv file is always preferable.

iii) Due to less number of images for training set (only 400 images are present) and comparatively less number of Epoch due to limited GPU and free session - Precison and Recall is quiet low, we can retrain the model if we collect more number of training images with proper annotations to improve the precision and recall.

10. Profile:

If you’re interested for the complete code walkthrough , please visit here.

connect with me in Linkedin

11. References:

[1] Research paper for the details model architecture
https://www.researchgate.net/publication/337242893_TableNet_Deep_Learning_Model_for_End-to-end_Table_Detection_and_Tabular_Data_Extraction_from_Scanned_Document_Images

[2] For the end to end case study
https://www.appliedaicourse.com/

[3] Image segmentation in tensorflow
https://www.tensorflow.org/tutorials/images/segmentation