DocExtract - Document Information Extraction System

6th March, 2025

1. Tools and Models Chosen

To build an efficient, cost-effective, and accurate document information extraction and query system, the following tools and models were selected:

1.1 Tesseract OCR

  • Reasoning: Open-source, free, and works offline with good accuracy for printed text.
  • Limitation: Struggles with handwritten text, low-quality images, and requires manual preprocessing.
  • Alternative: PaddleOCR for better performance.

1.2 NLP Models

A) SpaCy (Rule-Based Matching & NER)

  • Reasoning: Lightweight, fast, CPU-optimized.
  • Limitation: Needs manual rule definitions, struggles with complex queries.

B) Hugging Face DistilBERT

  • Reasoning: Contextual transformer with lower inference cost than BERT.
  • Limitation: Needs fine-tuning for domain-specific use.

C) Haystack + Elasticsearch/FAISS

  • Reasoning: Enables semantic search and scalable document QA.
  • Limitation: Adds infrastructure complexity.

1.2.1 Combined Alternatives

๐Ÿ“Œ Flow A:

  1. Retrieve relevant documents โ†’ FAISS/Elasticsearch
  2. Embed documents & query โ†’ SentenceTransformers (v2)
  3. Generate answer โ†’ T5 or GPT with retrieved context

๐Ÿ“Œ Flow B:

  1. Run NER โ†’ Extract entities (e.g., Invoice ID, Date)
  2. Index entities โ†’ FAISS/Elasticsearch
  3. Query entities โ†’ Retrieve documents using dense/keyword retrieval
  4. Generate response โ†’ Use a small generative model

Limitations

  • OCR issues with low-res images, handwriting
  • Transformers can be computationally expensive
  • Rule-based NER requires manual updates
  • Cloud OCR (e.g., Google Vision) has privacy concerns
  • Infrastructure complexity for Haystack + Elasticsearch

Future Recommendations

  • Adopt PaddleOCR and improve image preprocessing
  • Fine-tune DistilBERT or build a custom RAG pipeline
  • Use containerized deployment with lightweight models
  • Ensure compliance via RBAC, encryption, and on-prem solutions

๐Ÿงช Implementation Link

Some prototype demos on different models using raw data: Click here