DocExtract - Document Information Extraction System

6th March, 2025

1. Tools and Models Chosen

To build an efficient, cost-effective, and accurate document information extraction and query system, the following tools and models were selected:

1.1 Tesseract OCR

Reasoning: Open-source, free, and works offline with good accuracy for printed text.
Limitation: Struggles with handwritten text, low-quality images, and requires manual preprocessing.
Alternative: PaddleOCR for better performance.

1.2 NLP Models

A) SpaCy (Rule-Based Matching & NER)

Reasoning: Lightweight, fast, CPU-optimized.
Limitation: Needs manual rule definitions, struggles with complex queries.

B) Hugging Face DistilBERT

Reasoning: Contextual transformer with lower inference cost than BERT.
Limitation: Needs fine-tuning for domain-specific use.

C) Haystack + Elasticsearch/FAISS

Reasoning: Enables semantic search and scalable document QA.
Limitation: Adds infrastructure complexity.

1.2.1 Combined Alternatives

📌 Flow A:

Retrieve relevant documents → FAISS/Elasticsearch
Embed documents & query → SentenceTransformers (v2)
Generate answer → T5 or GPT with retrieved context

📌 Flow B:

Run NER → Extract entities (e.g., Invoice ID, Date)
Index entities → FAISS/Elasticsearch
Query entities → Retrieve documents using dense/keyword retrieval
Generate response → Use a small generative model

Limitations

OCR issues with low-res images, handwriting
Transformers can be computationally expensive
Rule-based NER requires manual updates
Cloud OCR (e.g., Google Vision) has privacy concerns
Infrastructure complexity for Haystack + Elasticsearch

Future Recommendations

Adopt PaddleOCR and improve image preprocessing
Fine-tune DistilBERT or build a custom RAG pipeline
Use containerized deployment with lightweight models
Ensure compliance via RBAC, encryption, and on-prem solutions

🧪 Implementation Link

Some prototype demos on different models using raw data: Click here