1. Tools and Models Chosen
To build an efficient, cost-effective, and accurate document information extraction and query system, the following tools and models were selected:
1.1 Tesseract OCR
- Reasoning: Open-source, free, and works offline with good accuracy for printed text.
- Limitation: Struggles with handwritten text, low-quality images, and requires manual preprocessing.
- Alternative: PaddleOCR for better performance.
1.2 NLP Models
A) SpaCy (Rule-Based Matching & NER)
- Reasoning: Lightweight, fast, CPU-optimized.
- Limitation: Needs manual rule definitions, struggles with complex queries.
B) Hugging Face DistilBERT
- Reasoning: Contextual transformer with lower inference cost than BERT.
- Limitation: Needs fine-tuning for domain-specific use.
C) Haystack + Elasticsearch/FAISS
- Reasoning: Enables semantic search and scalable document QA.
- Limitation: Adds infrastructure complexity.
1.2.1 Combined Alternatives
๐ Flow A:
- Retrieve relevant documents โ FAISS/Elasticsearch
- Embed documents & query โ SentenceTransformers (v2)
- Generate answer โ T5 or GPT with retrieved context
๐ Flow B:
- Run NER โ Extract entities (e.g., Invoice ID, Date)
- Index entities โ FAISS/Elasticsearch
- Query entities โ Retrieve documents using dense/keyword retrieval
- Generate response โ Use a small generative model
Limitations
- OCR issues with low-res images, handwriting
- Transformers can be computationally expensive
- Rule-based NER requires manual updates
- Cloud OCR (e.g., Google Vision) has privacy concerns
- Infrastructure complexity for Haystack + Elasticsearch
Future Recommendations
- Adopt PaddleOCR and improve image preprocessing
- Fine-tune DistilBERT or build a custom RAG pipeline
- Use containerized deployment with lightweight models
- Ensure compliance via RBAC, encryption, and on-prem solutions
๐งช Implementation Link
Some prototype demos on different models using raw data: Click here