Overview
Many real-world workflows involve bundled documents — a single PDF containing an invoice, a packing slip, and a certificate of origin, or a loan application with multiple forms. This cookbook shows how to build a pipeline that automatically splits, classifies, and extracts data from each section.What You’ll Build
An automated pipeline that:- Splits a bundled PDF into individual documents
- Classifies each section by type
- Extracts structured data from each section using type-specific schemas
Step 1: Split the Document Bundle
First, define the expected document types and split the bundle:Step 2: Define Type-Specific Schemas
Create extraction schemas for each document type:Python
Step 3: Extract Data from Each Section
Use async endpoints for parallel processing:Python
Using Studio Pipelines
Instead of coding the workflow manually, you can create a pipeline in Aifano Studio:- Create a new pipeline with type
parse_split_extract - Configure the Split processor with your document categories
- Configure the Extract processor with your schemas
- Upload documents and run the pipeline with one click
Tips
Use split_description for better classification
Use split_description for better classification
Provide clear, distinct descriptions for each document type. The more
specific the description, the more accurate the classification.
Process sections in parallel
Process sections in parallel
Submit all extraction jobs at once using async endpoints, then poll for
results. This is significantly faster than sequential processing.
Handle unknown document types
Handle unknown document types
Include a catch-all category like “Supporting Document” to capture sections
that don’t match your defined types.
Reuse job references
Reuse job references
Use
jobid:// references to avoid re-processing. Split once, then extract
from the same job multiple times with different schemas.Next Steps
- Invoice Processing — Extract financial data from invoices
- Contract Analysis — Analyze legal documents
- Pipelines — Create reusable workflows in Studio