MarkItDown
Python tool for converting files and office documents to Markdown β lightweight, production-ready utility for LLM pipelines and text analysis.
Overview
MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with LLMs and text analysis pipelines. It preserves important document structure (headings, lists, tables, links) while being optimized for token efficiency.
Why Markdown?
Markdown is extremely close to plain text with minimal markup, but still represents important document structure. Mainstream LLMs (OpenAI's GPT-4o, Anthropic's Claude, etc.) natively "speak" Markdown and often incorporate it into responses unprompted. Additionally, Markdown conventions are highly token-efficient.
Supported Formats
MarkItDown converts from:
- Documents: PDF, PowerPoint, Word, Excel (including older .xls files)
- Images: EXIF metadata extraction and OCR
- Audio: EXIF metadata and speech transcription (wav, mp3)
- Web: HTML
- Text: CSV, JSON, XML
- Archives: ZIP files (iterates over contents)
- Media: YouTube URLs (transcription)
- Books: EPubs
- And more!
Installation
Quick Install
# Install with all optional dependencies
pip install 'markitdown[all]'
From Source
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
Optional Dependencies
Install specific dependencies for more control:
# Only PDF, DOCX, and PPTX
pip install 'markitdown[pdf, docx, pptx]'
# Available options:
# [all] - All optional dependencies
# [pptx] - PowerPoint files
# [docx] - Word files
# [xlsx] - Excel files
# [xls] - Older Excel files
# [pdf] - PDF files
# [outlook] - Outlook messages
# [az-doc-intel] - Azure Document Intelligence
# [audio-transcription] - Audio transcription (wav, mp3)
# [youtube-transcription] - YouTube video transcription
Core Workflow
Command-Line Usage
# Basic conversion
markitdown path-to-file.pdf > document.md
# Specify output file
markitdown path-to-file.pdf -o document.md
# Pipe content
cat path-to-file.pdf | markitdown
Python API
from markitdown import MarkItDown
# Basic conversion
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)
# With Document Intelligence
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
# With LLM for image descriptions
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="optional custom prompt"
)
result = md.convert("example.jpg")
print(result.text_content)
Key Features
Structure Preservation
MarkItDown preserves important document elements:
- β Headings (h1-h6)
- β Lists (ordered and unordered)
- β Tables
- β Links
- β Images (as markdown references)
- β Code blocks
- β Blockquotes
OCR for Images
Extract text from images using OCR:
- EXIF metadata extraction
- Text recognition
- Table detection
- Layout analysis
Audio Transcription
Convert audio to text:
- EXIF metadata extraction
- Speech transcription (wav, mp3)
- Speaker diarization support
Azure Document Intelligence
Enhanced PDF conversion using Microsoft Document Intelligence:
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
MCP Server Integration
MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See markitdown-mcp for details.
Plugin System
MarkItDown supports third-party plugins:
# List installed plugins
markitdown --list-plugins
# Enable plugins
markitdown --use-plugins path-to-file.pdf
To find available plugins, search GitHub for #markitdown-plugin. To develop a plugin, see packages/markitdown-sample-plugin.
Use Cases
LLM Pipelines
Prepare documents for LLM consumption:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.pdf")
# Feed to LLM
response = llm.complete(
prompt=result.text_content,
instructions="Summarize this document"
)
Content Extraction
Extract structured data from documents:
result = md.convert("invoice.pdf")
# Markdown with preserved structure
# - Table for line items
# - Headings for sections
# - List for totals
Document Analysis
Analyze multiple document types uniformly:
documents = []
for file in ["report.pdf", "data.xlsx", "slides.pptx"]:
result = md.convert(file)
documents.append(result.text_content)
# Process uniformly
for doc in documents:
analyze(doc) # Same analysis for all formats
Batch Processing
Process multiple files efficiently:
# Convert all PDFs in directory
for file in *.pdf; do
markitdown "$file" -o "${file%.pdf}.md"
done
RAG Applications
Build retrieval-augmented generation systems:
from markitdown import MarkItDown
from vector_store import VectorStore
md = MarkItDown()
store = VectorStore()
for doc in documents:
result = md.convert(doc)
store.add(result.text_content, metadata=doc)
# Query
query = "What are the main findings?"
matches = store.search(query)
Advanced Features
Streaming Conversion
Convert file-like objects:
import io
from markitdown import MarkItDown
md = MarkItDown()
# From binary file-like object
with open("file.pdf", "rb") as f:
result = md.convert_stream(f)
# From BytesIO
buffer = io.BytesIO(file_content)
result = md.convert_stream(buffer)
Docker Support
# Build image
docker build -t markitdown:latest .
# Run conversion
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Azure Document Intelligence
Enhanced conversion for complex PDFs:
from markitdown import MarkItDown
md = MarkItDown(
docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
docintel_key="<api_key>"
)
result = md.convert("complex.pdf")
Benefits:
- Better OCR quality
- Improved table extraction
- Enhanced form field recognition
- Handwriting support
LLM-Enhanced Image Descriptions
Use LLMs to generate rich image descriptions:
from openai import OpenAI
from markitdown import MarkItDown
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this image in detail for accessibility purposes"
)
result = md.convert("chart.png")
# Result includes LLM-generated description
Comparison with Alternatives
vs textract
| Feature | MarkItDown | textract |
|---|---|---|
| Output format | Markdown (structured) | Plain text |
| Token efficiency | High (Markdown) | Lower (plain text) |
| Structure preservation | β Yes | β No |
| LLM-optimized | β Yes | β No |
| Python 3.10+ | β Required | β Supported |
vs PyPDF2 + pandas
# Old way (multiple libraries)
import PyPDF2
import pandas as pd
# Different APIs for different formats
pdf_text = PyPDF2.PdfReader("file.pdf")
df = pd.read_excel("file.xlsx")
# MarkItDown (unified API)
from markitdown import MarkItDown
md = MarkItDown()
pdf = md.convert("file.pdf")
xlsx = md.convert("file.xlsx")
Best Practices
Performance Optimization
-
Install only needed dependencies:
# Instead of [all], install specific formats pip install 'markitdown[pdf, docx]' -
Use batch processing for many files:
from concurrent.futures import ThreadPoolExecutor md = MarkItDown() def convert_file(file): return md.convert(file) with ThreadPoolExecutor(max_workers=4) as executor: results = executor.map(convert_file, files) -
Cache results when possible:
import hashlib import os def cached_convert(file_path): cache_key = hashlib.md5(file_path.encode()).hexdigest() cache_file = f"cache/{cache_key}.md" if os.path.exists(cache_file): return open(cache_file).read() result = md.convert(file_path) with open(cache_file, 'w') as f: f.write(result.text_content) return result.text_content
Quality Tips
- For complex PDFs: Use Azure Document Intelligence
- For images: Provide LLM prompt for better descriptions
- For scanned documents: Enable OCR explicitly
- For tables: Check output formatting, may need post-processing
Error Handling
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("file.pdf")
if result.text_content:
print("Success:", result.text_content[:100])
else:
print("Empty result")
except Exception as e:
print(f"Conversion failed: {e}")
# Fallback strategy
Technical Details
Architecture
- Language: Python 3.10+
- Dependencies: Minimal core with optional feature groups
- License: MIT
- Maintainer: Microsoft AutoGen Team
Breaking Changes (0.0.1 β 0.1.0)
- Dependencies organized into optional feature-groups
- Use
pip install 'markitdown[all]'for backward compatibility
- Use
convert_stream()now requires binary file-like objects- No longer accepts text file-like objects like
io.StringIO
- No longer accepts text file-like objects like
- DocumentConverter class interface changed
- Now reads from file-like streams, not file paths
- No temporary files created
Streaming Interface
# Before 0.1.0 (string path)
md.convert("file.pdf")
# After 0.1.0 (file-like object preferred)
with open("file.pdf", "rb") as f:
md.convert_stream(f)
# String path still works for backward compatibility
md.convert("file.pdf") # Still supported
Limitations
- Not optimized for human-presentable conversions
- Designed for LLM consumption, not high-fidelity document conversion
- Some complex layouts may not preserve perfectly
- OCR quality depends on input image quality
- Audio transcription requires audio-transcription feature
- YouTube transcription requires youtube-transcription feature
- Azure Document Intelligence requires Azure subscription
Testing
Running Tests
# Navigate to package
cd packages/markitdown
# Install hatch
pip install hatch
# Run tests
hatch shell
hatch test
Pre-commit Checks
# Run all pre-commit checks
pre-commit run --all-files
Contributing
MarkItDown welcomes contributions:
- Look at issues
- Review pull requests
- Marked as 'open for contribution' or 'open for reviewing'
Creating Plugins
Develop third-party plugins following the sample in packages/markitdown-sample-plugin. Use hashtag #markitdown-plugin when sharing.
Community
- Stars: 85.3k
- Forks: 4.9k
- Used by: 2.1k projects
- Contributors: 74+
- Language: Python 99.5%, Dockerfile 0.5%
Related Tools
- textract: Alternative document extraction library
- LangChain: Document loaders integration
- AutoGen: LLM agent framework (built by same team)
- MCP (Model Context Protocol): Standardized tool integration
Example: Complete LLM Pipeline
from markitdown import MarkItDown
from openai import OpenAI
# Initialize MarkItDown
md = MarkItDown()
client = OpenAI()
# Convert document
result = md.convert("research_paper.pdf")
# Create analysis prompt
prompt = f"""
Analyze this research paper:
{result.text_content}
Provide:
1. Key findings
2. Methodology summary
3. Future work suggestions
"""
# Get analysis
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
Example: Multi-Document Analysis
from markitdown import MarkItDown
md = MarkItDown()
# Convert multiple formats
files = [
"report.pdf",
"data.xlsx",
"presentation.pptx",
"notes.txt"
]
documents = {}
for file in files:
result = md.convert(file)
documents[file] = result.text_content
# Analyze collectively
for file, content in documents.items():
print(f"=== {file} ===")
print(content[:200] + "...")
print()
Release Notes
Latest version: 0.1.4 (Dec 1, 2025)
Recent improvements:
- Enhanced error handling
- Better plugin support
- Performance optimizations
- Additional format support
See releases for full changelog.
