MarkItDown

Python tool for converting files and office documents to Markdown — lightweight, production-ready utility for LLM pipelines and text analysis.

Overview

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with LLMs and text analysis pipelines. It preserves important document structure (headings, lists, tables, links) while being optimized for token efficiency.

Why Markdown?

Markdown is extremely close to plain text with minimal markup, but still represents important document structure. Mainstream LLMs (OpenAI's GPT-4o, Anthropic's Claude, etc.) natively "speak" Markdown and often incorporate it into responses unprompted. Additionally, Markdown conventions are highly token-efficient.

Supported Formats

MarkItDown converts from:

Documents: PDF, PowerPoint, Word, Excel (including older .xls files)
Images: EXIF metadata extraction and OCR
Audio: EXIF metadata and speech transcription (wav, mp3)
Web: HTML
Text: CSV, JSON, XML
Archives: ZIP files (iterates over contents)
Media: YouTube URLs (transcription)
Books: EPubs
And more!

Installation

Quick Install

# Install with all optional dependencies
pip install 'markitdown[all]'

From Source

git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'

Optional Dependencies

Install specific dependencies for more control:

# Only PDF, DOCX, and PPTX
pip install 'markitdown[pdf, docx, pptx]'

# Available options:
# [all]          - All optional dependencies
# [pptx]         - PowerPoint files
# [docx]         - Word files
# [xlsx]         - Excel files
# [xls]          - Older Excel files
# [pdf]          - PDF files
# [outlook]      - Outlook messages
# [az-doc-intel] - Azure Document Intelligence
# [audio-transcription]  - Audio transcription (wav, mp3)
# [youtube-transcription] - YouTube video transcription

Core Workflow

Command-Line Usage

# Basic conversion
markitdown path-to-file.pdf > document.md

# Specify output file
markitdown path-to-file.pdf -o document.md

# Pipe content
cat path-to-file.pdf | markitdown

Python API

from markitdown import MarkItDown

# Basic conversion
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)

# With Document Intelligence
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

# With LLM for image descriptions
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="optional custom prompt"
)
result = md.convert("example.jpg")
print(result.text_content)

Key Features

Structure Preservation

MarkItDown preserves important document elements:

✅ Headings (h1-h6)
✅ Lists (ordered and unordered)
✅ Tables
✅ Links
✅ Images (as markdown references)
✅ Code blocks
✅ Blockquotes

OCR for Images

Extract text from images using OCR:

EXIF metadata extraction
Text recognition
Table detection
Layout analysis

Audio Transcription

Convert audio to text:

EXIF metadata extraction
Speech transcription (wav, mp3)
Speaker diarization support

Azure Document Intelligence

Enhanced PDF conversion using Microsoft Document Intelligence:

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

MCP Server Integration

MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See markitdown-mcp for details.

Plugin System

MarkItDown supports third-party plugins:

# List installed plugins
markitdown --list-plugins

# Enable plugins
markitdown --use-plugins path-to-file.pdf

To find available plugins, search GitHub for #markitdown-plugin. To develop a plugin, see packages/markitdown-sample-plugin.

Use Cases

LLM Pipelines

Prepare documents for LLM consumption:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("report.pdf")

# Feed to LLM
response = llm.complete(
    prompt=result.text_content,
    instructions="Summarize this document"
)

Content Extraction

Extract structured data from documents:

result = md.convert("invoice.pdf")
# Markdown with preserved structure
# - Table for line items
# - Headings for sections
# - List for totals

Document Analysis

Analyze multiple document types uniformly:

documents = []
for file in ["report.pdf", "data.xlsx", "slides.pptx"]:
    result = md.convert(file)
    documents.append(result.text_content)

# Process uniformly
for doc in documents:
    analyze(doc)  # Same analysis for all formats

Batch Processing

Process multiple files efficiently:

# Convert all PDFs in directory
for file in *.pdf; do
    markitdown "$file" -o "${file%.pdf}.md"
done

RAG Applications

Build retrieval-augmented generation systems:

from markitdown import MarkItDown
from vector_store import VectorStore

md = MarkItDown()
store = VectorStore()

for doc in documents:
    result = md.convert(doc)
    store.add(result.text_content, metadata=doc)

# Query
query = "What are the main findings?"
matches = store.search(query)

Advanced Features

Streaming Conversion

Convert file-like objects:

import io
from markitdown import MarkItDown

md = MarkItDown()

# From binary file-like object
with open("file.pdf", "rb") as f:
    result = md.convert_stream(f)

# From BytesIO
buffer = io.BytesIO(file_content)
result = md.convert_stream(buffer)

Docker Support

# Build image
docker build -t markitdown:latest .

# Run conversion
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Azure Document Intelligence

Enhanced conversion for complex PDFs:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
    docintel_key="<api_key>"
)
result = md.convert("complex.pdf")

Benefits:

Better OCR quality
Improved table extraction
Enhanced form field recognition
Handwriting support

LLM-Enhanced Image Descriptions

Use LLMs to generate rich image descriptions:

from openai import OpenAI
from markitdown import MarkItDown

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail for accessibility purposes"
)

result = md.convert("chart.png")
# Result includes LLM-generated description

Comparison with Alternatives

vs textract

Feature	MarkItDown	textract
Output format	Markdown (structured)	Plain text
Token efficiency	High (Markdown)	Lower (plain text)
Structure preservation	✅ Yes	❌ No
LLM-optimized	✅ Yes	❌ No
Python 3.10+	✅ Required	✅ Supported

vs PyPDF2 + pandas

# Old way (multiple libraries)
import PyPDF2
import pandas as pd

# Different APIs for different formats
pdf_text = PyPDF2.PdfReader("file.pdf")
df = pd.read_excel("file.xlsx")

# MarkItDown (unified API)
from markitdown import MarkItDown
md = MarkItDown()
pdf = md.convert("file.pdf")
xlsx = md.convert("file.xlsx")

Best Practices

Performance Optimization

Install only needed dependencies:

# Instead of [all], install specific formats
pip install 'markitdown[pdf, docx]'

Use batch processing for many files:

from concurrent.futures import ThreadPoolExecutor

md = MarkItDown()

def convert_file(file):
    return md.convert(file)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(convert_file, files)

Cache results when possible:

import hashlib
import os

def cached_convert(file_path):
    cache_key = hashlib.md5(file_path.encode()).hexdigest()
    cache_file = f"cache/{cache_key}.md"

    if os.path.exists(cache_file):
        return open(cache_file).read()

    result = md.convert(file_path)
    with open(cache_file, 'w') as f:
        f.write(result.text_content)
    return result.text_content

Quality Tips

For complex PDFs: Use Azure Document Intelligence
For images: Provide LLM prompt for better descriptions
For scanned documents: Enable OCR explicitly
For tables: Check output formatting, may need post-processing

Error Handling

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("file.pdf")
    if result.text_content:
        print("Success:", result.text_content[:100])
    else:
        print("Empty result")
except Exception as e:
    print(f"Conversion failed: {e}")
    # Fallback strategy

Technical Details

Architecture

Language: Python 3.10+
Dependencies: Minimal core with optional feature groups
License: MIT
Maintainer: Microsoft AutoGen Team

Breaking Changes (0.0.1 → 0.1.0)

Dependencies organized into optional feature-groups
- Use pip install 'markitdown[all]' for backward compatibility
convert_stream() now requires binary file-like objects
- No longer accepts text file-like objects like io.StringIO
DocumentConverter class interface changed
- Now reads from file-like streams, not file paths
- No temporary files created

Streaming Interface

# Before 0.1.0 (string path)
md.convert("file.pdf")

# After 0.1.0 (file-like object preferred)
with open("file.pdf", "rb") as f:
    md.convert_stream(f)

# String path still works for backward compatibility
md.convert("file.pdf")  # Still supported

Limitations

Not optimized for human-presentable conversions
Designed for LLM consumption, not high-fidelity document conversion
Some complex layouts may not preserve perfectly
OCR quality depends on input image quality
Audio transcription requires audio-transcription feature
YouTube transcription requires youtube-transcription feature
Azure Document Intelligence requires Azure subscription

Testing

Running Tests

# Navigate to package
cd packages/markitdown

# Install hatch
pip install hatch

# Run tests
hatch shell
hatch test

Pre-commit Checks

# Run all pre-commit checks
pre-commit run --all-files

Contributing

MarkItDown welcomes contributions:

Look at issues
Review pull requests
Marked as 'open for contribution' or 'open for reviewing'

Creating Plugins

Develop third-party plugins following the sample in packages/markitdown-sample-plugin. Use hashtag #markitdown-plugin when sharing.

Community

Stars: 85.3k
Forks: 4.9k
Used by: 2.1k projects
Contributors: 74+
Language: Python 99.5%, Dockerfile 0.5%

textract: Alternative document extraction library
LangChain: Document loaders integration
AutoGen: LLM agent framework (built by same team)
MCP (Model Context Protocol): Standardized tool integration

Example: Complete LLM Pipeline

from markitdown import MarkItDown
from openai import OpenAI

# Initialize MarkItDown
md = MarkItDown()
client = OpenAI()

# Convert document
result = md.convert("research_paper.pdf")

# Create analysis prompt
prompt = f"""
Analyze this research paper:

{result.text_content}

Provide:
1. Key findings
2. Methodology summary
3. Future work suggestions
"""

# Get analysis
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

Example: Multi-Document Analysis

from markitdown import MarkItDown

md = MarkItDown()

# Convert multiple formats
files = [
    "report.pdf",
    "data.xlsx",
    "presentation.pptx",
    "notes.txt"
]

documents = {}
for file in files:
    result = md.convert(file)
    documents[file] = result.text_content

# Analyze collectively
for file, content in documents.items():
    print(f"=== {file} ===")
    print(content[:200] + "...")
    print()

Release Notes

Latest version: 0.1.4 (Dec 1, 2025)

Recent improvements:

Enhanced error handling
Better plugin support
Performance optimizations
Additional format support

See releases for full changelog.

MarkItDown

Related Tools

Agent Browser

Agent of Empires