ToolπŸ“… Updated 2025-12-01

MarkItDown

Python tool for converting files and office documents to Markdown β€” PDFs, PowerPoint, Word, Excel, images with OCR, audio transcription, and more.

document-conversionllm-pipelinecontent-extractionocrclidocument-processingpython
85,300
Stars
microsoft
Author

MarkItDown

Python tool for converting files and office documents to Markdown β€” lightweight, production-ready utility for LLM pipelines and text analysis.

Overview

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with LLMs and text analysis pipelines. It preserves important document structure (headings, lists, tables, links) while being optimized for token efficiency.

Why Markdown?

Markdown is extremely close to plain text with minimal markup, but still represents important document structure. Mainstream LLMs (OpenAI's GPT-4o, Anthropic's Claude, etc.) natively "speak" Markdown and often incorporate it into responses unprompted. Additionally, Markdown conventions are highly token-efficient.

Supported Formats

MarkItDown converts from:

  • Documents: PDF, PowerPoint, Word, Excel (including older .xls files)
  • Images: EXIF metadata extraction and OCR
  • Audio: EXIF metadata and speech transcription (wav, mp3)
  • Web: HTML
  • Text: CSV, JSON, XML
  • Archives: ZIP files (iterates over contents)
  • Media: YouTube URLs (transcription)
  • Books: EPubs
  • And more!

Installation

Quick Install

# Install with all optional dependencies
pip install 'markitdown[all]'

From Source

git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'

Optional Dependencies

Install specific dependencies for more control:

# Only PDF, DOCX, and PPTX
pip install 'markitdown[pdf, docx, pptx]'

# Available options:
# [all]          - All optional dependencies
# [pptx]         - PowerPoint files
# [docx]         - Word files
# [xlsx]         - Excel files
# [xls]          - Older Excel files
# [pdf]          - PDF files
# [outlook]      - Outlook messages
# [az-doc-intel] - Azure Document Intelligence
# [audio-transcription]  - Audio transcription (wav, mp3)
# [youtube-transcription] - YouTube video transcription

Core Workflow

Command-Line Usage

# Basic conversion
markitdown path-to-file.pdf > document.md

# Specify output file
markitdown path-to-file.pdf -o document.md

# Pipe content
cat path-to-file.pdf | markitdown

Python API

from markitdown import MarkItDown

# Basic conversion
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)

# With Document Intelligence
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

# With LLM for image descriptions
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="optional custom prompt"
)
result = md.convert("example.jpg")
print(result.text_content)

Key Features

Structure Preservation

MarkItDown preserves important document elements:

  • βœ… Headings (h1-h6)
  • βœ… Lists (ordered and unordered)
  • βœ… Tables
  • βœ… Links
  • βœ… Images (as markdown references)
  • βœ… Code blocks
  • βœ… Blockquotes

OCR for Images

Extract text from images using OCR:

  • EXIF metadata extraction
  • Text recognition
  • Table detection
  • Layout analysis

Audio Transcription

Convert audio to text:

  • EXIF metadata extraction
  • Speech transcription (wav, mp3)
  • Speaker diarization support

Azure Document Intelligence

Enhanced PDF conversion using Microsoft Document Intelligence:

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

MCP Server Integration

MarkItDown offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See markitdown-mcp for details.

Plugin System

MarkItDown supports third-party plugins:

# List installed plugins
markitdown --list-plugins

# Enable plugins
markitdown --use-plugins path-to-file.pdf

To find available plugins, search GitHub for #markitdown-plugin. To develop a plugin, see packages/markitdown-sample-plugin.

Use Cases

LLM Pipelines

Prepare documents for LLM consumption:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("report.pdf")

# Feed to LLM
response = llm.complete(
    prompt=result.text_content,
    instructions="Summarize this document"
)

Content Extraction

Extract structured data from documents:

result = md.convert("invoice.pdf")
# Markdown with preserved structure
# - Table for line items
# - Headings for sections
# - List for totals

Document Analysis

Analyze multiple document types uniformly:

documents = []
for file in ["report.pdf", "data.xlsx", "slides.pptx"]:
    result = md.convert(file)
    documents.append(result.text_content)

# Process uniformly
for doc in documents:
    analyze(doc)  # Same analysis for all formats

Batch Processing

Process multiple files efficiently:

# Convert all PDFs in directory
for file in *.pdf; do
    markitdown "$file" -o "${file%.pdf}.md"
done

RAG Applications

Build retrieval-augmented generation systems:

from markitdown import MarkItDown
from vector_store import VectorStore

md = MarkItDown()
store = VectorStore()

for doc in documents:
    result = md.convert(doc)
    store.add(result.text_content, metadata=doc)

# Query
query = "What are the main findings?"
matches = store.search(query)

Advanced Features

Streaming Conversion

Convert file-like objects:

import io
from markitdown import MarkItDown

md = MarkItDown()

# From binary file-like object
with open("file.pdf", "rb") as f:
    result = md.convert_stream(f)

# From BytesIO
buffer = io.BytesIO(file_content)
result = md.convert_stream(buffer)

Docker Support

# Build image
docker build -t markitdown:latest .

# Run conversion
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Azure Document Intelligence

Enhanced conversion for complex PDFs:

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
    docintel_key="<api_key>"
)
result = md.convert("complex.pdf")

Benefits:

  • Better OCR quality
  • Improved table extraction
  • Enhanced form field recognition
  • Handwriting support

LLM-Enhanced Image Descriptions

Use LLMs to generate rich image descriptions:

from openai import OpenAI
from markitdown import MarkItDown

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail for accessibility purposes"
)

result = md.convert("chart.png")
# Result includes LLM-generated description

Comparison with Alternatives

vs textract

FeatureMarkItDowntextract
Output formatMarkdown (structured)Plain text
Token efficiencyHigh (Markdown)Lower (plain text)
Structure preservationβœ… Yes❌ No
LLM-optimizedβœ… Yes❌ No
Python 3.10+βœ… Requiredβœ… Supported

vs PyPDF2 + pandas

# Old way (multiple libraries)
import PyPDF2
import pandas as pd

# Different APIs for different formats
pdf_text = PyPDF2.PdfReader("file.pdf")
df = pd.read_excel("file.xlsx")

# MarkItDown (unified API)
from markitdown import MarkItDown
md = MarkItDown()
pdf = md.convert("file.pdf")
xlsx = md.convert("file.xlsx")

Best Practices

Performance Optimization

  1. Install only needed dependencies:

    # Instead of [all], install specific formats
    pip install 'markitdown[pdf, docx]'
    
  2. Use batch processing for many files:

    from concurrent.futures import ThreadPoolExecutor
    
    md = MarkItDown()
    
    def convert_file(file):
        return md.convert(file)
    
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = executor.map(convert_file, files)
    
  3. Cache results when possible:

    import hashlib
    import os
    
    def cached_convert(file_path):
        cache_key = hashlib.md5(file_path.encode()).hexdigest()
        cache_file = f"cache/{cache_key}.md"
    
        if os.path.exists(cache_file):
            return open(cache_file).read()
    
        result = md.convert(file_path)
        with open(cache_file, 'w') as f:
            f.write(result.text_content)
        return result.text_content
    

Quality Tips

  1. For complex PDFs: Use Azure Document Intelligence
  2. For images: Provide LLM prompt for better descriptions
  3. For scanned documents: Enable OCR explicitly
  4. For tables: Check output formatting, may need post-processing

Error Handling

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("file.pdf")
    if result.text_content:
        print("Success:", result.text_content[:100])
    else:
        print("Empty result")
except Exception as e:
    print(f"Conversion failed: {e}")
    # Fallback strategy

Technical Details

Architecture

  • Language: Python 3.10+
  • Dependencies: Minimal core with optional feature groups
  • License: MIT
  • Maintainer: Microsoft AutoGen Team

Breaking Changes (0.0.1 β†’ 0.1.0)

  • Dependencies organized into optional feature-groups
    • Use pip install 'markitdown[all]' for backward compatibility
  • convert_stream() now requires binary file-like objects
    • No longer accepts text file-like objects like io.StringIO
  • DocumentConverter class interface changed
    • Now reads from file-like streams, not file paths
    • No temporary files created

Streaming Interface

# Before 0.1.0 (string path)
md.convert("file.pdf")

# After 0.1.0 (file-like object preferred)
with open("file.pdf", "rb") as f:
    md.convert_stream(f)

# String path still works for backward compatibility
md.convert("file.pdf")  # Still supported

Limitations

  • Not optimized for human-presentable conversions
  • Designed for LLM consumption, not high-fidelity document conversion
  • Some complex layouts may not preserve perfectly
  • OCR quality depends on input image quality
  • Audio transcription requires audio-transcription feature
  • YouTube transcription requires youtube-transcription feature
  • Azure Document Intelligence requires Azure subscription

Testing

Running Tests

# Navigate to package
cd packages/markitdown

# Install hatch
pip install hatch

# Run tests
hatch shell
hatch test

Pre-commit Checks

# Run all pre-commit checks
pre-commit run --all-files

Contributing

MarkItDown welcomes contributions:

  • Look at issues
  • Review pull requests
  • Marked as 'open for contribution' or 'open for reviewing'

Creating Plugins

Develop third-party plugins following the sample in packages/markitdown-sample-plugin. Use hashtag #markitdown-plugin when sharing.

Community

  • Stars: 85.3k
  • Forks: 4.9k
  • Used by: 2.1k projects
  • Contributors: 74+
  • Language: Python 99.5%, Dockerfile 0.5%
  • textract: Alternative document extraction library
  • LangChain: Document loaders integration
  • AutoGen: LLM agent framework (built by same team)
  • MCP (Model Context Protocol): Standardized tool integration

Example: Complete LLM Pipeline

from markitdown import MarkItDown
from openai import OpenAI

# Initialize MarkItDown
md = MarkItDown()
client = OpenAI()

# Convert document
result = md.convert("research_paper.pdf")

# Create analysis prompt
prompt = f"""
Analyze this research paper:

{result.text_content}

Provide:
1. Key findings
2. Methodology summary
3. Future work suggestions
"""

# Get analysis
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

Example: Multi-Document Analysis

from markitdown import MarkItDown

md = MarkItDown()

# Convert multiple formats
files = [
    "report.pdf",
    "data.xlsx",
    "presentation.pptx",
    "notes.txt"
]

documents = {}
for file in files:
    result = md.convert(file)
    documents[file] = result.text_content

# Analyze collectively
for file, content in documents.items():
    print(f"=== {file} ===")
    print(content[:200] + "...")
    print()

Release Notes

Latest version: 0.1.4 (Dec 1, 2025)

Recent improvements:

  • Enhanced error handling
  • Better plugin support
  • Performance optimizations
  • Additional format support

See releases for full changelog.