16 KiB
Enhanced Markdown Conversion
This document describes the enhanced markdown conversion feature for RAG-Anything, which provides high-quality PDF generation from markdown files with multiple backend options and advanced styling.
Overview
The enhanced markdown conversion feature provides professional-quality PDF generation from markdown files. It supports multiple conversion backends, advanced styling options, syntax highlighting, and seamless integration with RAG-Anything's document processing pipeline.
Key Features
- Multiple Backends: WeasyPrint, Pandoc, and automatic backend selection
- Advanced Styling: Custom CSS, syntax highlighting, and professional layouts
- Image Support: Embedded images with proper scaling and positioning
- Table Support: Formatted tables with borders and professional styling
- Code Highlighting: Syntax highlighting for code blocks using Pygments
- Custom Templates: Support for custom CSS and document templates
- Table of Contents: Automatic TOC generation with navigation links
- Professional Typography: High-quality fonts and spacing
Installation
Required Dependencies
# Basic installation
pip install raganything[all]
# Required for enhanced markdown conversion
pip install markdown weasyprint pygments
Optional Dependencies
# For Pandoc backend (system installation required)
# Ubuntu/Debian:
sudo apt-get install pandoc wkhtmltopdf
# macOS:
brew install pandoc wkhtmltopdf
# Or using conda:
conda install -c conda-forge pandoc wkhtmltopdf
Backend-Specific Installation
WeasyPrint (Recommended)
# Install WeasyPrint with system dependencies
pip install weasyprint
# Ubuntu/Debian system dependencies:
sudo apt-get install -y build-essential python3-dev python3-pip \
python3-setuptools python3-wheel python3-cffi libcairo2 \
libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
libffi-dev shared-mime-info
Pandoc
- Download from: https://pandoc.org/installing.html
- Requires system-wide installation
- Used for complex document structures and LaTeX-quality output
Usage
Basic Conversion
from raganything.enhanced_markdown import EnhancedMarkdownConverter, MarkdownConfig
# Create converter with default settings
converter = EnhancedMarkdownConverter()
# Convert markdown file to PDF
success = converter.convert_file_to_pdf(
input_path="document.md",
output_path="document.pdf",
method="auto" # Automatically select best available backend
)
if success:
print("✅ Conversion successful!")
else:
print("❌ Conversion failed")
Advanced Configuration
# Create custom configuration
config = MarkdownConfig(
page_size="A4", # A4, Letter, Legal, etc.
margin="1in", # CSS-style margins
font_size="12pt", # Base font size
line_height="1.5", # Line spacing
include_toc=True, # Generate table of contents
syntax_highlighting=True, # Enable code syntax highlighting
# Custom CSS styling
custom_css="""
body {
font-family: 'Georgia', serif;
color: #333;
}
h1 {
color: #2c3e50;
border-bottom: 2px solid #3498db;
padding-bottom: 0.3em;
}
code {
background-color: #f8f9fa;
padding: 2px 4px;
border-radius: 3px;
}
pre {
background-color: #f8f9fa;
border-left: 4px solid #3498db;
padding: 15px;
border-radius: 5px;
}
table {
border-collapse: collapse;
width: 100%;
margin: 1em 0;
}
th, td {
border: 1px solid #ddd;
padding: 8px 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
font-weight: bold;
}
"""
)
converter = EnhancedMarkdownConverter(config)
Backend Selection
# Check available backends
converter = EnhancedMarkdownConverter()
backend_info = converter.get_backend_info()
print("Available backends:")
for backend, available in backend_info["available_backends"].items():
status = "✅" if available else "❌"
print(f" {status} {backend}")
print(f"Recommended backend: {backend_info['recommended_backend']}")
# Use specific backend
converter.convert_file_to_pdf(
input_path="document.md",
output_path="document.pdf",
method="weasyprint" # or "pandoc", "pandoc_system", "auto"
)
Content Conversion
# Convert markdown content directly (not from file)
markdown_content = """
# Sample Document
## Introduction
This is a **bold** statement with *italic* text.
## Code Example
```python
def hello_world():
print("Hello, World!")
return "Success"
Table
| Feature | Status | Notes |
|---|---|---|
| PDF Generation | ✅ | Working |
| Syntax Highlighting | ✅ | Pygments |
| Custom CSS | ✅ | Full support |
| """ |
success = converter.convert_markdown_to_pdf( markdown_content=markdown_content, output_path="sample.pdf", method="auto" )
### Command Line Interface
```bash
# Basic conversion
python -m raganything.enhanced_markdown document.md --output document.pdf
# With specific backend
python -m raganything.enhanced_markdown document.md --method weasyprint
# With custom CSS file
python -m raganything.enhanced_markdown document.md --css custom_style.css
# Show backend information
python -m raganything.enhanced_markdown --info
# Help
python -m raganything.enhanced_markdown --help
Backend Comparison
| Backend | Pros | Cons | Best For | Quality |
|---|---|---|---|---|
| WeasyPrint | • Excellent CSS support • Fast rendering • Great web-style layouts • Python-based |
• Limited LaTeX features • Requires system deps |
• Web-style documents • Custom styling • Fast conversion |
⭐⭐⭐⭐ |
| Pandoc | • Extensive features • LaTeX-quality output • Academic formatting • Many input/output formats |
• Slower conversion • System installation • Complex setup |
• Academic papers • Complex documents • Publication quality |
⭐⭐⭐⭐⭐ |
| Auto | • Automatic selection • Fallback support • User-friendly |
• May not use optimal backend | • General use • Quick setup • Development |
⭐⭐⭐⭐ |
Configuration Options
MarkdownConfig Parameters
@dataclass
class MarkdownConfig:
# Page layout
page_size: str = "A4" # A4, Letter, Legal, A3, etc.
margin: str = "1in" # CSS margin format
font_size: str = "12pt" # Base font size
line_height: str = "1.5" # Line spacing multiplier
# Content options
include_toc: bool = True # Generate table of contents
syntax_highlighting: bool = True # Enable code highlighting
image_max_width: str = "100%" # Maximum image width
table_style: str = "..." # Default table CSS
# Styling
css_file: Optional[str] = None # External CSS file path
custom_css: Optional[str] = None # Inline CSS content
template_file: Optional[str] = None # Custom HTML template
# Output options
output_format: str = "pdf" # Currently only PDF supported
output_dir: Optional[str] = None # Output directory
# Metadata
metadata: Optional[Dict[str, str]] = None # Document metadata
Supported Markdown Features
Basic Formatting
- Headers:
# ## ### #### ##### ###### - Emphasis:
*italic*,**bold**,***bold italic*** - Links:
[text](url),[text][ref] - Images:
,![alt][ref] - Lists: Ordered and unordered, nested
- Blockquotes:
> quote - Line breaks: Double space or
\n\n
Advanced Features
- Tables: GitHub-style tables with alignment
- Code blocks: Fenced code blocks with language specification
- Inline code:
backtick code - Horizontal rules:
---or*** - Footnotes:
[^1]references - Definition lists: Term and definition pairs
- Attributes:
{#id .class key=value}
Code Highlighting
```python
def example_function():
"""This will be syntax highlighted"""
return "Hello, World!"
function exampleFunction() {
// This will also be highlighted
return "Hello, World!";
}
## Integration with RAG-Anything
The enhanced markdown conversion integrates seamlessly with RAG-Anything:
```python
from raganything import RAGAnything
# Initialize RAG-Anything
rag = RAGAnything()
# Process markdown files - enhanced conversion is used automatically
await rag.process_document_complete("document.md")
# Batch processing with enhanced markdown conversion
result = rag.process_documents_batch(
file_paths=["doc1.md", "doc2.md", "doc3.md"],
output_dir="./output"
)
# The .md files will be converted to PDF using enhanced conversion
# before being processed by the RAG system
Performance Considerations
Conversion Speed
- WeasyPrint: ~1-3 seconds for typical documents
- Pandoc: ~3-10 seconds for typical documents
- Large documents: Time scales roughly linearly with content
Memory Usage
- WeasyPrint: ~50-100MB per conversion
- Pandoc: ~100-200MB per conversion
- Images: Large images increase memory usage significantly
Optimization Tips
- Resize large images before embedding
- Use compressed images (JPEG for photos, PNG for graphics)
- Limit concurrent conversions to avoid memory issues
- Cache converted content when processing multiple times
Examples
Sample Markdown Document
# Technical Documentation
## Table of Contents
[TOC]
## Overview
This document provides comprehensive technical specifications.
## Architecture
### System Components
1. **Parser Engine**: Handles document processing
2. **Storage Layer**: Manages data persistence
3. **Query Interface**: Provides search capabilities
### Code Implementation
```python
from raganything import RAGAnything
# Initialize system
rag = RAGAnything(config={
"working_dir": "./storage",
"enable_image_processing": True
})
# Process document
await rag.process_document_complete("document.pdf")
Performance Metrics
| Component | Throughput | Latency | Memory |
|---|---|---|---|
| Parser | 100 docs/hour | 36s avg | 2.5 GB |
| Storage | 1000 ops/sec | 1ms avg | 512 MB |
| Query | 50 queries/sec | 20ms avg | 1 GB |
Integration Notes
Important
: Always validate input before processing.
Conclusion
The enhanced system provides excellent performance for document processing workflows.
### Generated PDF Features
The enhanced markdown converter produces PDFs with:
- **Professional typography** with proper font selection and spacing
- **Syntax-highlighted code blocks** using Pygments
- **Formatted tables** with borders and alternating row colors
- **Clickable table of contents** with navigation links
- **Responsive images** that scale appropriately
- **Custom styling** through CSS
- **Proper page breaks** and margins
- **Document metadata** and properties
## Troubleshooting
### Common Issues
#### WeasyPrint Installation Problems
```bash
# Ubuntu/Debian: Install system dependencies
sudo apt-get update
sudo apt-get install -y build-essential python3-dev libcairo2 \
libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
libffi-dev shared-mime-info
# Then reinstall WeasyPrint
pip install --force-reinstall weasyprint
Pandoc Not Found
# Check if Pandoc is installed
pandoc --version
# Install Pandoc (Ubuntu/Debian)
sudo apt-get install pandoc wkhtmltopdf
# Or download from: https://pandoc.org/installing.html
CSS Issues
- Check CSS syntax in custom_css
- Verify CSS file paths exist
- Test CSS with simple HTML first
- Use browser developer tools to debug styling
Image Problems
- Ensure images are accessible (correct paths)
- Check image file formats (PNG, JPEG, GIF supported)
- Verify image file permissions
- Consider image size and format optimization
Font Issues
# Use web-safe fonts
config = MarkdownConfig(
custom_css="""
body {
font-family: 'Arial', 'Helvetica', sans-serif;
}
"""
)
Debug Mode
Enable detailed logging for troubleshooting:
import logging
# Enable debug logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Create converter with debug logging
converter = EnhancedMarkdownConverter()
result = converter.convert_file_to_pdf("test.md", "test.pdf")
Error Handling
def robust_conversion(input_path, output_path):
"""Convert with fallback backends"""
converter = EnhancedMarkdownConverter()
# Try backends in order of preference
backends = ["weasyprint", "pandoc", "auto"]
for backend in backends:
try:
success = converter.convert_file_to_pdf(
input_path=input_path,
output_path=output_path,
method=backend
)
if success:
print(f"✅ Conversion successful with {backend}")
return True
except Exception as e:
print(f"❌ {backend} failed: {str(e)}")
continue
print("❌ All backends failed")
return False
API Reference
EnhancedMarkdownConverter
class EnhancedMarkdownConverter:
def __init__(self, config: Optional[MarkdownConfig] = None):
"""Initialize converter with optional configuration"""
def convert_file_to_pdf(self, input_path: str, output_path: str, method: str = "auto") -> bool:
"""Convert markdown file to PDF"""
def convert_markdown_to_pdf(self, markdown_content: str, output_path: str, method: str = "auto") -> bool:
"""Convert markdown content to PDF"""
def get_backend_info(self) -> Dict[str, Any]:
"""Get information about available backends"""
def convert_with_weasyprint(self, markdown_content: str, output_path: str) -> bool:
"""Convert using WeasyPrint backend"""
def convert_with_pandoc(self, markdown_content: str, output_path: str) -> bool:
"""Convert using Pandoc backend"""
Best Practices
-
Choose the right backend for your use case:
- WeasyPrint for web-style documents and custom CSS
- Pandoc for academic papers and complex formatting
- Auto for general use and development
-
Optimize images before embedding:
- Use appropriate formats (JPEG for photos, PNG for graphics)
- Compress images to reduce file size
- Set reasonable maximum widths
-
Design responsive layouts:
- Use relative units (%, em) instead of absolute (px)
- Test with different page sizes
- Consider print-specific CSS
-
Test your styling:
- Start with default styling and incrementally customize
- Test with sample content before production use
- Validate CSS syntax
-
Handle errors gracefully:
- Implement fallback backends
- Provide meaningful error messages
- Log conversion attempts for debugging
-
Performance optimization:
- Cache converted content when possible
- Process large batches with appropriate worker counts
- Monitor memory usage with large documents
Conclusion
The enhanced markdown conversion feature provides professional-quality PDF generation with flexible styling options and multiple backend support. It seamlessly integrates with RAG-Anything's document processing pipeline while offering standalone functionality for markdown-to-PDF conversion needs.