Files
office_document_test/docs/enhanced_markdown.md
MinalMahalaShorthillsAI 60f05e04cf improvised version
2025-07-28 10:08:54 +05:30

16 KiB

Enhanced Markdown Conversion

This document describes the enhanced markdown conversion feature for RAG-Anything, which provides high-quality PDF generation from markdown files with multiple backend options and advanced styling.

Overview

The enhanced markdown conversion feature provides professional-quality PDF generation from markdown files. It supports multiple conversion backends, advanced styling options, syntax highlighting, and seamless integration with RAG-Anything's document processing pipeline.

Key Features

  • Multiple Backends: WeasyPrint, Pandoc, and automatic backend selection
  • Advanced Styling: Custom CSS, syntax highlighting, and professional layouts
  • Image Support: Embedded images with proper scaling and positioning
  • Table Support: Formatted tables with borders and professional styling
  • Code Highlighting: Syntax highlighting for code blocks using Pygments
  • Custom Templates: Support for custom CSS and document templates
  • Table of Contents: Automatic TOC generation with navigation links
  • Professional Typography: High-quality fonts and spacing

Installation

Required Dependencies

# Basic installation
pip install raganything[all]

# Required for enhanced markdown conversion
pip install markdown weasyprint pygments

Optional Dependencies

# For Pandoc backend (system installation required)
# Ubuntu/Debian:
sudo apt-get install pandoc wkhtmltopdf

# macOS:
brew install pandoc wkhtmltopdf

# Or using conda:
conda install -c conda-forge pandoc wkhtmltopdf

Backend-Specific Installation

# Install WeasyPrint with system dependencies
pip install weasyprint

# Ubuntu/Debian system dependencies:
sudo apt-get install -y build-essential python3-dev python3-pip \
    python3-setuptools python3-wheel python3-cffi libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info

Pandoc

Usage

Basic Conversion

from raganything.enhanced_markdown import EnhancedMarkdownConverter, MarkdownConfig

# Create converter with default settings
converter = EnhancedMarkdownConverter()

# Convert markdown file to PDF
success = converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="auto"  # Automatically select best available backend
)

if success:
    print("✅ Conversion successful!")
else:
    print("❌ Conversion failed")

Advanced Configuration

# Create custom configuration
config = MarkdownConfig(
    page_size="A4",           # A4, Letter, Legal, etc.
    margin="1in",             # CSS-style margins
    font_size="12pt",         # Base font size
    line_height="1.5",        # Line spacing
    include_toc=True,         # Generate table of contents
    syntax_highlighting=True, # Enable code syntax highlighting
    
    # Custom CSS styling
    custom_css="""
    body { 
        font-family: 'Georgia', serif; 
        color: #333;
    }
    h1 { 
        color: #2c3e50; 
        border-bottom: 2px solid #3498db; 
        padding-bottom: 0.3em;
    }
    code { 
        background-color: #f8f9fa; 
        padding: 2px 4px; 
        border-radius: 3px;
    }
    pre {
        background-color: #f8f9fa;
        border-left: 4px solid #3498db;
        padding: 15px;
        border-radius: 5px;
    }
    table {
        border-collapse: collapse;
        width: 100%;
        margin: 1em 0;
    }
    th, td {
        border: 1px solid #ddd;
        padding: 8px 12px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
        font-weight: bold;
    }
    """
)

converter = EnhancedMarkdownConverter(config)

Backend Selection

# Check available backends
converter = EnhancedMarkdownConverter()
backend_info = converter.get_backend_info()

print("Available backends:")
for backend, available in backend_info["available_backends"].items():
    status = "✅" if available else "❌"
    print(f"  {status} {backend}")

print(f"Recommended backend: {backend_info['recommended_backend']}")

# Use specific backend
converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="weasyprint"  # or "pandoc", "pandoc_system", "auto"
)

Content Conversion

# Convert markdown content directly (not from file)
markdown_content = """
# Sample Document

## Introduction
This is a **bold** statement with *italic* text.

## Code Example
```python
def hello_world():
    print("Hello, World!")
    return "Success"

Table

Feature Status Notes
PDF Generation Working
Syntax Highlighting Pygments
Custom CSS Full support
"""

success = converter.convert_markdown_to_pdf( markdown_content=markdown_content, output_path="sample.pdf", method="auto" )


### Command Line Interface

```bash
# Basic conversion
python -m raganything.enhanced_markdown document.md --output document.pdf

# With specific backend
python -m raganything.enhanced_markdown document.md --method weasyprint

# With custom CSS file
python -m raganything.enhanced_markdown document.md --css custom_style.css

# Show backend information
python -m raganything.enhanced_markdown --info

# Help
python -m raganything.enhanced_markdown --help

Backend Comparison

Backend Pros Cons Best For Quality
WeasyPrint • Excellent CSS support
• Fast rendering
• Great web-style layouts
• Python-based
• Limited LaTeX features
• Requires system deps
• Web-style documents
• Custom styling
• Fast conversion
Pandoc • Extensive features
• LaTeX-quality output
• Academic formatting
• Many input/output formats
• Slower conversion
• System installation
• Complex setup
• Academic papers
• Complex documents
• Publication quality
Auto • Automatic selection
• Fallback support
• User-friendly
• May not use optimal backend • General use
• Quick setup
• Development

Configuration Options

MarkdownConfig Parameters

@dataclass
class MarkdownConfig:
    # Page layout
    page_size: str = "A4"              # A4, Letter, Legal, A3, etc.
    margin: str = "1in"                # CSS margin format
    font_size: str = "12pt"            # Base font size
    line_height: str = "1.5"           # Line spacing multiplier
    
    # Content options
    include_toc: bool = True           # Generate table of contents
    syntax_highlighting: bool = True   # Enable code highlighting
    image_max_width: str = "100%"      # Maximum image width
    table_style: str = "..."           # Default table CSS
    
    # Styling
    css_file: Optional[str] = None     # External CSS file path
    custom_css: Optional[str] = None   # Inline CSS content
    template_file: Optional[str] = None # Custom HTML template
    
    # Output options
    output_format: str = "pdf"         # Currently only PDF supported
    output_dir: Optional[str] = None   # Output directory
    
    # Metadata
    metadata: Optional[Dict[str, str]] = None  # Document metadata

Supported Markdown Features

Basic Formatting

  • Headers: # ## ### #### ##### ######
  • Emphasis: *italic*, **bold**, ***bold italic***
  • Links: [text](url), [text][ref]
  • Images: ![alt](url), ![alt][ref]
  • Lists: Ordered and unordered, nested
  • Blockquotes: > quote
  • Line breaks: Double space or \n\n

Advanced Features

  • Tables: GitHub-style tables with alignment
  • Code blocks: Fenced code blocks with language specification
  • Inline code: backtick code
  • Horizontal rules: --- or ***
  • Footnotes: [^1] references
  • Definition lists: Term and definition pairs
  • Attributes: {#id .class key=value}

Code Highlighting

```python
def example_function():
    """This will be syntax highlighted"""
    return "Hello, World!"
function exampleFunction() {
    // This will also be highlighted
    return "Hello, World!";
}

## Integration with RAG-Anything

The enhanced markdown conversion integrates seamlessly with RAG-Anything:

```python
from raganything import RAGAnything

# Initialize RAG-Anything
rag = RAGAnything()

# Process markdown files - enhanced conversion is used automatically
await rag.process_document_complete("document.md")

# Batch processing with enhanced markdown conversion
result = rag.process_documents_batch(
    file_paths=["doc1.md", "doc2.md", "doc3.md"],
    output_dir="./output"
)

# The .md files will be converted to PDF using enhanced conversion
# before being processed by the RAG system

Performance Considerations

Conversion Speed

  • WeasyPrint: ~1-3 seconds for typical documents
  • Pandoc: ~3-10 seconds for typical documents
  • Large documents: Time scales roughly linearly with content

Memory Usage

  • WeasyPrint: ~50-100MB per conversion
  • Pandoc: ~100-200MB per conversion
  • Images: Large images increase memory usage significantly

Optimization Tips

  1. Resize large images before embedding
  2. Use compressed images (JPEG for photos, PNG for graphics)
  3. Limit concurrent conversions to avoid memory issues
  4. Cache converted content when processing multiple times

Examples

Sample Markdown Document

# Technical Documentation

## Table of Contents
[TOC]

## Overview
This document provides comprehensive technical specifications.

## Architecture

### System Components
1. **Parser Engine**: Handles document processing
2. **Storage Layer**: Manages data persistence  
3. **Query Interface**: Provides search capabilities

### Code Implementation
```python
from raganything import RAGAnything

# Initialize system
rag = RAGAnything(config={
    "working_dir": "./storage",
    "enable_image_processing": True
})

# Process document
await rag.process_document_complete("document.pdf")

Performance Metrics

Component Throughput Latency Memory
Parser 100 docs/hour 36s avg 2.5 GB
Storage 1000 ops/sec 1ms avg 512 MB
Query 50 queries/sec 20ms avg 1 GB

Integration Notes

Important

: Always validate input before processing.

Conclusion

The enhanced system provides excellent performance for document processing workflows.


### Generated PDF Features

The enhanced markdown converter produces PDFs with:

- **Professional typography** with proper font selection and spacing
- **Syntax-highlighted code blocks** using Pygments
- **Formatted tables** with borders and alternating row colors
- **Clickable table of contents** with navigation links
- **Responsive images** that scale appropriately
- **Custom styling** through CSS
- **Proper page breaks** and margins
- **Document metadata** and properties

## Troubleshooting

### Common Issues

#### WeasyPrint Installation Problems
```bash
# Ubuntu/Debian: Install system dependencies
sudo apt-get update
sudo apt-get install -y build-essential python3-dev libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info

# Then reinstall WeasyPrint
pip install --force-reinstall weasyprint

Pandoc Not Found

# Check if Pandoc is installed
pandoc --version

# Install Pandoc (Ubuntu/Debian)
sudo apt-get install pandoc wkhtmltopdf

# Or download from: https://pandoc.org/installing.html

CSS Issues

  • Check CSS syntax in custom_css
  • Verify CSS file paths exist
  • Test CSS with simple HTML first
  • Use browser developer tools to debug styling

Image Problems

  • Ensure images are accessible (correct paths)
  • Check image file formats (PNG, JPEG, GIF supported)
  • Verify image file permissions
  • Consider image size and format optimization

Font Issues

# Use web-safe fonts
config = MarkdownConfig(
    custom_css="""
    body { 
        font-family: 'Arial', 'Helvetica', sans-serif; 
    }
    """
)

Debug Mode

Enable detailed logging for troubleshooting:

import logging

# Enable debug logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Create converter with debug logging
converter = EnhancedMarkdownConverter()
result = converter.convert_file_to_pdf("test.md", "test.pdf")

Error Handling

def robust_conversion(input_path, output_path):
    """Convert with fallback backends"""
    converter = EnhancedMarkdownConverter()
    
    # Try backends in order of preference
    backends = ["weasyprint", "pandoc", "auto"]
    
    for backend in backends:
        try:
            success = converter.convert_file_to_pdf(
                input_path=input_path,
                output_path=output_path,
                method=backend
            )
            if success:
                print(f"✅ Conversion successful with {backend}")
                return True
        except Exception as e:
            print(f"❌ {backend} failed: {str(e)}")
            continue
    
    print("❌ All backends failed")
    return False

API Reference

EnhancedMarkdownConverter

class EnhancedMarkdownConverter:
    def __init__(self, config: Optional[MarkdownConfig] = None):
        """Initialize converter with optional configuration"""
    
    def convert_file_to_pdf(self, input_path: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown file to PDF"""
    
    def convert_markdown_to_pdf(self, markdown_content: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown content to PDF"""
    
    def get_backend_info(self) -> Dict[str, Any]:
        """Get information about available backends"""
    
    def convert_with_weasyprint(self, markdown_content: str, output_path: str) -> bool:
        """Convert using WeasyPrint backend"""
    
    def convert_with_pandoc(self, markdown_content: str, output_path: str) -> bool:
        """Convert using Pandoc backend"""

Best Practices

  1. Choose the right backend for your use case:

    • WeasyPrint for web-style documents and custom CSS
    • Pandoc for academic papers and complex formatting
    • Auto for general use and development
  2. Optimize images before embedding:

    • Use appropriate formats (JPEG for photos, PNG for graphics)
    • Compress images to reduce file size
    • Set reasonable maximum widths
  3. Design responsive layouts:

    • Use relative units (%, em) instead of absolute (px)
    • Test with different page sizes
    • Consider print-specific CSS
  4. Test your styling:

    • Start with default styling and incrementally customize
    • Test with sample content before production use
    • Validate CSS syntax
  5. Handle errors gracefully:

    • Implement fallback backends
    • Provide meaningful error messages
    • Log conversion attempts for debugging
  6. Performance optimization:

    • Cache converted content when possible
    • Process large batches with appropriate worker counts
    • Monitor memory usage with large documents

Conclusion

The enhanced markdown conversion feature provides professional-quality PDF generation with flexible styling options and multiple backend support. It seamlessly integrates with RAG-Anything's document processing pipeline while offering standalone functionality for markdown-to-PDF conversion needs.