mirror of https://github.com/HKUDS/RAG-Anything.git synced 2025-08-20 19:01:34 +03:00

Files

MinalMahalaShorthillsAI 60f05e04cf improvised version

2025-07-28 10:08:54 +05:30

16 KiB

Raw Blame History

Enhanced Markdown Conversion

This document describes the enhanced markdown conversion feature for RAG-Anything, which provides high-quality PDF generation from markdown files with multiple backend options and advanced styling.

Overview

The enhanced markdown conversion feature provides professional-quality PDF generation from markdown files. It supports multiple conversion backends, advanced styling options, syntax highlighting, and seamless integration with RAG-Anything's document processing pipeline.

Key Features

Multiple Backends: WeasyPrint, Pandoc, and automatic backend selection
Advanced Styling: Custom CSS, syntax highlighting, and professional layouts
Image Support: Embedded images with proper scaling and positioning
Table Support: Formatted tables with borders and professional styling
Code Highlighting: Syntax highlighting for code blocks using Pygments
Custom Templates: Support for custom CSS and document templates
Table of Contents: Automatic TOC generation with navigation links
Professional Typography: High-quality fonts and spacing

Installation

Required Dependencies

# Basic installation
pip install raganything[all]

# Required for enhanced markdown conversion
pip install markdown weasyprint pygments

Optional Dependencies

# For Pandoc backend (system installation required)
# Ubuntu/Debian:
sudo apt-get install pandoc wkhtmltopdf

# macOS:
brew install pandoc wkhtmltopdf

# Or using conda:
conda install -c conda-forge pandoc wkhtmltopdf

Backend-Specific Installation

WeasyPrint (Recommended)

# Install WeasyPrint with system dependencies
pip install weasyprint

# Ubuntu/Debian system dependencies:
sudo apt-get install -y build-essential python3-dev python3-pip \
    python3-setuptools python3-wheel python3-cffi libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info

Pandoc

Download from: https://pandoc.org/installing.html
Requires system-wide installation
Used for complex document structures and LaTeX-quality output

Usage

Basic Conversion

from raganything.enhanced_markdown import EnhancedMarkdownConverter, MarkdownConfig

# Create converter with default settings
converter = EnhancedMarkdownConverter()

# Convert markdown file to PDF
success = converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="auto"  # Automatically select best available backend
)

if success:
    print("✅ Conversion successful!")
else:
    print("❌ Conversion failed")

Advanced Configuration

# Create custom configuration
config = MarkdownConfig(
    page_size="A4",           # A4, Letter, Legal, etc.
    margin="1in",             # CSS-style margins
    font_size="12pt",         # Base font size
    line_height="1.5",        # Line spacing
    include_toc=True,         # Generate table of contents
    syntax_highlighting=True, # Enable code syntax highlighting
    
    # Custom CSS styling
    custom_css="""
    body { 
        font-family: 'Georgia', serif; 
        color: #333;
    }
    h1 { 
        color: #2c3e50; 
        border-bottom: 2px solid #3498db; 
        padding-bottom: 0.3em;
    }
    code { 
        background-color: #f8f9fa; 
        padding: 2px 4px; 
        border-radius: 3px;
    }
    pre {
        background-color: #f8f9fa;
        border-left: 4px solid #3498db;
        padding: 15px;
        border-radius: 5px;
    }
    table {
        border-collapse: collapse;
        width: 100%;
        margin: 1em 0;
    }
    th, td {
        border: 1px solid #ddd;
        padding: 8px 12px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
        font-weight: bold;
    }
    """
)

converter = EnhancedMarkdownConverter(config)

Backend Selection

# Check available backends
converter = EnhancedMarkdownConverter()
backend_info = converter.get_backend_info()

print("Available backends:")
for backend, available in backend_info["available_backends"].items():
    status = "✅" if available else "❌"
    print(f"  {status} {backend}")

print(f"Recommended backend: {backend_info['recommended_backend']}")

# Use specific backend
converter.convert_file_to_pdf(
    input_path="document.md",
    output_path="document.pdf",
    method="weasyprint"  # or "pandoc", "pandoc_system", "auto"
)

Content Conversion

# Convert markdown content directly (not from file)
markdown_content = """
# Sample Document

## Introduction
This is a **bold** statement with *italic* text.

## Code Example
```python
def hello_world():
    print("Hello, World!")
    return "Success"

Table

Feature	Status	Notes
PDF Generation	✅	Working
Syntax Highlighting	✅	Pygments
Custom CSS	✅	Full support
"""

success = converter.convert_markdown_to_pdf( markdown_content=markdown_content, output_path="sample.pdf", method="auto" )


### Command Line Interface

```bash
# Basic conversion
python -m raganything.enhanced_markdown document.md --output document.pdf

# With specific backend
python -m raganything.enhanced_markdown document.md --method weasyprint

# With custom CSS file
python -m raganything.enhanced_markdown document.md --css custom_style.css

# Show backend information
python -m raganything.enhanced_markdown --info

# Help
python -m raganything.enhanced_markdown --help

Backend Comparison

Backend	Pros	Cons	Best For	Quality
WeasyPrint	• Excellent CSS support • Fast rendering • Great web-style layouts • Python-based	• Limited LaTeX features • Requires system deps	• Web-style documents • Custom styling • Fast conversion	⭐⭐⭐⭐
Pandoc	• Extensive features • LaTeX-quality output • Academic formatting • Many input/output formats	• Slower conversion • System installation • Complex setup	• Academic papers • Complex documents • Publication quality	⭐⭐⭐⭐⭐
Auto	• Automatic selection • Fallback support • User-friendly	• May not use optimal backend	• General use • Quick setup • Development	⭐⭐⭐⭐

Configuration Options

MarkdownConfig Parameters

@dataclass
class MarkdownConfig:
    # Page layout
    page_size: str = "A4"              # A4, Letter, Legal, A3, etc.
    margin: str = "1in"                # CSS margin format
    font_size: str = "12pt"            # Base font size
    line_height: str = "1.5"           # Line spacing multiplier
    
    # Content options
    include_toc: bool = True           # Generate table of contents
    syntax_highlighting: bool = True   # Enable code highlighting
    image_max_width: str = "100%"      # Maximum image width
    table_style: str = "..."           # Default table CSS
    
    # Styling
    css_file: Optional[str] = None     # External CSS file path
    custom_css: Optional[str] = None   # Inline CSS content
    template_file: Optional[str] = None # Custom HTML template
    
    # Output options
    output_format: str = "pdf"         # Currently only PDF supported
    output_dir: Optional[str] = None   # Output directory
    
    # Metadata
    metadata: Optional[Dict[str, str]] = None  # Document metadata

Supported Markdown Features

Basic Formatting

Headers: # ## ### #### ##### ######
Emphasis: *italic*, **bold**, ***bold italic***
Links: [text](url), [text][ref]
Images: ![alt](url), ![alt][ref]
Lists: Ordered and unordered, nested
Blockquotes: > quote
Line breaks: Double space or \n\n

Advanced Features

Tables: GitHub-style tables with alignment
Code blocks: Fenced code blocks with language specification
Inline code: backtick code
Horizontal rules: --- or ***
Footnotes: [^1] references
Definition lists: Term and definition pairs
Attributes: {#id .class key=value}

Code Highlighting

```python
def example_function():
    """This will be syntax highlighted"""
    return "Hello, World!"

function exampleFunction() {
    // This will also be highlighted
    return "Hello, World!";
}


## Integration with RAG-Anything

The enhanced markdown conversion integrates seamlessly with RAG-Anything:

```python
from raganything import RAGAnything

# Initialize RAG-Anything
rag = RAGAnything()

# Process markdown files - enhanced conversion is used automatically
await rag.process_document_complete("document.md")

# Batch processing with enhanced markdown conversion
result = rag.process_documents_batch(
    file_paths=["doc1.md", "doc2.md", "doc3.md"],
    output_dir="./output"
)

# The .md files will be converted to PDF using enhanced conversion
# before being processed by the RAG system

Performance Considerations

Conversion Speed

WeasyPrint: ~1-3 seconds for typical documents
Pandoc: ~3-10 seconds for typical documents
Large documents: Time scales roughly linearly with content

Memory Usage

WeasyPrint: ~50-100MB per conversion
Pandoc: ~100-200MB per conversion
Images: Large images increase memory usage significantly

Optimization Tips

Resize large images before embedding
Use compressed images (JPEG for photos, PNG for graphics)
Limit concurrent conversions to avoid memory issues
Cache converted content when processing multiple times

Examples

Sample Markdown Document

# Technical Documentation

## Table of Contents
[TOC]

## Overview
This document provides comprehensive technical specifications.

## Architecture

### System Components
1. **Parser Engine**: Handles document processing
2. **Storage Layer**: Manages data persistence  
3. **Query Interface**: Provides search capabilities

### Code Implementation
```python
from raganything import RAGAnything

# Initialize system
rag = RAGAnything(config={
    "working_dir": "./storage",
    "enable_image_processing": True
})

# Process document
await rag.process_document_complete("document.pdf")

Performance Metrics

Component	Throughput	Latency	Memory
Parser	100 docs/hour	36s avg	2.5 GB
Storage	1000 ops/sec	1ms avg	512 MB
Query	50 queries/sec	20ms avg	1 GB

Integration Notes

Important

: Always validate input before processing.

Conclusion

The enhanced system provides excellent performance for document processing workflows.


### Generated PDF Features

The enhanced markdown converter produces PDFs with:

- **Professional typography** with proper font selection and spacing
- **Syntax-highlighted code blocks** using Pygments
- **Formatted tables** with borders and alternating row colors
- **Clickable table of contents** with navigation links
- **Responsive images** that scale appropriately
- **Custom styling** through CSS
- **Proper page breaks** and margins
- **Document metadata** and properties

## Troubleshooting

### Common Issues

#### WeasyPrint Installation Problems
```bash
# Ubuntu/Debian: Install system dependencies
sudo apt-get update
sudo apt-get install -y build-essential python3-dev libcairo2 \
    libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 \
    libffi-dev shared-mime-info

# Then reinstall WeasyPrint
pip install --force-reinstall weasyprint

Pandoc Not Found

# Check if Pandoc is installed
pandoc --version

# Install Pandoc (Ubuntu/Debian)
sudo apt-get install pandoc wkhtmltopdf

# Or download from: https://pandoc.org/installing.html

CSS Issues

Check CSS syntax in custom_css
Verify CSS file paths exist
Test CSS with simple HTML first
Use browser developer tools to debug styling

Image Problems

Ensure images are accessible (correct paths)
Check image file formats (PNG, JPEG, GIF supported)
Verify image file permissions
Consider image size and format optimization

Font Issues

# Use web-safe fonts
config = MarkdownConfig(
    custom_css="""
    body { 
        font-family: 'Arial', 'Helvetica', sans-serif; 
    }
    """
)

Debug Mode

Enable detailed logging for troubleshooting:

import logging

# Enable debug logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Create converter with debug logging
converter = EnhancedMarkdownConverter()
result = converter.convert_file_to_pdf("test.md", "test.pdf")

Error Handling

def robust_conversion(input_path, output_path):
    """Convert with fallback backends"""
    converter = EnhancedMarkdownConverter()
    
    # Try backends in order of preference
    backends = ["weasyprint", "pandoc", "auto"]
    
    for backend in backends:
        try:
            success = converter.convert_file_to_pdf(
                input_path=input_path,
                output_path=output_path,
                method=backend
            )
            if success:
                print(f"✅ Conversion successful with {backend}")
                return True
        except Exception as e:
            print(f"❌ {backend} failed: {str(e)}")
            continue
    
    print("❌ All backends failed")
    return False

API Reference

EnhancedMarkdownConverter

class EnhancedMarkdownConverter:
    def __init__(self, config: Optional[MarkdownConfig] = None):
        """Initialize converter with optional configuration"""
    
    def convert_file_to_pdf(self, input_path: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown file to PDF"""
    
    def convert_markdown_to_pdf(self, markdown_content: str, output_path: str, method: str = "auto") -> bool:
        """Convert markdown content to PDF"""
    
    def get_backend_info(self) -> Dict[str, Any]:
        """Get information about available backends"""
    
    def convert_with_weasyprint(self, markdown_content: str, output_path: str) -> bool:
        """Convert using WeasyPrint backend"""
    
    def convert_with_pandoc(self, markdown_content: str, output_path: str) -> bool:
        """Convert using Pandoc backend"""

Best Practices

Choose the right backend for your use case:
- WeasyPrint for web-style documents and custom CSS
- Pandoc for academic papers and complex formatting
- Auto for general use and development
Optimize images before embedding:
- Use appropriate formats (JPEG for photos, PNG for graphics)
- Compress images to reduce file size
- Set reasonable maximum widths
Design responsive layouts:
- Use relative units (%, em) instead of absolute (px)
- Test with different page sizes
- Consider print-specific CSS
Test your styling:
- Start with default styling and incrementally customize
- Test with sample content before production use
- Validate CSS syntax
Handle errors gracefully:
- Implement fallback backends
- Provide meaningful error messages
- Log conversion attempts for debugging
Performance optimization:
- Cache converted content when possible
- Process large batches with appropriate worker counts
- Monitor memory usage with large documents

Conclusion

The enhanced markdown conversion feature provides professional-quality PDF generation with flexible styling options and multiple backend support. It seamlessly integrates with RAG-Anything's document processing pipeline while offering standalone functionality for markdown-to-PDF conversion needs.

16 KiB Raw Blame History

Enhanced Markdown Conversion

Overview

Key Features

Installation

Required Dependencies

Optional Dependencies

Backend-Specific Installation

WeasyPrint (Recommended)

Pandoc

Usage

Basic Conversion

Advanced Configuration

Backend Selection

Content Conversion

Table

Backend Comparison

Configuration Options

MarkdownConfig Parameters

Supported Markdown Features

Basic Formatting

Advanced Features

Code Highlighting

Performance Considerations

Conversion Speed

Memory Usage

Optimization Tips

Examples

Sample Markdown Document

Performance Metrics

Integration Notes

Conclusion

Pandoc Not Found

CSS Issues

Image Problems

Font Issues

Debug Mode

Error Handling

API Reference

EnhancedMarkdownConverter

Best Practices

Conclusion

16 KiB

Raw Blame History