mirror of
https://github.com/unclecode/crawl4ai.git
synced 2024-12-22 15:52:24 +03:00
### New Features: - **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
90 lines
3.7 KiB
YAML
90 lines
3.7 KiB
YAML
site_name: Crawl4AI Documentation
|
|
site_description: 🔥🕷️ Crawl4AI, Open-source LLM Friendly Web Crawler & Scrapper
|
|
site_url: https://docs.crawl4ai.com
|
|
repo_url: https://github.com/unclecode/crawl4ai
|
|
repo_name: unclecode/crawl4ai
|
|
docs_dir: docs/md_v2
|
|
|
|
nav:
|
|
- Home: 'index.md'
|
|
- 'Installation': 'basic/installation.md'
|
|
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
|
|
- 'Quick Start': 'basic/quickstart.md'
|
|
- Changelog & Blog:
|
|
- 'Blog Home': 'blog/index.md'
|
|
- 'Latest (0.4.1)': 'blog/releases/0.4.1.md'
|
|
- 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
|
|
|
|
- Basic:
|
|
- 'Simple Crawling': 'basic/simple-crawling.md'
|
|
- 'Output Formats': 'basic/output-formats.md'
|
|
- 'Browser Configuration': 'basic/browser-config.md'
|
|
- 'Page Interaction': 'basic/page-interaction.md'
|
|
- 'Content Selection': 'basic/content-selection.md'
|
|
- 'Cache Modes': 'basic/cache-modes.md'
|
|
|
|
- Advanced:
|
|
- 'Content Processing': 'advanced/content-processing.md'
|
|
- 'Magic Mode': 'advanced/magic-mode.md'
|
|
- 'Hooks & Auth': 'advanced/hooks-auth.md'
|
|
- 'Proxy & Security': 'advanced/proxy-security.md'
|
|
- 'Session Management': 'advanced/session-management.md'
|
|
- 'Session Management (Advanced)': 'advanced/session-management-advanced.md'
|
|
|
|
- Extraction:
|
|
- 'Overview': 'extraction/overview.md'
|
|
- 'LLM Strategy': 'extraction/llm.md'
|
|
- 'Json-CSS Extractor Basic': 'extraction/css.md'
|
|
- 'Json-CSS Extractor Advanced': 'extraction/css-advanced.md'
|
|
- 'Cosine Strategy': 'extraction/cosine.md'
|
|
- 'Chunking': 'extraction/chunking.md'
|
|
|
|
- API Reference:
|
|
- 'Parameters Table': 'api/parameters.md'
|
|
- 'AsyncWebCrawler': 'api/async-webcrawler.md'
|
|
- 'AsyncWebCrawler.arun()': 'api/arun.md'
|
|
- 'CrawlResult': 'api/crawl-result.md'
|
|
- 'Strategies': 'api/strategies.md'
|
|
|
|
- Tutorial:
|
|
- '1. Getting Started': 'tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md'
|
|
- '2. Advanced Features': 'tutorial/episode_02_Overview_of_Advanced_Features.md'
|
|
- '3. Browser Setup': 'tutorial/episode_03_Browser_Configurations_&_Headless_Crawling.md'
|
|
- '4. Proxy Settings': 'tutorial/episode_04_Advanced_Proxy_and_Security_Settings.md'
|
|
- '5. Dynamic Content': 'tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md'
|
|
- '6. Magic Mode': 'tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md'
|
|
- '7. Content Cleaning': 'tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md'
|
|
- '8. Media Handling': 'tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md'
|
|
- '9. Link Analysis': 'tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md'
|
|
- '10. User Simulation': 'tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md'
|
|
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md'
|
|
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies_LLM.md'
|
|
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies_Cosine.md'
|
|
- '12. Session Crawling': 'tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md'
|
|
- '13. Text Chunking': 'tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md'
|
|
- '14. Custom Workflows': 'tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md'
|
|
|
|
|
|
theme:
|
|
name: terminal
|
|
palette: dark
|
|
|
|
markdown_extensions:
|
|
- pymdownx.highlight:
|
|
anchor_linenums: true
|
|
- pymdownx.inlinehilite
|
|
- pymdownx.snippets
|
|
- pymdownx.superfences
|
|
- admonition
|
|
- pymdownx.details
|
|
- attr_list
|
|
- tables
|
|
|
|
extra_css:
|
|
- assets/styles.css
|
|
- assets/highlight.css
|
|
- assets/dmvendor.css
|
|
|
|
extra_javascript:
|
|
- assets/highlight.min.js
|
|
- assets/highlight_init.js |