- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
- Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
- Updated version number to 0.4.21 in `__version__.py`.
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
- Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
- Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
- Improved error handling with detailed context extraction during exceptions.
- Enhanced overall maintainability and usability of the web crawler.
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
- Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
- Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
- Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
- Added support for exporting pages as PDFs
- Enhanced screenshot functionality for long pages
- Created a tutorial on dynamic content loading with 'Load More' buttons.
- Updated web crawler to handle PDF data in responses.
- Introduced new async crawl strategy with session management.
- Added BrowserManager for improved browser management.
- Enhanced documentation, focusing on storage state and usage examples.
- Improved error handling and logging for sessions.
- Added JavaScript snippets for customizing navigator properties.
Enhance Async Crawler with storage state handling
- Updated Async Crawler to support storage state management.
- Added error handling for URL validation in Async Web Crawler.
- Modified README logo and improved .gitignore entries.
- Fixed issues in multiple files for better code robustness.
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.
### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.
### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.
### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.
### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
- Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.
This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
- Enhanced the web scraping strategy with new methods for optimized media handling.
- Added new utility functions for better content processing.
- Refined existing features for improved accuracy and efficiency in scraping tasks.
- Introduced more robust filtering criteria for media elements.
- Enhanced error handling in async crawler.
- Added flexible options in Markdown generation.
- Updated user agent settings for improved reliability.
- Reflected changes in documentation and examples.
- Introduced the PruningContentFilter for better content relevance.
- Implemented comprehensive unit tests for verification of functionality.
- Enhanced existing BM25ContentFilter tests for edge case coverage.
- Updated documentation to include usage examples for new filter.
- Added a new UserAgentGenerator class for generating random User-Agents.
- Integrated User-Agent generation in AsyncPlaywrightCrawlerStrategy for randomization.
- Enhanced HTTP headers with generated Client Hints.
Thanks for your contribution and such a nice approach. Now that I think of it, I guess I can make good use of this for some other part of the code. By the way, thank you so much; I will add your name to the new list of contributors.
- Added new Docker commands for platform-specific builds.
- Updated README with comprehensive installation and setup instructions.
- Introduced `post_install` method in setup script for automation.
- Refined migration processes with enhanced error logging.
- Bump version to 0.3.746 and updated dependencies.
- Enhanced Dockerfile for platform-specific installations
- Added ARG for TARGETPLATFORM and BUILDPLATFORM
- Improved GPU support conditional on TARGETPLATFORM
- Removed static pages mounting in main.py
- Streamlined code structure to improve maintainability
- Added a post-installation setup script for initialization.
- Updated README with installation notes for Playwright setup.
- Enhanced migration logging for better error visibility.
- Added 'pydantic' to requirements.
- Bumped version to 0.3.746.