mirror of
https://github.com/BruceDone/awesome-crawler.git
synced 2021-06-07 22:49:07 +03:00
finish the first version
This commit is contained in:
94
.gitignore
vendored
Normal file
94
.gitignore
vendored
Normal file
@@ -0,0 +1,94 @@
|
||||
# Created by .ignore support plugin (hsz.mobi)
|
||||
### Python template
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
env/
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*,cover
|
||||
.hypothesis/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
target/
|
||||
|
||||
# IPython Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# pyenv
|
||||
.python-version
|
||||
|
||||
# celery beat schedule file
|
||||
celerybeat-schedule
|
||||
|
||||
# dotenv
|
||||
.env
|
||||
|
||||
# virtualenv
|
||||
venv/
|
||||
ENV/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# pycharm ide
|
||||
.idea
|
||||
48
README.md
48
README.md
@@ -1,2 +1,46 @@
|
||||
# awesome-crawler
|
||||
A collection of awesome web crawler,spider in different language
|
||||
# Awesome-crawler
|
||||
A collection of awesome web crawler,spider and resources in different language
|
||||
|
||||
## Python
|
||||
* [Scrapy](http://scrapy.org/) - A fast high-level screen scraping and web crawling framework.
|
||||
* [cola](https://github.com/chineking/cola) - A distributed crawling framework.
|
||||
* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
|
||||
* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
|
||||
|
||||
## JavaScript
|
||||
* [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.
|
||||
* [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.
|
||||
* [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.
|
||||
|
||||
# Java
|
||||
* [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.
|
||||
* [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.
|
||||
* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.
|
||||
* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML INformation eXtraction.
|
||||
|
||||
## C#
|
||||
* [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
|
||||
* [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.
|
||||
* [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.
|
||||
|
||||
|
||||
## PHP
|
||||
* [dom-crawler](https://github.com/symfony/dom-crawler) - The DomCrawler component eases DOM navigation for HTML and XML documents.
|
||||
* [pspider](https://github.com/hightman/pspider) - Parallel web crawler written in PHP.
|
||||
* [php-spider](https://github.com/mvdbos/php-spider) - A configurable and extensible PHP web spider
|
||||
|
||||
## C++
|
||||
* [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++.
|
||||
|
||||
## Ruby
|
||||
* [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
|
||||
* [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester.
|
||||
|
||||
## Go
|
||||
* [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
|
||||
* [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
|
||||
|
||||
## Scala
|
||||
* [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.
|
||||
* [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.
|
||||
* [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
|
||||
|
||||
Reference in New Issue
Block a user