finish the first version

2021-06-07 22:49:07 +03:00 · 2016-10-10 18:12:20 +08:00
parent bc2077e6ee
commit 080215684f
2 changed files with 140 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,94 @@
+# Created by .ignore support plugin (hsz.mobi)
+### Python template
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# IPython Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# dotenv
+.env
+
+# virtualenv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+
+# Rope project settings
+.ropeproject
+
+# pycharm ide
+.idea
--- a/README.md
+++ b/README.md
@@ -1,2 +1,46 @@
-# awesome-crawler
-A collection of awesome web crawler,spider in different language
+# Awesome-crawler
+A collection of awesome web crawler,spider and resources in different language
+
+## Python 
+* [Scrapy](http://scrapy.org/) - A fast high-level screen scraping and web crawling framework.
+* [cola](https://github.com/chineking/cola) - A distributed crawling framework.
+* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
+* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
+
+## JavaScript
+* [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.
+* [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.
+* [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.
+
+# Java
+* [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.
+* [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.
+* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.
+* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML INformation eXtraction.
+
+## C# 
+* [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
+* [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.
+* [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.
+
+
+## PHP
+* [dom-crawler](https://github.com/symfony/dom-crawler) - The DomCrawler component eases DOM navigation for HTML and XML documents.
+* [pspider](https://github.com/hightman/pspider) - Parallel web crawler written in PHP.
+* [php-spider](https://github.com/mvdbos/php-spider) - A configurable and extensible PHP web spider 
+
+## C++
+* [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++. 
+
+## Ruby
+* [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
+* [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester.
+
+## Go
+* [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
+* [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
+
+## Scala
+* [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.
+* [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.
+* [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.