new_frontera 0.9 documentation¶
new_frontera is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:
crawl frontier framework managing when and what to crawl and checking for crawling goal* accomplishment,
workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.
new_frontera contain components to allow creation of fully-operational web crawler with Scrapy. Even though it was originally designed for Scrapy, it can also be used with any other crawling framework/system.
Introduction¶
The purpose of this chapter is to introduce you to the concepts behind new_frontera so that you can get an idea of how it works and decide if it is suited to your needs.
- new_frontera at a glance
Understand what new_frontera is and how it can help you.
- Run modes
High level architecture and new_frontera run modes.
- Quick start single process
using Scrapy as a container for running new_frontera.
- Quick start distributed mode
with SQLite and ZeroMQ.
- Cluster setup guide
Setting up clustered version of new_frontera on multiple machines with HBase and Kafka.
Using new_frontera¶
- Installation Guide
HOWTO and Dependencies options.
- Crawling strategies
A list of built-in crawling strategies.
- Frontier objects
Understand the classes used to represent requests and responses.
- Middlewares
Filter or alter information for links and documents.
- Canonical URL Solver
Identify and make use of canonical url of document.
- Backends
Built-in backends, and tips on implementing your own.
- Message bus
Built-in message bus reference.
- Writing custom crawling strategy
Implementing your own crawling strategy.
- Using the Frontier with Scrapy
Learn how to use new_frontera with Scrapy.
- Settings
Settings reference.
Advanced usage¶
- What is a Crawl Frontier?
Learn Crawl Frontier theory.
- Graph Manager
Define fake crawlings for websites to test your frontier.
- Recording a Scrapy crawl
Create Scrapy crawl recordings and reproduce them later.
- Fine tuning of new_frontera cluster
Cluster deployment and fine tuning information.
- DNS Service
Few words about DNS service setup.
Developer documentation¶
- Architecture overview
See how new_frontera works and its different components.
- new_frontera API
Learn how to use the frontier.
- Using the Frontier with Requests
Learn how to use new_frontera with Requests.
- Examples
Some example projects and scripts using new_frontera.
- Tests
How to run and write new_frontera tests.
- Logging
A list of loggers for use with python native logging system.
- Testing a Frontier
Test your frontier in an easy way.
- Contribution guidelines
HOWTO contribute.
- Glossary
Glossary of terms.