new_frontera 0.9 documentation

new_frontera is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:

  • crawl frontier framework managing when and what to crawl and checking for crawling goal* accomplishment,

  • workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.

new_frontera contain components to allow creation of fully-operational web crawler with Scrapy. Even though it was originally designed for Scrapy, it can also be used with any other crawling framework/system.

Introduction

The purpose of this chapter is to introduce you to the concepts behind new_frontera so that you can get an idea of how it works and decide if it is suited to your needs.

new_frontera at a glance

Understand what new_frontera is and how it can help you.

Run modes

High level architecture and new_frontera run modes.

Quick start single process

using Scrapy as a container for running new_frontera.

Quick start distributed mode

with SQLite and ZeroMQ.

Cluster setup guide

Setting up clustered version of new_frontera on multiple machines with HBase and Kafka.

Using new_frontera

Installation Guide

HOWTO and Dependencies options.

Crawling strategies

A list of built-in crawling strategies.

Frontier objects

Understand the classes used to represent requests and responses.

Middlewares

Filter or alter information for links and documents.

Canonical URL Solver

Identify and make use of canonical url of document.

Backends

Built-in backends, and tips on implementing your own.

Message bus

Built-in message bus reference.

Writing custom crawling strategy

Implementing your own crawling strategy.

Using the Frontier with Scrapy

Learn how to use new_frontera with Scrapy.

Settings

Settings reference.

Advanced usage

What is a Crawl Frontier?

Learn Crawl Frontier theory.

Graph Manager

Define fake crawlings for websites to test your frontier.

Recording a Scrapy crawl

Create Scrapy crawl recordings and reproduce them later.

Fine tuning of new_frontera cluster

Cluster deployment and fine tuning information.

DNS Service

Few words about DNS service setup.

Developer documentation

Architecture overview

See how new_frontera works and its different components.

new_frontera API

Learn how to use the frontier.

Using the Frontier with Requests

Learn how to use new_frontera with Requests.

Examples

Some example projects and scripts using new_frontera.

Tests

How to run and write new_frontera tests.

Logging

A list of loggers for use with python native logging system.

Testing a Frontier

Test your frontier in an easy way.

Contribution guidelines

HOWTO contribute.

Glossary

Glossary of terms.