| id | scrapy |
|---|---|
| title | Using Scrapy |
| description | Convert Scrapy spiders into Apify Actors with platform storage and proxy integration. |
import CodeBlock from '@theme/CodeBlock'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
import UnderscoreMainExample from '!!raw-loader!./code/scrapy_project/src/main.py'; import MainExample from '!!raw-loader!./code/scrapy_project/src/main.py'; import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py'; import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py'; import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';
In this guide, you'll learn how to use the Scrapy framework in your Apify Actors.
Scrapy is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify Actors, integrated with Apify storages, and executed on the Apify platform.
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses Twisted for asynchronous execution, while the Apify SDK is based on asyncio. The key thing is to install the Twisted's asyncioreactor to run Twisted's asyncio compatible event loop. The apify.scrapy.run_scrapy_actor function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
In this setup, apify.scrapy.initialize_logging configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The apify.scrapy.run_scrapy_actor installs Twisted's asyncio-compatible reactor and bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
Make sure the SCRAPY_SETTINGS_MODULE environment variable is set to the path of the Scrapy settings module. This variable is also used by the Actor class to detect that the project is a Scrapy project, triggering additional actions.
Within the Actor's main coroutine, the Actor's input is processed as usual. The function apify.scrapy.apply_apify_settings is then used to configure Scrapy settings with Apify-specific components before the spider is executed. The key components and other helper functions are described in the next section.
The Apify SDK provides several custom components to support integration with the Apify platform:
apify.scrapy.ApifyScheduler- Replaces Scrapy's default scheduler with one that uses Apify's request queue for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests.apify.scrapy.ActorDatasetPushPipeline- A Scrapy item pipeline that pushes scraped items to Apify's dataset. When enabled, every item produced by the spider is sent to the dataset.apify.scrapy.ApifyHttpProxyMiddleware- A Scrapy middleware that manages proxy configurations. This middleware replaces Scrapy's defaultHttpProxyMiddlewareto facilitate the use of Apify's proxy service.apify.scrapy.extensions.ApifyCacheStorage- A storage backend for Scrapy's built-in HTTP cache middleware. This backend uses Apify's key-value store. Make sure to setHTTPCACHE_ENABLEDandHTTPCACHE_EXPIRATION_SECSin your settings, or caching won't work.
Additional helper functions in the apify.scrapy subpackage include:
apply_apify_settings- Applies Apify-specific components to Scrapy settings.to_apify_requestandto_scrapy_request- Convert between Apify and Scrapy request objects.initialize_logging- Configures logging for the Actor environment.run_scrapy_actor- Installs Twisted's asyncio reactor and bridges asyncio and Twisted event loops.
The simplest way to start using Scrapy in Apify Actors is to use the Scrapy Actor template. The template provides a pre-configured project structure and setup that includes all necessary components to run Scrapy spiders as Actors and store their output in Apify datasets. If you prefer manual setup, refer to the example Actor section below for configuration details.
The Apify CLI supports converting an existing Scrapy project into an Apify Actor with a single command. The CLI expects the project to follow the standard Scrapy layout (including a scrapy.cfg file in the project root). During the wrapping process, the CLI:
- Creates the necessary files and directories for an Apify Actor.
- Installs the Apify SDK and required dependencies.
- Updates Scrapy settings to include Apify-specific components.
For further details, see the Scrapy migration guide.
The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
{UnderscoreMainExample} {MainExample} {SettingsExample} {ItemsExample} {SpidersExample}Under some circumstances, the platform may decide to migrate your Actor from one piece of infrastructure to another while it's in progress. While Crawlee-based projects can pause and resume their work after a restart, achieving the same with a Scrapy-based project can be challenging.
As a workaround for this issue (tracked as apify/actor-templates#303), turn on caching with HTTPCACHE_ENABLED and set HTTPCACHE_EXPIRATION_SECS to at least a few minutes—the exact value depends on your use case. If your Actor gets migrated and restarted, the subsequent run will hit the cache, making it fast and avoiding unnecessary resource consumption.
In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects using Scrapy, the Apify SDK and host them on the Apify platform. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!