Skip to content

Latest commit

 

History

History
56 lines (37 loc) · 5.12 KB

File metadata and controls

56 lines (37 loc) · 5.12 KB
id crawlee
title Using Crawlee
description Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler.

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawlee_beautifulsoup.py'; import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py'; import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py';

In this guide you'll learn how to use the Crawlee library in your Apify Actors.

Introduction

Crawlee is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler, BeautifulSoupCrawler and ParselCrawler, and browser-based crawlers like PlaywrightCrawler, to suit different scraping needs.

In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler to build Apify Actors for web scraping.

Actor with BeautifulSoupCrawler

The BeautifulSoupCrawler is ideal for extracting data from static HTML pages. It uses BeautifulSoup for parsing and ImpitHttpClient for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.

{CrawleeBeautifulSoupExample}

Actor with ParselCrawler

The ParselCrawler works in the same way as BeautifulSoupCrawler, but it uses the Parsel library for HTML parsing. This allows for more powerful and flexible data extraction using XPath selectors. It should be faster than BeautifulSoupCrawler. Below is an example of how to use ParselCrawler in an Apify Actor.

{CrawleeParselExample}

Actor with PlaywrightCrawler

The PlaywrightCrawler is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use PlaywrightCrawler in an Apify Actor.

{CrawleePlaywrightExample}

Conclusion

In this guide, you learned how to use the Crawlee library in your Apify Actors. By using the BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Additional resources