Extremely Fast Python Web Scraping

Build an Extremely Fast Python Web Scraper

A web scraper is a tool that extracts structured data from a website. Using BeautifulSoup, requests, and other Python modules, you may create an efficient web scraper. These solutions, however, are not rapid enough. In this essay, I’ll show you how to use Python to create a super fast web scraper.

Don’t use BeautifulSoup4

BeautifulSoup4 is nice and easy to use, however it is slow. It is still slow even if you use an external extractor like lxml for HTML parsing or cchardet to detect the encoding.

Use selectolax instead of BeautifulSoup4 for HTML parsing

selectolax is a Python binding to Modest and Lexbor engines.

To install selectolax with pip:

pip install selectolax

The usage of selectolax is similar to BeautifulSoup4.

from selectolax.parser import HTMLParser html = """ <body> <h1 class='>Welcome to selectolax tutorial</h1> <div id="text"> <p class='p3'>Lorem ipsum</p> <p class='p3'>Lorem ipsum 2</p> </div> <div> <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p> </div> </body> """ # Select all elements with class 'p3' parser = HTMLParser(html) parser.select('p.p3') # Select first match parser.css_first('p.p3') # Iterate over all nodes on the current level for node in parser.css('div'): for cnode in node.iter(): print(cnode.tag, cnode.html)
Code language: HTML, XML (xml)

For more information, please visit selectolax walkthrough tutorial

Use httpx instead of requests

Python requests is a human-friendly HTTP client. It’s simple to use, but it’s not quick. Only synchronous requests are supported.

httpx is a full-featured HTTP client for Python 3 that supports both HTTP/1.1 and HTTP/2 and includes sync and async APIs. It comes with a conventional synchronous API by default, but you can optionally use an async client if necessary. To use pip to install httpx, follow these steps:

pip install httpx

httpx offers the same api with requests:

import httpx async def main(): async with httpx.AsyncClient() as client: response = await client.get('https://httpbin.org/get') print(response.status_code) print(response.json()) import asyncio asyncio.run(main())
Code language: JavaScript (javascript)

For examples and usage, please visit httpx home page

Use aiofiles for file IO

aiofiles is a Python package that allows you to perform asyncio-based file I/O. It provides a high-level API for file manipulation. To use pip to install aiofiles, follow these steps:

pip install aiofiles

Basic usage:

import aiofiles async def main(): async with aiofiles.open('test.txt', 'w') as f: await f.write('Hello world!') async with aiofiles.open('test.txt', 'r') as f: print(await f.read()) import asyncio asyncio.run(main())\
Code language: JavaScript (javascript)

For more information, please visit aiofiles repository

Conclusion

Basic web scraping in Python is simple, but it can take a long time. When you Google something like “rapid web scraping in python,” multiprocessing appears to be the simplest option, but it can only do so much. Hopefully, the above mentioned solutions can help you create a faster and more efficent Python Web Scraper.

%d bloggers like this: