GitHub - shanedrabing/screpe: High-level Python web scraping.

High-level Python web scraping.

(Crepes not included)

Installation

Using `pip`

The Python package installer makes it easy to install screpe.

Using `git`

Otherwise, clone this repository to your local machine with git, then install with Python.

git clone https://github.com/shanedrabing/screpe.git
cd screpe
python setup.py install

You can also simply download screpe.py and place it in your working directory.

Getting Started

Initializing Screpe

Import the module in Python, and initialize a Screpe object.

from screpe import Screpe

# do we want the scraper to remember previous responses?
scr = Screpe(is_caching=True)

All methods in this module live on the Screpe class, so there is no need to import anything else!

Requests and BeautifulSoup

If you are familiar with web scraping in Python, then you have probably used the requests and bs4 packages before. There are a couple of static methods that Screpe provides to make their usage even easier!

# a webpage we want to scrape
url = "https://www.wikipedia.org"

# returns None if status code is not 200
html = Screpe.get(url)

# can handle None as input, parses the HTML with `lxml`
soup = Screpe.cook(html)

# check to make sure we have a soup object, otherwise see bs4
if soup is not None:
    print(soup.select_one("h1"))

We can marry these two functions with the instance method Screpe.dine. Remember that we have the scr object from the section above.

# get and cook
soup = scr.dine(url)

Responses from Screpe.dine can be cached and adhere to rate-limiting (see next sections).

Downloading a Webpage or a File

Commonly, we just want to download an image, webpage, generic file, etc. Let us see how to do this with Screpe!

# locator to file we want, local path to where we want it
url = "https://www.python.org/static/img/python-logo.png"
fpath = "logo.png"

# let us use our object to download the file
scr.download(url, fpath)

Note that the URL can be pretty much any filetype as the response is saved in binary, just make sure you get the filetype right.

Downloading an HTML Table

Sometimes there is a nice HTML table on a webpage that we want as more interoperable format. The pandas package can do this easily, and we take advantage of that with Screpe.

# this webpage contains a table that we want to download
url = "https://www.multpl.com/cpi/table/by-year"

# we save the tables as a CSV file
fpath = "table.csv"

# the `which` parameter decides what table to save
scr.download_table(url, fpath, which=0)

Selenium

One of the most challenging tasks in web scraping is to deal with dynamic pages that require a web browser to work properly. Thankfully, the selenium package is pretty good at this. Screpe removes headaches surrounding Selenium.

# the homepage of Wikipedia has a search box
url = "https://www.wikipedia.org"

# let us open the page in a webdriver
scr.open(url)

# we can click on the input box
scr.click("input#searchInput")

# ...enter a search term
scr.send_keys("Selenium")

# ...and hit return to initiate the search
scr.bide(lambda: scr.send_enter())
# note that the `Screpe.bide` function takes a function as input, checks what
# page it is on, calls the function, and waits for the next page to load

# we can use bs4 once the next page loads!
soup = scr.source()

Caching does not apply to the Selenium-related functions, it is a stateful activity and we cannot simply load an old webdriver state.

Asynchronous Requests

Screpe uses concurrent.futures to spawn a bunch of threads that can work simulatanously to retrieve webpages.

# a collection of URLs
urls = ["https://www.wikipedia.org/wiki/Dog",
        "https://www.wikipedia.org/wiki/Cat",
        "https://www.wikipedia.org/wiki/Sheep"]

# we want soup objects for all
soups = scr.dine_many(urls)

Rate-Limiting

If sites are sensitive to how often you can request, consider setting your Screpe object to halt before sending another request.

# we give the function a duration, but can find that from a rate
rate_per_second = 2
duration_in_seconds = 1 / rate_per_second

# inform your scraper to not surpass the request interval
scr.halt_duration(duration_in_seconds)

Note that cached responses do not adhere to the rate limit. After all, we already have the reponse!

Caching

Sometimes, we have to request many pages. So that we do not waste bandwidth, or a rate limit, we can use cached reponses. Note that caching is on by default, turn it off if you want real-time responses.

# turn caching on
scr.cache_on()

# ...or turn it off
scr.cache_off()

We can save and load the cache between sessions for even more greatness!

# where shall we save the cache? (binary file)
fpath = "cahce.bin"

# save the cache
scr.cache_save(fpath)

# load the cache
scr.cache_load(fpath)

# clear the cache
scr.cache_clear()

License

MIT License