Using pip
The Python package installer makes it easy to install screpe.
Using git
Otherwise, clone this repository to your local machine with git, then install with Python.
git clone https://github.com/shanedrabing/screpe.git cd screpe python setup.py install
You can also simply download screpe.py and place it in your working
directory.
Initializing Screpe
Import the module in Python, and initialize a Screpe object.
from screpe import Screpe # do we want the scraper to remember previous responses? scr = Screpe(is_caching=True)
All methods in this module live on the Screpe class, so there is no need to
import anything else!
Requests and BeautifulSoup
If you are familiar with web scraping in Python, then you have probably used
the requests and bs4 packages before. There are a couple of static methods
that Screpe provides to make their usage even easier!
# a webpage we want to scrape url = "https://www.wikipedia.org" # returns None if status code is not 200 html = Screpe.get(url) # can handle None as input, parses the HTML with `lxml` soup = Screpe.cook(html) # check to make sure we have a soup object, otherwise see bs4 if soup is not None: print(soup.select_one("h1"))
We can marry these two functions with the instance method Screpe.dine.
Remember that we have the scr object from the section above.
# get and cook soup = scr.dine(url)
Responses from Screpe.dine can be cached and adhere to rate-limiting (see
next sections).
Downloading a Webpage or a File
Commonly, we just want to download an image, webpage, generic file, etc. Let us see how to do this with Screpe!
# locator to file we want, local path to where we want it url = "https://www.python.org/static/img/python-logo.png" fpath = "logo.png" # let us use our object to download the file scr.download(url, fpath)
Note that the URL can be pretty much any filetype as the response is saved in binary, just make sure you get the filetype right.
Downloading an HTML Table
Sometimes there is a nice HTML table on a webpage that we want as more
interoperable format. The pandas package can do this easily, and we take
advantage of that with Screpe.
# this webpage contains a table that we want to download url = "https://www.multpl.com/cpi/table/by-year" # we save the tables as a CSV file fpath = "table.csv" # the `which` parameter decides what table to save scr.download_table(url, fpath, which=0)
Selenium
One of the most challenging tasks in web scraping is to deal with dynamic pages
that require a web browser to work properly. Thankfully, the selenium package
is pretty good at this. Screpe removes headaches surrounding Selenium.
# the homepage of Wikipedia has a search box url = "https://www.wikipedia.org" # let us open the page in a webdriver scr.open(url) # we can click on the input box scr.click("input#searchInput") # ...enter a search term scr.send_keys("Selenium") # ...and hit return to initiate the search scr.bide(lambda: scr.send_enter()) # note that the `Screpe.bide` function takes a function as input, checks what # page it is on, calls the function, and waits for the next page to load # we can use bs4 once the next page loads! soup = scr.source()
Caching does not apply to the Selenium-related functions, it is a stateful activity and we cannot simply load an old webdriver state.
Asynchronous Requests
Screpe uses concurrent.futures to spawn a bunch of threads that can work
simulatanously to retrieve webpages.
# a collection of URLs urls = ["https://www.wikipedia.org/wiki/Dog", "https://www.wikipedia.org/wiki/Cat", "https://www.wikipedia.org/wiki/Sheep"] # we want soup objects for all soups = scr.dine_many(urls)
Rate-Limiting
If sites are sensitive to how often you can request, consider setting your
Screpe object to halt before sending another request.
# we give the function a duration, but can find that from a rate rate_per_second = 2 duration_in_seconds = 1 / rate_per_second # inform your scraper to not surpass the request interval scr.halt_duration(duration_in_seconds)
Note that cached responses do not adhere to the rate limit. After all, we already have the reponse!
Caching
Sometimes, we have to request many pages. So that we do not waste bandwidth, or a rate limit, we can use cached reponses. Note that caching is on by default, turn it off if you want real-time responses.
# turn caching on scr.cache_on() # ...or turn it off scr.cache_off()
We can save and load the cache between sessions for even more greatness!
# where shall we save the cache? (binary file) fpath = "cahce.bin" # save the cache scr.cache_save(fpath) # load the cache scr.cache_load(fpath) # clear the cache scr.cache_clear()
Copyright (c) 2022 Shane Drabing
