Scraping 101

Summary

This is a multipart post on how to scrape gu.se for fun and for profit using python. There are no other prerequisites than:

  • A small amount of patience
  • Some rudimentary idea of programming

Ok, well access to a computer of some sort is also necessary.

The beginning

Unfortunately first things first, before we collect the data we will need to collect the necessary tools to do so.

Installing Python

We need to install the python interpreter (the program that can read python code and do stuff). This can be done in many ways in fact, there are many python versions and distributions and of course there are many different operating systems to install python on. For simplicity I will just choose the official python.org distribution and I will guide you through on how to install it on Windows 10. Head off to python.org And download the latest version.

Well, if you are too lazy (and running a recent Windows or macOs) you can just download the installers for 3.7.1 from this list:

So, just go on installing the thing, tick the box “Add Python 3.7 to PATH” as shown below when installing on Windows.

Then click “Install now” and just agree to all the questions asked.

Just one more thing

Now that the python interpreter is installed, and we are ready to go! Almost, we just need one more thing, the thing to write the code with. So there are many editors, you can use almost anything, well at least anything that does not introduce weird symbols. A lot of text editors that are not meant for programming can do just that, write unwanted symbols and also not use the correct binary encoding of your program text. Just go to https://code.visualstudio.com/ and download Visual Studio Code for your platform.

Or you can just use my links below to install version 1.29.1

Just install and agree to all questions asked.

Let the code begin

Ok so, now we have Python and an editor installed. Now open the editor, create a new file File->New file and paste this code:

print("hello")

Save the file as hello.py in your Documents directory. Yes now we want to run this thing. Click on View->Terminal and a subwindow will be shown under the code that you’ve written. Now click the tab TERMINAL if this is not already highlighted, make sure 1: powershell is chosen in the dropdown menu shown to the right of the tabs. Now just type:

cd Documents
python hello.py

What you should see is something like this:

If you don’t then obviously something is wrong. Either you try again or give up.

The next level

Congratulations for making it this far! It does not matter if you have this exact same setup, as long as you have a working Python installation, an editor to write code in and you know how to run that code, then you are ready to follow along to Part 2 of Scraping 101