Scraping 101 part 4

December 2, 2018

Make the change

Let’s try and improve our script and see if we can get the gender.speculate method to properly identify all names gender. As we can see from our print out we notice that there is an 'U' in the group 'Universitetslektor':

'Universitetslektor': {'M': 5, 'U': 1, 'F': 4}

We can conclude that there someone with the title Universitetslektor who’s gender we could not identify. Either we did not parse first names correctly or it is marked as an 'U' because it is not in the gename database. Let’s first make sure that we did get the names properly extracted from the webpage. Since we are iterating through all the persons on the staff page, let us just print out their first name on every loop and have a look. Change the line:

first_name = name.split(',')[1].strip()

to:

first_name = name.split(',')[1].strip()
print(first_name)

Save the file and run the program again by typing: python scrape_gu.py. Now you will get a pretty big output, first there is a list of names followed by the statistics that we saw before, nothing surprising about that right? Now scroll up the list of names and you should find this name 'Mathias A.'. That is not a first name and for sure gender.speculate will fail to figure out the gender here. We need to figure out a way to modify the script in a way that parses this name correctly. What if we just split the 'Mathias A.' string by the ' ' character and just use the first part, sounds good right? Make the following change to do this:

first_name = name.split(',')[1].strip().split(' ')[0]
print(first_name)

We just appended some stuff directly to the previous result, first we made the split on ' ' by appending .split(' ') and then we used the first part by appending [0]. Counting in programming mostly start with 0 and not 1, I think that is the only difference between man and machine. This all seems good for the string 'Mathias A.' but what about the other names that do not have this problem? Well if we split say 'Marianthi' by the ' ' character the first part of that result would still be 'Marianthi'. Let’s try it out before we change our program. In our terminal inside VS Code, we just type python.

Now we are in the python interpreter and we can run python commands directly, type 'Marianthi'.split(' ')[0], please make sure you enter all the single quotes correctly. In python you can enter a string with double or single quotes, I prefer to use single quotes for no particular reason. The output should be as we hoped for: 'Marianthi'. Now type: 'Mathias A.'.split(' ')[0] and hit enter to notice that we also this time got the result that we hoped for. Below you will see the output, I also typed the commands without the ending [0] which might give you some insight:

Now exit the python interpreter by typing exit(). Now that we have done some testing we can remove the print line that entered before in our program:

print(first_name)

Run the program again by typing python scrape_gu.py in the terminal and you should see the full stats output that we have aleady seen but in particular:

'Universitetslektor': {'M': 6, 'F': 4, 'U': 0}

The 'U' that we had previously has now become an 'M', in this context this is a good thing.

Hard working academics

Our next challenge is to have all the staff that hold multiple titles properly counted. This change will prove to be a little bit more complicated, don’t worry if you don’t understand everything.

We have seen in our output things like:

'Universitetslektor, Prefekt': {'M': 0, 'F': 1, 'U': 0}

Does this problem seem somewhat familiar? We should be thinking about how to split this string into multiple strings. We could split on comma, right? The only extra problem we will end up with then is that we will have more titles than persons. Ok, let’s try to implement this idea into our code, change the line:

title = person.find_next('td').span['title']

Into:

titles = person.find_next('td').span['title'].split(',')

Ok, now what, we changed the variable name title into titles and this is now a list of at least one title and the program that we have is written to only use one title. We need to make some more changes for the program to work. As we talked about earlier concerning the following part of the program:

for person in all_persons:

That is a loop that will assign every person in all_persons to the variable person and then run the code inside the loop. We want to do the same thing here with titles and title. Make the following change to your code:

Save the file and run it, the output should be something along the lines of:

{'Doktorand': {'M': 6, 'F': 7, 'U': 0}, 'Universitetsadjunkt': {'M': 5, 'F': 11, 'U': 0}, 'Universitetslektor':
{'M': 7, 'F': 7, 'U': 0}, 'Professor': {'M': 3, 'F': 4, 'U': 0}, 'Proprefekt': {'M': 0, 'F': 1, 'U': 0}, 'UNESCO-professor i yttrandefrihet': {'M': 0, 'F': 1, 'U': 0}, 'medieutveckling och global politik': {'M': 0, 'F': 1, 'U': 0}, 'Forskare': {'M': 0, 'F': 1, 'U': 0}, 'biträdande': {'M': 0, 'F': 1, 'U': 0}, 'Viceprefekt': {'M': 1, 'F': 0, 'U': 0}, 'Studierektor': {'M': 0, 'F': 1, 'U': 0}, 'Datadriftledare': {'M': 1, 'F': 0, 'U': 0}, 'Studieadministratör': {'M': 1, 'F': 1, 'U': 0}, 'Prefekt': {'M': 0, 'F': 1, 'U': 0}, 'Kommunikatör': {'M': 0, 'F': 1, 'U': 0}, 'Gästforskare': {'M': 0, 'F': 1, 'U': 0}, 'Docent': {'M': 1, 'F': 1, 'U': 0}, 'Gästlärare': {'M': 1, 'F':
0, 'U': 0}, 'Administrativ chef': {'M': 1, 'F': 0, 'U': 0}, 'inst': {'M': 1, 'F': 0, 'U': 0}, 'Personalhandläggare': {'M': 0, 'F': 1, 'U': 0}}

Is that good? Yes and no, it seems we did not take all that much care to some cases. We can see some titles like:

'biträdande': {'M': 0, 'F': 1, 'U': 0}

'inst': {'M': 1, 'F': 0, 'U': 0}

Those titles are probably not correct. What has gone wrong? Looking more closely at the data it seems that comma is used to separate titles but it is also used inside some titles. This is a quite common situation, the data is (hopefully) well enough structured for us humans to understand but not optimal for machines to parse. After reading the titles again and trying to figure out if there is some structure that we can exploit we note that all the different titles in the title string start with a capital letter. If we have a comma followed by a space character followed by a capital letter then we have a new title. Then we look up on the internet that we can split the title string on this fact by changing:

titles = person.find_next('td').span['title'].split(',')

To:

titles = person.find_next('td').span['title']
titles = re.split(', (?=[A-Z])', titles)

That looks complicated! Yes, you don’t have to know the details of this other than we used something called regular expressions. The logic however, is pretty easy to understand in what it does, we already defined the conditions needed for a split of titles:

First we want a comma
Then we need the following character to be a space
Lastly we need the next character to be an upper character.

To use these regular expressions we need to import re, you can do this at the top of your file, like this:

import re  # add this line!
import requests
import requests_cache
from bs4 import BeautifulSoup
from gename import Gender
# ...
# and so on...

Save the program and run it again: python scrape_gu.py. Here is the full program code after our changes but without comments for reference:

import re
import requests
import requests_cache
from bs4 import BeautifulSoup
from gename import Gender


gender = Gender()
requests_cache.install_cache('cache')


def scrape():
    resp = requests.get('https://jmg.gu.se/om-institutionen/', params={
        'selectedTab': 2,
        'itemsPerPage': 500,
    })
    soup = BeautifulSoup(resp.text, 'html.parser')
    stats = {}
    all_persons = soup.find_all('a', {'class': 'person'})
    for person in all_persons:
        name = person.string.strip()
        first_name = name.split(',')[1].strip().split(' ')[0]
        gen = gender.speculate(first_name)
        titles = person.find_next('td').span['title']
        titles = re.split(r', (?=[A-Z])', titles)
        for title in titles:
            title = title.strip()
            stats.setdefault(title, {'M': 0, 'F': 0, 'U': 0})
            stats[title][gen] += 1
    print(stats)


scrape()

Run the program again and now we have some great success:

{'Doktorand': {'M': 6, 'F': 7, 'U': 0}, 'Universitetsadjunkt': {'M': 5, 'F': 11, 'U': 0}, 'Datadriftledare': {'M': 1, 'F': 0, 'U': 0}, 'Professor': {'M': 3, 'F': 4, 'U': 0}, 'Prefekt': {'M': 0, 'F': 1, 'U': 0}, 'Proprefekt': {'M': 0, 'F': 1, 'U': 0}, 'Administrativ chef, inst': {'M': 1, 'F': 0, 'U': 0}, 'Gästlärare': {'M': 1, 'F': 0, 'U': 0}, 'UNESCO-professor i yttrandefrihet, medieutveckling och global politik': {'M': 0, 'F': 1, 'U': 0}, 'Universitetslektor': {'M': 7, 'F': 7, 'U': 0}, 'Studierektor': {'M': 0, 'F': 1, 'U': 0}, 'Forskare, biträdande': {'M': 0, 'F': 1, 'U': 0}, 'Personalhandläggare': {'M': 0, 'F': 1, 'U': 0}, 'Docent': {'M': 1, 'F': 1, 'U': 0}, 'Gästforskare': {'M': 0, 'F': 1, 'U': 0}, 'Viceprefekt': {'M': 1, 'F': 0, 'U': 0}, 'Kommunikatör': {'M': 0, 'F': 1, 'U': 0}, 'Studieadministratör': {'M': 1, 'F': 1, 'U': 0}}

That looks aweful and my eyes hurt

Right, so, the output is not pretty, can we do something about this? Can we at the very least have some sane line breaks to make the output more readable? There is a module called pprint for pretty printing available in the so called standard library in python, let’s try that. At the top of our script we need to import it first, change your program like this:

import pprint  # add this line!
import re
import requests
import requests_cache
from bs4 import BeautifulSoup
from gename import Gender
# ...
# and so on...

And then we use the pretty printer instead of print, old code:

print(stats)

New code:

pprint.pprint(stats)

Running the program now yields something like this:

{'Administrativ chef, inst': {'F': 0, 'M': 1, 'U': 0},
 'Datadriftledare': {'F': 0, 'M': 1, 'U': 0},
 'Docent': {'F': 1, 'M': 1, 'U': 0},
 'Doktorand': {'F': 7, 'M': 6, 'U': 0},
 'Forskare, biträdande': {'F': 1, 'M': 0, 'U': 0},
 'Gästforskare': {'F': 1, 'M': 0, 'U': 0},
 'Gästlärare': {'F': 0, 'M': 1, 'U': 0},
 'Kommunikatör': {'F': 1, 'M': 0, 'U': 0},
 'Personalhandläggare': {'F': 1, 'M': 0, 'U': 0},
 'Prefekt': {'F': 1, 'M': 0, 'U': 0},
 'Professor': {'F': 4, 'M': 3, 'U': 0},
 'Proprefekt': {'F': 1, 'M': 0, 'U': 0},
 'Studieadministratör': {'F': 1, 'M': 1, 'U': 0},
 'Studierektor': {'F': 1, 'M': 0, 'U': 0},
 'UNESCO-professor i yttrandefrihet, medieutveckling och global politik': {'F': 1,
                                                                           'M': 0,
                                                                           'U': 0},
 'Universitetsadjunkt': {'F': 11, 'M': 5, 'U': 0},
 'Universitetslektor': {'F': 7, 'M': 7, 'U': 0},
 'Viceprefekt': {'F': 0, 'M': 1, 'U': 0}}

I think that is good enough, most of the time we would want to insert this data into a database or save it in some format readable by Excel and so putting more time into pretty printing is kind of almost like wasting time.

I hope you had fun, this marks the end of this series. I know I also talked about using this program to look at the gender distribution on other GU institutions, I will leave that as an exercise :) A hint is to change the scrape function to accept a url parameter.