In some cases we may want to get some data from websites to use in our apps or software we are working on. Let’s say you are following new ‘Upcoming Events’ on the Python website but you don’t want to go to the website every time to see whether there is a new upcoming event on the website. In order to handle this problem image that you are working on a python bot which sends an email to your mail address when a new upcoming event is published on the Python website. In this case we usually use APIs and send GET request to the websites and URLs. But what if they don’t have a public API that we can make GET request. This is where web scraping comes in.
Using web scraping techniques we can get the html source code of a web site. And if there is some valuable data in the source code like some texts or links, it is easy to parse the html source using a few line of code.
Let’s scrap and parse some certain data from the original Python website as I mentioned in the first paragraph.
First things first I am sharing the full code and then I will explain how lines work step by step.
In python there are different libraries for web scraping called ‘lxml’ and ‘BeautifulSoup’. I prefered to use lxml in my project. If you wonder about ‘BeautifulSoup’, just check it out its documentation for more details.
def get_data(url, title_selector)
In the function definition we are passing two parameters.
url: This is the website URL we want to scrap.
title_selector: We use a css selector to select some specific tags.
response = requests.get(url)
We make a GET request to the website and get a response object.
if status_code == 200
In this line we check if the request has succeeded.
content = str(response.content, ‘utf-8’)
Content is the html source. We’ll parse this content.
Note: Some of websites are not using utf-8 or unicode encoding. In this case we may have some strange characters in our result. That’s why we are passing a parameter called ‘utf-8’.
tree = html.fromstring(content)
html.fromstring function parses the html, returns a single element/document.
titles = tree.cssselect(title_selector)
Using a css selector we are parsing the tree and getting the upcoming event titles. After you see the output, I’ll make this parameter clearer.
for title in titles:
print(title.text_content())
Here we are iterating all the upcoming event titles using a for loop. text_content () function gives us the text inside a tag. We would might get an attribiute(like ‘href’) using attrib[“href”] function. For more information you can check it out from here.
Now, it’s time to invoke our method.
get_data(
'https://www.python.org',
'div.event-widget > div.shrubbery > ul.menu > li > a'
)
Output will be like;
We got the result correctly.
Now let’s make css selectors clearer. Firstly go to the original Python website. On the page when you scroll down a little you’ll see a part called ‘Upcoming Events’.
These were the titles for upcoming events in our output. So just right-click on the title and then click ‘Inspect’.
As you can see here we have some nested html elements. So in our selector we didn’t only use <a> tag to select. Because there can be a lot of different text in different <a> tags. Therefore we used more specific selector to select only ‘Upcoming Events’ part.
Let’s remeber the selector we passed in our function before. It was like given below;
‘div.event-widget > div.shrubbery > ul.menu > li > a’
In general css selectors are used by web developers and designers to style the elements. If you are confused about how css selectors work you can learn them from here in more details.
Note: Web scraping has a disadvantage. When the website owner rebuilds the website or changes its style, we won’t be able to use our selector like before. In this case we need to change our css selector. Even though web scraping has a disadvantage like that, it’s still a good option when we don’t have an API to get data from websites.
I hope the article was useful for you, thanks for reading.