Job search is a tedious process involving going through numerous job pages and lengthy job descriptions and summaries. In this article, I will try to make this process a bit easier for you using Python’s Beautiful Soup and Word cloud library.
We will extract information like job title, company name, location, job summary and description. All this information would be converted into a data frame to provide a concise view of all the job descriptions at one place, and later on we will put that info on word clouds.
Web Scraping using BeautifulSoup
Web scraping is the process of extracting data from websites. Web scraping a web page involves fetching the page and extracting data from it. Web pages are built using text-based mark-up languages such as HTML and XHTML. Beautiful Soup is a Python library for pulling data out of such HTML and XML files.
We are going to extract the following information from one of the most popular job search sites – indeed.com:
Name of the company
We will access all this information through html tags and their attributes like class. We will follow the following steps to access the above information:
Gather URL of all the pages (numbered 1, 2,3 and so on at the bottom of each page)
Access each URL and extract text
Store the extracted data
To gather the successive pages containing the job postings, we are going to put a query for Data Scientist and run a search for it. The URL you get in the browser is:
Now, click on page 2, page 3 and so on, observe the URL of successive pages you click:
Load all the required libraries and fetch the URL of the first and successive pages. Next, we will visit each page and extract their content. We are going to do that by calling Python’s urlopen().read() function from urllib package and parse the page by passing the content to BeautifulSoup function.
Fetch and store URL of all pages and parse the content
Now that we have content from all the pages, we are going to find, and store information related to Job title, Job location and Name of the company. To do so, right-click on a job listing and click on inspect to open the source code, and fetch the highlighted tags and their classes as shown in the code below.
In order to get the job description for each job posting, we will find all the div tags with class title, and extract all the href links.
Append “https://www.indeed.com” to the beginning of each link because, in the source code of the page, all the hrefs are relative. If you click on the href link in the picture above, you would see the job listing in detail.
Visualise with World Cloud
We have all the information that we wanted for each job posting. Next, we will create word cloud using python’s wordcloud package and visualise using matplotlib library.
Word Cloud is a visual depiction of text data in which the size of each word indicates its frequency or importance. For instance, if we plot the Word cloud of job titles, the most common titles are displayed in a bigger size.
Word cloud of job titlesWord cloud of hiring CompaniesWord cloud of the locations of jobs posted @indeedWord cloud of job descriptions
This is the simplest way to generate word clouds. There is a lot more that you can do with word clouds from here, you can play with different shape masks for the word cloud, or use advanced methods like TDIDF weighting technique.
Hope you liked reading this article.
Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!
Don't Miss This Absolutely Free, No Conditions Attached Course