Introduction

Web scraping is used to collect large information from websites. This documentation shows how to scrap websites, how to store it in a database, and how to automate the scripts to run and schedule web scrapper in order to get up to date data.

Version:

1.0.0.

About this App

Python, Selenium, Airflow

Technology

This app is built with Python, Selenium, Airflow.
http://airflow.apache.org/
https://www.python.org/
https://selenium-python.readthedocs.io/
https://www.postgresql.org/

Getting Started

Job scraping is to gather job posting information online in a programmatic manner. This automated way of extracting data from the web to build resourceful job database by integrating various data sources into one.

Scrapping Infrastructure

Add basic diagram

Important: It is perfectly legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential information for profit. However most of websites will find trick ways to "stop" you from mass scripting. It is important to put time sleeper on the scripts in order not to get a connection time out. This is when there is too many requests from the same IP. Also, it's most likely we will get good results of smaller (specific key words) scrapping then mass scrapping.

By example, i search two key words on indeed site. what = "Data Science" and where "Canada". Over 1700 results came up. But after scrapping for 45min, it ends up by "We have removed 352 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included."

How come? So Indeed deduplicate your search by Location and job titles.

https://recruitingblogs.com/m/blogpost

Company Name Job Title Location Posting Date
Company A Administrative Assistant Fairfield, OH March 8
Company A Administrative Assistant Cincinnati, OH February 14
Company A Software Engineer I Cincinnati, OH February 14
Company A Software Engineer IV Cincinnati, OH March 8

If someone searched for Company A's jobs, they would show up similar to this:

Administrative Assistant Company A - Fairfield, OH - +1 Location Software Engineer IV Company A - Cincinnati, OH We have removed 2 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included.

Pre-Requirements

Libraries: psycopg2 - sqlalchemy - pandas - numpy - BeautifulSoup - Selenium

Scrapping and Data Integration

This sections shows Python, beautiful and Selenium scripts to extract data from websites

Scrapping Linkedin Public Job Posts

This script shows how get data job posts of Linkedin


Linkedin Scrapper

# -*- coding: utf-8 -*-
"""
Created on Wed Oct 27 16:59:29 2021
@author: Simon Renauld
"""
"""
Download and sources
https://chromedriver.chromium.org/downloads
https://www.guru99.com/xpath-selenium.html#1
https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f
"""
from selenium import webdriver
import time
import pandas as pd
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import signal
import sys
import time
url = 'https://www.linkedin.com/jobs/search?keywords=Analytics&location=Vietnam&geoId=104195383&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
wd = webdriver.Chrome(executable_path=r'C:/Users/bin/chromedriver')
wd.get(url)
"""
Get Number of jobs per link search and location
"""
no_of_jobs = wd.find_element_by_css_selector('h1>span').get_attribute('innerText')
no_of_jobs = no_of_jobs.translate({ord('+'): None})
no_of_jobs = no_of_jobs.translate({ord(','): None})
no_of_jobs = int(no_of_jobs)
############# Scrool Page down to get more results
i = 2
while i <= int(no_of_jobs/25)+1:
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
i = i + 1
try:
wd.find_element_by_xpath('/html/body/div/div/main/section/button').click()
time.sleep(5)
except:
pass
time.sleep(5)
if i > 60:
break
############# Break lope if no new results
try:
wd.find_element_by_class_name('jobs-search__results-list')
except NoSuchElementException:
break
job_lists = wd.find_element_by_class_name('jobs-search__results-list')
jobs = job_lists.find_elements_by_tag_name('li') # return a list
print(len(jobs))
#Get the job search card elements
job_id= []
job_title = []
company_name = []
location = []
date = []
job_link = []
for job in jobs:
job_id0 = job.find_element_by_css_selector('div').get_attribute('data-tracking-id')
job_id.append(job_id0)
job_title0 = job.find_element_by_css_selector('h3').get_attribute('innerText')
job_title.append(job_title0)
company_name0 = job.find_element_by_css_selector('h4').get_attribute('innerText')
company_name.append(company_name0)
location0 = job.find_element_by_css_selector('[class="job-search-card__location"]').get_attribute('innerText')
location.append(location0)
date0 = job.find_element_by_css_selector('div>div>time').get_attribute('datetime')
date.append(date0)
job_link0 = job.find_element_by_css_selector('a').get_attribute('href')
job_link.append(job_link0)
time.sleep(15)
"""
#Get the detailed job descriptions
jd = []
seniority = []
emp_type = []
job_func = []
industries = []
for item in range(len(jobs)):
job_func0=[]
industries0=[]
# clicking job to view job details
job_click_path = f'/html/body/div/div/main/section/ul/li/div/a[{item+1}]'
job_click = job.find_element_by_xpath(job_click_path).click()
time.sleep(5)
jd_path = '/html/body/main/section/div[2]/section[2]/div'
jd0 = job.find_element_by_xpath(jd_path).get_attribute('innerText')
jd.append(jd0)
seniority_path = '/html/body/main/section/div[2]/section[2]/ul/li[1]/span'
seniority0 = job.find_element_by_xpath(seniority_path).get_attribute('innerText')
seniority.append(seniority0)
emp_type_path = '/html/body/main/section/div[2]/section[2]/ul/li[2]/span'
emp_type0 = job.find_element_by_xpath(emp_type_path).get_attribute('innerText')
emp_type.append(emp_type0)
job_func_path = '/html/body/main/section/div[2]/section[2]/ul/li[3]/span'
job_func_elements = job.find_elements_by_xpath(job_func_path)
for element in job_func_elements:
job_func0.append(element.get_attribute('innerText'))
job_func_final = ', '.join(job_func0)
job_func.append(job_func_final)
industries_path = '/html/body/main/section/div[2]/section[2]/ul/li[4]/span'
industries_elements = job.find_elements_by_xpath(industries_path)
for element in industries_elements:
industries0.append(element.get_attribute('innerText'))
industries_final = ', '.join(industries0)
industries.append(industries_final)
"""
job_data = pd.DataFrame({'ID': job_id,
'Date': date,
'Company': company_name,
'Title': job_title,
'Location': location,
'Link': job_link
})
# cleaning description column
job_data.to_csv('.\outputs\\analyticsvietnam4.csv',encoding='utf-8-sig')

Data Integration with Psycopg2 - PgSQL

This python script shows how to pre-clean data from the scrapping and Import into a PgSQL database


Data Integration with Psycopg2

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 31 20:15:22 2021
@author: renau
"""
# Import Librairies
import psycopg2
from sqlalchemy import create_engine
import pandas as pd
import glob
from io import StringIO
import prettytable
import os
import sys
import time
import numpy as np
from timeit import default_timer as timer
# Delete previous file
#################################################################
file = r'C:\\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsall.csv'
if(os.path.exists(file) and os.path.isfile(file)):
os.remove(file)
print("file deleted")
else:
print("file not found")
###########################################################
# Merge all Csv from Linkedin Output Folder
path = r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs' # use your path
all_files = glob.glob(path + "/*.csv")
full_list_linkedin = []
for filename in all_files:
df = pd.read_csv(filename)
full_list_linkedin.append(df)
frame = pd.concat(full_list_linkedin, axis=0, ignore_index=True)
print((frame).head())
frame.rename(columns={'Unnamed: 0': 'auto_id'}, inplace=True)
for col in frame.columns:
print(col)
###########################################################
# Pre Cleaning check for duplicates Links
frame = frame.drop(frame.columns[[0,1,6]], axis=1) # df.columns is zero-based pd.Index
frame = frame.drop(frame.columns[[5]], axis=1) # df.columns is zero-based pd.Index
frame.drop_duplicates(subset=['Link'])
#Lower Caps all text
frame = frame.drop_duplicates(subset=['Link'])
frame['Title'] = frame['Title'].str.lower()
print(frame['Title'].head())
#Remove rows where CONTAINS ->
frame = frame[~frame['Title'].isin(['data entry'])]
print(frame.info())
# Output Frame to CSV
print(frame.info())
frame.to_csv(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsall.csv',encoding='utf-8-sig')
################################################################# (Updates Insert only new rows)
#INPUT YOUR OWN CONNECTION STRING HERE
conn_string = 'postgres://postgres:admin@localhost/postgres'
#perform to_sql test and print result
db = create_engine(conn_string)
conn = db.connect()
start_time = time.time()
frame.to_sql('Linkedinjob', con=conn, if_exists='append', index=False)
print("to_sql duration: {} seconds".format(time.time() - start_time))
print("COPY duration: {} seconds".format(time.time() - start_time))
#close connection
conn.close()

Get Jobs Description details

This python script shows how to loop trough a CSV lists Html links to get text details of job description


Open HTML links and Get data

import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
import time
import pandas as pd
import re
browser = webdriver.Chrome(executable_path=r'C:/Users/bin/chromedriver')
contents = []
descriptions=[]
links =[]
#Open Html links from CSV
with open(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsalltest.csv', encoding="utf8") as cp_csv:
cp_url = csv.reader(cp_csv)
next(cp_csv)
for row in cp_url:
links = row[5]
contents.append(links)
#link should be something like "https://www...."
for link in contents:
browser.get(link)
time.sleep( 3 )
try:
cookie = browser.find_element_by_xpath("//button[contains(text(),'Show more')]")
cookie.click()
except:
pass
try:
description = browser.find_element_by_id("main-content").text # adapt to your html page
except:
description = "None"
descriptions.append(description)
df_da=pd.DataFrame()
df_da['Description']=descriptions
df_da['Link']=contents
# Delete all text after string
df_da['Description'] = df_da['Description'].str.split('Similar jobs').str[0]
df_da['Description'] = df_da['Description'].str.split('Off').str[0]
df_da.to_csv(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsalltestdescriptions.csv',encoding='utf-8-sig')

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feature 4

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature 5

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature 5

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Data Mining and NPL

Feature A

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature B

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature C

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature C.1

Lorem ipsum dolor sit amet, consectetur adipiscing elit:

  • Point 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  • Point 2 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor.

  • Point 3 Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feauture C.2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Important: Please notice that lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feature D

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Accessibility

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

More Info

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.marks are protected by intellectual property rights.