Web scraping is used to collect large information from websites. This documentation shows how to scrap websites, how to store it in a database, and how to automate the scripts to run and schedule web scrapper in order to get up to date data.
1.0.0.
Python, Selenium, Airflow
This app is built with Python, Selenium, Airflow.
http://airflow.apache.org/
https://www.python.org/
https://selenium-python.readthedocs.io/
https://www.postgresql.org/
Add basic diagram
Important:
It is perfectly legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential information for profit.
However most of websites will find trick ways to "stop" you from mass scripting. It is important to put time sleeper on the scripts in order not to get a connection time out. This is
when there is too many requests from the same IP. Also, it's most likely we will get good results of smaller (specific key words) scrapping then mass scrapping.
By example, i search two key words on indeed site. what = "Data Science" and where "Canada". Over 1700 results came up. But after scrapping for 45min, it
ends up by "We have removed 352 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included."
How come? So Indeed deduplicate your search by Location and job titles.
https://recruitingblogs.com/m/blogpost
Company Name | Job Title | Location | Posting Date |
---|---|---|---|
Company A | Administrative Assistant | Fairfield, OH | March 8 |
Company A | Administrative Assistant | Cincinnati, OH | February 14 |
Company A | Software Engineer I | Cincinnati, OH | February 14 |
Company A | Software Engineer IV | Cincinnati, OH | March 8 |
If someone searched for Company A's jobs, they would show up similar to this:
Administrative Assistant Company A - Fairfield, OH - +1 Location Software Engineer IV Company A - Cincinnati, OH We have removed 2 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included.
Libraries: psycopg2 - sqlalchemy - pandas - numpy - BeautifulSoup - Selenium
This sections shows Python, beautiful and Selenium scripts to extract data from websites
This script shows how get data job posts of Linkedin
# -*- coding: utf-8 -*- | |
""" | |
Created on Wed Oct 27 16:59:29 2021 | |
@author: Simon Renauld | |
""" | |
""" | |
Download and sources | |
https://chromedriver.chromium.org/downloads | |
https://www.guru99.com/xpath-selenium.html#1 | |
https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f | |
""" | |
from selenium import webdriver | |
import time | |
import pandas as pd | |
from selenium.common.exceptions import NoSuchElementException | |
import pandas as pd | |
from selenium.webdriver.common.by import By | |
from selenium import webdriver | |
from selenium.webdriver.common.keys import Keys | |
from selenium.webdriver.support.ui import WebDriverWait | |
from selenium.webdriver.support import expected_conditions as EC | |
from selenium.common.exceptions import NoSuchElementException | |
import signal | |
import sys | |
import time | |
url = 'https://www.linkedin.com/jobs/search?keywords=Analytics&location=Vietnam&geoId=104195383&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0' | |
wd = webdriver.Chrome(executable_path=r'C:/Users/bin/chromedriver') | |
wd.get(url) | |
""" | |
Get Number of jobs per link search and location | |
""" | |
no_of_jobs = wd.find_element_by_css_selector('h1>span').get_attribute('innerText') | |
no_of_jobs = no_of_jobs.translate({ord('+'): None}) | |
no_of_jobs = no_of_jobs.translate({ord(','): None}) | |
no_of_jobs = int(no_of_jobs) | |
############# Scrool Page down to get more results | |
i = 2 | |
while i <= int(no_of_jobs/25)+1: | |
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);") | |
i = i + 1 | |
try: | |
wd.find_element_by_xpath('/html/body/div/div/main/section/button').click() | |
time.sleep(5) | |
except: | |
pass | |
time.sleep(5) | |
if i > 60: | |
break | |
############# Break lope if no new results | |
try: | |
wd.find_element_by_class_name('jobs-search__results-list') | |
except NoSuchElementException: | |
break | |
job_lists = wd.find_element_by_class_name('jobs-search__results-list') | |
jobs = job_lists.find_elements_by_tag_name('li') # return a list | |
print(len(jobs)) | |
#Get the job search card elements | |
job_id= [] | |
job_title = [] | |
company_name = [] | |
location = [] | |
date = [] | |
job_link = [] | |
for job in jobs: | |
job_id0 = job.find_element_by_css_selector('div').get_attribute('data-tracking-id') | |
job_id.append(job_id0) | |
job_title0 = job.find_element_by_css_selector('h3').get_attribute('innerText') | |
job_title.append(job_title0) | |
company_name0 = job.find_element_by_css_selector('h4').get_attribute('innerText') | |
company_name.append(company_name0) | |
location0 = job.find_element_by_css_selector('[class="job-search-card__location"]').get_attribute('innerText') | |
location.append(location0) | |
date0 = job.find_element_by_css_selector('div>div>time').get_attribute('datetime') | |
date.append(date0) | |
job_link0 = job.find_element_by_css_selector('a').get_attribute('href') | |
job_link.append(job_link0) | |
time.sleep(15) | |
""" | |
#Get the detailed job descriptions | |
jd = [] | |
seniority = [] | |
emp_type = [] | |
job_func = [] | |
industries = [] | |
for item in range(len(jobs)): | |
job_func0=[] | |
industries0=[] | |
# clicking job to view job details | |
job_click_path = f'/html/body/div/div/main/section/ul/li/div/a[{item+1}]' | |
job_click = job.find_element_by_xpath(job_click_path).click() | |
time.sleep(5) | |
jd_path = '/html/body/main/section/div[2]/section[2]/div' | |
jd0 = job.find_element_by_xpath(jd_path).get_attribute('innerText') | |
jd.append(jd0) | |
seniority_path = '/html/body/main/section/div[2]/section[2]/ul/li[1]/span' | |
seniority0 = job.find_element_by_xpath(seniority_path).get_attribute('innerText') | |
seniority.append(seniority0) | |
emp_type_path = '/html/body/main/section/div[2]/section[2]/ul/li[2]/span' | |
emp_type0 = job.find_element_by_xpath(emp_type_path).get_attribute('innerText') | |
emp_type.append(emp_type0) | |
job_func_path = '/html/body/main/section/div[2]/section[2]/ul/li[3]/span' | |
job_func_elements = job.find_elements_by_xpath(job_func_path) | |
for element in job_func_elements: | |
job_func0.append(element.get_attribute('innerText')) | |
job_func_final = ', '.join(job_func0) | |
job_func.append(job_func_final) | |
industries_path = '/html/body/main/section/div[2]/section[2]/ul/li[4]/span' | |
industries_elements = job.find_elements_by_xpath(industries_path) | |
for element in industries_elements: | |
industries0.append(element.get_attribute('innerText')) | |
industries_final = ', '.join(industries0) | |
industries.append(industries_final) | |
""" | |
job_data = pd.DataFrame({'ID': job_id, | |
'Date': date, | |
'Company': company_name, | |
'Title': job_title, | |
'Location': location, | |
'Link': job_link | |
}) | |
# cleaning description column | |
job_data.to_csv('.\outputs\\analyticsvietnam4.csv',encoding='utf-8-sig') | |
This python script shows how to pre-clean data from the scrapping and Import into a PgSQL database
# -*- coding: utf-8 -*- | |
""" | |
Created on Sun Oct 31 20:15:22 2021 | |
@author: renau | |
""" | |
# Import Librairies | |
import psycopg2 | |
from sqlalchemy import create_engine | |
import pandas as pd | |
import glob | |
from io import StringIO | |
import prettytable | |
import os | |
import sys | |
import time | |
import numpy as np | |
from timeit import default_timer as timer | |
# Delete previous file | |
################################################################# | |
file = r'C:\\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsall.csv' | |
if(os.path.exists(file) and os.path.isfile(file)): | |
os.remove(file) | |
print("file deleted") | |
else: | |
print("file not found") | |
########################################################### | |
# Merge all Csv from Linkedin Output Folder | |
path = r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs' # use your path | |
all_files = glob.glob(path + "/*.csv") | |
full_list_linkedin = [] | |
for filename in all_files: | |
df = pd.read_csv(filename) | |
full_list_linkedin.append(df) | |
frame = pd.concat(full_list_linkedin, axis=0, ignore_index=True) | |
print((frame).head()) | |
frame.rename(columns={'Unnamed: 0': 'auto_id'}, inplace=True) | |
for col in frame.columns: | |
print(col) | |
########################################################### | |
# Pre Cleaning check for duplicates Links | |
frame = frame.drop(frame.columns[[0,1,6]], axis=1) # df.columns is zero-based pd.Index | |
frame = frame.drop(frame.columns[[5]], axis=1) # df.columns is zero-based pd.Index | |
frame.drop_duplicates(subset=['Link']) | |
#Lower Caps all text | |
frame = frame.drop_duplicates(subset=['Link']) | |
frame['Title'] = frame['Title'].str.lower() | |
print(frame['Title'].head()) | |
#Remove rows where CONTAINS -> | |
frame = frame[~frame['Title'].isin(['data entry'])] | |
print(frame.info()) | |
# Output Frame to CSV | |
print(frame.info()) | |
frame.to_csv(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsall.csv',encoding='utf-8-sig') | |
################################################################# (Updates Insert only new rows) | |
#INPUT YOUR OWN CONNECTION STRING HERE | |
conn_string = 'postgres://postgres:admin@localhost/postgres' | |
#perform to_sql test and print result | |
db = create_engine(conn_string) | |
conn = db.connect() | |
start_time = time.time() | |
frame.to_sql('Linkedinjob', con=conn, if_exists='append', index=False) | |
print("to_sql duration: {} seconds".format(time.time() - start_time)) | |
print("COPY duration: {} seconds".format(time.time() - start_time)) | |
#close connection | |
conn.close() | |
This python script shows how to loop trough a CSV lists Html links to get text details of job description
import csv | |
from selenium import webdriver | |
from bs4 import BeautifulSoup as soup | |
from selenium.webdriver.common.by import By | |
from selenium.webdriver.support.ui import WebDriverWait as browser_wait | |
from selenium.webdriver.support import expected_conditions as EC | |
import requests | |
import time | |
import pandas as pd | |
import re | |
browser = webdriver.Chrome(executable_path=r'C:/Users/bin/chromedriver') | |
contents = [] | |
descriptions=[] | |
links =[] | |
#Open Html links from CSV | |
with open(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsalltest.csv', encoding="utf8") as cp_csv: | |
cp_url = csv.reader(cp_csv) | |
next(cp_csv) | |
for row in cp_url: | |
links = row[5] | |
contents.append(links) | |
#link should be something like "https://www...." | |
for link in contents: | |
browser.get(link) | |
time.sleep( 3 ) | |
try: | |
cookie = browser.find_element_by_xpath("//button[contains(text(),'Show more')]") | |
cookie.click() | |
except: | |
pass | |
try: | |
description = browser.find_element_by_id("main-content").text # adapt to your html page | |
except: | |
description = "None" | |
descriptions.append(description) | |
df_da=pd.DataFrame() | |
df_da['Description']=descriptions | |
df_da['Link']=contents | |
# Delete all text after string | |
df_da['Description'] = df_da['Description'].str.split('Similar jobs').str[0] | |
df_da['Description'] = df_da['Description'].str.split('Off').str[0] | |
df_da.to_csv(r'C:\Users\renau\OneDrive\02-Data Projects\01-Data-Engineering\webscrapping\linkedin\outputs\outputsalltestdescriptions.csv',encoding='utf-8-sig') | |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Important: Please notice that lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.marks are protected by intellectual property rights.