Introduction

Web scraping is used to collect large information from websites. This documentation shows how to scrap websites, how to store it in a database, and how to automate the scripts to run and schedule web scrapper in order to get up to date data.

Version:

1.0.0.

About this App

Python, Selenium, Airflow

Technology

This app is built with Python, Selenium, Airflow.
http://airflow.apache.org/
https://www.python.org/
https://selenium-python.readthedocs.io/
https://www.postgresql.org/

Getting Started

Job scraping is to gather job posting information online in a programmatic manner. This automated way of extracting data from the web to build resourceful job database by integrating various data sources into one.

Scrapping Infrastructure

Add basic diagram

Important: It is perfectly legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential information for profit. However most of websites will find trick ways to "stop" you from mass scripting. It is important to put time sleeper on the scripts in order not to get a connection time out. This is when there is too many requests from the same IP. Also, it's most likely we will get good results of smaller (specific key words) scrapping then mass scrapping.

By example, i search two key words on indeed site. what = "Data Science" and where "Canada". Over 1700 results came up. But after scrapping for 45min, it ends up by "We have removed 352 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included."

How come? So Indeed deduplicate your search by Location and job titles.

https://recruitingblogs.com/m/blogpost

Company Name Job Title Location Posting Date
Company A Administrative Assistant Fairfield, OH March 8
Company A Administrative Assistant Cincinnati, OH February 14
Company A Software Engineer I Cincinnati, OH February 14
Company A Software Engineer IV Cincinnati, OH March 8

If someone searched for Company A's jobs, they would show up similar to this:

Administrative Assistant Company A - Fairfield, OH - +1 Location Software Engineer IV Company A - Cincinnati, OH We have removed 2 job postings very similar to those already shown. To see these additional results, you may repeat your search with the omitted job postings included.

Pre-Requirements

Libraries: psycopg2 - sqlalchemy - pandas - numpy - BeautifulSoup - Selenium

Scrapping and Data Integration

This sections shows Python, beautiful and Selenium scripts to extract data from websites

Scrapping Linkedin Public Job Posts

This script shows how get data job posts of Linkedin


Linkedin Scrapper

Data Integration with Psycopg2 - PgSQL

This python script shows how to pre-clean data from the scrapping and Import into a PgSQL database


Data Integration with Psycopg2


Get Jobs Description details

This python script shows how to loop trough a CSV lists Html links to get text details of job description


Open HTML links and Get data

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feature 4

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature 5

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature 5

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Data Mining and NPL

Feature A

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature B

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature C

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Feature C.1

Lorem ipsum dolor sit amet, consectetur adipiscing elit:

  • Point 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  • Point 2 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor.

  • Point 3 Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feauture C.2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Important: Please notice that lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Feature D

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Accessibility

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

More Info

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.marks are protected by intellectual property rights.