Skip to content

Saphall/Yelp-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Yelp-ETL

The file structure of this repo:

data/                               # Folder containg link to Yelp Dataset.
  |_ *.txt 
docs/                               # Folder containing .md files for explanations.      
  |_ images/                        # Screenshots
  |_ *.md
schema/                             # Folder containing sql queries to creates db.
  |_ *.sql    
src/
  |_ pipeline/                      # Folder containing python scripts    
  |   |_ utils/                     # Folder containing utiliy modules.
  |   |_ *.py   
  |_ sql/                           # Folder containing DQL/DML.
      |_ procedures/                # Procedural queries   
      |_ queries/                   # SQL queries
      |_ validation_scripts/        # Validation queries

This is the complete ETL (Extract -> Transform -> Load) pipeline for Yelp Dataset.

Dataset:

The Yelp Dataset is subset of Yelp's businesses, reviews, and user data, available for academic use. It contains 160585 business, 2189457 user, 8635403 reviews, 1162119 tip, 138876 checkin and 200000 photo data in json format. This data has been used for the ETL process in this repo.

DataModel:

This is the target model to achieve using the dataset.

ETL:

The main.py file helps to carry out this process according to our choice. e.g.

Visualization :

The visualization for the data analysis and insights was done using MicrosoftPowerBI.

The complete documentation is in docs/ folder.