DATA CLEANING & VISUALIZATIONS: General Assembly Final Project
Summary: End-to-End Data Cleaning, Analysis, and Visualizations On Kaggle.com's Grupo Bimbo's Inventory Demand Challenge
Last September, I completed General Assembly's 10-week Data Science course where I learned industry-leading data science techniques. This course helped me develop fluency in Python, R, and Unix programming and increased my knowledge of data modeling, analysis, and visualization techniques.
For our final project, we were required to develop an data science project. Because of my interest in logistics, I chose to explore the Grupo Bimbo Inventory Demand Challenge dataset posted on Kaggle.com (a data science learning and collaboration website). Grupo Bimbo, a multinational baked goods company, is Mexico's 9th largest company and owns American brands like Sara Lee and Entenmanns. The challenge asked competitors to develop a model that could accurately predict the weekly net units sold for each of its clients to save Grupo Bimbo from sending out excess product.
Given the scope and difficulty of working on a time-series dataset with over 7.4 million observations, I used this opportunity to build an end-to-end data cleaning and analysis project.
Please view my code on Github.
Methodology:
Tools Used:
- Data Analysis: iPython Notebooks, R Markdown
- Python Libraries: pandas, numpy, sklearn, matplotlib, seaborn
- R Libraries: readr, dplyr, ggplot2, scales, treemap
Data Preparation:
InitIal Challenges:
- Data Constraints: The Train data set contained over 7.4 million observations, which was too large use in iPython notebook.
- Translation: Upon reading in the datasets and observing its variables, I realized the dataset's columns, its nearly 1 million clients names, and its nearly 2500 product names were in Spanish!
Project Approach:
To address the data constraints and contain the scope of the project, I took a random sample of 0.1% of the data. This sample contains 74,810 observations spread over seven weeks.
Clustering Clients Types Using Natural Language Processing:
When addressing the issue of the dataset being in Spanish, my professor advised me to look at the code kernels posted on Kaggle. There, I found a kernel that uses a Natural Language Processing (NLP) technique called Count Vectorizer to build client groups from the individual client names in Spanish.
MODELS USED:
- Count Vectorizer - NLP function that creates a vector that lists all the strings within a client's name and the frequency with which each string occurs.
Implementation:
Step 1: Initial observations of the client names indicate that it is a noisy and varied set of strings.
Step 2: Examining vectorized words and their frequency. Here, the word 'escuela' or 'school' appears in 0.56% of all collected words.
Step 3: Identify potential client groupings. In the sample of client names below, we can see that there are many clients that have the name 'cafeteria' or 'restaurant'.
Step 4: Define function to filter Spanish client names into English client types.
Step 5: Perform value count on new client names to verify data.
Data Analysis:
After cleaning the dataset, I wanted to do some data analysis and visualizations. This kernel helped me learn about useful graphing and visualization techniques in R.
1. Weeks - Histogram of the Weekly Sales
- From this histogram, we can see that there are approximately 10,000 observations or units sold for each week in the dataset.
2. Net Units Sold - Mean, Median, Standard Deviation:
- The sample indicates that Grupo Bimbo sells its clients an average of 7.23 units. The median units sold is 3.0 units. The standard deviation for units sold is ~19.3 units.
3. Sales Depots - TreeMap of Sales Depots by % of Total Returns:
- The TreeMap on the top 100 Sales Depots by % of total returns indicates the top 100 depots sell a nearly equivalent number of units and that Sales Depots 1250 and 2230 accounts at least 0.08% of Grupo Bimbo's total returned products over the period. This is a signal to inspect the clients and products that are served from these two depots.
4. Products - TreeMap of Products by % of Total Returns:
- The TreeMap on the top 100 products by % of total returns indicates that atleast 0.045% of Grupo Bimbo's returned product are 'Tostado 210g' and 'Mantecadas Nuez 123g'. 'Bolsa Mini Rocko' has the highest return rate for products with larger numbers of units sold.
5. Price of Products- Mean, Median Variance:
- The sample indicates that the average price of Grupo Bimbo's products is Mex$ 14.37 pesos. The products have a median price of Mex$ 10.37 pesos and a standard deviation of Mex$ 13.89 pesos.
6. Price of Products - Histogram and Estimated Density Function:
- The histogram and estimated density function indicates that majority of Grupo Bimbo's products cost less than Mex$25 pesos.
7. Clients - TreeMap of Ungrouped Clients by % of Total Returns
- The TreeMap on the top 100 ungrouped clients by % of total returns indicates the 'Jalisco Remision" accounts for at least 0.16% of Grupo Bimbo's total returns. This a signal to inspect this client's product bundle more closely.
Takeaways:
Through this project, I gained valuable experience about cleaning very messy datasets using natural language processing. This project also reinforced my comfort with using exploratory data analysis and visualization techniques to guide business insights.