This Sales Analysis was forked from GitHub provided by Kieth Galli. I decided to pick this project to do fun analysis on sales data. In this project, I will be cleaning the data and answering five questions. Before I load the csv file, I need to load some packages. The OS module in python provides functions for interacting with the operating system. The PANDAS modules in python provide data manipulation and analysis tools.
import os
import pandas as pd
Since our data is in csv files by months, we will merge the data into one csv file.
path = "./Sales_Data"
files = [file for file in os.listdir(path) if not file.startswith('.')
all_months_data = pd.DataFrame()
for file in files:
current_data = pd.read_csv(path+"/"+file)
all_months_data = pd.concat([all_months_data, current_data])
Now we read in the updated data frame.
all_data = pd.read_csv("all_data.csv") all_data.head()
Next we need to clean up the data. Sometimes you’ll have data with empty cells. I decided to clean the data by deleting the row.
# Find NAN
nan_df = all_data[all_data.isna().any(axis=1)] display(nan_df.head())
all_data = all_data.dropna(how='all') all_data.head()
The first question we want to answer is which month had the most sales? The Order Date column has date and time format as month/day/year hour:minute. To extract the month only, we can use string as str[0:2] for the first two characters.
all_data["Month"] = all_data["Order Date"].str[0:2]
all_data["Month"] = all_data["Month"].astype('int32')
Add a sales column with total purchase in dollars.
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data['Sales'] = all_data['Quantity Ordered']*all_data['Price Each']
all_data.head()
Find the best month for sales and bar plot the months using matplotlib.
results = all_data.groupby('Month').sum()['Sales']
import matplotlib.pyplot as plt
months = range(1,13)
plt.bar(months, results)
plt.xticks(months)
plt.ylabel('Sales in USD ($)')
plt.xlabel('Month number')
plt.show()
After looking at our plot, we can see the month number 12, representing December, has the most sales in the year. We can see there is an increase in sales from November to December. I believe these months have the most sales since it's the holidays; specifically Christmas is in December.
The next question we want to answer is which city had the most sales? To do this we need to create a city, state column from our Purchase Address Column. Note, we need the city and state since many cities in the United States have the same name.
def get_city(address):
return address.split(',')[1]
def get_state(address):
return address.split(',')[2].split(' ')[1]
all_data['City'] = all_data["Purchase Address"].apply(lambda x: get_city(x)+
all_data.head()
results_city = all_data.groupby('City').sum()['Sales']
print(results_city)
City
Atlanta, GA 2795498.58
Austin, TX 1819581.75
Boston, MA 3661642.01
Dallas, TX 2767975.40
Los Angeles, CA 5452570.80
New York City, NY 4664317.43
Portland, ME 449758.27
Portland, OR 1870732.34
San Francisco, CA 8262203.91
Seattle, WA 2747755.48
Name: Sales, dtype: float64
Finally we plot the information on a bar plot graph.
# convert results to array
results_city=pd.to_numeric(results_city)
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
cities = [city for city, df in all_data.groupby('City')]
plt.bar(cities, results_city)
plt.ylabel('Sales in USD ($)')
plt.xlabel('City')
plt.xticks(cities, rotation='vertical', size=8)
plt.show()
The city with the most sales is San Francisco, CA.
At what time of day were there the most sales? Using our Order Date column we can extract the time the sale was charged.
#extract specific hour, minute in column
all_data['hour']= all_data['Order Date'].dt.hour
all_data['minute']= all_data['Order Date'].dt.minute
all_data.head()
Here is the line plot of the hours.
hours = [hour for hour, df in all_data.groupby('hour')]
plt.plot(hours, all_data.groupby('hour').count())
plt.ylabel('Count')
plt.xlabel('Hours')
plt.xticks(hours, rotation='vertical', size=8)
plt.grid
plt.show()
The best time to display ads are at local maxima in our graph, so the peak time is around12 am and 7pm.
What items were most frequently bought together?
df = all_data[all_data['Order ID'].duplicated(keep=False)]
df['Grouped']=df.groupby('Order ID')['Product'].transform(lambda x: ', '.join(x))
df = df[['Order ID', 'Grouped']].drop_duplicates()
df.head()
from itertools import combinations
from collections import Counter
count = Counter()
for row in df['Grouped']:
row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))
for key,value in count.most_common(10):
print(key, value)
The most items in pairs bought together was iPhone and lightning charging cable with 1005 sales. Since iPhones are no longer sold with charging cables in the box, customers who don't have a charging cable are forced to buy one.
Question 5:
Which items were ordered the most?
results3 = all_data.groupby('Product').sum()['Quantity Ordered']
print(results3)
Product
20in Monitor 4129
27in 4K Gaming Monitor 6244
27in FHD Monitor 7550
34in Ultrawide Monitor 6199
AA Batteries (4-pack) 27635
AAA Batteries (4-pack) 31017
Apple Airpods Headphones 15661
Bose SoundSport Headphones 13457
Flatscreen TV 4819
Google Phone 5532
LG Dryer 646
LG Washing Machine 666
Lightning Charging Cable 23217
Macbook Pro Laptop 4728
ThinkPad Laptop 4130
USB-C Charging Cable 23975
Vareebadd Phone 2068
Wired Headphones 20557
iPhone 6849
Name: Quantity Ordered, dtype: int64
products = [product for product, df in all_data.groupby('Product')]
plt.bar(products,results3)
plt.ylabel('Count')
plt.xlabel('Product')
plt.xticks(products, rotation='vertical')
plt.grid
plt.show()
The most bought item was AAA batteries (4-pack). We can see in our data set that AAA batteries (4-pack) is also the cheapest item in the store.
Overall, I had fun answering these questions. Thank you if you made it this far! :)