Sales Analysis

This Sales Analysis was forked from GitHub provided by Kieth Galli. I decided to pick this project to do fun analysis on sales data. In this project, I will be cleaning the data and answering five questions. Before I load the csv file, I need to load some packages. The OS module in python provides functions for interacting with the operating system. The PANDAS modules in python provide data manipulation and analysis tools.

import os

import pandas as pd

Since our data is in csv files by months, we will merge the data into one csv file.

path = "./Sales_Data"
files = [file for file in os.listdir(path) if not file.startswith('.')

all_months_data = pd.DataFrame()

for file in files:
current_data = pd.read_csv(path+"/"+file)
all_months_data = pd.concat([all_months_data, current_data])

Now we read in the updated data frame.

all_data = pd.read_csv("all_data.csv") all_data.head()

Next we need to clean up the data. Sometimes you’ll have data with empty cells. I decided to clean the data by deleting the row.

# Find NAN
nan_df = all_data[all_data.isna().any(axis=1)] display(nan_df.head())

all_data = all_data.dropna(how='all') all_data.head()

Question 1:

The first question we want to answer is which month had the most sales? The Order Date column has date and time format as month/day/year hour:minute. To extract the month only, we can use string as str[0:2] for the first two characters.

all_data["Month"] = all_data["Order Date"].str[0:2]

all_data["Month"] = all_data["Month"].astype('int32')

Add a sales column with total purchase in dollars.

all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])

all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

all_data['Sales'] = all_data['Quantity Ordered']*all_data['Price Each']

all_data.head()

Find the best month for sales and bar plot the months using matplotlib.

results = all_data.groupby('Month').sum()['Sales']

import matplotlib.pyplot as plt

months = range(1,13)

plt.bar(months, results)

plt.xticks(months)

plt.ylabel('Sales in USD ($)')

plt.xlabel('Month number')

plt.show()

After looking at our plot, we can see the month number 12, representing December, has the most sales in the year. We can see there is an increase in sales from November to December. I believe these months have the most sales since it's the holidays; specifically Christmas is in December.

Question 2:

The next question we want to answer is which city had the most sales? To do this we need to create a city, state column from our Purchase Address Column. Note, we need the city and state since many cities in the United States have the same name.

def get_city(address):

return address.split(',')[1]

def get_state(address):

return address.split(',')[2].split(' ')[1]

all_data['City'] = all_data["Purchase Address"].apply(lambda x: get_city(x)+

all_data.head()

results_city = all_data.groupby('City').sum()['Sales']

print(results_city)

City

Atlanta, GA 2795498.58

Austin, TX 1819581.75

Boston, MA 3661642.01

Dallas, TX 2767975.40

Los Angeles, CA 5452570.80

New York City, NY 4664317.43

Portland, ME 449758.27

Portland, OR 1870732.34

San Francisco, CA 8262203.91

Seattle, WA 2747755.48

Name: Sales, dtype: float64

Finally we plot the information on a bar plot graph.

# convert results to array

results_city=pd.to_numeric(results_city)

all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

cities = [city for city, df in all_data.groupby('City')]

plt.bar(cities, results_city)

plt.ylabel('Sales in USD ($)')

plt.xlabel('City')

plt.xticks(cities, rotation='vertical', size=8)

plt.show()

The city with the most sales is San Francisco, CA.

Question 3:

At what time of day were there the most sales? Using our Order Date column we can extract the time the sale was charged.

#extract specific hour, minute in column

all_data['hour']= all_data['Order Date'].dt.hour

all_data['minute']= all_data['Order Date'].dt.minute

all_data.head()

Here is the line plot of the hours.

hours = [hour for hour, df in all_data.groupby('hour')]

plt.plot(hours, all_data.groupby('hour').count())

plt.ylabel('Count')

plt.xlabel('Hours')

plt.xticks(hours, rotation='vertical', size=8)

plt.grid

plt.show()

The best time to display ads are at local maxima in our graph, so the peak time is around12 am and 7pm.

Question 4:

What items were most frequently bought together?

df = all_data[all_data['Order ID'].duplicated(keep=False)]

df['Grouped']=df.groupby('Order ID')['Product'].transform(lambda x: ', '.join(x))

df = df[['Order ID', 'Grouped']].drop_duplicates()

df.head()

from itertools import combinations

from collections import Counter

count = Counter()

for row in df['Grouped']:

row_list = row.split(',')

count.update(Counter(combinations(row_list, 2)))

for key,value in count.most_common(10):

print(key, value)

The most items in pairs bought together was iPhone and lightning charging cable with 1005 sales. Since iPhones are no longer sold with charging cables in the box, customers who don't have a charging cable are forced to buy one.

Question 5:

Which items were ordered the most?

results3 = all_data.groupby('Product').sum()['Quantity Ordered']

print(results3)

Product

20in Monitor 4129

27in 4K Gaming Monitor 6244

27in FHD Monitor 7550

34in Ultrawide Monitor 6199

AA Batteries (4-pack) 27635

AAA Batteries (4-pack) 31017

Apple Airpods Headphones 15661

Bose SoundSport Headphones 13457

Flatscreen TV 4819

Google Phone 5532

LG Dryer 646

LG Washing Machine 666

Lightning Charging Cable 23217

Macbook Pro Laptop 4728

ThinkPad Laptop 4130

USB-C Charging Cable 23975

Vareebadd Phone 2068

Wired Headphones 20557

iPhone 6849

Name: Quantity Ordered, dtype: int64

products = [product for product, df in all_data.groupby('Product')]

plt.bar(products,results3)

plt.ylabel('Count')

plt.xlabel('Product')

plt.xticks(products, rotation='vertical')

plt.grid

plt.show()

The most bought item was AAA batteries (4-pack). We can see in our data set that AAA batteries (4-pack) is also the cheapest item in the store.

Overall, I had fun answering these questions. Thank you if you made it this far! :)

Page updated

Google Sites

Report abuse