top of page
Search

A SQL Analysis Of 4-year Sales Data

  • Writer: Abdul-Sherrif Saaka
    Abdul-Sherrif Saaka
  • Oct 24, 2022
  • 7 min read

Updated: Feb 24, 2023

Hi. Happy to have you here. In this project I analyze a 4-year sales dataset and answer some important business questions with SQL. MySQL is the SQL flavor of choice for this project. SQL stands for Structured Query Language and it is the language you'll need to extract most of your data that is housed in databases. Some call it "sequel", others call it "es-queue-el" , I am not here to judge(ᶜᵃˡˡ ⁱᵗ ˢᵉᵠᵘᵉˡ), but whatever you call it, just stay with me to the end. Let's get into it!


Business questions

These are the questions I will be answering in this project

Q1 top 10 products by revenue across each zone

Q2 channels that generated the most revenue in each state

Q3 sales made per year

Q4 profit by year

Q5 profit margins by year

Q6 products that sold the most in each zone in each year

Q7 sales made when there is discount v no discount

Q8 profit made when there is discount vs no discount

Q9 revenue by product category

Q10 order quantity by product category and zone

Q11 top product category(sales) by quarter in each year

Q12 products that are never out of stock

Q13 products that are out of stock sometimes

Q14 profit margins by category


Inspecting data with Excel

To begin I opened my "salesdata.csv" file in Excel for an initial inspection and I see that the names of the columns do not follow naming conventions in SQL. For example there is a column named "order id" which might present some problems in SQL so I correct that to "order_id" and do same for all the other columns. The difference in these two names being the underscore. There are no missing values so my data is ready to import into MySQL.


Importing data into MySQL

Before I import my data i create a new schema for the new project. A schema is simply a container in a database that holds objects like tables.


CREATE DATABASE sql_project_001;


After creating my schema named sql_project_001, I use the table import wizard in MySQL workbench to import the dataset. To ensure all records are successfully imported I run a select all query


SELECT * FROM sql_project_001.salesdata;


I run a query to see the data type of all the fields

SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.columns
WHERE TABLE_NAME = 'salesdata'


From the results of the query we see that order_date column is a varchar(text format) so I write a query to change to DATE data type. On my first two attempts I used CAST() and CONVERT() which both threw an error because both functions do not contain an argument for specifying the format of the original date. Which is what I needed to do here to help SQL convert my order_date from varchar to date. STR_TO_DATE() is the one to use here.

SELECT str_to_date(order_date, '%d %m %Y') order_date
FROM sql_project_001.salesdata;


Answering business questions

I checked for duplicates on the primary key(the order_id column) by using the DISTINCT function and found none. A primary key is the field that uniquely identifies each record. It cannot contain duplicates or null values. No dataset is perfect but this was not the dirtiest I have seen so it did not need a lot to be done on cleaning. After making sure that everything is clean and tidy it is time to start answering the business questions above


Q1 top 10 products by revenue across each zone

I used the ROW_NUMBER() window function to get the rank of the total revenue per product and zone and sorted it by the revenue in descending order with the "desc" function to get the highest revenues first. This produces a rank of all the revenues of all products. This is where the WHERE clause comes in and filters to show only revenues ranked from 1 to 10 based on the rank provided by the window function. The subquery on the FROM clause was necessary because without it I wouldn't be able to use the alias for the window function in the WHERE clause. Total revenue is absent in the table so I calculated it myself by summing up the products of the unit price and order quantity for each product. We will come across this multiple times in this project

Sneak peek of results


Q2 channels that generated the most revenue in each state

I needed another subquery here because a MAX(SUM()) function wouldn't work. So the result of the subquery contains the the sum of revenue for all channels across each state and the outer query uses the MAX() function to find the channel with the maximum revenue for each state. I used the ROUND() function to round the answer to 2 decimal places to make reading easier. We see the total revenue here again, which if you recall is the sum of the products of the unit price of all items and their order quantities

results


Q3 sales made per year

Over here and throughout this project I used the STR_TO_DATE() function to change my date to the right data type before working with it. In this query I used it with the YEAR() function to extract the year from the corrected date and calculated the total sales to find the sales per year

results

2012 was the year with most sales made with $18,368,505.24. Let's see if that equals year with most profit made in the next query






Q4 profit by year

Profit from this dataset = total revenue -total cost. Total revenue is the sum of the products of the unit price of all items and their order quantities and total cost is the sum of the products of the unit cost of all items and their order quantities. The year function makes another appearance here to help me group the profits by year.

results

2012 was in fact the year with the highest profits. I felt like this was not giving the full story concerning profits so even though it fell out of the scope of my assignment I decided to find the profit margins per year to get a better picture of the best performing year



Q5 profit margins by year

To find profit margin is to divide the profit obtained in Q4 above by the total revenue. I divided by 100 to get percentages

results

Although 2011 was the year with lowest sales and profit (because sales in 2011 only start in Q4 ), it has the highest profit margin 52.9%. This is an interesting insight that would require further investigation



Q6 products that sold the most in each zone in each year

Just like in Q1, to answer this question I needed to add up all order quantities, group by the products, and find the products with the highest order quantity in each zone and year. Since the sum and max functions do not work together I first created a subquery to get the total order quantities of all products with the SUM() function in each zone in each year and used MAX() function in the outer query to find the products with the highest number of orders

results


Q7 sales made when there is discount v no discount

To answer this question I used a common table expression, cte for short. A cte helps to simplify queries by acting as a virtual table. Since I have two values to find in this query I will have two select statements in the cte. To find sales when there is a discount I use the WHERE clause to filter the dataset on the promotion name field to find products and the corresponding promotion at the time of sale. And I do same for products with "no discount" in the promotion name field

results

There is a strong correlation between discounts and sales seeing as products with promotions and discounts sold 52% more than products with no discounts. Further analysis needed to get the full effect of discounts on sales



Q8 profit made when there is discount vs no discount

Again, I used a cte here which contained two select statements. I calculated for the profit and use the WHERE clause to filter the results on the promotion name field to find products and the corresponding promotion at the time of sale. And I do same for products with "no discount" in the promotion name field

results

Products with discounts sold more and made more profit





Q9 revenue by product category

results

Computers generated the most revenue at $16,409,174.98





Q10 order quantity by product category and zone

results


Q11 top product category(sales) by quarter in each year

In this query I use a new function- QUARTER() to extract the quarters of the year and products that sold the most in each quarter of each year

results

Overall, Computers saw the highest sales quarter in quarter 1 of 2012 and topped at least one quarter each year with the exception of 2011. It topped two quarters in 2012 and 2014. With the exception of 2014, cameras and camcorders topped the sales in quarter 4 of every year. This might have something to do with the holidays approaching at that time of year but further investigation would be needed to establish how strongly correlated they are




Q12 products that are never out of stock

This was a tricky one but to find products that had been in stock for every month of every year I figured the count of distinct months had to be 12 for every month AND the count of distinct years had to be 4 for all the four years

results

From my query these are the only three products that met the criteria



Q13 products that are out of stock sometimes

For this I decided that for a product to meet the criteria it had to have been out of stock in at least one month OR one year. So if a product appeared in every year but not every month of every year my query returned it and if it appeared in all 12 months but not for all years my query returned that too.

sneak peak of results


Q14 profit margins by category

For the last question I calculated the profit margin, grouped by the product category and used the CASE expression to return the level of the profit margin which were high, mid or low.

results

Music, movies and Audio books had the highest profit margin. Computers, the best selling product had the lowest profit margin at 48.17%





Conclusion

Let's wrap it up shall we? We can conclude that in the period under consideration the customers preferred to buy at the physical stores. Overall, computers generated the most sales and profit but had the lowest profit margin. Music , Movies and Audiobooks had the highest profit margin. During the holiday season, Cameras and camcorders were usually the best selling products but they dropped off in 2014 which warrants looking into. A worrying pattern though is how sales and profit consistently dropped from 2012-2014. Customers LOVE their discounts. I get them, I do too.



You can find the complete code and dataset on my Github


Thank you for reading , I hope you enjoyed it. You are always welcome to join the "sequel" side and your suggestions are always welcome here. The goal is consistent growth

Comments


©2022 by Abdul-Sherrif. Proudly created with Wix.com

bottom of page