Introduction

What we have here is a few months worth of sales data from a bakery.

The original dataset was put on Kaggle but deleted: https://www.kaggle.com/deleted-dataset/50231.

Fortunately, someone reuploaded the data here: https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery.

We’ll first jump in with some general exploratory data analysis (EDA) followed by a more specific analysis using market basket analysis (MBA).

Packages used

A list of packages utilized in this analysis:

##  [1] "arulesViz"   "grid"        "arules"      "Matrix"      "ggridges"   
##  [6] "gridExtra"   "ggthemes"    "viridis"     "viridisLite" "lubridate"  
## [11] "forcats"     "stringr"     "dplyr"       "purrr"       "readr"      
## [16] "tidyr"       "ggplot2"     "tidyverse"   "tibble"      "stats"      
## [21] "graphics"    "grDevices"   "utils"       "datasets"    "methods"    
## [26] "base"

Preliminary data peaking

Let’s get a quick idea of what our data holds and the structure thereof before embarking too deep in our analysis.

Right off the bat we see our time period of data collection is quite narrow: 161 days worth of data, only. We can also see the most popular transaction time is around 12:40pm, with Coffee being the most purchased item.

# import data
x <- read_csv("data/BreadBasket_DMS.csv") %>%
  mutate(Date=as.Date(Date),
         Time=hms(Time),
         Item=as.factor(Item))

summary(x)

##       Date                 Time                            Transaction  
##  Min.   :2016-10-30   Min.   :1H 21M 5S                   Min.   :   1  
##  1st Qu.:2016-12-03   1st Qu.:10H 56M 40S                 1st Qu.:2548  
##  Median :2017-01-21   Median :12H 37M 30S                 Median :5067  
##  Mean   :2017-01-17   Mean   :12H 45M 1.40125862959394S   Mean   :4952  
##  3rd Qu.:2017-02-28   3rd Qu.:14H 29M 37S                 3rd Qu.:7329  
##  Max.   :2017-04-09   Max.   :23H 38M 41S                 Max.   :9684  
##                                                                         
##       Item     
##  Coffee :5471  
##  Bread  :3325  
##  Tea    :1435  
##  Cake   :1025  
##  Pastry : 856  
##  NONE   : 786  
##  (Other):8395

glimpse(x)

## Observations: 21,293
## Variables: 4
## $ Date        <date> 2016-10-30, 2016-10-30, 2016-10-30, 2016-10-30, 2...
## $ Time        <S4: Period> 9H 58M 11S, 10H 5M 34S, 10H 5M 34S, 10H 7M ...
## $ Transaction <dbl> 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7,...
## $ Item        <fct> Bread, Scandinavian, Scandinavian, Hot chocolate, ...

# 161 days of data
difftime(x$Date[nrow(x)],x$Date[1])

## Time difference of 161 days

EDA

Let’s see what we can glean looking at various manipulations of the data we have available. Most of the analyses will be plotting total items or specific items sold per unit of time, as this is the most simple use of the data.

Item frequency

No real surprises with item frequency, although the NONE is slightly odd. It should be noted that when using count we’re adding all instances of that particular item. We are not discriminating between different sales receipts. So if one person has ordered 5 coffees, we count coffee 5 times rather than once. Right now this doesn’t bother me, we can look at it later on.

x1 <- x %>% 
  group_by(Item) %>% 
  summarise(Count = n()) %>% 
  arrange(desc(Count)) %>%
  slice(1:10) %>% # keep the top 10 items
  ggplot(aes(x=reorder(Item,Count),y=Count,fill=Item))+
  geom_bar(stat="identity")+
  coord_flip()+
  ggtitle("Most popular line items")+
  theme(legend.position="none")
x1

Items sold per day

Plotting total line items sold per day shows some obvious peaks and valleys. Saturday is the big sales day for this bakery, followed by Sunday.

# Line items sold per day
x2 <- x %>% 
  group_by(Date) %>% 
  summarise(Count= n()) %>% 
  mutate(Day=wday(Date,label=T)) %>% 
  ggplot(aes(x=Date,y=Count,fill=Day))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Line items sold per day")+
  scale_fill_viridis(discrete=TRUE)
x2

Items sold per weekday

We can confirm the general pattern we saw above by grouping by weekdays. It turns out Saturdays are still the top sales days with Sundays close behind. Mondays, Tuesdays, and Wednesdays are roughly tied; and the end of the week ramps sales on Thursdays and Fridays.

# Total line items sold per weekday
x3 <- x %>% 
  mutate(Day = wday(Date,label=T)) %>% 
  group_by(Day) %>% 
  summarise(Count= n()) %>% 
  ggplot(aes(x=Day,y=Count,fill=Day))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Total line items sold per weekday")+
  theme(legend.position="none")
x3

Unique transactions per weekday

So far our item totals have been based on… total items. Taking the total coffees sold per week, or total sandwiches sold per week gives us some great insight on how we might want to prepare ordering items for next week. It does not, however, give us an idea of how busy the store is on a certain day.

Plotting unique transactions per day is a more fair representation of ‘business’ throughout the week. We see below that Saturday maintains the top position with, this time, Friday following close behind. In contrast to before, Sunday sales are more in line with the Monday through Thursday! This could indicate a couple things (among others): there are group lunches in which one person foots the bill; hungrier people come in on Sundays, therefore ordering more items; people may be ordering takeout for groups at another location.

# Total unique transactions per weekday
x4 <- x %>% 
  mutate(wday=wday(Date,label=T)) %>% 
  group_by(wday,Transaction) %>% 
  summarise(n_distinct(Transaction)) %>% 
  summarise(Count=n()) %>% 
  ggplot(aes(x=wday,y=Count,fill=wday))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Unique transactions per weekday")+
  theme(legend.position="none")
x4

Unique items per purchase

We can test our hypothesis above by calculating how many items, on average, are purchased per transaction.

It turns out that people are definitely ordering more on Sundays. We see Sundays average total items per transaction measure is at 2.6, with the rest of the weekdays showing total transactions between 2.1 to 2.3. In general, people are more likely to buy an additional, 3rd item on Sundays.

x5.1 <- x %>% 
  mutate(Day = wday(Date,label=T)) %>% 
  group_by(Day) %>% 
  summarise(Count= n()) 

x5.2 <- x %>% 
  mutate(wday=wday(Date,label=T)) %>% 
  group_by(wday,Transaction) %>% 
  summarise(n_distinct(Transaction)) %>% 
  summarise(Count=n())

x5.3 <- data.frame(x5.1, # Days, total items
                   x5.2[2], # unique transactions
                   x5.1[2]/x5.2[2])  # items per unique transaction
colnames(x5.3) <- c("Day","Line","Unique","Items.Trans")

ggplot(x5.3,aes(x=Day,y=Items.Trans,fill=Day))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Total unique transactions per weekday")+
  theme(legend.position="none")+
  geom_text(aes(label=round(Items.Trans,1)), vjust=2)

Hourly sales

Similar to above, we calculate total items sold per hour, unique transactions per hour, and total items per transaction per hour.

Our items per hour and unique transactions per hour don’t seem to differ much throughout the day except in regards to scale. The majority of items sold seems to happen around 11:00am, with the lunch rush occuring between 9:00am and 2:00pm.

When we look at our hourly total items per transaction, we see that on the tail ends of the spectrum people are most likely to stop in and buy a single item, a coffee maybe. The closer we get towards lunchtime hours, the more likely we are to see multiple items being purchased.

Items per hour

# Line items sold per Hour
x7 <- x %>%
  mutate(Hour = as.factor(hour(x$Time))) %>% 
  group_by(Hour) %>% 
  summarise(Count=n()) %>% 
  ggplot(aes(x=Hour,y=Count,fill=Hour))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Line items sold per Hour")+
  theme(legend.position="none")
x7

Unique transactions per hour

# Unique transactions per Hour
x8 <- x %>% 
  mutate(Hour = as.factor(hour(Time)))%>% 
  group_by(Hour,Transaction) %>% 
  summarise(n_distinct(Transaction)) %>% 
  summarise(Count=n()) %>% 
  ggplot(aes(x=Hour,y=Count,fill=Hour))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Unique transactions per Hour")+
  theme(legend.position="none")
x8

Total items per transaction per hour

x9.1 <- x %>% 
  mutate(Hour = as.factor(hour(Time)))%>% 
  group_by(Hour) %>% 
  summarise(Count= n()) 

x9.2 <- x %>% 
  mutate(Hour = as.factor(hour(Time)))%>% 
  group_by(Hour,Transaction) %>% 
  summarise(n_distinct(Transaction)) %>% 
  summarise(Count=n())

x9.3 <- data.frame(x9.1, # Days, total items
                   x9.2[2], # unique transactions
                   x9.1[2]/x9.2[2])  # items per unique transaction
colnames(x9.3) <- c("Hour","Line","Unique","Items.Trans")

x9 <- 
  ggplot(x9.3,aes(x=Hour,y=Items.Trans,fill=Hour))+
  theme_fivethirtyeight()+
  geom_bar(stat="identity")+
  ggtitle("Total items per transaction per hour")+
  theme(legend.position="none")+
  geom_text(aes(label=round(Items.Trans,1)), vjust=2)
x9

Transaction density per hour per day

We can use a ridge plot to summarize some of the key points we saw above. We see that Saturday seems to have the largest, most constant density throughout the day. While Sunday has a very sharp peak in transactions around lunchtime but tapers off quicker.

x10 <- x %>% 
  mutate(Hour=as.factor(hour(Time)),
         Day=wday(Date,label=T)) %>% 
  group_by(Day,Hour,Transaction) %>% 
  summarise(Count=n()) %>% 
  ggplot(aes(x=Hour,y=Day,group=Day,fill=..density..))+
  theme_fivethirtyeight()+
  ggtitle("Transaction density per Hour by Day")+
  theme(legend.position="none")+
  geom_density_ridges_gradient(scale=2, rel_min_height=0.01)+
  scale_fill_viridis(option="magma")
x10

Market Basket Analysis

In order to use our data in MBA we need to convert it to the correct format.

Upon importing, we find data that agree with our previousn observations. We also see that most transactions are either one or two items. We also see 3 generous, 10-item transactions. They must have bought for the office, how nice!

We can visualize our most popular items similarly to how we did above using the itemFrequencyPlot from the arules package.

Import and peak

y <- read.transactions("data/BreadBasket_DMS.csv",
                       format="single",
                       cols=c(3,4),
                       sep=","
                       )
summary(y)

## transactions as itemMatrix in sparse format with
##  6614 rows (elements/itemsets/transactions) and
##  104 columns (items) and a density of 0.02008705 
## 
## most frequent items:
##  Coffee   Bread     Tea    Cake  Pastry (Other) 
##    3188    2146     941     694     576    6272 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10 
## 2556 2154 1078  546  187   67   18    3    2    3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.089   3.000  10.000 
## 
## includes extended item information - examples:
##                     labels
## 1               Adjustment
## 2 Afternoon with the baker
## 3                Alfajores
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3          1000

# item frequencies  
par(mfrow=c(1,2))
itemFrequencyPlot(y,topN=5,type="relative")
itemFrequencyPlot(y,topN=5,type="absolute")

Support and Confidence

We’ll need to define optimal support and confidence values. Before diving in too deeply we should define them.

Support: An itemset with high support means it appears in our transactions frequently. High support is directly related to the frequency an item appears. For example, coffee probably has very high support.

\[Support = \frac {n(X\Rightarrow Y)} {N}\] Where \(n(X\rightarrow Y)\) equals the count of both X and Y (the total itemset) appearing together. And \(N\) equals the total number of transactions.

Confidence: Is the strength of our association rule. For example, a confidence of 1 implies that when the LHS item is purchased, the RHS item is purchased 100% of the time.

\[Confidence = \frac {n(X\Rightarrow Y)} {n(X)}\] Confidence = n(X->Y)/n(X)
Where \(n(X\rightarrow Y)\) equals the count of both X and Y (the total itemset) appearing together. And \(n(X)\) equals the count of X (the LHS) appearing in transactions.

So let’s pin down our support value first.

summary(itemFrequency(y))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001512 0.0002646 0.0013607 0.0200871 0.0146659 0.4820079

Remember, item frequency is directly related to support. We can see the distribution of values above, with the max frequency at 0.4820079. This is probably coffee which we saw above appears greater than 40% of the time in purchases.

Since our maximum support is 48%, let’s pick a value around 30% to start. We can find this value by dividing 30 by the total amount of transactions.

30/length(y)

## [1] 0.004535833

Assign rules

Our support value will be roughly 0.0045. So what about confidence? Let’s use a cointoss as inspiration. Remember that confidence is the likelihood, having purchased the item on the LHS, that the item on the RHS will be purchased in addition. Let’s go with 50%, or a value of 0.5.

rules <- apriori(y,parameter=list(support=0.0045,confidence=.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5  0.0045      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[104 item(s), 6614 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules

## set of 20 rules

Inspect rules

We’re given a set of 20 rules, which is quite managable. Let’s take a look.

inspect(head(rules,by="support"))

##     lhs            rhs      support    confidence lift     count
## [1] {Cake}      => {Coffee} 0.05654672 0.5389049  1.118042 374  
## [2] {Pastry}    => {Coffee} 0.04868461 0.5590278  1.159790 322  
## [3] {Sandwich}  => {Coffee} 0.04233444 0.5679513  1.178303 280  
## [4] {NONE}      => {Coffee} 0.04172966 0.5810526  1.205484 276  
## [5] {Medialuna} => {Coffee} 0.03296039 0.5751979  1.193337 218  
## [6] {Cookies}   => {Coffee} 0.02978530 0.5267380  1.092800 197

Cake, pastry, and sandwich are most commonly bought in tandem with Coffee. Oddly, Juice is also frequently bought with Coffee… We can only assume that this is due to one person buying for another and hopefully not that some monster is drinking drinking both simultaneously.

inspect(head(rules,by="confidence"))

##     lhs                     rhs      support     confidence lift     count
## [1] {Cake,Sandwich}      => {Coffee} 0.005594194 0.7551020  1.566576  37  
## [2] {Toast}              => {Coffee} 0.025703054 0.7296137  1.513697 170  
## [3] {NONE,Sandwich}      => {Coffee} 0.005140611 0.6415094  1.330911  34  
## [4] {Spanish Brunch}     => {Coffee} 0.014061083 0.6326531  1.312537  93  
## [5] {Cake,Hot chocolate} => {Coffee} 0.006501361 0.6323529  1.311914  43  
## [6] {Salad}              => {Coffee} 0.007862111 0.6117647  1.269201  52

More interesting are the association rules. Sorting by confidence gives us items on the LHS that give a high chance of being bought together with the item on the RHS. When looking at purchases that contrain {Cake, Sandwich}, 75% of them also contain {Coffee}.

Considering the suppport or count values, the association of {Toast} => {Coffee} may be more useful.

What about lift?

inspect(head(rules,by="lift"))

##     lhs                     rhs      support     confidence lift     count
## [1] {Cake,Sandwich}      => {Coffee} 0.005594194 0.7551020  1.566576  37  
## [2] {Toast}              => {Coffee} 0.025703054 0.7296137  1.513697 170  
## [3] {NONE,Sandwich}      => {Coffee} 0.005140611 0.6415094  1.330911  34  
## [4] {Spanish Brunch}     => {Coffee} 0.014061083 0.6326531  1.312537  93  
## [5] {Cake,Hot chocolate} => {Coffee} 0.006501361 0.6323529  1.311914  43  
## [6] {Salad}              => {Coffee} 0.007862111 0.6117647  1.269201  52

Lift

The lift measure is related closely to the confidence measure.

Lift: is the ratio of Confidence to Expected Confidence. A high lift value implies that there is a greater chance that our rule is not simply a coincidence. The larger the lift ratio, the more significant the relationship between LHS and RHS.

\[Lift = \frac{Support(X\Rightarrow Y)} {Support(X) * Support(Y)}\]

Though our lift values are not incredibly high, they may be high enough to rule out random chance. If we really want to find some high lift values we can refine our ruleset and adjust our support and confidence parameters.

rules2 <- apriori(y,parameter=list(support=0.00045,confidence=.9))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5 0.00045      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[104 item(s), 6614 transaction(s)] done [0.00s].
## sorting and recoding items ... [72 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [87 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(rules2,by="lift",n=10))

##      lhs                                      rhs             support     
## [1]  {Juice,Pick and Mix Bowls}            => {Mineral water} 0.0004535833
## [2]  {Bread,Scone,Truffles}                => {Mineral water} 0.0004535833
## [3]  {Bread,Mineral water,Truffles}        => {Scone}         0.0004535833
## [4]  {Mineral water,Pick and Mix Bowls}    => {Juice}         0.0004535833
## [5]  {Cookies,Hearty & Seasonal}           => {Hot chocolate} 0.0004535833
## [6]  {Cookies,Extra Salami or Feta}        => {NONE}          0.0004535833
## [7]  {Coffee,Cookies,Extra Salami or Feta} => {NONE}          0.0004535833
## [8]  {Coffee,Hot chocolate,Spanish Brunch} => {NONE}          0.0004535833
## [9]  {Coffee,Coke,Truffles}                => {Sandwich}      0.0006047777
## [10] {Hot chocolate,NONE,Tea}              => {Sandwich}      0.0006047777
##      confidence lift     count
## [1]  1          60.67890 3    
## [2]  1          60.67890 3    
## [3]  1          29.39556 3    
## [4]  1          24.86466 3    
## [5]  1          19.33918 3    
## [6]  1          13.92421 3    
## [7]  1          13.92421 3    
## [8]  1          13.92421 3    
## [9]  1          13.41582 4    
## [10] 1          13.41582 4

Lift values really rocket up! We would likely be way off an any assumptions made using these association rules as they’ve only appeared a few times in total (very low support).

Let’s get into visualizations now.

Visualize

There are various plots available from the arulesViz library. Though they can be nice to look at I find they aren’t quite as useful as tweaking rules manually via the support and confindence values and filtering based on support/confidence/lift. But let’s see what we can see.

plot(rules, measure=c("support","lift"), shading="confidence")

Based on the graph we have one or two points of interest with higher lift than the surrounding points. By adding the interactive=TRUE option to the plot() function we can select the datapoints for further exploration. It turns out these points are the ones below:

inspect(head(rules,by="lift",n=2))

##     lhs                rhs      support     confidence lift     count
## [1] {Cake,Sandwich} => {Coffee} 0.005594194 0.7551020  1.566576  37  
## [2] {Toast}         => {Coffee} 0.025703054 0.7296137  1.513697 170

Another somewhat useful plot is the Two-key plot

plot(rules, method = "two-key plot")

We can use this to distinguish total items purchased. In this case we see most people buy only 2 items but 3 item purchases seem to have a higher confidence level.

There are other graphs available in the arulesViz library that I simply do not know how to properly use, and in some cases are not useful to us in this particular analysis.

plot(rules, method="graph")

The above graph is not useful in our situation but would have allowed us to visualize clusters of different RHS items if they existed. In our case almost everyone buys coffee with another item, so Coffee is the only clustered item. In a grocery store we might see many different clusters characterizing different shoppers.

plot(rules, method="grouped")

plot(rules, method="paracoord", control=list(reorder=TRUE))

Conclusions

Taking a cursory look at the data and doing some relatively simple graphical analyses, we were able to discern which days were busiest, and at what times throughout the day that most transactions were taking place.

This data might be useful when scheduling staff and determining when additional staff is required, or when only minimal staff is needed. Of course, the bakery had probably already figured this out in a couple weeks of business. The graphical representation does give it a nice concrete feeling.

More useful is the number or items per transaction throughout the day.

As a business we obviously want to sell more items. We see most transactions consist of either one or 2 purchases. If we implemented some kind of lunch special during peak hours that incentivizes a 3rd item (preferably with high support already), we may push the total average transactions from 2 to 3.

It also might be wortwhile to target off-hours by creating a two-item “combo” which pushes people from buying just the single coffee item to coffee plus another high support item.

Our MBA shows that \(Toast \Rightarrow Coffee\) is a common occurence. Also that \(Cake, Sandwich \Rightarrow Coffee\) has high lift and confidence but low support. We might increase the likelihood of a 3rd item purchased by introducing a Toast/Cake/Coffee or a Sandwich/Cake/Coffee special at discounted price.

Bakery Transaction Data Analysis

Edward Yu

2018-11-17