Intro

Financial and customer data has been modified to anonymize.

The overall goal is to create a customer tier list in which the lowest tier customers have their sales visits limited or cut off at a certain point. Similarly, we prioritize high tier customers. The business case appears to be more efficient utilization of resources in order to score the biggest profit for the company.

Clean

During the cleaning of the 3 different data files we are able to summarize what is actually in them.

Sales

The sales data is essentially, sales data. We appear to have one transaction per row with data including customer sold to, product line, gross profit, etc. This seems immediately useful for calculations regarding net sales, profits, and margins across all product lines.

If we want to rate customers it might be beneficial to group the sales data by customer and then take the mean or median margin, sales, or combinations thereof to produce some sort of ranking metric.

glimpse(df_sales)

## Observations: 9,009
## Variables: 15
## $ date                  <dttm> 2019-04-01, 2019-04-01, 2019-04-01, 2019-04-...
## $ customer_num_sold_to  <fct> 304707, 305370, 305370, 330489, 330489, 33048...
## $ customer_num_ship_to  <fct> NA, NA, NA, 330489, 330489, 330489, 330489, 3...
## $ customer_name_sold_to <fct> Wabtec Foundry Limited, Esco Ltd, Esco Ltd, W...
## $ customer_name_ship_to <fct> Not Assigned, Not Assigned, Not Assigned, Wir...
## $ sales_office          <fct> NA, NA, NA, 2306, 2306, 2306, 2306, 2306, 230...
## $ region                <fct> Not Assigned, Not Assigned, Not Assigned, Mid...
## $ local_employee        <fct> Not Assigned, Not Assigned, Not Assigned, Mik...
## $ ext_matl_group        <fct> 0030, 0001, 0041, 0010, 0025, 0027, 0028, 002...
## $ product_line          <fct> Filters, Cold Box / Epoxy, Release Agents & A...
## $ product_group         <fct> Others - Cellular Filters, Amine Curing Pu Sy...
## $ net_sales             <dbl> 0, 0, 0, 8434, 1184, 1523, 1600, 0, 458, 0, 1...
## $ gross_profit          <dbl> -624, -2259, -151, 2644, 602, 608, 675, -18, ...
## $ gross_margin          <dbl> 0, 0, 0, 4220, 980, 608, 675, -18, 244, 0, 92...
## $ freight_customer      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Visits

The data appears to be a summary of opportunity on a per customer level. We appear to have a lot of junk data such as the gross_profit_pct which has values all equal to zero, with 26 NA values. There are more variables like this that seem, perhaps, a bit contrived in what they are trying to represent.

There are quite a few values here that should be automatically generated but appear to be missing. There might be an issue with this that needs resolved further up if we actually need the data.

glimpse(df_visits)

## Observations: 1,974
## Variables: 45
## $ accounts              <fct> a b Foundry, a b Foundry Machining, a l Johns...
## $ representative        <fct> Fitt Wayne, Thomas Keith, Fitt Wayne, Fitt Wa...
## $ region                <fct> Sm n Am Distributors, Central Region, Sm n Am...
## $ sales_ytd             <dbl> 303, 0, 0, 0, 0, 0, 0, 10080, 2008, 0, 0, 441...
## $ sales_opp             <dbl> 0, 12960, 0, 0, 233500, 0, 51695, 2673, 0, 26...
## $ gross_profit_pct      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ blue_silver           <fct> Silver, Silver, Silver, Silver, Silver, Silve...
## $ alert_msg_0           <fct> NA, NA, Missing Segmentation Data, Missing Se...
## $ blue_silver_score     <dbl> 2, 1, 0, 0, 1, 0, 1, 3, 2, 1, 0, 5, 5, 7, 5, ...
## $ actual_score          <dbl> 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 3, 2, 3, 2, ...
## $ opp_score             <dbl> 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 3, 3, 3, ...
## $ profit_score          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...
## $ num_visits            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 13, 7, 3,...
## $ pct_total             <dbl> 0.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ growth_type           <fct> Average Account, Average Account, Average Acc...
## $ account_type          <fct> Average Account, Average Account, Average Acc...
## $ last_visit            <ord> Jan, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sales                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 9, 5, 2, ...
## $ cat                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, ...
## $ tpm                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, ...
## $ blm                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ alert_msg_1           <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ growth_project        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ gp_type_0             <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, C...
## $ opp_rank_0            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, M...
## $ growth_project_amt_0  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1...
## $ gp_type_1             <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ opp_rank_1            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ growth_project_amt_1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ alert_msg_2           <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, M...
## $ a                     <fct> $0, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, F...
## $ b                     <fct> $0, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, R...
## $ c                     <fct> $0, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA, N...
## $ d                     <fct> $0, NA, NA, NA, Ct, NA, Cb, Cb, NA, NA, NA, N...
## $ e                     <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ f                     <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ g                     <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ h                     <fct> $0, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, N...
## $ i                     <fct> $0, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, N...
## $ business_line_a       <fct> 0.00, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA,...
## $ business_line_a_value <dbl> 0, 9990, NA, NA, 168500, NA, 38345, 2664, NA,...
## $ business_line_b       <fct> 0.00, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA,...
## $ business_line_b_value <dbl> 0, 1980, NA, NA, 25000, NA, 11494, 1, NA, 150...
## $ business_line_c       <fct> 0.00, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA,...
## $ business_line_c_value <dbl> 0, 990, NA, NA, 20000, NA, 1850, 1, NA, NA, N...

Opportunity

The profit and loss opportunity data seems to be similar to the last visits data columns. The goal appears to be an attempt to log and rank opporunities from each business line.

glimpse(df_ploop)

## Observations: 1,974
## Variables: 30
## $ account    <fct> a b Foundry, a b Foundry Machining, a l Johnson Co Gc In...
## $ CB         <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 800000, NA...
## $ NB         <dbl> NA, 9990, NA, NA, 168500, NA, 11494, NA, NA, NA, NA, NA,...
## $ CT         <dbl> NA, 1980, NA, NA, 20000, NA, 1, 1, NA, NA, NA, NA, 10080...
## $ AD         <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 249000, 22...
## $ RA         <dbl> NA, 990, NA, NA, 20000, NA, 1850, 2664, NA, 1500, NA, NA...
## $ IN         <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA...
## $ FT         <dbl> NA, NA, NA, NA, NA, NA, 38345, NA, NA, NA, NA, 100000, N...
## $ RS         <dbl> NA, NA, NA, NA, 25000, NA, 1, 1, NA, 25000, NA, 92500, 4...
## $ MT         <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 15000, 260...
## $ opp_a      <fct> NA, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, Ft, Ct, Ft, ...
## $ opp_b      <fct> NA, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, Rs, Cb, Mt, ...
## $ bl_a       <fct> NA, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, Ft, Ct, Ft, ...
## $ bl_a_value <dbl> NA, 9990, NA, NA, 168500, NA, 38345, 2664, NA, 25000, NA...
## $ bl_b       <fct> NA, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, Rs, Cb, Mt, ...
## $ bl_b_value <dbl> NA, 1980, NA, NA, 25000, NA, 11494, 1, NA, 1500, NA, 925...
## $ bl_c       <fct> NA, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA, NA, Rs, Rs, ...
## $ bl_c_value <dbl> NA, 990, NA, NA, 20000, NA, 1850, 1, NA, NA, NA, NA, 466...
## $ bl_d       <fct> NA, NA, NA, NA, Ct, NA, Cb, Cb, NA, NA, NA, NA, Ad, Ad, ...
## $ bl_d_value <dbl> NA, NA, NA, NA, 20000, NA, 1, 1, NA, NA, NA, NA, 249000,...
## $ bl_e       <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, Mt, Ct, ...
## $ bl_e_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 15000, 160...
## $ bl_f       <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, NA, NA, ...
## $ bl_f_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, 25...
## $ bl_g       <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, NA, NA, ...
## $ bl_g_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, 39...
## $ bl_h       <fct> NA, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, NA, NA, NA, ...
## $ bl_h_value <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N...
## $ bl_i       <fct> NA, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, NA, NA, NA, ...
## $ bl_i_value <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N...

Explore

Foreward
It’s clear that certain product lines pull in more net sales while others pull in more profits or margins. When looking even more closely at individual product lines we see that for each product line, there is generally one very significant product group that shifts the numbers for that entire line. Amine curing PU systems, for example, pulls the weight for the Cold Box product line. If we look at the other products in the Cold Box product line–Isomax, $SO_2$ Curing Systems, Amine Catalysts, and Methylformiat Curing Systems–all of their net sales combined do not equal those of the Amine curing PU systems alone.

What this ultimately means is that calculations used to determine average gross profit margin % on a per product line basis cannot take into account the top movers within each product line. Where this comes into play is when combining average gross margins with the sales opportunity data. We cannot be sure if, for example, the $9,000 potential in sales for the Cold Box / Epoxy group comes from the Amine curing PU systems group–32% margin–or the Amine Catalysts group–19.9% margin.
Therefore the exploratory analysis below reaches some useful conclusions but with some caveats.

Summary

General exploration of the data by product line shows Cold box and No bake / Hot box as top movers in terms of net sales. Filters and Coatings & additives also appear to be high volume movers but the generally lower prices on a per item basis are likely what pull these back towards the lower end of the net sales spectrum. The mean gross profit margin is 26%, with a median of around 28%.

When we look more closely at our product lines, expanding them into their separate product groups, the picture becomes more clear. While Cold box appeared to be the leading group, more specifically, it is the Amine curing PU systems that pull the numbers up. We see similar stories with Pep set pulling up the No bake group, etc. This point is further illustrated via the below histograms.

What this may mean overall, is that when making critical decisions we need to look more specifically at product groups rather than broadly at product lines. When attempting to rank customers, we will also need to find a balance between sales volumes and margins.

Product lines & groups

Gross profit margins vs Net sales

Product lines

Very nice to visualize the overall performance of all product lines. We can see our “bread and butter” higher net selling items far off the the right, right in the middle of our gross profit margin range. Metallurgy, risers, and coatings appear to be more commodity-like items, with release agents and specialty resins being more “designer” items, netting higher profit margins. The filters fall between both, likely due to the contrast between standard and 3d filters.

Plot

Important to note that values below zero were included. It appears that primarily the Coatings & Additives group is affected by including these values. When excluded, their gross_profit_margin value increases to roughly 30%, from 16%.

##################################
# GPM vs NETSALES across PRODUCT LINES, size = TRANSACTION COUNT
line_counts <- df_sales %>%
  dplyr::filter(product_line != "#N/A") %>%
  group_by(product_line) %>%
  add_tally() %>%
  slice(1) %>%
  dplyr::select(product_line, n)

product_line_df <- df_sales %>%
  dplyr::filter(product_group != "#N/A") %>%
  group_by(product_line) %>%
  dplyr::summarise(net_sales = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   gross_margin = sum(gross_margin),
                   gross_profit_margin = gross_profit / net_sales) %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
  mutate_if(is.numeric, function(x) replace(x, x<0, 0))

product_line_df <- left_join(product_line_df,line_counts) %>% 
  mutate(p_line_counts = n) %>% select(-c(n)) %>% 
  mutate(gross_profit_margin=gross_profit_margin*100) %>% 
  set_colnames(c("Product Line", "Net Sales", "Gross Profit", "Gross Margin","Gross Profit Margin","Transaction Count"))

product_line_df %>% 
  ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`))+
  geom_point(aes(fill=`Transaction Count`, size=`Transaction Count`), shape=21,alpha=1)+
  geom_label_repel(aes(label=`Product Line`),
                   alpha=.9, 
                   force=70,
                   box.padding = .9,
                   seed=1002,
                   size=4,
                   segment.colour = 'grey40')+
  theme_minimal()+
  labs(title = "Product Lines - Gross profit margin vs Net sales",
       caption = "Removed: 'Unassigned', 'Design services', 'Others'\nSize & color relative to Transaction count",
       fill="Transaction count")+
  scale_fill_viridis(option = "C")+
  scale_y_continuous(name = "Gross profit margin (%)",
                     breaks=seq(0,100,5),
                     limits=c(15,40))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,50000000,5000000),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  guides(size=FALSE)

Interactive

##################################
# GPM vs NETSALES across PRODUCT LINES, size = TRANSACTION COUNT
# PLOTLY VERSION
mycurrency <- function(x){
  return(paste0("$",formatC(signif(x,3), format="fg",big.mark=",")))
}

gg_product_lines <- product_line_df %>%
  mutate(`Gross Profit Margin` = round(`Gross Profit Margin`,2)) %>% 
  # ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`))+
  ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`,
             text = paste('Product Line: ' , `Product Line`,
                          '<br>Net Sales: ', mycurrency(`Net Sales`),
                          '<br>Gross Profit Margin: ', paste0(round(`Gross Profit Margin`,2),"%"),
                          '<br>Transaction Count: '  , `Transaction Count`
                          )))+
  geom_point(aes(color=`Transaction Count`, size=`Transaction Count`))+
  theme_minimal()+
  labs(title = "Product Lines - Gross profit margin vs Net sales",
       caption = "Removed: 'Unassigned', 'Design services', 'Others'\nSize relative to Transaction count",
       fill="Transaction count")+
  scale_color_viridis(option = "C")+
  scale_y_continuous(name = "Gross profit margin (%)",
                     breaks=seq(0,100,5),
                     limits=c(15,40))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,50000000,5000000),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  guides(size=FALSE)
  

# ggplotly(gg_product_lines) %>% 
ggplotly(gg_product_lines, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Product Lines - Gross profit margin vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "Removed: 'Unassigned', 'Design services', 'Others'; Size relative to Transaction count",
                                    '</sup>')),
         legend=list(orientation="h",x=0,y=-.2))

Product groups

Though the previous plot is not necessarily misleading, it tells a more thorough story when product lines are ungrouped into their separate product groups. It becomes more clear that although the Cold Box / Epoxy group leads in net sales, this is due specifically to the Amine Curing PU Systems product group. The same can be said of the Filters and No Bake/ Hot Box groups with Zr-Filters and Pep Set products, respectively.

Scatter

##################################
# GPM vs NET SALES across PRODUCT GROUPS, size = transaction count, color = product lines
line_counts <- df_sales %>%
  dplyr::filter(product_line != "#N/A") %>%
  group_by(product_group) %>%
  add_tally() %>%
  slice(1) %>% 
  ungroup() %>% 
  dplyr::select(product_group, n)

gg_product_groups_df <- df_sales %>%
  dplyr::filter(product_group != "#N/A") %>%
  group_by(product_group,product_line) %>%
  dplyr::summarise(net_sales = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   gross_margin = sum(gross_margin),
                   gross_profit_margin = (gross_profit / net_sales)*100) %>% 
  arrange(-net_sales) %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
  mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>% 
  dplyr::filter(gross_profit_margin < 100 & gross_profit_margin > 0 & net_sales > 0) %>% 
  left_join(line_counts)

gg_product_groups_df %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin, label=product_group))+
  geom_point(aes(fill=product_line, size=n), shape=21)+
  geom_label_repel(data=subset(gg_product_groups_df,
                               (net_sales > 3e6 & gross_profit_margin > 25) | 
                                 gross_profit_margin > 40), 
                   aes(label=product_group),
                   alpha=.6,
                   force=90,
                   box.padding = .6,
                   seed=1003,
                   size=4,
                   segment.colour = 'grey40')+
  theme_minimal()+
  scale_y_continuous(name = "Gross profit margin (%)",
                     breaks=seq(-10,100,5))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,35e6, 5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  scale_fill_brewer(palette = "Set1")+
  guides(size=FALSE,
         fill = guide_legend(override.aes = list(size=5)))+
  labs(title = "Product Groups - Gross profit margin vs Net sales",
       subtitle = "Size = transaction count",
       caption = "Removed: 0%, 100% margins; net sales < 0",
       fill= "Product line")

Interactive

##################################
# GPM vs NET SALES across PRODUCT GROUPS, size = transaction count, color = product lines
# PLOTLY VERSION
gg_product_lines <- gg_product_groups_df %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin, label=product_group,
         text = paste0('Product Group: ', paste0(gg_product_groups_df$product_group," (",gg_product_groups_df$product_line, ")"),
                      '<br>Net Sales: ', mycurrency(gg_product_groups_df$net_sales),
                      '<br>Gross Profit Margin: ', paste0(round(gg_product_groups_df$gross_profit_margin,2),"%"),
                      '<br>Transaction Count: '  , gg_product_groups_df$n)))+
  geom_point(aes(color=product_line, size=n), alpha=.8)+
  theme_minimal()+
  scale_y_continuous(name = "Gross profit margin (%)",
                     breaks=seq(-10,100,5))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,35e6, 5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  scale_color_brewer(palette = "Set1")+
  guides(size=FALSE)+
  # guides(size=FALSE,
  #        color = guide_legend(override.aes = list(size=5)))+
  labs(title   = "Product Groups - Gross profit margin vs Net sales",
       caption = "Removed: 0%, 100% margins; net sales < 0",
       color   = "Product line")
  
# ggplotly(gg_product_lines)
ggplotly(gg_product_lines, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Product Lines - Gross profit margin vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "Size = transaction count; Removed: 0%, 100% margins; net sales < 0",
                                    '</sup>')))

Sunburst

Sunburst diagram representing product lines as sub-categories of product groups. Size is relative to total gross profits.

Product lines, by product group, representing total gross profit

df_sales %>%
  dplyr::filter(gross_profit > 0) %>% 
  dplyr::select(product_line, product_group, gross_profit) %>% 
  mutate(product_line = gsub("[^[:alnum:] ]", "", product_line)) %>% 
  mutate(product_line = gsub("  ", " ", product_line)) %>%
  mutate(product_group = gsub("[^[:alnum:] ]", "", product_group)) %>% 
  mutate(product_group = gsub("  ", " ", product_group)) %>% 
  group_by(product_line, product_group) %>% 
  summarise(gp=sum(gross_profit)) %>% 
  mutate(path = paste0(product_line, "-", product_group)) %>% 
  ungroup() %>% 
  dplyr::select(path, gp) %>% 
  sunburst(legend=FALSE)

Customer analysis

Gross profit segmentation

By grouping customers into clusters representing roughly 10% of total gross profits we can glean a better understanding of where much of the profits come from. Each primary cluster in the treemap below represents 10% of total gross profits. The number on the top left of each cluster (1-9) represents a pseudo-ranking of customer “importance”, with importance representing percentage of total gross profits contributed.

For example, cluster 1 shows that Porter Warner Industries Llc represents the first 10% of total gross profits. cluster 2 represents customers that make up 11-20% of gross profits, and so on. If we list all the customers in primary clusters 1-7, for example, then we have a list of the fewest customers that make up 79% of total gross profits.

It’s an interesting thought experiment in deciding whether our rank 1: Porter Warner Industries Llc is a more “valuable” customer than, for example, our rank 2: Caterpillar Incorporated and Waupaca Foundry Inc customers. Although a larger share of profits originate with the larger customer, the risk is also doubled in the case of a loss. The rank 2 customers include a sort of built-in hedge against this while also bringing in significant value.

Treemap

Note, gross profits and net sales below zero are included.

#########################################
# TREEMAP of customers
df_totals <- df_sales %>% 
  dplyr::filter(customer_name_sold_to != "not assigned") %>%
  # dplyr::filter(gross_profit > 0 & net_sales > 0 & customer_name_sold_to != "not assigned") %>% 
  dplyr::summarise(net_sales = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   gross_margin = sum(gross_margin),
                   gross_profit_margin = gross_profit / net_sales)

gg_tree_df <- df_sales %>% 
  group_by(customer_name_sold_to) %>% 
  dplyr::filter(gross_profit > 0) %>% 
  dplyr::summarise(net_sales    = sum(net_sales),
                   gross_profit = sum(gross_profit)) %>% 
  arrange(-gross_profit) %>% 
  mutate(net_sales_total = df_totals$net_sales) %>% 
  mutate(net_sales_pct = net_sales/net_sales_total) %>% 
  mutate(gross_profit_total = df_totals$gross_profit) %>%
  mutate(gross_profit_pct = gross_profit/gross_profit_total) %>% 
  mutate(cusum_gp = cumsum(gross_profit_pct)) %>% 
  mutate(rank = ifelse(cusum_gp>.9, 9, 
                       ifelse(cusum_gp>.8, 8,
                              ifelse(cusum_gp>.7, 7,
                                     ifelse(cusum_gp>.6, 6,
                                            ifelse(cusum_gp>.5, 5,
                                                   ifelse(cusum_gp>.4, 4,
                                                          ifelse(cusum_gp>.3, 3,
                                                                 ifelse(cusum_gp>.2, 2,
                                                                        ifelse(cusum_gp>.1, 1, 0)))))))))) %>% 
  mutate(rank2 = as.factor(rank)) %>%
  group_by(rank) %>%
  mutate(customer_name_sold_to = toTitleCase(as.character(customer_name_sold_to))) %>% 
  add_tally()

tm <- treemap(
  gg_tree_df,
  index=c("rank2", "customer_name_sold_to"),
  vSize="gross_profit_pct",
  type="categorical",
  vColor="rank2",
  title="10% total gross profit clusters",
  algorithm="pivotSize",
  sortID = "rank",
  palette="-RdYlGn",
  fontsize.labels = c(16,10),
  border.col=c("white", "grey60"),
  lowerbound.cex.labels = 0,
  bg.labels=220,
  align.labels=list(
    c("left", "top"),
    c("center", "center")),
  position.legend = "none"
)

Table

The table of our ranked customer clusters shows a few things to us:

Num customers in rank: the number of customers allocated to each assigned ranking group.
Cum num customers in rank: a cumulative version of the Num customers in rank column.
Pct of total customers in rank: the percentage of customers in that rank out of the total amount of customers.
Pct total GP in rank: the percentage of gross profit in that rank out of the total gross profits.
Cum pct of total customers: a cumulative version of the Pct of total customers in rank column.
Cum pct total GP in rank: a cumulative version of the Pct total GP in rank column.

One of the more interesting bits of information we can gather from this table comes from the last two, cumulative columns. We see a pseudo-Pareto-like pattern. Row 8, or rank 8, shows that the top 11.1% of customers account for 79.4% of gross profits.

####################################
# DF of CUSTOMER RANK INFO
gg_tree_df_gp <- gg_tree_df %>% 
  group_by(rank) %>%
  summarise(pct_gp_in_rank = sum(gross_profit)/sum(gg_tree_df$gross_profit)) %>% 
  mutate(cum_pct_gp_in_rank = cumsum(pct_gp_in_rank))

gg_tree_df_gp <- gg_tree_df %>% 
  group_by(rank) %>% slice(1) %>% ungroup() %>% 
  mutate(cum_num_customers_in_rank = cumsum(n),
         pct_total_customers_in_rank = n/nrow(gg_tree_df),
         cum_pct_total_customers_in_rank = cumsum(pct_total_customers_in_rank)) %>% 
  left_join(gg_tree_df_gp) %>% 
  dplyr::select(rank, 
                n, 
                cum_num_customers_in_rank,
                pct_total_customers_in_rank,
                pct_gp_in_rank,
                cum_pct_total_customers_in_rank, 
                cum_pct_gp_in_rank) %>% 
  set_colnames(c("Rank", 
                 "Num customers in rank", 
                 "Cum num customers in rank", 
                 "Pct of total customers in rank",
                 "Pct total GrossProfit in rank",
                 "Cum pct of total customers", 
                 "Cum pct total GrossProfit in rank"))

datatable(gg_tree_df_gp, 
          options = list(
            autoWidth = TRUE,
            dom = 't'
            # dom = 'tip'
          ),
          class = 'cell-border stripe',
          caption = "Gross profit opportunity of ranked customers",
          rownames = FALSE) %>% 
  formatPercentage(c(names(gg_tree_df_gp[-c(1,2,3)])), 1) %>% 
  formatStyle(names(gg_tree_df_gp[2]),
              background = styleColorBar(range(gg_tree_df_gp[2]), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_tree_df_gp[3]),
              background = styleColorBar(range(gg_tree_df_gp[3]), 'cornsilk'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_tree_df_gp[4]),
              background = styleColorBar(range(gg_tree_df_gp[4]), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_tree_df_gp[5]),
              background = styleColorBar(range(gg_tree_df_gp[5]), 'lightsalmon'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_tree_df_gp[6]),
              background = styleColorBar(range(gg_tree_df_gp[6]), 'lightblue'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_tree_df_gp[7]),
              background = styleColorBar(range(gg_tree_df_gp[7]), 'lightsalmon'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Lolipop

Here we can look at a sorted list of customers that essentially make up to 79% of gross profit. The list consists of 53 total customers.

#############################
# LOLIPOP OF SORTED GP pct CUSTOMERS

df_lolipop <- gg_tree_df %>%
  dplyr::select(customer_name_sold_to, gross_profit, gross_profit_pct, rank, net_sales) %>% 
  arrange(-gross_profit_pct) %>% 
  ungroup() %>% 
  mutate(Rank = as.factor(rank)) %>%
  mutate(customer_name_sold_to = toTitleCase(as.character(customer_name_sold_to))) %>% 
  slice(1:53)

gg_lolipop <- df_lolipop %>% 
  ggplot(aes(x=reorder(customer_name_sold_to,gross_profit_pct),
             y=gross_profit_pct,
             text=paste("Customer: ",customer_name_sold_to,
                        "<br>Rank: ", Rank,
                        "<br>Gross profit: ", mycurrency(gross_profit),
                        "<br>Gross profit % of total:", paste0(round( (signif(gross_profit_pct,3)*100), 3), "%"),
                        "<br>Net sales: ", mycurrency(net_sales))))+
  geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
  geom_segment(aes(x=customer_name_sold_to, 
                   xend=customer_name_sold_to, 
                   y=0, 
                   yend=gross_profit_pct), 
               alpha=.4,
               color="grey40")+
  theme(panel.grid.major.y = element_blank())+
  coord_flip()+
  theme_minimal()+
  scale_y_continuous(name = "Gross profit (%)",
                     labels = function(x) paste0(x * 100, '%'),
                     breaks=seq(0,.15,.01))+
  xlab("Customer")+
  # scale_color_brewer(palette="Dark2")+
  scale_fill_brewer(palette="RdYlGn",direction=-1)+
  labs(title = "Customers vs gross profit contribution",
       caption= "Top 53 customers accounting for ~79% gross profit")

ggplotly(gg_lolipop, tooltip = c("text")) %>% 
  layout(title = list(text = paste0('Customers vs gross profit contribution',
                                    '<br>',
                                    '<sup>',
                                    "Top 53 customers accounting for ~79% gross profit",
                                    '</sup>')))

Gross profit opportunity

Though important to know where current profits originate, and where the top percentage of company profits come from, if we don’t look at opportunity then we are ignoring the potential for growth.

The sales opportunity dataset provides sales opportunity estimates not only for each prospective foundry, but for each product line. We know that each product line has quite different margins thanks to our analysis above. Now, we can leverage this knowledge and provide accurate estimates of potential gross profits.

In other words, we will use our average calculated gross margin percent for each product line, multiply this by the sales opportunity figure, resulting in a better representation of potential gross profits.

Data

Though not shown in completion, the below data is the backbone for the new gross profit opportunity calculation. The raw opportunity data for each customer, for each product line, is used in conjuction with the margin data calculated above. This new product line gross profit opportunity metric is then added together for each customer and presented below.

Raw opportunity

The data is first reshaped into long format.

df_opps <- df_ploop %>% 
  left_join(df_visits, by=c('account'= 'accounts')) %>% 
  distinct() %>% 
  dplyr::filter(sales_opp > 0) %>% 
  dplyr::select(colnames(df_ploop[1:10])) %>% 
  mutate_if(is.numeric, funs(replace_na(., 0))) %>% 
  mutate(CAA = CT+AD) %>% 
  dplyr::select(-c(CT,AD)) %>% 
  set_colnames(c("account",
                 "Cold Box / Epoxy",
                 "No Bake / Hot Box",
                 "Release Agents & Auxiliaries",
                 "Inorganics",
                 "Filters",
                 "Risers",
                 "Metallurgy",
                 "Coatings & Additives")) %>% 
  pivot_longer(cols=-account, 
               names_to="product_line", 
               values_to="sales_opp")  %>% 
  mutate_if(is.character, as.factor)

datatable(df_opps %>% 
            set_colnames(c("Account","Product Line", "Sales Opportunity")),
          options = list(
            autoWidth = TRUE,
            # dom = 't'
            dom = 'ftip'
          ),
          class = 'cell-border stripe',
          caption = "Gross profit opportunity of ranked customers",
          rownames = FALSE) %>% 
  formatCurrency(c("Sales Opportunity"), "$")

Raw margins

Here we have the raw margin data from calculations above. The Inorganics margin was imputed using the mean of all other lines, since the data was missing.

df_product_line_sales <- df_sales %>%
  dplyr::filter(product_group != "#n/a") %>%
  dplyr::filter(product_line %in% levels(df_opps$product_line)) %>% 
  group_by(product_line) %>%
  dplyr::summarise(net_sales = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   # gross_margin = sum(gross_margin),
                   gross_profit_margin = gross_profit / net_sales) %>% 
  ungroup() %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
  mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>% 
  mutate(net_sales_pct = net_sales/sum(net_sales),
         gross_profit_pct = gross_profit/sum(gross_profit)) %>% 
  mutate(gross_profit_margin = ifelse(is.na(gross_profit_margin), 
                                      mean(gross_profit_margin,na.rm = TRUE), gross_profit_margin))

datatable(df_product_line_sales %>%
            dplyr::select(product_line, gross_profit_margin) %>%
            set_colnames(c("Product Line", "Gross Profit Margin %")),
          options = list(
            autoWidth = TRUE,
            dom = 't'
            # dom = 'tip'
            ),
          class = 'cell-border stripe',
          caption = "Calculated gross profit margins for each product line",
          rownames = FALSE) %>%
  formatPercentage(c("Gross Profit Margin %"), 2)

Treemap

################################
# TOP CUSTOMERS basedon potential profit

df_opps <- df_ploop %>% 
  left_join(df_visits, by=c('account'= 'accounts')) %>% 
  distinct() %>% 
  dplyr::filter(sales_opp > 0) %>% 
  dplyr::select(colnames(df_ploop[1:10])) %>% 
  mutate_if(is.numeric, funs(replace_na(., 0))) %>% 
  mutate(CAA = CT+AD) %>% 
  dplyr::select(-c(CT,AD)) %>% 
  set_colnames(c("account",
                 "Cold Box / Epoxy",
                 "No Bake / Hot Box",
                 "Release Agents & Auxiliaries",
                 "Inorganics",
                 "Filters",
                 "Risers",
                 "Metallurgy",
                 "Coatings & Additives")) %>% 
  pivot_longer(cols=-account, 
               names_to="product_line", 
               values_to="sales_opp")  %>% 
  mutate_if(is.character, as.factor)

df_product_line_sales <- df_sales %>%
  dplyr::filter(product_group != "#n/a") %>%
  dplyr::filter(product_line %in% levels(df_opps$product_line)) %>% 
  group_by(product_line) %>%
  dplyr::summarise(net_sales = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   # gross_margin = sum(gross_margin),
                   gross_profit_margin = gross_profit / net_sales) %>% 
  ungroup() %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
  mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>% 
  mutate(net_sales_pct = net_sales/sum(net_sales),
         gross_profit_pct = gross_profit/sum(gross_profit)) %>% 
  mutate(gross_profit_margin = ifelse(is.na(gross_profit_margin), 
                                      mean(gross_profit_margin,na.rm = TRUE), gross_profit_margin))

df_all_opps <- df_opps %>%
  left_join(df_product_line_sales %>% dplyr::select(product_line, gross_profit_margin)) %>% 
  mutate(gross_profit_opp = sales_opp*gross_profit_margin) %>% 
  dplyr::select(-c(gross_profit_margin)) %>% 
  dplyr::filter(sales_opp != 0) %>% 
  group_by(account) %>% 
  dplyr::summarise(sales_opp = sum(sales_opp),
                   gross_profit_opp = sum(gross_profit_opp)) %>% 
  left_join(df_visits %>% dplyr::select(accounts, num_visits), by=c('account' = 'accounts')) %>% 
  arrange(-gross_profit_opp) %>% 
  distinct() %>% 
  mutate(gross_profit_opp_pct = gross_profit_opp/sum(gross_profit_opp)) %>% 
  mutate(gross_profit_opp_cusum_pct = cumsum(gross_profit_opp_pct)) %>% 
  mutate(rank = ifelse(gross_profit_opp_cusum_pct>.9, 9, 
                       ifelse(gross_profit_opp_cusum_pct>.8, 8,
                              ifelse(gross_profit_opp_cusum_pct>.7, 7,
                                     ifelse(gross_profit_opp_cusum_pct>.6, 6,
                                            ifelse(gross_profit_opp_cusum_pct>.5, 5,
                                                   ifelse(gross_profit_opp_cusum_pct>.4, 4,
                                                          ifelse(gross_profit_opp_cusum_pct>.3, 3,
                                                                 ifelse(gross_profit_opp_cusum_pct>.2, 2,
                                                                        ifelse(gross_profit_opp_cusum_pct>.1, 1, 0)))))))))) %>% 
  mutate(rank2 = as.factor(rank)) %>%
  group_by(rank)



tm <- treemap(
  df_all_opps,
  index=c("rank", "account"),
  vSize="gross_profit_opp",
  type="categorical",
  vColor="rank2",
  title="10% total potential gross profit clusters",
  algorithm="pivotSize",
  sortID = "rank",
  palette="-RdYlGn",
  fontsize.labels = c(16,10),
  border.col=c("white", "grey60"),
  lowerbound.cex.labels = 0,
  bg.labels=220,
  align.labels=list(
    c("left", "top"),
    c("center", "center")),
  position.legend = "none"
)

Table

df_all_opps_gp <- df_all_opps %>% 
  group_by(rank) %>%
  summarise(pct_gp_opp_in_rank = sum(gross_profit_opp)/sum(df_all_opps$gross_profit_opp)) %>% 
  mutate(cum_pct_gp_opp_in_rank = cumsum(pct_gp_opp_in_rank))

df_all_opps_gp <- df_all_opps %>% 
  group_by(rank) %>%
  add_tally() %>% slice(1) %>% ungroup() %>% 
  mutate(cum_num_customers_in_rank = cumsum(n),
         pct_total_customers_in_rank = n/nrow(df_all_opps),
         cum_pct_total_customers_in_rank = cumsum(pct_total_customers_in_rank)) %>% 
  left_join(df_all_opps_gp) %>% 
  dplyr::select(rank, 
                n, 
                cum_num_customers_in_rank,
                pct_total_customers_in_rank,
                pct_gp_opp_in_rank,
                cum_pct_total_customers_in_rank, 
                cum_pct_gp_opp_in_rank) %>% 
  set_colnames(c("Rank", 
                 "Num customers in rank", 
                 "Cum num customers in rank", 
                 "Pct of total customers in rank",
                 "Pct total GP opportunity in rank",
                 "Cum pct of total customers", 
                 "Cum pct total GP opportunity in rank"))

datatable(df_all_opps_gp, 
          options = list(
            autoWidth = TRUE,
            dom = 't'
            # dom = 'tip'
          ),
          class = 'cell-border stripe',
          caption = "Gross profit opportunity of ranked customers",
          rownames = FALSE) %>% 
  formatPercentage(c(names(df_all_opps_gp[-c(1,2,3)])), 1) %>% 
  formatStyle(names(df_all_opps_gp[2]),
              background = styleColorBar(range(df_all_opps_gp[2]), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_all_opps_gp[3]),
              background = styleColorBar(range(df_all_opps_gp[3]), 'cornsilk'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_all_opps_gp[4]),
              background = styleColorBar(range(df_all_opps_gp[4]), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_all_opps_gp[5]),
              background = styleColorBar(range(df_all_opps_gp[5]), 'lightsalmon'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_all_opps_gp[6]),
              background = styleColorBar(range(df_all_opps_gp[6]), 'lightblue'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_all_opps_gp[7]),
              background = styleColorBar(range(df_all_opps_gp[7]), 'lightsalmon'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Lolipop

GP opp % all seems very low, double check calculations. Total gross profit potential?

#################################
# GP OPP LOLIPOP

df_opp_lolipop <- df_all_opps %>%
  dplyr::select(account, gross_profit_opp, gross_profit_opp_pct, rank, sales_opp) %>% 
  arrange(-gross_profit_opp_pct) %>% 
  ungroup() %>% 
  mutate(Rank = as.factor(rank)) %>%
  slice(1:67)

df_opp_short_names <- df_opp_lolipop %>% 
  mutate(account = as.character(account)) %>% 
  mutate(nchar = nchar(account)) %>% 
  dplyr::select(account, nchar) %>% 
  arrange(-nchar) %>% 
  mutate(account_short = substring(account,1,35)) %>% 
  dplyr::select(account, account_short)
  # mutate_all(as.factor)

df_opp_lolipop <- df_opp_lolipop %>% 
  left_join(df_opp_short_names)


gg_df_opp_lolipop <- df_opp_lolipop %>% 
  ggplot(aes(x=reorder(account_short ,gross_profit_opp_pct),
             y=gross_profit_opp_pct,
             text=paste("Customer: ",account_short,
                        "<br>Rank: ", Rank,
                        "<br>Gross profit opportunity: ", mycurrency(gross_profit_opp),
                        "<br>Gross profit opportunity % of total:", paste0(round( (signif(gross_profit_opp_pct,3)*100), 3), "%"),
                        "<br>Net sales opportunity: ", mycurrency(sales_opp))))+
  geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
  geom_segment(aes(x=account_short, 
                   xend=account_short, 
                   y=0, 
                   yend=gross_profit_opp_pct), 
               alpha=.4,
               color="grey40")+
  theme(panel.grid.major.y = element_blank())+
  coord_flip()+
  theme_minimal()+
  scale_y_continuous(name = "Gross profit opportunity (%)",
                     labels = function(x) paste0(x * 100, '%'),
                     breaks=seq(0,.25,.02))+
  xlab("Customer")+
  # scale_color_brewer(palette="Dark2")+
  scale_fill_brewer(palette="RdYlGn",direction=-1)+
  labs(title = "Customers vs gross profit opportunity",
       caption= "Top 67 customers accounting for ~79% gross profit opportunity")

ggplotly(gg_df_opp_lolipop, tooltip = c("text")) %>% 
  layout(title = list(text = paste0('Customers vs gross profit opportunity',
                                    '<br>',
                                    '<sup>',
                                    "Top 67 customers accounting for ~79% gross profit opportunity",
                                    '</sup>')))

Analysis

We’ve performed a bit of analysis above, grouping current customers into categories of most and least profitable. Similarly, we’ve classified potential customers based on potential gross profit. At this point we should restate the goal of this endeavor.

Classify customer accounts (premium vs non-premium) and determine which accounts will continue to receive complimentary support services with no additional charge.

Also of interest is ranking customers on a scale that assists with prioritization of visits and/or attention. This seems straightforward and easily accomplished by simply looking at total gross profits on a per customer basis, however, because certain product lines have dramatically higher net sales than others, using a gross profit or net sales metric would unfairly select customer priority based on a single product line. In other words, we need a way to standardize or normalize the effects.

Priority by product line

To emphasize the point, we can show the clustered treemap visualization for each product line, along with the sorted lolipop diagram. It becomes apparent that not all customers carry the same importance when segregating by product line.

Treemap

Coatings & Additives

df_sales_customer <- df_sales %>% 
  dplyr::filter(product_line %in% levels(df_opps$product_line)) %>% 
  group_by(customer_name_sold_to, product_line) %>% 
  summarise(net_sales = sum(net_sales),
            gross_profit = sum(gross_profit),
            gross_profit_margin = gross_profit/net_sales) %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA))

get_pl_treemap <- function(PL = "Coatings & Additives"){
  a <- df_sales_customer %>% 
    dplyr::filter(net_sales > 0, 
                  gross_profit > 0,
                  gross_profit_margin > 0) %>% 
    dplyr::filter(product_line == PL) %>% 
    arrange(-gross_profit) %>% 
    ungroup() %>% 
    mutate(gross_profit_pct = gross_profit/sum(gross_profit),
           gross_profit_cum_pct = cumsum(gross_profit_pct)) %>% 
    mutate(rank = ifelse(gross_profit_cum_pct>.9, 9, 
                         ifelse(gross_profit_cum_pct>.8, 8,
                                ifelse(gross_profit_cum_pct>.7, 7,
                                       ifelse(gross_profit_cum_pct>.6, 6,
                                              ifelse(gross_profit_cum_pct>.5, 5,
                                                     ifelse(gross_profit_cum_pct>.4, 4,
                                                            ifelse(gross_profit_cum_pct>.3, 3,
                                                                   ifelse(gross_profit_cum_pct>.2, 2,
                                                                          ifelse(gross_profit_cum_pct>.1, 1, 0)))))))))) %>%
    mutate(rank2 = as.factor(rank))
  return(
    treemap(
      a,
      index=c("rank2", "customer_name_sold_to"),
      vSize="gross_profit_pct",
      type="categorical",
      vColor="rank2",
      title=paste0(PL, " - 10% total gross profit clusters"),
      algorithm="pivotSize",
      sortID = "rank",
      palette="-RdYlGn",
      fontsize.labels = c(16,10),
      border.col=c("white", "grey60"),
      lowerbound.cex.labels = 0,
      bg.labels=220,
      align.labels=list(
        c("left", "top"),
        c("center", "center")),
      position.legend = "none"
    )
  ) 
}
get_pl_treemap()

Cold Box / Epoxy

get_pl_treemap("Cold Box / Epoxy")

Filters

get_pl_treemap("Filters")

Metallurgy

get_pl_treemap("Metallurgy")

No Bake / Hot Box

get_pl_treemap("No Bake / Hot Box")

Release Agents & Auxiliaries

get_pl_treemap("Release Agents & Auxiliaries")

Risers

get_pl_treemap("Risers")

Lolipop

Coatings & Additives

get_pl_lolipop <- function(PL = "Coatings & Additives"){
  a <- df_sales_customer %>% 
    dplyr::filter(net_sales > 0, 
                  gross_profit > 0,
                  gross_profit_margin > 0) %>% 
    dplyr::filter(product_line == PL) %>% 
    arrange(-gross_profit) %>% 
    ungroup() %>% 
    mutate(gross_profit_pct = gross_profit/sum(gross_profit),
           gross_profit_cum_pct = cumsum(gross_profit_pct)) %>% 
    mutate(rank = ifelse(gross_profit_cum_pct>.9, 9, 
                         ifelse(gross_profit_cum_pct>.8, 8,
                                ifelse(gross_profit_cum_pct>.7, 7,
                                       ifelse(gross_profit_cum_pct>.6, 6,
                                              ifelse(gross_profit_cum_pct>.5, 5,
                                                     ifelse(gross_profit_cum_pct>.4, 4,
                                                            ifelse(gross_profit_cum_pct>.3, 3,
                                                                   ifelse(gross_profit_cum_pct>.2, 2,
                                                                          ifelse(gross_profit_cum_pct>.1, 1, 0)))))))))) %>%
    ungroup() %>% 
    mutate(Rank = as.factor(rank)) %>% 
    dplyr::filter(rank < 8)
  
  top_cust <- nrow(a)
  gp_scale <- ceiling(round(max(a$gross_profit_pct),3)*100)/100
  gp_scale_breaks <- floor(gp_scale/10*100)/100
  title  <- paste0(PL, " - Customers vs gross profit contribution")
  caption <-  paste0("Top ", top_cust, " customers accounting for ~80% gross profit")
  
  b <- a %>% 
    ggplot(aes(x=reorder(customer_name_sold_to,gross_profit_pct),
               y=gross_profit_pct,
               text=paste("Customer: ",customer_name_sold_to,
                          "<br>Rank: ", Rank,
                          "<br>Gross profit: ", mycurrency(gross_profit),
                          "<br>Gross profit % of total:", paste0(round( (signif(gross_profit_pct,3)*100), 3), "%"),
                          "<br>Net sales: ", mycurrency(net_sales))))+
    geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
    geom_segment(aes(x=customer_name_sold_to, 
                     xend=customer_name_sold_to, 
                     y=0, 
                     yend=gross_profit_pct), 
                 alpha=.4,
                 color="grey40")+
    theme(panel.grid.major.y = element_blank())+
    coord_flip()+
    theme_minimal()+
    scale_y_continuous(name = "Gross profit (%)",
                       labels = function(x) paste0(x * 100, '%'),
                       breaks=seq(0,gp_scale,gp_scale_breaks))+
    xlab("Customer")+
    # scale_color_brewer(palette="Dark2")+
    scale_fill_brewer(palette="RdYlGn",direction=-1)+
    labs(title = title,
         caption= caption)
  
  ggplotly(b, tooltip = c("text")) %>% 
    layout(title = list(text = paste0(title,
                                      '<br>',
                                      '<sup>',
                                      caption,
                                      '</sup>')))
  
}
get_pl_lolipop()

Cold Box / Epoxy

get_pl_lolipop("Cold Box / Epoxy")

Filters

get_pl_lolipop("Filters")

Metallurgy

get_pl_lolipop("Metallurgy")

No Bake / Hot Box

get_pl_lolipop("No Bake / Hot Box")

Release Agents & Auxiliaries

get_pl_lolipop("Release Agents & Auxiliaries")

Risers

get_pl_lolipop("Risers")

Scaling product lines

What normalization or scaling allows is a conversion of values to a range of 0 and 1. If we segregate our product lines and scale all values, we would be left with relative measures that would allow for more fair comparison of customer value between groups.

So instead of comparing a Cold Box customer with $2M gross profit to a Riser customer with $0.75M gross profit, we scale both of these maximum values to 1, giving them equal importance.

Looking at the data below we can visualize the effects first hand. The minimum and maximum values within each product line are converted to 0 and 1, respectively, and all values between are mapped to a relative value between.

Visualize scaling

Unscaled

df_sales_customer <- df_sales %>% 
  dplyr::filter(product_line %in% levels(df_opps$product_line)) %>% 
  group_by(customer_name_sold_to, product_line) %>% 
  summarise(net_sales = sum(net_sales),
            gross_profit = sum(gross_profit),
            gross_profit_margin = gross_profit/net_sales) %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA))

df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin))+
  ggpointdensity::geom_pointdensity()+
  facet_wrap(~product_line)+
  theme_minimal()+
  scale_color_viridis(option="C")+
  labs(title = "Product Lines - Gross profit margin vs Net sales",
       caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
       color = "N-neighbors")+
  theme(legend.position = "none")+
  scale_y_continuous(name = "Gross profit margin (%)",
                     breaks=seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,10e6,2e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))

Scaled

df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  group_by(product_line) %>% 
  mutate_if(is.numeric, rescale) %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin))+
  ggpointdensity::geom_pointdensity()+
  facet_wrap(~product_line)+
  theme_minimal()+
  scale_color_viridis(option="C")+
  labs(title = "Product Lines - Scaled - Gross profit margin vs Net sales",
       caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
       color = "N-neighbors")+
  theme(legend.position = "none")+
  scale_y_continuous(name = "Gross profit margin (Scaled)",
                     breaks=seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_x_continuous(name = "Net sales (Scaled)",
                     breaks = seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))

Clustering

Working with our scaled data we can assign scores to each customer by clustering them into groups. The overall goal is to work within each product line, clustering customers into 5 different Scores, with the highest Score corresponding to the most profitable customers (i.e. largest gross profit value), and a score of 1 to the least profitable. The Score can then be tallied across all customers giving a relative total ranking.

For example. If Customer A is the top customer in Cold Box, only, they are assigned a rank of 5. If Customer B receives a rank of 3 on both Metallurgy and Filters, their score would then be a 6. Customer B would therefore have a higher score.

We group into 5 clusters (k=5), visualized in the below plot.

K = 5

Plotting gross profit margin vs net sales is how we’re used to visualizing so far, which we see in the first plot. Since the clusters were assigned using gross profit, it is slightly more straightforward to see how they are grouped in the alternative plot.

vs Gross profit margin

Typical plot we are used to seeing.

# scaled df
df_sales_customer_normal <- df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>%  
  group_by(product_line) %>% 
  mutate_if(is.numeric, rescale)

# cluster all in groups
df_sales_customer_normal %>% 
  group_by(product_line) %>% 
  mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin))+
  geom_point(aes(fill=as.factor(cluster)), shape=21,color="grey50",size=2,alpha=.5)+
  facet_wrap(~product_line)+
  theme_few()+ 
  scale_fill_manual(values=c("#ffccf7","#3e78b2","#dd6031","#baf2bd","#1e0002"))+
  labs(title = "Product Lines - Scaled, Clustered - Gross profit margin vs Net sales",
       caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
       fill = "Clusters")+
  # theme(legend.position = "bottom")+
  scale_y_continuous(name = "Gross profit margin (Scaled)",
                     breaks=seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_x_continuous(name = "Net sales (Scaled)",
                     breaks = seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))

vs Gross profit

Clusters are assigned based on gross profit, so slightly easier to visualize the clustering this way.

# cluster all in groups
df_sales_customer_normal %>% 
  group_by(product_line) %>% 
  mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>% 
  ggplot(aes(x=net_sales, y=gross_profit))+
  geom_point(aes(fill=as.factor(cluster)), shape=21,color="grey50",size=2,alpha=.5)+
  facet_wrap(~product_line)+
  theme_few()+ 
  scale_fill_manual(values=c("#ffccf7","#3e78b2","#dd6031","#baf2bd","#1e0002"))+
  labs(title = "Product Lines - Scaled, Clustered - Gross profit margin vs Net sales",
       caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
       fill = "Clusters")+
  # theme(legend.position = "bottom")+
  scale_y_continuous(name = "Gross profit margin (Scaled)",
                     breaks=seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_x_continuous(name = "Net sales (Scaled)",
                     breaks = seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))

Clustered customers

Lolipop

df_scaled_scored_indiv <- df_sales_customer_normal %>% 
  mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>% 
  dplyr::select(-c(net_sales, gross_profit, gross_profit_margin)) %>% 
  right_join(df_sales_customer) %>% 
  mutate(cluster = ifelse(is.na(cluster),0,cluster))

df_scaled_scored <- df_scaled_scored_indiv %>% 
  group_by(customer_name_sold_to) %>% 
  summarise(rank = sum(cluster)) %>% 
  right_join(df_scaled_scored_indiv) %>% 
  mutate(rank2 = as.factor(rank)) %>% 
  mutate(customer_short = substring(customer_name_sold_to,1,25)) %>%
  arrange(-rank)

gg_df_scaled_scored <- df_scaled_scored %>% 
  arrange(-rank) %>% 
  slice(1:300) %>% 
  # spread(key=product_line, value=net_sales) %>% 
  ggplot(aes(x=reorder(customer_short ,rank),
             y=rank,
             text=paste("Customer: ",customer_name_sold_to,
                        "<br>Rank: ", rank)))+
  geom_point(aes(fill=rank2), size=3, stroke=.2, shape=21)+
  geom_segment(aes(x=customer_short, 
                   xend=customer_short, 
                   y=0, 
                   yend=rank), 
               alpha=.4,
               color="grey40")+
  theme(panel.grid.major.y = element_blank())+
  coord_flip()+
  theme_minimal()+
  theme(legend.position = "none")+
  scale_y_continuous(name = "Rank",
                     breaks=seq(0,25,2))+
  xlab("Customer")+
  labs(title = "Customers vs Rank",
       caption= "Rank calculated by scaling gross profit\nFollowed by univariate clustering, k=5")

ggplotly(gg_df_scaled_scored, tooltip = c("text")) %>% 
  layout(title = list(text = paste0('Customers vs Rank',
                                    '<br>',
                                    '<sup>',
                                    "Rank calculated by scaling gross profit across product lines, followed by univariate clustering, k=5",
                                    '</sup>')))

Datatable

Pulling all sales data with the new, ranking variables we create a datatable summarizing everything from cumulative or Customer Rank, to separate ranks within each product line, or Product Line Cluster.

df_scaled_scored <- df_scaled_scored %>% 
  dplyr::select(customer_name_sold_to,
                product_line,
                rank,
                net_sales,
                gross_profit,
                gross_profit_margin,
                cluster) %>% 
  set_colnames(c("Customer",
                 "Product Line",
                 "Customer Rank",
                 "Net Sales",
                 "Gross Profit",
                 "Gross Profit Margin",
                 "Product Line Cluster"))

datatable(df_scaled_scored,
          options = list(
            autoWidth = TRUE
            # dom = 't'
            # dom = 'tip'
          ),
          class = 'cell-border stripe',
          caption = "Total gross profit of current customers",
          rownames = FALSE) %>% 
  formatPercentage(c(names(df_scaled_scored[c(6)])), 1) %>%
  formatCurrency(c(names(df_scaled_scored[c(4,5)])), "$") %>% 
  formatStyle(names(df_scaled_scored[3]),
              background = styleColorBar(as.numeric(range(df_scaled_scored[3], na.rm=TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_scaled_scored[4]), 
              background = styleColorBar(range(df_scaled_scored[4]), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_scaled_scored[5]),
              background = styleColorBar(as.numeric(range(df_scaled_scored[5],na.rm = TRUE)), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_scaled_scored[6]),
              background = styleColorBar(as.numeric(range(df_scaled_scored[6],na.rm = TRUE)), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(df_scaled_scored[7]),
              background = styleColorBar(as.numeric(range(df_scaled_scored[7],na.rm = TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Gold-Silver-Bronze

Though we have nice rank numbers for each customer it does seem more complicated than assigning a much simpler ranking of Gold/Silver/Bronze. If we look at the distribution of our ranks we find that the majority of customers fall into Rank 1. It makes sense to assign Rank 1 and below as Bronze.

At the other end of the spectrum we find that only 5% of customers are at or above Rank 7. These would be good Gold candidates. Thus, putting customers that fall between Ranks 2 - 6 into the Silver category.

We can assign new variables and plot the results to visualize our new classificiation system.

Score distribution

50% of customers fall into Rank 1 or below.
Only 5% of customers are Rank 7 or above.
The remaining 45% of customers of course fall between the above ranks.

ranks <- df_scaled_scored %>% 
  dplyr::select(Customer, `Customer Rank`) %>% 
  set_colnames(c("customer_name_sold_to", "rank"))

gg_ranked_customers <- df_sales %>% 
  group_by(customer_name_sold_to) %>% 
  dplyr::summarise(net_sales    = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   gross_profit_margin = gross_profit/net_sales) %>% 
  # slice(1) %>% 
  left_join(ranks) %>% 
  group_by(customer_name_sold_to) %>% 
  slice(1) %>% 
  dplyr::filter(net_sales > 0 & gross_profit_margin > 0 & gross_profit_margin < 1) %>% 
  mutate(gross_profit_margin =  gross_profit_margin*100) %>% 
  mutate(rank = ifelse(is.na(rank),0,rank))

gg_ranks_dt <- gg_ranked_customers %>% 
  group_by(rank) %>% 
  add_tally() %>% 
  arrange(-rank) %>% ungroup() %>% 
  as_tibble() %>% 
  mutate(n_pct = n/nrow(gg_ranked_customers)) %>% 
  group_by(rank) %>% slice(1) %>% ungroup() %>% 
  mutate(n_pct_cusum = cumsum(n_pct)) %>%  
  dplyr::select(rank,
                n,
                n_pct,
                n_pct_cusum) %>% 
  set_colnames(c("Rank",
                 "Num customers in rank",
                 "Pct customers in rank",
                 "Cusum pct customers in rank"))

datatable(gg_ranks_dt,
          options = list(
            autoWidth = TRUE,
            pageLength = 25,
            dom = 'ti'
            # dom = 'tip'
          ),
          fillContainer = FALSE,
          class = 'cell-border stripe',
          caption = "Distribution of ranks",
          rownames = FALSE) %>% 
  formatPercentage(c(names(gg_ranks_dt[c(3,4)])), 1) %>%
  formatStyle(names(gg_ranks_dt[2]),
              background = styleColorBar(as.numeric(range(gg_ranks_dt[2], na.rm=TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_ranks_dt[3]), 
              background = styleColorBar(range(gg_ranks_dt[3]), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_ranks_dt[4]),
              background = styleColorBar(as.numeric(range(gg_ranks_dt[4],na.rm = TRUE)), 'lightblue'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Visualize ranked customers

A final representation of customers segregated into olympic ranks.

vs Gross profit margin

df_olympics <- gg_ranked_customers %>% 
  mutate(olympics = ifelse(rank < 2, "Bronze", 
                           ifelse(rank < 7, "Silver", "Gold"))) %>% 
  mutate(gross_profit_margin = gross_profit_margin/100)

# df_olympics$olympics <- factor(df_olympics$olympics, levels = c("Gold","Silver","Bronze"))
df_olympics$olympics <- factor(df_olympics$olympics, levels = c("Bronze","Silver","Gold"))

gg_olympics <- df_olympics %>% 
  # ggplot(aes(x=net_sales, y=gross_profit,
  ggplot(aes(x=net_sales, y=gross_profit_margin,
             text=paste('Customer:',customer_name_sold_to,
                        '<br>Net Sales:',mycurrency(net_sales),
                        '<br>Gross Profit:',mycurrency(gross_profit),
                        '<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
                        '<br>Rank:',rank,
                        '<br>Medal:',olympics)
                        ))+
  geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
  theme_few()+
  labs(title = "Ranked customers - Gross profit margin vs Net sales",
       caption = "",
       fill="Rank",
       alpha = "")+
  # scale_fill_manual(values=c("#ffd707","#EEEEEE","#a77044"))+
  scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
  scale_alpha_discrete(range = c(0.5, 1))+
  scale_y_continuous(name = "Gross profit margin (%)",
                     labels=percent_format(suffix="%",accuracy=1),
                     breaks=seq(0,1,.25))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,15e6,5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
  
ggplotly(gg_olympics, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "",
                                    '</sup>')))

vs Gross profit

gg_olympics <- df_olympics %>% 
  ggplot(aes(x=net_sales, y=gross_profit,
             text=paste('Customer:',customer_name_sold_to,
                        '<br>Net Sales:',mycurrency(net_sales),
                        '<br>Gross Profit:',mycurrency(gross_profit),
                        '<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
                        '<br>Rank:',rank,
                        '<br>Medal:',olympics)
                        ))+
  geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
  theme_few()+
  labs(title = "Ranked customers - Gross profit margin vs Net sales",
       caption = "",
       fill="Rank",
       alpha = "")+
  # scale_fill_manual(values=c("#ffd707","#EEEEEE","#a77044"))+
  scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
  scale_alpha_discrete(range = c(0.5, 1))+
  scale_y_continuous(name = "Gross profit (USD)",
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
                     breaks=seq(0,4e6,1e6))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,15e6,5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
  
ggplotly(gg_olympics, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "",
                                    '</sup>')))

Normalizing product lines

One potential issue with scaling rather than normalizing is the preservation of outliers. In our scaling process we observed this in the non-normal distribution of Gold/Silver/Bronze rankings, with Gold essentially making up all of the extreme positive outliers, while Silver and Bronze made up the majority of customers. Is this a problem?

If we want to prioritize only the extreme outliers, then this is not a problem. However, normalization would allow a correction of this non-normal distribution, allowing us to use standard deviation as a useful measure. Specifically, in a normally distributed dataset, 68% of the observations fall within the +/- 1 standard deviation from the mean; 95% of data is contained within the +/- 2 standard deviation from the mean; 99.7% of data is contained within the +/- 3 standard deviation from the mean.

The benefits to normalizing rather than scaling is that in each category we capture the 20% highest performers rather than simply the extreme outliers.

Standard bell curve

Visualize normalization

We can visualize the normal distribution in a few different way, using scatterplots and density plots. Using the scatterplots we can easily see how the maxima and minima have been scaled to a value of 0 or 1, respectively. The density plots show how the normalization has given nice standardized bell curves of all data.

Scatter

Comparison of raw, scaled, normalized data.

Raw

Comparing raw data shows vastly different values with maximum values of the larger product lines shadowing the smaller.

df_customer_sales <- df_sales %>% 
  dplyr::filter(product_line %in% levels(df_opps$product_line)) %>% 
  group_by(customer_name_sold_to, product_line) %>% 
  summarise(net_sales = sum(net_sales),
            gross_profit = sum(gross_profit),
            gross_profit_margin = gross_profit/net_sales) %>% 
  mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>%  
  dplyr::select(-c(gross_profit_margin)) %>% 
  arrange(-gross_profit)

get_bestNormal <- function(df = df_customer_sales, 
                           PL = "Coatings & Additives", 
                           col = "gross_profit"){
  
  df_pl <- df_customer_sales %>% 
    dplyr::filter(product_line == PL) %>% 
    arrange(-gross_profit)
  
  new_colname = paste0("norm_",col)
  x <- df_pl[[col]]
  x_t <- tibble(!!new_colname := bestNormalize::bestNormalize(x, warn=FALSE)$x.t)
  
  new_df <- df_pl %>% 
    bind_cols(x_t)
  
  return(new_df)
}

combine_normal_df <- function(dfa = df_customer_sales, dfb = get_bestNormal(PL = "Filters")){
  merged <- dfa %>% 
    left_join(dfb, by=colnames(df_customer_sales))
  # full_join(dfb)
  return(merged)
  
}

# combine all norm_gross profits
all_norms <- combine_normal_df(dfa = df_customer_sales, 
                               dfb = get_bestNormal(PL = levels(gdata::drop.levels(df_customer_sales$product_line))[1]))
for (level in levels(gdata::drop.levels(df_customer_sales$product_line))){
  all_norms <- combine_normal_df(dfa = all_norms, 
                                 dfb = get_bestNormal(PL = level)) %>%
    mutate(norm_gross_profit = ifelse(is.na(norm_gross_profit.x),
                                      norm_gross_profit.y,
                                      norm_gross_profit.x)) %>%
    dplyr::select(-c(norm_gross_profit.x, norm_gross_profit.y))
}

# combine all norm_net_sales
all_norms <- combine_normal_df(dfa = all_norms, 
                               dfb = get_bestNormal(PL = levels(gdata::drop.levels(df_customer_sales$product_line))[1],
                                                    col = "net_sales"))
for (level in levels(gdata::drop.levels(df_customer_sales$product_line))){
  all_norms <- combine_normal_df(dfa = all_norms, 
                                 dfb = get_bestNormal(PL = level,
                                                      col = "net_sales")) %>%
    mutate(norm_net_sales = ifelse(is.na(norm_net_sales.x),
                                   norm_net_sales.y,
                                   norm_net_sales.x)) %>%
    dplyr::select(-c(norm_net_sales.x, norm_net_sales.y))
}

# unscaled
df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  # ggplot(aes(x=net_sales, y=gross_profit_margin))+
  ggplot(aes(x=net_sales, y=gross_profit))+
  ggpointdensity::geom_pointdensity(adjust=.5e6)+
  facet_wrap(~product_line)+
  theme_few()+
  scale_color_viridis(option="C")+
  labs(title = "Product Lines - Gross profit vs Net sales",
       caption = "Raw data",
       color = "N-neighbors")+
  # theme(legend.position = "none")+
  scale_y_continuous(name = "Gross profit (USD)",
                     breaks = seq(0,2e6,1e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,6e6,2e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))

Scaled

Scaling the values on both axes “stretches” out the data. We still have extreme outliers but they are now comparable to one another.

# scaled
df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  group_by(product_line) %>% 
  mutate_if(is.numeric, rescale) %>% 
  ggplot(aes(x=net_sales, y=gross_profit))+
  ggpointdensity::geom_pointdensity()+
  facet_wrap(~product_line)+
  theme_few()+
  scale_color_viridis(option="C")+
  labs(title = "Product Lines - Gross profit margin vs Net sales",
       caption = "Scaled data",
       color = "N-neighbors")+
  # theme(legend.position = "none")+
  scale_y_continuous(name = "Gross profit (Scaled)",
                     breaks=seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_x_continuous(name = "Net sales (Scaled)",
                     breaks = seq(0,1,.25),
                     labels=percent_format(suffix="%",accuracy=1))

Normalized

Normally distributed data has transformed data so that 68% of observations fall within 1 standard deviation of the mean, etc. Extreme outliers have been essentially eliminated.

all_norms %>% 
  mutate(norm_gross_profit_margin = norm_gross_profit/norm_net_sales) %>% 
  ggplot(aes(x=norm_net_sales, y=norm_gross_profit))+
  ggpointdensity::geom_pointdensity()+
  geom_vline(xintercept = fivenum(all_norms$norm_gross_profit),linetype="dashed",alpha=.3, color="grey50")+
  facet_wrap(~product_line)+
  ggpointdensity::geom_pointdensity()+
  facet_wrap(~product_line)+
  theme_few()+
  scale_color_viridis(option="C")+
  labs(title = "Product Lines - Gross profit margin vs Net sales",
       caption = "Normalized data",
       color = "N-neighbors")+
  # theme(legend.position = "none")+
  scale_y_continuous(name = "Gross profit (Normalized)",
                     breaks=seq(-3,3,1))+
  scale_x_continuous(name = "Net sales (Normalized)",
                     breaks = seq(-3,3,1))

Density

Density plots allow better visualization of the bell-curves. Our raw plot is skewed heavily to the left. Scaling shifts it slightly, but scaling does not help outliers. Normalization produces proper bell curves for us.

Raw

# raw
df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  ggplot(aes(x=gross_profit, y=..scaled..))+
  geom_density()+ facet_wrap(~product_line)+
  theme_few()+
  labs(title = "Product Lines - Gross profit density",
       caption = "Raw data")+
  scale_x_continuous(name = "Gross profit",
                     breaks = seq(0,2e6,1e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
  scale_y_continuous(name = "Density",
                     breaks=seq(0,1,.5),
                     labels=percent_format(suffix="%",accuracy=1))

Scaled

# scaled
df_sales_customer %>% 
  dplyr::filter(net_sales > 0, 
                gross_profit > 0,
                gross_profit_margin > 0) %>% 
  group_by(product_line) %>% 
  mutate_if(is.numeric, rescale) %>% 
  ggplot(aes(x=gross_profit, y=..scaled..))+
  geom_density()+ facet_wrap(~product_line)+
  theme_few()+
  labs(title = "Product Lines - Gross profit density",
       caption = "Scaled data")+
  scale_x_continuous(name = "Gross profit (Scaled)",
                     breaks=seq(0,1,.5),
                     labels=percent_format(suffix="%",accuracy=1))+
  scale_y_continuous(name = "Density",
                     breaks=seq(0,1,.5),
                     labels=percent_format(suffix="%",accuracy=1))

Normalized

# normalized
all_norms %>% 
  mutate(norm_gross_profit_margin = norm_gross_profit/norm_net_sales) %>% 
  ggplot(aes(x=norm_gross_profit, y=..scaled..))+
  geom_density()+ facet_wrap(~product_line) + 
  theme_few()+
  labs(title = "Product Lines - Gross profit density",
       caption = "Normalized data")+
  scale_x_continuous(name = "Gross profit (Normalized)",
                     breaks = seq(-3,3,1))+
  scale_y_continuous(name = "Density",
                     breaks=seq(0,1,.5),
                     labels=percent_format(suffix="%",accuracy=1))

Clustering

Now we cluster based on standard deviation from the mean values within each product line. We will break customers up into 6 different clusters:

Rank 1, extreme low performers: below -2 stdev from mean
Rank 2, low performers: between -2 and -1 stdev from mean
Rank 3, below average performers: between -1 and 0 stdev from mean
Rank 4, above average performers: between 0 and +1 stdev from mean
Rank 5, high performers: between +1 and +2 stdev from mean
Rank 6, extreme high performers: above +2 stdev from mean

Because normalization seeks to create a mean at zero and a standard deviation of 1, performing our clustering is far simpler than with scaling. We simply assign the appropriate Rank of 1-6 based on whether the value lies within the assigned range.

The benefits of this become quite obvious when we look at the below datatable summarizing our customers as a whole and by product line: instead of a random cluster number, we know for certain that a Rank 6 customer is in the top, roughly 0.3%, of highest gross profit customers. Similarly, if a customer is a Rank 4 customer, we know they are an above average performer based on our bell curve.

Distribution

Because we normalized our data, rankings are also evenly distributed.

# rank customers by product line
norms_customers_ranked <- all_norms %>% 
  mutate(norm_rank = ifelse(norm_gross_profit > 2, 6,
                            ifelse(norm_gross_profit >1, 5,
                                   ifelse(norm_gross_profit >0, 4, 
                                          ifelse(norm_gross_profit > -1, 3,
                                                 ifelse(norm_gross_profit > -2, 2, 1))))))
# group customers, add ranks
norms_customers_sum_ranks <- norms_customers_ranked %>% 
  group_by(customer_name_sold_to) %>% 
  summarise_at(vars(norm_rank), funs(sum,prod)) %>% 
  # arrange(-norm_rank) %>% 
  set_colnames(c("customer_name_sold_to","norm_rank_sum","norm_rank_prod"))

# join ranks and sum ranks to sales data
df_customers_norm_ranked <- df_customer_sales %>% 
  mutate(gross_profit_margin = gross_profit/net_sales) %>% 
  left_join(norms_customers_ranked %>% dplyr::select(customer_name_sold_to, product_line, norm_rank)) %>% 
  left_join(norms_customers_sum_ranks) %>% 
  arrange(customer_name_sold_to)

# distribution of individual ranks among all customers-productline combos
df_customers_norm_ranked %>% 
  ggplot(aes(x=norm_rank))+
  geom_histogram(bins=6, color="white")+
  theme_few()+
  scale_x_continuous(name = "Rank",
                     breaks =seq(0,6,1))+
  scale_y_continuous(name = "Count",
                     breaks = seq(0,300,50))+
  labs(title = "Histogram of customer ranks",
       caption= "Straightforward normal distribution of scores")

Clustered customers

Lolipop

# lolipop of top ranking customers-----
gg_customers_norm_ranked <- df_customers_norm_ranked %>% 
  group_by(customer_name_sold_to) %>% slice(1) %>% ungroup() %>% 
  mutate(customer_short = substring(customer_name_sold_to,1,25)) %>%
  arrange(-norm_rank_sum) %>% 
  dplyr::filter(norm_rank_sum > 10) %>% 
  ggplot(aes(x=reorder(customer_short ,norm_rank_sum),
             y=norm_rank_sum,
             text=paste("Customer: ",customer_name_sold_to,
                        "<br>Net sales: ", mycurrency(net_sales),
                        "<br>Gross profit: ", mycurrency(gross_profit),
                        "<br>Gross profit margin:", paste0(round( (signif(gross_profit_margin,3)*100), 3), "%"),
                        "<br>Rank: ", norm_rank_sum)))+
  geom_point(aes(fill=as.factor(norm_rank_sum)), size=3, stroke=.2, shape=21)+
  geom_segment(aes(x=customer_short, 
                   xend=customer_short, 
                   y=0, 
                   yend=norm_rank_sum), 
               alpha=.4,
               color="grey40")+
  theme(panel.grid.major.y = element_blank())+
  coord_flip()+
  theme_minimal()+
  theme(legend.position = "none")+
  scale_y_continuous(name = "Rank",
                     breaks=seq(0,36,2))+
  xlab("Customer")+
  labs(title = "Customers vs Rank (Rank > 10)",
       caption= "Rank: sum of normalized gross sales rankings by product lines")

ggplotly(gg_customers_norm_ranked, tooltip = c("text")) %>% 
  layout(title = list(text = paste0('Customers vs Rank (Rank > 10)',
                                    '<br>',
                                    '<sup>',
                                    "Rank: sum of normalized gross sales rankings by product lines",
                                    '</sup>')))

Datatable

# datatable of norm rank data ----
dt_customers_norm_ranked <- df_customers_norm_ranked %>% 
  arrange(-norm_rank_sum) %>% 
  dplyr::select(customer_name_sold_to,
                product_line,
                norm_rank_sum,
                net_sales,
                gross_profit,
                gross_profit_margin,
                norm_rank) %>% 
  set_colnames(c("Customer",
                 "Product Line",
                 "Customer Rank",
                 "Net Sales",
                 "Gross Profit",
                 "Gross Profit Margin",
                 "Product Line Rank"))

datatable(dt_customers_norm_ranked,
          options = list(
            autoWidth = TRUE
            # dom = 't'
            # dom = 'tip'
          ),
          class = 'cell-border stripe',
          caption = "Normalized customer rankings by product line",
          rownames = FALSE) %>% 
  formatPercentage(c(names(dt_customers_norm_ranked[c(6)])), 1) %>%
  formatCurrency(c(names(dt_customers_norm_ranked[c(4,5)])), "$") %>% 
  formatStyle(names(dt_customers_norm_ranked[3]),
              background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[3], na.rm=TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(dt_customers_norm_ranked[4]), 
              background = styleColorBar(range(dt_customers_norm_ranked[4]), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(dt_customers_norm_ranked[5]),
              background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[5],na.rm = TRUE)), 'lightblue'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(dt_customers_norm_ranked[6]),
              background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[6],na.rm = TRUE)), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(dt_customers_norm_ranked[7]),
              background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[7],na.rm = TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Gold-Silver-Bronze

Though the above information is useful, it may still be encumbering to sort through numbered ranks. Once again we can look into assigning customers olympic medals. We can try a similar method as we tried above, with our scaled ranks: segregating ranks into clusters of 50%, 45%, and 5% for Bronze, Silver, and Gold ranks, respectively.

We first look at the distribution of our ranks and then onto the final visualized scoring.

Score distribution

Normalization reduced the heavy right-skewed distribution we saw when scaling our values, above. We do see the distribution of scores has evened out, still right-skewed, but far less right-skewed than without normalization.

# distribution of norm ranks ?
norm_ranks <- df_customers_norm_ranked %>% 
  dplyr::select(customer_name_sold_to, norm_rank_sum)

df_norm_ranks_distr <- df_sales %>% 
  group_by(customer_name_sold_to) %>% 
  dplyr::summarise(net_sales    = sum(net_sales),
                   gross_profit = sum(gross_profit),
                   gross_profit_margin = gross_profit/net_sales) %>% 
  # slice(1) %>% 
  left_join(norm_ranks) %>% 
  group_by(customer_name_sold_to) %>% 
  slice(1) %>% 
  dplyr::filter(net_sales > 0 & gross_profit_margin > 0 & gross_profit_margin < 1) %>% 
  mutate(gross_profit_margin =  gross_profit_margin*100) %>% 
  mutate(norm_rank_sum = ifelse(is.na(norm_rank_sum),0,norm_rank_sum))

dt_norm_ranks_distr <- df_norm_ranks_distr %>% 
  group_by(norm_rank_sum) %>% 
  add_tally() %>% 
  arrange(-norm_rank_sum) %>% ungroup() %>% 
  as_tibble() %>% 
  mutate(n_pct = n/nrow(df_norm_ranks_distr)) %>% 
  group_by(norm_rank_sum) %>% slice(1) %>% ungroup() %>% 
  mutate(n_pct_cusum = cumsum(n_pct)) %>% 
  dplyr::select(norm_rank_sum,
                n,
                n_pct,
                n_pct_cusum) %>% 
  set_colnames(c("Rank",
                 "Num customers in rank",
                 "Pct customers in rank",
                 "Cusum pct customers in rank"))

datatable(dt_norm_ranks_distr,
          options = list(
            dom = 'tipl ',
            autoWidth = TRUE,
            pageLength = 50
            # dom = 'tip'
          ),
          fillContainer = FALSE,
          class = 'cell-border stripe',
          caption = "Distribution of normalized ranks",
          rownames = FALSE) %>% 
  formatPercentage(c(names(gg_ranks_dt[c(3,4)])), 1) %>%
  formatStyle(names(gg_ranks_dt[2]),
              background = styleColorBar(as.numeric(range(gg_ranks_dt[2], na.rm=TRUE)), 'lightsalmon'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_ranks_dt[3]), 
              background = styleColorBar(range(gg_ranks_dt[3]), 'bisque'),
              # backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center') %>% 
  formatStyle(names(gg_ranks_dt[4]),
              background = styleColorBar(as.numeric(range(gg_ranks_dt[4],na.rm = TRUE)), 'lightblue'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Visualize ranked customers

vs Gross profit margin

# dotplot olympic norm_rank
df_norm_olympics <- df_norm_ranks_distr %>% 
  mutate(olympics = ifelse(norm_rank_sum < 5, "Bronze", 
                           ifelse(norm_rank_sum < 18, "Silver", "Gold"))) %>% 
  mutate(gross_profit_margin = gross_profit_margin/100)
  
df_norm_olympics$olympics <- factor(df_norm_olympics$olympics, levels = c("Bronze","Silver","Gold"))

gg_norm_olympics <- df_norm_olympics %>% 
  ggplot(aes(x=net_sales, y=gross_profit_margin,
             # ggplot(aes(x=net_sales, y=gross_profit_margin,
             text=paste('Customer:',customer_name_sold_to,
                        '<br>Net Sales:',mycurrency(net_sales),
                        '<br>Gross Profit:',mycurrency(gross_profit),
                        '<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
                        '<br>Rank:',norm_rank_sum,
                        '<br>Medal:',olympics)))+
  geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
  theme_few()+
  labs(title = "Ranked customers - Gross profit vs Net sales",
       caption = "",
       fill="Rank",
       alpha = "")+
  scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
  scale_alpha_discrete(range = c(0.5, 1))+
  # scale_y_continuous(name = "Gross profit (USD)",
  #                    labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
  #                    breaks=seq(0,4e6,1e6))+
  scale_y_continuous(name = "Gross profit margin (%)",
                     labels=percent_format(suffix="%",accuracy=1),
                     breaks=seq(0,1,.25))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,15e6,5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))

ggplotly(gg_norm_olympics, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "Ranks assigned using normalized customer/product-line data",
                                    '</sup>')))

vs Gross profit

# dotplot olympic norm_rank
df_norm_olympics <- df_norm_ranks_distr %>% 
  mutate(olympics = ifelse(norm_rank_sum < 5, "Bronze", 
                           ifelse(norm_rank_sum < 18, "Silver", "Gold"))) %>% 
  mutate(gross_profit_margin = gross_profit_margin/100)
  
df_norm_olympics$olympics <- factor(df_norm_olympics$olympics, levels = c("Bronze","Silver","Gold"))

gg_norm_olympics <- df_norm_olympics %>% 
  ggplot(aes(x=net_sales, y=gross_profit,
             text=paste('Customer:',customer_name_sold_to,
                        '<br>Net Sales:',mycurrency(net_sales),
                        '<br>Gross Profit:',mycurrency(gross_profit),
                        '<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
                        '<br>Rank:',norm_rank_sum,
                        '<br>Medal:',olympics)))+
  geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
  theme_few()+
  labs(title = "Ranked customers - Gross profit vs Net sales",
       caption = "",
       fill="Rank",
       alpha = "")+
  scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
  scale_alpha_discrete(range = c(0.5, 1))+
  scale_y_continuous(name = "Gross profit (USD)",
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
                     breaks=seq(0,4e6,1e6))+
  # scale_y_continuous(name = "Gross profit margin (%)",
  #                    labels=percent_format(suffix="%",accuracy=1),
  #                    breaks=seq(0,1,.25))+
  scale_x_continuous(name = "Net sales (USD)",
                     breaks = seq(0,15e6,5e6),
                     labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))

ggplotly(gg_norm_olympics, tooltip = c("text")) %>%
  layout(title = list(text = paste0('Ranked customers - Gross profit vs Net sales',
                                    '<br>',
                                    '<sup>',
                                    "Ranks assigned using normalized customer/product-line data",
                                    '</sup>')))

Customer Segmentation

Ranking customers based on financial impact

Edward Yu

26 March, 2020

Intro

Clean

Sales

Visits

Opportunity

Explore

Product lines & groups

Gross profit margins vs Net sales

Product lines

Plot

Interactive

Product groups

Scatter

Interactive

Sunburst

Customer analysis

Gross profit segmentation

Treemap

Table

Lolipop

Gross profit opportunity

Data

Raw opportunity

Raw margins

Treemap

Table

Lolipop

Analysis

Priority by product line

Treemap

Coatings & Additives

Cold Box / Epoxy

Filters

Metallurgy

No Bake / Hot Box

Release Agents & Auxiliaries

Risers

Lolipop

Coatings & Additives

Cold Box / Epoxy

Filters

Metallurgy

No Bake / Hot Box

Release Agents & Auxiliaries

Risers

Scaling product lines

Visualize scaling

Unscaled

Scaled

Clustering

K = 5

vs Gross profit margin

vs Gross profit

Clustered customers

Lolipop

Datatable

Gold-Silver-Bronze

Score distribution

Visualize ranked customers

vs Gross profit margin

vs Gross profit

Normalizing product lines

Visualize normalization

Scatter

Raw

Scaled

Normalized

Density

Raw

Scaled

Normalized

Clustering

Distribution

Clustered customers

Lolipop

Datatable