Financial and customer data has been modified to anonymize.
The overall goal is to create a customer tier list in which the lowest tier customers have their sales visits limited or cut off at a certain point. Similarly, we prioritize high tier customers. The business case appears to be more efficient utilization of resources in order to score the biggest profit for the company.
During the cleaning of the 3 different data files we are able to summarize what is actually in them.
The sales data is essentially, sales data. We appear to have one transaction per row with data including customer sold to, product line, gross profit, etc. This seems immediately useful for calculations regarding net sales, profits, and margins across all product lines.
If we want to rate customers it might be beneficial to group the sales data by customer and then take the mean or median margin, sales, or combinations thereof to produce some sort of ranking metric.
## Observations: 9,009
## Variables: 15
## $ date <dttm> 2019-04-01, 2019-04-01, 2019-04-01, 2019-04-...
## $ customer_num_sold_to <fct> 304707, 305370, 305370, 330489, 330489, 33048...
## $ customer_num_ship_to <fct> NA, NA, NA, 330489, 330489, 330489, 330489, 3...
## $ customer_name_sold_to <fct> Wabtec Foundry Limited, Esco Ltd, Esco Ltd, W...
## $ customer_name_ship_to <fct> Not Assigned, Not Assigned, Not Assigned, Wir...
## $ sales_office <fct> NA, NA, NA, 2306, 2306, 2306, 2306, 2306, 230...
## $ region <fct> Not Assigned, Not Assigned, Not Assigned, Mid...
## $ local_employee <fct> Not Assigned, Not Assigned, Not Assigned, Mik...
## $ ext_matl_group <fct> 0030, 0001, 0041, 0010, 0025, 0027, 0028, 002...
## $ product_line <fct> Filters, Cold Box / Epoxy, Release Agents & A...
## $ product_group <fct> Others - Cellular Filters, Amine Curing Pu Sy...
## $ net_sales <dbl> 0, 0, 0, 8434, 1184, 1523, 1600, 0, 458, 0, 1...
## $ gross_profit <dbl> -624, -2259, -151, 2644, 602, 608, 675, -18, ...
## $ gross_margin <dbl> 0, 0, 0, 4220, 980, 608, 675, -18, 244, 0, 92...
## $ freight_customer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
The data appears to be a summary of opportunity on a per customer level. We appear to have a lot of junk
data such as the gross_profit_pct
which has values all equal to zero, with 26 NA
values. There are more variables like this that seem, perhaps, a bit contrived in what they are trying to
represent.
There are quite a few values here that should be automatically generated but appear to be missing. There might be an issue with this that needs resolved further up if we actually need the data.
## Observations: 1,974
## Variables: 45
## $ accounts <fct> a b Foundry, a b Foundry Machining, a l Johns...
## $ representative <fct> Fitt Wayne, Thomas Keith, Fitt Wayne, Fitt Wa...
## $ region <fct> Sm n Am Distributors, Central Region, Sm n Am...
## $ sales_ytd <dbl> 303, 0, 0, 0, 0, 0, 0, 10080, 2008, 0, 0, 441...
## $ sales_opp <dbl> 0, 12960, 0, 0, 233500, 0, 51695, 2673, 0, 26...
## $ gross_profit_pct <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ blue_silver <fct> Silver, Silver, Silver, Silver, Silver, Silve...
## $ alert_msg_0 <fct> NA, NA, Missing Segmentation Data, Missing Se...
## $ blue_silver_score <dbl> 2, 1, 0, 0, 1, 0, 1, 3, 2, 1, 0, 5, 5, 7, 5, ...
## $ actual_score <dbl> 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 3, 2, 3, 2, ...
## $ opp_score <dbl> 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 3, 3, 3, ...
## $ profit_score <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...
## $ num_visits <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 13, 7, 3,...
## $ pct_total <dbl> 0.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ growth_type <fct> Average Account, Average Account, Average Acc...
## $ account_type <fct> Average Account, Average Account, Average Acc...
## $ last_visit <ord> Jan, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sales <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 9, 5, 2, ...
## $ cat <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, ...
## $ tpm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, ...
## $ blm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ alert_msg_1 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ growth_project <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ gp_type_0 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, C...
## $ opp_rank_0 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, M...
## $ growth_project_amt_0 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1...
## $ gp_type_1 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ opp_rank_1 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ growth_project_amt_1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ alert_msg_2 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, M...
## $ a <fct> $0, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, F...
## $ b <fct> $0, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, R...
## $ c <fct> $0, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA, N...
## $ d <fct> $0, NA, NA, NA, Ct, NA, Cb, Cb, NA, NA, NA, N...
## $ e <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ f <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ g <fct> $0, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, N...
## $ h <fct> $0, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, N...
## $ i <fct> $0, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, N...
## $ business_line_a <fct> 0.00, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA,...
## $ business_line_a_value <dbl> 0, 9990, NA, NA, 168500, NA, 38345, 2664, NA,...
## $ business_line_b <fct> 0.00, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA,...
## $ business_line_b_value <dbl> 0, 1980, NA, NA, 25000, NA, 11494, 1, NA, 150...
## $ business_line_c <fct> 0.00, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA,...
## $ business_line_c_value <dbl> 0, 990, NA, NA, 20000, NA, 1850, 1, NA, NA, N...
The profit and loss opportunity data seems to be similar to the last visits data columns. The goal appears to be an attempt to log and rank opporunities from each business line.
## Observations: 1,974
## Variables: 30
## $ account <fct> a b Foundry, a b Foundry Machining, a l Johnson Co Gc In...
## $ CB <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 800000, NA...
## $ NB <dbl> NA, 9990, NA, NA, 168500, NA, 11494, NA, NA, NA, NA, NA,...
## $ CT <dbl> NA, 1980, NA, NA, 20000, NA, 1, 1, NA, NA, NA, NA, 10080...
## $ AD <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 249000, 22...
## $ RA <dbl> NA, 990, NA, NA, 20000, NA, 1850, 2664, NA, 1500, NA, NA...
## $ IN <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA...
## $ FT <dbl> NA, NA, NA, NA, NA, NA, 38345, NA, NA, NA, NA, 100000, N...
## $ RS <dbl> NA, NA, NA, NA, 25000, NA, 1, 1, NA, 25000, NA, 92500, 4...
## $ MT <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 15000, 260...
## $ opp_a <fct> NA, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, Ft, Ct, Ft, ...
## $ opp_b <fct> NA, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, Rs, Cb, Mt, ...
## $ bl_a <fct> NA, Nb, NA, NA, Nb, NA, Ft, Ra, NA, Rs, NA, Ft, Ct, Ft, ...
## $ bl_a_value <dbl> NA, 9990, NA, NA, 168500, NA, 38345, 2664, NA, 25000, NA...
## $ bl_b <fct> NA, Ct, NA, NA, Rs, NA, Nb, Cb, NA, Ra, NA, Rs, Cb, Mt, ...
## $ bl_b_value <dbl> NA, 1980, NA, NA, 25000, NA, 11494, 1, NA, 1500, NA, 925...
## $ bl_c <fct> NA, Ra, NA, NA, Ct, NA, Ra, Cb, NA, NA, NA, NA, Rs, Rs, ...
## $ bl_c_value <dbl> NA, 990, NA, NA, 20000, NA, 1850, 1, NA, NA, NA, NA, 466...
## $ bl_d <fct> NA, NA, NA, NA, Ct, NA, Cb, Cb, NA, NA, NA, NA, Ad, Ad, ...
## $ bl_d_value <dbl> NA, NA, NA, NA, 20000, NA, 1, 1, NA, NA, NA, NA, 249000,...
## $ bl_e <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, Mt, Ct, ...
## $ bl_e_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, 15000, 160...
## $ bl_f <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, NA, NA, ...
## $ bl_f_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, 25...
## $ bl_g <fct> NA, NA, NA, NA, NA, NA, Cb, Cb, NA, NA, NA, NA, NA, NA, ...
## $ bl_g_value <dbl> NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, 39...
## $ bl_h <fct> NA, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, NA, NA, NA, ...
## $ bl_h_value <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N...
## $ bl_i <fct> NA, NA, NA, NA, NA, NA, Cb, NA, NA, NA, NA, NA, NA, NA, ...
## $ bl_i_value <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N...
Foreward
It’s clear that certain product lines pull in more net sales while others pull in more profits or margins.
When looking even more closely at individual product lines we see that for each product line, there is
generally one very significant product group that shifts the numbers for that entire line. Amine
curing PU systems, for example, pulls the weight for the Cold Box product line.
If we look at the other products in the Cold Box product line–Isomax, \(SO_2\) Curing Systems, Amine Catalysts, and Methylformiat Curing
Systems–all of their net sales combined do not equal those of the Amine curing PU
systems alone.
What this ultimately means is that calculations used to determine average gross profit margin % on
a per product line basis cannot take into account the top movers within each product line. Where this comes
into play is when combining average gross margins with the sales opportunity data. We cannot be
sure if, for example, the $9,000 potential in sales for the Cold Box / Epoxy group comes
from the Amine curing PU systems group–32% margin–or the Amine Catalysts
group–19.9% margin.
Therefore the exploratory analysis below reaches some useful conclusions but with some caveats.
Summary
General exploration of the data by product line shows Cold box and No bake / Hot box as top movers in terms of net sales. Filters and Coatings & additives also appear to be high volume movers but the generally lower prices on a per item basis are likely what pull these back towards the lower end of the net sales spectrum. The mean gross profit margin is 26%, with a median of around 28%.
When we look more closely at our product lines, expanding them into their separate product groups, the picture becomes more clear. While Cold box appeared to be the leading group, more specifically, it is the Amine curing PU systems that pull the numbers up. We see similar stories with Pep set pulling up the No bake group, etc. This point is further illustrated via the below histograms.
What this may mean overall, is that when making critical decisions we need to look more specifically at product groups rather than broadly at product lines. When attempting to rank customers, we will also need to find a balance between sales volumes and margins.
Very nice to visualize the overall performance of all product lines. We can see our “bread and butter” higher net selling items far off the the right, right in the middle of our gross profit margin range. Metallurgy, risers, and coatings appear to be more commodity-like items, with release agents and specialty resins being more “designer” items, netting higher profit margins. The filters fall between both, likely due to the contrast between standard and 3d filters.
Important to note that values below zero were included. It appears that primarily the Coatings
& Additives group is affected by including these values. When excluded, their
gross_profit_margin
value increases to roughly 30%, from 16%.
##################################
# GPM vs NETSALES across PRODUCT LINES, size = TRANSACTION COUNT
line_counts <- df_sales %>%
dplyr::filter(product_line != "#N/A") %>%
group_by(product_line) %>%
add_tally() %>%
slice(1) %>%
dplyr::select(product_line, n)
product_line_df <- df_sales %>%
dplyr::filter(product_group != "#N/A") %>%
group_by(product_line) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_margin = sum(gross_margin),
gross_profit_margin = gross_profit / net_sales) %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
mutate_if(is.numeric, function(x) replace(x, x<0, 0))
product_line_df <- left_join(product_line_df,line_counts) %>%
mutate(p_line_counts = n) %>% select(-c(n)) %>%
mutate(gross_profit_margin=gross_profit_margin*100) %>%
set_colnames(c("Product Line", "Net Sales", "Gross Profit", "Gross Margin","Gross Profit Margin","Transaction Count"))
product_line_df %>%
ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`))+
geom_point(aes(fill=`Transaction Count`, size=`Transaction Count`), shape=21,alpha=1)+
geom_label_repel(aes(label=`Product Line`),
alpha=.9,
force=70,
box.padding = .9,
seed=1002,
size=4,
segment.colour = 'grey40')+
theme_minimal()+
labs(title = "Product Lines - Gross profit margin vs Net sales",
caption = "Removed: 'Unassigned', 'Design services', 'Others'\nSize & color relative to Transaction count",
fill="Transaction count")+
scale_fill_viridis(option = "C")+
scale_y_continuous(name = "Gross profit margin (%)",
breaks=seq(0,100,5),
limits=c(15,40))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,50000000,5000000),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
guides(size=FALSE)
##################################
# GPM vs NETSALES across PRODUCT LINES, size = TRANSACTION COUNT
# PLOTLY VERSION
mycurrency <- function(x){
return(paste0("$",formatC(signif(x,3), format="fg",big.mark=",")))
}
gg_product_lines <- product_line_df %>%
mutate(`Gross Profit Margin` = round(`Gross Profit Margin`,2)) %>%
# ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`))+
ggplot(aes(x=`Net Sales`, y=`Gross Profit Margin`, label=`Product Line`,
text = paste('Product Line: ' , `Product Line`,
'<br>Net Sales: ', mycurrency(`Net Sales`),
'<br>Gross Profit Margin: ', paste0(round(`Gross Profit Margin`,2),"%"),
'<br>Transaction Count: ' , `Transaction Count`
)))+
geom_point(aes(color=`Transaction Count`, size=`Transaction Count`))+
theme_minimal()+
labs(title = "Product Lines - Gross profit margin vs Net sales",
caption = "Removed: 'Unassigned', 'Design services', 'Others'\nSize relative to Transaction count",
fill="Transaction count")+
scale_color_viridis(option = "C")+
scale_y_continuous(name = "Gross profit margin (%)",
breaks=seq(0,100,5),
limits=c(15,40))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,50000000,5000000),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
guides(size=FALSE)
# ggplotly(gg_product_lines) %>%
ggplotly(gg_product_lines, tooltip = c("text")) %>%
layout(title = list(text = paste0('Product Lines - Gross profit margin vs Net sales',
'<br>',
'<sup>',
"Removed: 'Unassigned', 'Design services', 'Others'; Size relative to Transaction count",
'</sup>')),
legend=list(orientation="h",x=0,y=-.2))
Though the previous plot is not necessarily misleading, it tells a more thorough story when product lines are ungrouped into their separate product groups. It becomes more clear that although the Cold Box / Epoxy group leads in net sales, this is due specifically to the Amine Curing PU Systems product group. The same can be said of the Filters and No Bake/ Hot Box groups with Zr-Filters and Pep Set products, respectively.
##################################
# GPM vs NET SALES across PRODUCT GROUPS, size = transaction count, color = product lines
line_counts <- df_sales %>%
dplyr::filter(product_line != "#N/A") %>%
group_by(product_group) %>%
add_tally() %>%
slice(1) %>%
ungroup() %>%
dplyr::select(product_group, n)
gg_product_groups_df <- df_sales %>%
dplyr::filter(product_group != "#N/A") %>%
group_by(product_group,product_line) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_margin = sum(gross_margin),
gross_profit_margin = (gross_profit / net_sales)*100) %>%
arrange(-net_sales) %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>%
dplyr::filter(gross_profit_margin < 100 & gross_profit_margin > 0 & net_sales > 0) %>%
left_join(line_counts)
gg_product_groups_df %>%
ggplot(aes(x=net_sales, y=gross_profit_margin, label=product_group))+
geom_point(aes(fill=product_line, size=n), shape=21)+
geom_label_repel(data=subset(gg_product_groups_df,
(net_sales > 3e6 & gross_profit_margin > 25) |
gross_profit_margin > 40),
aes(label=product_group),
alpha=.6,
force=90,
box.padding = .6,
seed=1003,
size=4,
segment.colour = 'grey40')+
theme_minimal()+
scale_y_continuous(name = "Gross profit margin (%)",
breaks=seq(-10,100,5))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,35e6, 5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
scale_fill_brewer(palette = "Set1")+
guides(size=FALSE,
fill = guide_legend(override.aes = list(size=5)))+
labs(title = "Product Groups - Gross profit margin vs Net sales",
subtitle = "Size = transaction count",
caption = "Removed: 0%, 100% margins; net sales < 0",
fill= "Product line")
##################################
# GPM vs NET SALES across PRODUCT GROUPS, size = transaction count, color = product lines
# PLOTLY VERSION
gg_product_lines <- gg_product_groups_df %>%
ggplot(aes(x=net_sales, y=gross_profit_margin, label=product_group,
text = paste0('Product Group: ', paste0(gg_product_groups_df$product_group," (",gg_product_groups_df$product_line, ")"),
'<br>Net Sales: ', mycurrency(gg_product_groups_df$net_sales),
'<br>Gross Profit Margin: ', paste0(round(gg_product_groups_df$gross_profit_margin,2),"%"),
'<br>Transaction Count: ' , gg_product_groups_df$n)))+
geom_point(aes(color=product_line, size=n), alpha=.8)+
theme_minimal()+
scale_y_continuous(name = "Gross profit margin (%)",
breaks=seq(-10,100,5))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,35e6, 5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
scale_color_brewer(palette = "Set1")+
guides(size=FALSE)+
# guides(size=FALSE,
# color = guide_legend(override.aes = list(size=5)))+
labs(title = "Product Groups - Gross profit margin vs Net sales",
caption = "Removed: 0%, 100% margins; net sales < 0",
color = "Product line")
# ggplotly(gg_product_lines)
ggplotly(gg_product_lines, tooltip = c("text")) %>%
layout(title = list(text = paste0('Product Lines - Gross profit margin vs Net sales',
'<br>',
'<sup>',
"Size = transaction count; Removed: 0%, 100% margins; net sales < 0",
'</sup>')))
Sunburst diagram representing product lines as sub-categories of product groups. Size is relative to total gross profits.
Product lines, by product group, representing total gross profit
df_sales %>%
dplyr::filter(gross_profit > 0) %>%
dplyr::select(product_line, product_group, gross_profit) %>%
mutate(product_line = gsub("[^[:alnum:] ]", "", product_line)) %>%
mutate(product_line = gsub(" ", " ", product_line)) %>%
mutate(product_group = gsub("[^[:alnum:] ]", "", product_group)) %>%
mutate(product_group = gsub(" ", " ", product_group)) %>%
group_by(product_line, product_group) %>%
summarise(gp=sum(gross_profit)) %>%
mutate(path = paste0(product_line, "-", product_group)) %>%
ungroup() %>%
dplyr::select(path, gp) %>%
sunburst(legend=FALSE)
By grouping customers into clusters representing roughly 10% of total gross profits we can glean a better understanding of where much of the profits come from. Each primary cluster in the treemap below represents 10% of total gross profits. The number on the top left of each cluster (1-9) represents a pseudo-ranking of customer “importance”, with importance representing percentage of total gross profits contributed.
For example, cluster 1
shows that Porter Warner Industries Llc represents the
first 10% of total gross profits. cluster 2
represents customers that make up 11-20% of
gross profits, and so on. If we list all the customers in primary clusters 1-7
, for
example, then we have a list of the fewest customers that make up 79% of total gross profits.
It’s an interesting thought experiment in deciding whether our rank 1: Porter Warner Industries Llc is a more “valuable” customer than, for example, our rank 2: Caterpillar Incorporated and Waupaca Foundry Inc customers. Although a larger share of profits originate with the larger customer, the risk is also doubled in the case of a loss. The rank 2 customers include a sort of built-in hedge against this while also bringing in significant value.
Note, gross profits and net sales below zero are included.
#########################################
# TREEMAP of customers
df_totals <- df_sales %>%
dplyr::filter(customer_name_sold_to != "not assigned") %>%
# dplyr::filter(gross_profit > 0 & net_sales > 0 & customer_name_sold_to != "not assigned") %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_margin = sum(gross_margin),
gross_profit_margin = gross_profit / net_sales)
gg_tree_df <- df_sales %>%
group_by(customer_name_sold_to) %>%
dplyr::filter(gross_profit > 0) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit)) %>%
arrange(-gross_profit) %>%
mutate(net_sales_total = df_totals$net_sales) %>%
mutate(net_sales_pct = net_sales/net_sales_total) %>%
mutate(gross_profit_total = df_totals$gross_profit) %>%
mutate(gross_profit_pct = gross_profit/gross_profit_total) %>%
mutate(cusum_gp = cumsum(gross_profit_pct)) %>%
mutate(rank = ifelse(cusum_gp>.9, 9,
ifelse(cusum_gp>.8, 8,
ifelse(cusum_gp>.7, 7,
ifelse(cusum_gp>.6, 6,
ifelse(cusum_gp>.5, 5,
ifelse(cusum_gp>.4, 4,
ifelse(cusum_gp>.3, 3,
ifelse(cusum_gp>.2, 2,
ifelse(cusum_gp>.1, 1, 0)))))))))) %>%
mutate(rank2 = as.factor(rank)) %>%
group_by(rank) %>%
mutate(customer_name_sold_to = toTitleCase(as.character(customer_name_sold_to))) %>%
add_tally()
tm <- treemap(
gg_tree_df,
index=c("rank2", "customer_name_sold_to"),
vSize="gross_profit_pct",
type="categorical",
vColor="rank2",
title="10% total gross profit clusters",
algorithm="pivotSize",
sortID = "rank",
palette="-RdYlGn",
fontsize.labels = c(16,10),
border.col=c("white", "grey60"),
lowerbound.cex.labels = 0,
bg.labels=220,
align.labels=list(
c("left", "top"),
c("center", "center")),
position.legend = "none"
)
The table of our ranked customer clusters shows a few things to us:
Num customers in rank
: the number of customers allocated to each assigned ranking
group.Cum num customers in rank
: a cumulative version of the Num customers in
rank
column.Pct of total customers in rank
: the percentage of customers in that rank out of the
total amount of customers.Pct total GP in rank
: the percentage of gross profit in that rank out of the total
gross profits.Cum pct of total customers
: a cumulative version of the Pct of total customers
in rank
column.Cum pct total GP in rank
: a cumulative version of the Pct total GP in
rank
column.One of the more interesting bits of information we can gather from this table comes from the last two, cumulative columns. We see a pseudo-Pareto-like pattern. Row 8, or rank 8, shows that the top 11.1% of customers account for 79.4% of gross profits.
####################################
# DF of CUSTOMER RANK INFO
gg_tree_df_gp <- gg_tree_df %>%
group_by(rank) %>%
summarise(pct_gp_in_rank = sum(gross_profit)/sum(gg_tree_df$gross_profit)) %>%
mutate(cum_pct_gp_in_rank = cumsum(pct_gp_in_rank))
gg_tree_df_gp <- gg_tree_df %>%
group_by(rank) %>% slice(1) %>% ungroup() %>%
mutate(cum_num_customers_in_rank = cumsum(n),
pct_total_customers_in_rank = n/nrow(gg_tree_df),
cum_pct_total_customers_in_rank = cumsum(pct_total_customers_in_rank)) %>%
left_join(gg_tree_df_gp) %>%
dplyr::select(rank,
n,
cum_num_customers_in_rank,
pct_total_customers_in_rank,
pct_gp_in_rank,
cum_pct_total_customers_in_rank,
cum_pct_gp_in_rank) %>%
set_colnames(c("Rank",
"Num customers in rank",
"Cum num customers in rank",
"Pct of total customers in rank",
"Pct total GrossProfit in rank",
"Cum pct of total customers",
"Cum pct total GrossProfit in rank"))
datatable(gg_tree_df_gp,
options = list(
autoWidth = TRUE,
dom = 't'
# dom = 'tip'
),
class = 'cell-border stripe',
caption = "Gross profit opportunity of ranked customers",
rownames = FALSE) %>%
formatPercentage(c(names(gg_tree_df_gp[-c(1,2,3)])), 1) %>%
formatStyle(names(gg_tree_df_gp[2]),
background = styleColorBar(range(gg_tree_df_gp[2]), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_tree_df_gp[3]),
background = styleColorBar(range(gg_tree_df_gp[3]), 'cornsilk'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_tree_df_gp[4]),
background = styleColorBar(range(gg_tree_df_gp[4]), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_tree_df_gp[5]),
background = styleColorBar(range(gg_tree_df_gp[5]), 'lightsalmon'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_tree_df_gp[6]),
background = styleColorBar(range(gg_tree_df_gp[6]), 'lightblue'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_tree_df_gp[7]),
background = styleColorBar(range(gg_tree_df_gp[7]), 'lightsalmon'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
Here we can look at a sorted list of customers that essentially make up to 79% of gross profit. The list consists of 53 total customers.
#############################
# LOLIPOP OF SORTED GP pct CUSTOMERS
df_lolipop <- gg_tree_df %>%
dplyr::select(customer_name_sold_to, gross_profit, gross_profit_pct, rank, net_sales) %>%
arrange(-gross_profit_pct) %>%
ungroup() %>%
mutate(Rank = as.factor(rank)) %>%
mutate(customer_name_sold_to = toTitleCase(as.character(customer_name_sold_to))) %>%
slice(1:53)
gg_lolipop <- df_lolipop %>%
ggplot(aes(x=reorder(customer_name_sold_to,gross_profit_pct),
y=gross_profit_pct,
text=paste("Customer: ",customer_name_sold_to,
"<br>Rank: ", Rank,
"<br>Gross profit: ", mycurrency(gross_profit),
"<br>Gross profit % of total:", paste0(round( (signif(gross_profit_pct,3)*100), 3), "%"),
"<br>Net sales: ", mycurrency(net_sales))))+
geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
geom_segment(aes(x=customer_name_sold_to,
xend=customer_name_sold_to,
y=0,
yend=gross_profit_pct),
alpha=.4,
color="grey40")+
theme(panel.grid.major.y = element_blank())+
coord_flip()+
theme_minimal()+
scale_y_continuous(name = "Gross profit (%)",
labels = function(x) paste0(x * 100, '%'),
breaks=seq(0,.15,.01))+
xlab("Customer")+
# scale_color_brewer(palette="Dark2")+
scale_fill_brewer(palette="RdYlGn",direction=-1)+
labs(title = "Customers vs gross profit contribution",
caption= "Top 53 customers accounting for ~79% gross profit")
ggplotly(gg_lolipop, tooltip = c("text")) %>%
layout(title = list(text = paste0('Customers vs gross profit contribution',
'<br>',
'<sup>',
"Top 53 customers accounting for ~79% gross profit",
'</sup>')))
Though important to know where current profits originate, and where the top percentage of company profits come from, if we don’t look at opportunity then we are ignoring the potential for growth.
The sales opportunity dataset provides sales opportunity estimates not only for each prospective foundry, but for each product line. We know that each product line has quite different margins thanks to our analysis above. Now, we can leverage this knowledge and provide accurate estimates of potential gross profits.
In other words, we will use our average calculated gross margin percent for each product line, multiply this by the sales opportunity figure, resulting in a better representation of potential gross profits.
Though not shown in completion, the below data is the backbone for the new gross profit opportunity calculation. The raw opportunity data for each customer, for each product line, is used in conjuction with the margin data calculated above. This new product line gross profit opportunity metric is then added together for each customer and presented below.
The data is first reshaped into long format.
df_opps <- df_ploop %>%
left_join(df_visits, by=c('account'= 'accounts')) %>%
distinct() %>%
dplyr::filter(sales_opp > 0) %>%
dplyr::select(colnames(df_ploop[1:10])) %>%
mutate_if(is.numeric, funs(replace_na(., 0))) %>%
mutate(CAA = CT+AD) %>%
dplyr::select(-c(CT,AD)) %>%
set_colnames(c("account",
"Cold Box / Epoxy",
"No Bake / Hot Box",
"Release Agents & Auxiliaries",
"Inorganics",
"Filters",
"Risers",
"Metallurgy",
"Coatings & Additives")) %>%
pivot_longer(cols=-account,
names_to="product_line",
values_to="sales_opp") %>%
mutate_if(is.character, as.factor)
datatable(df_opps %>%
set_colnames(c("Account","Product Line", "Sales Opportunity")),
options = list(
autoWidth = TRUE,
# dom = 't'
dom = 'ftip'
),
class = 'cell-border stripe',
caption = "Gross profit opportunity of ranked customers",
rownames = FALSE) %>%
formatCurrency(c("Sales Opportunity"), "$")
Here we have the raw margin data from calculations above. The Inorganics margin was imputed using the mean of all other lines, since the data was missing.
df_product_line_sales <- df_sales %>%
dplyr::filter(product_group != "#n/a") %>%
dplyr::filter(product_line %in% levels(df_opps$product_line)) %>%
group_by(product_line) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
# gross_margin = sum(gross_margin),
gross_profit_margin = gross_profit / net_sales) %>%
ungroup() %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>%
mutate(net_sales_pct = net_sales/sum(net_sales),
gross_profit_pct = gross_profit/sum(gross_profit)) %>%
mutate(gross_profit_margin = ifelse(is.na(gross_profit_margin),
mean(gross_profit_margin,na.rm = TRUE), gross_profit_margin))
datatable(df_product_line_sales %>%
dplyr::select(product_line, gross_profit_margin) %>%
set_colnames(c("Product Line", "Gross Profit Margin %")),
options = list(
autoWidth = TRUE,
dom = 't'
# dom = 'tip'
),
class = 'cell-border stripe',
caption = "Calculated gross profit margins for each product line",
rownames = FALSE) %>%
formatPercentage(c("Gross Profit Margin %"), 2)
################################
# TOP CUSTOMERS basedon potential profit
df_opps <- df_ploop %>%
left_join(df_visits, by=c('account'= 'accounts')) %>%
distinct() %>%
dplyr::filter(sales_opp > 0) %>%
dplyr::select(colnames(df_ploop[1:10])) %>%
mutate_if(is.numeric, funs(replace_na(., 0))) %>%
mutate(CAA = CT+AD) %>%
dplyr::select(-c(CT,AD)) %>%
set_colnames(c("account",
"Cold Box / Epoxy",
"No Bake / Hot Box",
"Release Agents & Auxiliaries",
"Inorganics",
"Filters",
"Risers",
"Metallurgy",
"Coatings & Additives")) %>%
pivot_longer(cols=-account,
names_to="product_line",
values_to="sales_opp") %>%
mutate_if(is.character, as.factor)
df_product_line_sales <- df_sales %>%
dplyr::filter(product_group != "#n/a") %>%
dplyr::filter(product_line %in% levels(df_opps$product_line)) %>%
group_by(product_line) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
# gross_margin = sum(gross_margin),
gross_profit_margin = gross_profit / net_sales) %>%
ungroup() %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
mutate_if(is.numeric, function(x) replace(x, x<0, 0)) %>%
mutate(net_sales_pct = net_sales/sum(net_sales),
gross_profit_pct = gross_profit/sum(gross_profit)) %>%
mutate(gross_profit_margin = ifelse(is.na(gross_profit_margin),
mean(gross_profit_margin,na.rm = TRUE), gross_profit_margin))
df_all_opps <- df_opps %>%
left_join(df_product_line_sales %>% dplyr::select(product_line, gross_profit_margin)) %>%
mutate(gross_profit_opp = sales_opp*gross_profit_margin) %>%
dplyr::select(-c(gross_profit_margin)) %>%
dplyr::filter(sales_opp != 0) %>%
group_by(account) %>%
dplyr::summarise(sales_opp = sum(sales_opp),
gross_profit_opp = sum(gross_profit_opp)) %>%
left_join(df_visits %>% dplyr::select(accounts, num_visits), by=c('account' = 'accounts')) %>%
arrange(-gross_profit_opp) %>%
distinct() %>%
mutate(gross_profit_opp_pct = gross_profit_opp/sum(gross_profit_opp)) %>%
mutate(gross_profit_opp_cusum_pct = cumsum(gross_profit_opp_pct)) %>%
mutate(rank = ifelse(gross_profit_opp_cusum_pct>.9, 9,
ifelse(gross_profit_opp_cusum_pct>.8, 8,
ifelse(gross_profit_opp_cusum_pct>.7, 7,
ifelse(gross_profit_opp_cusum_pct>.6, 6,
ifelse(gross_profit_opp_cusum_pct>.5, 5,
ifelse(gross_profit_opp_cusum_pct>.4, 4,
ifelse(gross_profit_opp_cusum_pct>.3, 3,
ifelse(gross_profit_opp_cusum_pct>.2, 2,
ifelse(gross_profit_opp_cusum_pct>.1, 1, 0)))))))))) %>%
mutate(rank2 = as.factor(rank)) %>%
group_by(rank)
tm <- treemap(
df_all_opps,
index=c("rank", "account"),
vSize="gross_profit_opp",
type="categorical",
vColor="rank2",
title="10% total potential gross profit clusters",
algorithm="pivotSize",
sortID = "rank",
palette="-RdYlGn",
fontsize.labels = c(16,10),
border.col=c("white", "grey60"),
lowerbound.cex.labels = 0,
bg.labels=220,
align.labels=list(
c("left", "top"),
c("center", "center")),
position.legend = "none"
)
df_all_opps_gp <- df_all_opps %>%
group_by(rank) %>%
summarise(pct_gp_opp_in_rank = sum(gross_profit_opp)/sum(df_all_opps$gross_profit_opp)) %>%
mutate(cum_pct_gp_opp_in_rank = cumsum(pct_gp_opp_in_rank))
df_all_opps_gp <- df_all_opps %>%
group_by(rank) %>%
add_tally() %>% slice(1) %>% ungroup() %>%
mutate(cum_num_customers_in_rank = cumsum(n),
pct_total_customers_in_rank = n/nrow(df_all_opps),
cum_pct_total_customers_in_rank = cumsum(pct_total_customers_in_rank)) %>%
left_join(df_all_opps_gp) %>%
dplyr::select(rank,
n,
cum_num_customers_in_rank,
pct_total_customers_in_rank,
pct_gp_opp_in_rank,
cum_pct_total_customers_in_rank,
cum_pct_gp_opp_in_rank) %>%
set_colnames(c("Rank",
"Num customers in rank",
"Cum num customers in rank",
"Pct of total customers in rank",
"Pct total GP opportunity in rank",
"Cum pct of total customers",
"Cum pct total GP opportunity in rank"))
datatable(df_all_opps_gp,
options = list(
autoWidth = TRUE,
dom = 't'
# dom = 'tip'
),
class = 'cell-border stripe',
caption = "Gross profit opportunity of ranked customers",
rownames = FALSE) %>%
formatPercentage(c(names(df_all_opps_gp[-c(1,2,3)])), 1) %>%
formatStyle(names(df_all_opps_gp[2]),
background = styleColorBar(range(df_all_opps_gp[2]), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_all_opps_gp[3]),
background = styleColorBar(range(df_all_opps_gp[3]), 'cornsilk'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_all_opps_gp[4]),
background = styleColorBar(range(df_all_opps_gp[4]), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_all_opps_gp[5]),
background = styleColorBar(range(df_all_opps_gp[5]), 'lightsalmon'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_all_opps_gp[6]),
background = styleColorBar(range(df_all_opps_gp[6]), 'lightblue'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_all_opps_gp[7]),
background = styleColorBar(range(df_all_opps_gp[7]), 'lightsalmon'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
GP opp % all seems very low, double check calculations. Total gross profit potential?
#################################
# GP OPP LOLIPOP
df_opp_lolipop <- df_all_opps %>%
dplyr::select(account, gross_profit_opp, gross_profit_opp_pct, rank, sales_opp) %>%
arrange(-gross_profit_opp_pct) %>%
ungroup() %>%
mutate(Rank = as.factor(rank)) %>%
slice(1:67)
df_opp_short_names <- df_opp_lolipop %>%
mutate(account = as.character(account)) %>%
mutate(nchar = nchar(account)) %>%
dplyr::select(account, nchar) %>%
arrange(-nchar) %>%
mutate(account_short = substring(account,1,35)) %>%
dplyr::select(account, account_short)
# mutate_all(as.factor)
df_opp_lolipop <- df_opp_lolipop %>%
left_join(df_opp_short_names)
gg_df_opp_lolipop <- df_opp_lolipop %>%
ggplot(aes(x=reorder(account_short ,gross_profit_opp_pct),
y=gross_profit_opp_pct,
text=paste("Customer: ",account_short,
"<br>Rank: ", Rank,
"<br>Gross profit opportunity: ", mycurrency(gross_profit_opp),
"<br>Gross profit opportunity % of total:", paste0(round( (signif(gross_profit_opp_pct,3)*100), 3), "%"),
"<br>Net sales opportunity: ", mycurrency(sales_opp))))+
geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
geom_segment(aes(x=account_short,
xend=account_short,
y=0,
yend=gross_profit_opp_pct),
alpha=.4,
color="grey40")+
theme(panel.grid.major.y = element_blank())+
coord_flip()+
theme_minimal()+
scale_y_continuous(name = "Gross profit opportunity (%)",
labels = function(x) paste0(x * 100, '%'),
breaks=seq(0,.25,.02))+
xlab("Customer")+
# scale_color_brewer(palette="Dark2")+
scale_fill_brewer(palette="RdYlGn",direction=-1)+
labs(title = "Customers vs gross profit opportunity",
caption= "Top 67 customers accounting for ~79% gross profit opportunity")
ggplotly(gg_df_opp_lolipop, tooltip = c("text")) %>%
layout(title = list(text = paste0('Customers vs gross profit opportunity',
'<br>',
'<sup>',
"Top 67 customers accounting for ~79% gross profit opportunity",
'</sup>')))
We’ve performed a bit of analysis above, grouping current customers into categories of most and least profitable. Similarly, we’ve classified potential customers based on potential gross profit. At this point we should restate the goal of this endeavor.
Classify customer accounts (premium vs non-premium) and determine which accounts will continue to receive complimentary support services with no additional charge.
Also of interest is ranking customers on a scale that assists with prioritization of visits and/or attention. This seems straightforward and easily accomplished by simply looking at total gross profits on a per customer basis, however, because certain product lines have dramatically higher net sales than others, using a gross profit or net sales metric would unfairly select customer priority based on a single product line. In other words, we need a way to standardize or normalize the effects.
To emphasize the point, we can show the clustered treemap visualization for each product line, along with the sorted lolipop diagram. It becomes apparent that not all customers carry the same importance when segregating by product line.
df_sales_customer <- df_sales %>%
dplyr::filter(product_line %in% levels(df_opps$product_line)) %>%
group_by(customer_name_sold_to, product_line) %>%
summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_profit_margin = gross_profit/net_sales) %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA))
get_pl_treemap <- function(PL = "Coatings & Additives"){
a <- df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
dplyr::filter(product_line == PL) %>%
arrange(-gross_profit) %>%
ungroup() %>%
mutate(gross_profit_pct = gross_profit/sum(gross_profit),
gross_profit_cum_pct = cumsum(gross_profit_pct)) %>%
mutate(rank = ifelse(gross_profit_cum_pct>.9, 9,
ifelse(gross_profit_cum_pct>.8, 8,
ifelse(gross_profit_cum_pct>.7, 7,
ifelse(gross_profit_cum_pct>.6, 6,
ifelse(gross_profit_cum_pct>.5, 5,
ifelse(gross_profit_cum_pct>.4, 4,
ifelse(gross_profit_cum_pct>.3, 3,
ifelse(gross_profit_cum_pct>.2, 2,
ifelse(gross_profit_cum_pct>.1, 1, 0)))))))))) %>%
mutate(rank2 = as.factor(rank))
return(
treemap(
a,
index=c("rank2", "customer_name_sold_to"),
vSize="gross_profit_pct",
type="categorical",
vColor="rank2",
title=paste0(PL, " - 10% total gross profit clusters"),
algorithm="pivotSize",
sortID = "rank",
palette="-RdYlGn",
fontsize.labels = c(16,10),
border.col=c("white", "grey60"),
lowerbound.cex.labels = 0,
bg.labels=220,
align.labels=list(
c("left", "top"),
c("center", "center")),
position.legend = "none"
)
)
}
get_pl_treemap()
get_pl_lolipop <- function(PL = "Coatings & Additives"){
a <- df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
dplyr::filter(product_line == PL) %>%
arrange(-gross_profit) %>%
ungroup() %>%
mutate(gross_profit_pct = gross_profit/sum(gross_profit),
gross_profit_cum_pct = cumsum(gross_profit_pct)) %>%
mutate(rank = ifelse(gross_profit_cum_pct>.9, 9,
ifelse(gross_profit_cum_pct>.8, 8,
ifelse(gross_profit_cum_pct>.7, 7,
ifelse(gross_profit_cum_pct>.6, 6,
ifelse(gross_profit_cum_pct>.5, 5,
ifelse(gross_profit_cum_pct>.4, 4,
ifelse(gross_profit_cum_pct>.3, 3,
ifelse(gross_profit_cum_pct>.2, 2,
ifelse(gross_profit_cum_pct>.1, 1, 0)))))))))) %>%
ungroup() %>%
mutate(Rank = as.factor(rank)) %>%
dplyr::filter(rank < 8)
top_cust <- nrow(a)
gp_scale <- ceiling(round(max(a$gross_profit_pct),3)*100)/100
gp_scale_breaks <- floor(gp_scale/10*100)/100
title <- paste0(PL, " - Customers vs gross profit contribution")
caption <- paste0("Top ", top_cust, " customers accounting for ~80% gross profit")
b <- a %>%
ggplot(aes(x=reorder(customer_name_sold_to,gross_profit_pct),
y=gross_profit_pct,
text=paste("Customer: ",customer_name_sold_to,
"<br>Rank: ", Rank,
"<br>Gross profit: ", mycurrency(gross_profit),
"<br>Gross profit % of total:", paste0(round( (signif(gross_profit_pct,3)*100), 3), "%"),
"<br>Net sales: ", mycurrency(net_sales))))+
geom_point(aes(fill=Rank), size=3, stroke=.2, shape=21)+
geom_segment(aes(x=customer_name_sold_to,
xend=customer_name_sold_to,
y=0,
yend=gross_profit_pct),
alpha=.4,
color="grey40")+
theme(panel.grid.major.y = element_blank())+
coord_flip()+
theme_minimal()+
scale_y_continuous(name = "Gross profit (%)",
labels = function(x) paste0(x * 100, '%'),
breaks=seq(0,gp_scale,gp_scale_breaks))+
xlab("Customer")+
# scale_color_brewer(palette="Dark2")+
scale_fill_brewer(palette="RdYlGn",direction=-1)+
labs(title = title,
caption= caption)
ggplotly(b, tooltip = c("text")) %>%
layout(title = list(text = paste0(title,
'<br>',
'<sup>',
caption,
'</sup>')))
}
get_pl_lolipop()
What normalization or scaling allows is a conversion of values to a range of 0 and 1. If we segregate our product lines and scale all values, we would be left with relative measures that would allow for more fair comparison of customer value between groups.
So instead of comparing a Cold Box customer with $2M
gross profit to a Riser customer with
$0.75M
gross profit, we scale both of these maximum values to 1
, giving them
equal importance.
Looking at the data below we can visualize the effects first hand. The minimum and maximum values within
each product line are converted to 0
and 1
, respectively, and all values between
are mapped to a relative value between.
df_sales_customer <- df_sales %>%
dplyr::filter(product_line %in% levels(df_opps$product_line)) %>%
group_by(customer_name_sold_to, product_line) %>%
summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_profit_margin = gross_profit/net_sales) %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA))
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
ggplot(aes(x=net_sales, y=gross_profit_margin))+
ggpointdensity::geom_pointdensity()+
facet_wrap(~product_line)+
theme_minimal()+
scale_color_viridis(option="C")+
labs(title = "Product Lines - Gross profit margin vs Net sales",
caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
color = "N-neighbors")+
theme(legend.position = "none")+
scale_y_continuous(name = "Gross profit margin (%)",
breaks=seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,10e6,2e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
group_by(product_line) %>%
mutate_if(is.numeric, rescale) %>%
ggplot(aes(x=net_sales, y=gross_profit_margin))+
ggpointdensity::geom_pointdensity()+
facet_wrap(~product_line)+
theme_minimal()+
scale_color_viridis(option="C")+
labs(title = "Product Lines - Scaled - Gross profit margin vs Net sales",
caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
color = "N-neighbors")+
theme(legend.position = "none")+
scale_y_continuous(name = "Gross profit margin (Scaled)",
breaks=seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))+
scale_x_continuous(name = "Net sales (Scaled)",
breaks = seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))
Working with our scaled data we can assign scores to each customer by clustering them into groups. The
overall goal is to work within each product line, clustering customers into 5 different
Scores
, with the highest Score
corresponding to the most profitable customers
(i.e. largest gross profit value), and a score of 1 to the least profitable. The Score
can
then be tallied across all customers giving a relative total ranking.
For example. If Customer A is the top customer in Cold Box, only, they are assigned a rank of 5. If Customer B receives a rank of 3 on both Metallurgy and Filters, their score would then be a 6. Customer B would therefore have a higher score.
We group into 5 clusters (k=5
), visualized in the below plot.
Plotting gross profit margin vs net sales is how we’re used to visualizing so far, which we see in the first plot. Since the clusters were assigned using gross profit, it is slightly more straightforward to see how they are grouped in the alternative plot.
Typical plot we are used to seeing.
# scaled df
df_sales_customer_normal <- df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
group_by(product_line) %>%
mutate_if(is.numeric, rescale)
# cluster all in groups
df_sales_customer_normal %>%
group_by(product_line) %>%
mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>%
ggplot(aes(x=net_sales, y=gross_profit_margin))+
geom_point(aes(fill=as.factor(cluster)), shape=21,color="grey50",size=2,alpha=.5)+
facet_wrap(~product_line)+
theme_few()+
scale_fill_manual(values=c("#ffccf7","#3e78b2","#dd6031","#baf2bd","#1e0002"))+
labs(title = "Product Lines - Scaled, Clustered - Gross profit margin vs Net sales",
caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
fill = "Clusters")+
# theme(legend.position = "bottom")+
scale_y_continuous(name = "Gross profit margin (Scaled)",
breaks=seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))+
scale_x_continuous(name = "Net sales (Scaled)",
breaks = seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))
Clusters are assigned based on gross profit, so slightly easier to visualize the clustering this way.
# cluster all in groups
df_sales_customer_normal %>%
group_by(product_line) %>%
mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>%
ggplot(aes(x=net_sales, y=gross_profit))+
geom_point(aes(fill=as.factor(cluster)), shape=21,color="grey50",size=2,alpha=.5)+
facet_wrap(~product_line)+
theme_few()+
scale_fill_manual(values=c("#ffccf7","#3e78b2","#dd6031","#baf2bd","#1e0002"))+
labs(title = "Product Lines - Scaled, Clustered - Gross profit margin vs Net sales",
caption = "Removed product lines not represented in opportunity data\n Removed values below zero",
fill = "Clusters")+
# theme(legend.position = "bottom")+
scale_y_continuous(name = "Gross profit margin (Scaled)",
breaks=seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))+
scale_x_continuous(name = "Net sales (Scaled)",
breaks = seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))
df_scaled_scored_indiv <- df_sales_customer_normal %>%
mutate(cluster = (Ckmeans.1d.dp(gross_profit, k=5)$cluster)) %>%
dplyr::select(-c(net_sales, gross_profit, gross_profit_margin)) %>%
right_join(df_sales_customer) %>%
mutate(cluster = ifelse(is.na(cluster),0,cluster))
df_scaled_scored <- df_scaled_scored_indiv %>%
group_by(customer_name_sold_to) %>%
summarise(rank = sum(cluster)) %>%
right_join(df_scaled_scored_indiv) %>%
mutate(rank2 = as.factor(rank)) %>%
mutate(customer_short = substring(customer_name_sold_to,1,25)) %>%
arrange(-rank)
gg_df_scaled_scored <- df_scaled_scored %>%
arrange(-rank) %>%
slice(1:300) %>%
# spread(key=product_line, value=net_sales) %>%
ggplot(aes(x=reorder(customer_short ,rank),
y=rank,
text=paste("Customer: ",customer_name_sold_to,
"<br>Rank: ", rank)))+
geom_point(aes(fill=rank2), size=3, stroke=.2, shape=21)+
geom_segment(aes(x=customer_short,
xend=customer_short,
y=0,
yend=rank),
alpha=.4,
color="grey40")+
theme(panel.grid.major.y = element_blank())+
coord_flip()+
theme_minimal()+
theme(legend.position = "none")+
scale_y_continuous(name = "Rank",
breaks=seq(0,25,2))+
xlab("Customer")+
labs(title = "Customers vs Rank",
caption= "Rank calculated by scaling gross profit\nFollowed by univariate clustering, k=5")
ggplotly(gg_df_scaled_scored, tooltip = c("text")) %>%
layout(title = list(text = paste0('Customers vs Rank',
'<br>',
'<sup>',
"Rank calculated by scaling gross profit across product lines, followed by univariate clustering, k=5",
'</sup>')))
Pulling all sales data with the new, ranking variables we create a datatable summarizing everything
from cumulative or Customer Rank
, to separate ranks within each product line, or
Product Line Cluster
.
df_scaled_scored <- df_scaled_scored %>%
dplyr::select(customer_name_sold_to,
product_line,
rank,
net_sales,
gross_profit,
gross_profit_margin,
cluster) %>%
set_colnames(c("Customer",
"Product Line",
"Customer Rank",
"Net Sales",
"Gross Profit",
"Gross Profit Margin",
"Product Line Cluster"))
datatable(df_scaled_scored,
options = list(
autoWidth = TRUE
# dom = 't'
# dom = 'tip'
),
class = 'cell-border stripe',
caption = "Total gross profit of current customers",
rownames = FALSE) %>%
formatPercentage(c(names(df_scaled_scored[c(6)])), 1) %>%
formatCurrency(c(names(df_scaled_scored[c(4,5)])), "$") %>%
formatStyle(names(df_scaled_scored[3]),
background = styleColorBar(as.numeric(range(df_scaled_scored[3], na.rm=TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_scaled_scored[4]),
background = styleColorBar(range(df_scaled_scored[4]), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_scaled_scored[5]),
background = styleColorBar(as.numeric(range(df_scaled_scored[5],na.rm = TRUE)), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_scaled_scored[6]),
background = styleColorBar(as.numeric(range(df_scaled_scored[6],na.rm = TRUE)), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(df_scaled_scored[7]),
background = styleColorBar(as.numeric(range(df_scaled_scored[7],na.rm = TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
Though we have nice rank numbers for each customer it does seem more complicated than assigning a much
simpler ranking of Gold/Silver/Bronze. If we look at the distribution of our ranks we find that the
majority of customers fall into Rank 1
. It makes sense to assign Rank 1
and
below as Bronze.
At the other end of the spectrum we find that only 5% of customers are at or above Rank 7
.
These would be good Gold candidates. Thus, putting customers that fall between
Ranks 2 - 6
into the Silver category.
We can assign new variables and plot the results to visualize our new classificiation system.
50% of customers fall into Rank 1
or below.
Only 5% of customers are Rank 7
or above.
The remaining 45% of customers of course fall between the above ranks.
ranks <- df_scaled_scored %>%
dplyr::select(Customer, `Customer Rank`) %>%
set_colnames(c("customer_name_sold_to", "rank"))
gg_ranked_customers <- df_sales %>%
group_by(customer_name_sold_to) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_profit_margin = gross_profit/net_sales) %>%
# slice(1) %>%
left_join(ranks) %>%
group_by(customer_name_sold_to) %>%
slice(1) %>%
dplyr::filter(net_sales > 0 & gross_profit_margin > 0 & gross_profit_margin < 1) %>%
mutate(gross_profit_margin = gross_profit_margin*100) %>%
mutate(rank = ifelse(is.na(rank),0,rank))
gg_ranks_dt <- gg_ranked_customers %>%
group_by(rank) %>%
add_tally() %>%
arrange(-rank) %>% ungroup() %>%
as_tibble() %>%
mutate(n_pct = n/nrow(gg_ranked_customers)) %>%
group_by(rank) %>% slice(1) %>% ungroup() %>%
mutate(n_pct_cusum = cumsum(n_pct)) %>%
dplyr::select(rank,
n,
n_pct,
n_pct_cusum) %>%
set_colnames(c("Rank",
"Num customers in rank",
"Pct customers in rank",
"Cusum pct customers in rank"))
datatable(gg_ranks_dt,
options = list(
autoWidth = TRUE,
pageLength = 25,
dom = 'ti'
# dom = 'tip'
),
fillContainer = FALSE,
class = 'cell-border stripe',
caption = "Distribution of ranks",
rownames = FALSE) %>%
formatPercentage(c(names(gg_ranks_dt[c(3,4)])), 1) %>%
formatStyle(names(gg_ranks_dt[2]),
background = styleColorBar(as.numeric(range(gg_ranks_dt[2], na.rm=TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_ranks_dt[3]),
background = styleColorBar(range(gg_ranks_dt[3]), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_ranks_dt[4]),
background = styleColorBar(as.numeric(range(gg_ranks_dt[4],na.rm = TRUE)), 'lightblue'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
A final representation of customers segregated into olympic ranks.
df_olympics <- gg_ranked_customers %>%
mutate(olympics = ifelse(rank < 2, "Bronze",
ifelse(rank < 7, "Silver", "Gold"))) %>%
mutate(gross_profit_margin = gross_profit_margin/100)
# df_olympics$olympics <- factor(df_olympics$olympics, levels = c("Gold","Silver","Bronze"))
df_olympics$olympics <- factor(df_olympics$olympics, levels = c("Bronze","Silver","Gold"))
gg_olympics <- df_olympics %>%
# ggplot(aes(x=net_sales, y=gross_profit,
ggplot(aes(x=net_sales, y=gross_profit_margin,
text=paste('Customer:',customer_name_sold_to,
'<br>Net Sales:',mycurrency(net_sales),
'<br>Gross Profit:',mycurrency(gross_profit),
'<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
'<br>Rank:',rank,
'<br>Medal:',olympics)
))+
geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
theme_few()+
labs(title = "Ranked customers - Gross profit margin vs Net sales",
caption = "",
fill="Rank",
alpha = "")+
# scale_fill_manual(values=c("#ffd707","#EEEEEE","#a77044"))+
scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
scale_alpha_discrete(range = c(0.5, 1))+
scale_y_continuous(name = "Gross profit margin (%)",
labels=percent_format(suffix="%",accuracy=1),
breaks=seq(0,1,.25))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,15e6,5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
ggplotly(gg_olympics, tooltip = c("text")) %>%
layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
'<br>',
'<sup>',
"",
'</sup>')))
gg_olympics <- df_olympics %>%
ggplot(aes(x=net_sales, y=gross_profit,
text=paste('Customer:',customer_name_sold_to,
'<br>Net Sales:',mycurrency(net_sales),
'<br>Gross Profit:',mycurrency(gross_profit),
'<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
'<br>Rank:',rank,
'<br>Medal:',olympics)
))+
geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
theme_few()+
labs(title = "Ranked customers - Gross profit margin vs Net sales",
caption = "",
fill="Rank",
alpha = "")+
# scale_fill_manual(values=c("#ffd707","#EEEEEE","#a77044"))+
scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
scale_alpha_discrete(range = c(0.5, 1))+
scale_y_continuous(name = "Gross profit (USD)",
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
breaks=seq(0,4e6,1e6))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,15e6,5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
ggplotly(gg_olympics, tooltip = c("text")) %>%
layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
'<br>',
'<sup>',
"",
'</sup>')))
One potential issue with scaling rather than normalizing is the preservation of outliers. In our scaling process we observed this in the non-normal distribution of Gold/Silver/Bronze rankings, with Gold essentially making up all of the extreme positive outliers, while Silver and Bronze made up the majority of customers. Is this a problem?
If we want to prioritize only the extreme outliers, then this is not a problem. However, normalization would allow a correction of this non-normal distribution, allowing us to use standard deviation as a useful measure. Specifically, in a normally distributed dataset, 68% of the observations fall within the +/- 1 standard deviation from the mean; 95% of data is contained within the +/- 2 standard deviation from the mean; 99.7% of data is contained within the +/- 3 standard deviation from the mean.
The benefits to normalizing rather than scaling is that in each category we capture the 20% highest performers rather than simply the extreme outliers.
We can visualize the normal distribution in a few different way, using scatterplots and density plots. Using the scatterplots we can easily see how the maxima and minima have been scaled to a value of 0 or 1, respectively. The density plots show how the normalization has given nice standardized bell curves of all data.
Comparison of raw, scaled, normalized data.
Comparing raw data shows vastly different values with maximum values of the larger product lines shadowing the smaller.
df_customer_sales <- df_sales %>%
dplyr::filter(product_line %in% levels(df_opps$product_line)) %>%
group_by(customer_name_sold_to, product_line) %>%
summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_profit_margin = gross_profit/net_sales) %>%
mutate_if(is.numeric, function(x) replace(x, is.infinite(x), NA)) %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
dplyr::select(-c(gross_profit_margin)) %>%
arrange(-gross_profit)
get_bestNormal <- function(df = df_customer_sales,
PL = "Coatings & Additives",
col = "gross_profit"){
df_pl <- df_customer_sales %>%
dplyr::filter(product_line == PL) %>%
arrange(-gross_profit)
new_colname = paste0("norm_",col)
x <- df_pl[[col]]
x_t <- tibble(!!new_colname := bestNormalize::bestNormalize(x, warn=FALSE)$x.t)
new_df <- df_pl %>%
bind_cols(x_t)
return(new_df)
}
combine_normal_df <- function(dfa = df_customer_sales, dfb = get_bestNormal(PL = "Filters")){
merged <- dfa %>%
left_join(dfb, by=colnames(df_customer_sales))
# full_join(dfb)
return(merged)
}
# combine all norm_gross profits
all_norms <- combine_normal_df(dfa = df_customer_sales,
dfb = get_bestNormal(PL = levels(gdata::drop.levels(df_customer_sales$product_line))[1]))
for (level in levels(gdata::drop.levels(df_customer_sales$product_line))){
all_norms <- combine_normal_df(dfa = all_norms,
dfb = get_bestNormal(PL = level)) %>%
mutate(norm_gross_profit = ifelse(is.na(norm_gross_profit.x),
norm_gross_profit.y,
norm_gross_profit.x)) %>%
dplyr::select(-c(norm_gross_profit.x, norm_gross_profit.y))
}
# combine all norm_net_sales
all_norms <- combine_normal_df(dfa = all_norms,
dfb = get_bestNormal(PL = levels(gdata::drop.levels(df_customer_sales$product_line))[1],
col = "net_sales"))
for (level in levels(gdata::drop.levels(df_customer_sales$product_line))){
all_norms <- combine_normal_df(dfa = all_norms,
dfb = get_bestNormal(PL = level,
col = "net_sales")) %>%
mutate(norm_net_sales = ifelse(is.na(norm_net_sales.x),
norm_net_sales.y,
norm_net_sales.x)) %>%
dplyr::select(-c(norm_net_sales.x, norm_net_sales.y))
}
# unscaled
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
# ggplot(aes(x=net_sales, y=gross_profit_margin))+
ggplot(aes(x=net_sales, y=gross_profit))+
ggpointdensity::geom_pointdensity(adjust=.5e6)+
facet_wrap(~product_line)+
theme_few()+
scale_color_viridis(option="C")+
labs(title = "Product Lines - Gross profit vs Net sales",
caption = "Raw data",
color = "N-neighbors")+
# theme(legend.position = "none")+
scale_y_continuous(name = "Gross profit (USD)",
breaks = seq(0,2e6,1e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,6e6,2e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
Scaling the values on both axes “stretches” out the data. We still have extreme outliers but they are now comparable to one another.
# scaled
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
group_by(product_line) %>%
mutate_if(is.numeric, rescale) %>%
ggplot(aes(x=net_sales, y=gross_profit))+
ggpointdensity::geom_pointdensity()+
facet_wrap(~product_line)+
theme_few()+
scale_color_viridis(option="C")+
labs(title = "Product Lines - Gross profit margin vs Net sales",
caption = "Scaled data",
color = "N-neighbors")+
# theme(legend.position = "none")+
scale_y_continuous(name = "Gross profit (Scaled)",
breaks=seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))+
scale_x_continuous(name = "Net sales (Scaled)",
breaks = seq(0,1,.25),
labels=percent_format(suffix="%",accuracy=1))
Normally distributed data has transformed data so that 68% of observations fall within 1 standard deviation of the mean, etc. Extreme outliers have been essentially eliminated.
all_norms %>%
mutate(norm_gross_profit_margin = norm_gross_profit/norm_net_sales) %>%
ggplot(aes(x=norm_net_sales, y=norm_gross_profit))+
ggpointdensity::geom_pointdensity()+
geom_vline(xintercept = fivenum(all_norms$norm_gross_profit),linetype="dashed",alpha=.3, color="grey50")+
facet_wrap(~product_line)+
ggpointdensity::geom_pointdensity()+
facet_wrap(~product_line)+
theme_few()+
scale_color_viridis(option="C")+
labs(title = "Product Lines - Gross profit margin vs Net sales",
caption = "Normalized data",
color = "N-neighbors")+
# theme(legend.position = "none")+
scale_y_continuous(name = "Gross profit (Normalized)",
breaks=seq(-3,3,1))+
scale_x_continuous(name = "Net sales (Normalized)",
breaks = seq(-3,3,1))
Density plots allow better visualization of the bell-curves. Our raw plot is skewed heavily to the left. Scaling shifts it slightly, but scaling does not help outliers. Normalization produces proper bell curves for us.
# raw
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
ggplot(aes(x=gross_profit, y=..scaled..))+
geom_density()+ facet_wrap(~product_line)+
theme_few()+
labs(title = "Product Lines - Gross profit density",
caption = "Raw data")+
scale_x_continuous(name = "Gross profit",
breaks = seq(0,2e6,1e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))+
scale_y_continuous(name = "Density",
breaks=seq(0,1,.5),
labels=percent_format(suffix="%",accuracy=1))
# scaled
df_sales_customer %>%
dplyr::filter(net_sales > 0,
gross_profit > 0,
gross_profit_margin > 0) %>%
group_by(product_line) %>%
mutate_if(is.numeric, rescale) %>%
ggplot(aes(x=gross_profit, y=..scaled..))+
geom_density()+ facet_wrap(~product_line)+
theme_few()+
labs(title = "Product Lines - Gross profit density",
caption = "Scaled data")+
scale_x_continuous(name = "Gross profit (Scaled)",
breaks=seq(0,1,.5),
labels=percent_format(suffix="%",accuracy=1))+
scale_y_continuous(name = "Density",
breaks=seq(0,1,.5),
labels=percent_format(suffix="%",accuracy=1))
# normalized
all_norms %>%
mutate(norm_gross_profit_margin = norm_gross_profit/norm_net_sales) %>%
ggplot(aes(x=norm_gross_profit, y=..scaled..))+
geom_density()+ facet_wrap(~product_line) +
theme_few()+
labs(title = "Product Lines - Gross profit density",
caption = "Normalized data")+
scale_x_continuous(name = "Gross profit (Normalized)",
breaks = seq(-3,3,1))+
scale_y_continuous(name = "Density",
breaks=seq(0,1,.5),
labels=percent_format(suffix="%",accuracy=1))
Now we cluster based on standard deviation from the mean values within each product line. We will break customers up into 6 different clusters:
Because normalization seeks to create a mean at zero and a standard deviation of 1, performing our
clustering is far simpler than with scaling. We simply assign the appropriate Rank
of 1-6
based on whether the value lies within the assigned range.
The benefits of this become quite obvious when we look at the below datatable summarizing our customers
as a whole and by product line: instead of a random cluster number, we know for certain that a
Rank 6
customer is in the top, roughly 0.3%, of highest gross profit customers. Similarly,
if a customer is a Rank 4
customer, we know they are an above average performer based on
our bell curve.
Because we normalized our data, rankings are also evenly distributed.
# rank customers by product line
norms_customers_ranked <- all_norms %>%
mutate(norm_rank = ifelse(norm_gross_profit > 2, 6,
ifelse(norm_gross_profit >1, 5,
ifelse(norm_gross_profit >0, 4,
ifelse(norm_gross_profit > -1, 3,
ifelse(norm_gross_profit > -2, 2, 1))))))
# group customers, add ranks
norms_customers_sum_ranks <- norms_customers_ranked %>%
group_by(customer_name_sold_to) %>%
summarise_at(vars(norm_rank), funs(sum,prod)) %>%
# arrange(-norm_rank) %>%
set_colnames(c("customer_name_sold_to","norm_rank_sum","norm_rank_prod"))
# join ranks and sum ranks to sales data
df_customers_norm_ranked <- df_customer_sales %>%
mutate(gross_profit_margin = gross_profit/net_sales) %>%
left_join(norms_customers_ranked %>% dplyr::select(customer_name_sold_to, product_line, norm_rank)) %>%
left_join(norms_customers_sum_ranks) %>%
arrange(customer_name_sold_to)
# distribution of individual ranks among all customers-productline combos
df_customers_norm_ranked %>%
ggplot(aes(x=norm_rank))+
geom_histogram(bins=6, color="white")+
theme_few()+
scale_x_continuous(name = "Rank",
breaks =seq(0,6,1))+
scale_y_continuous(name = "Count",
breaks = seq(0,300,50))+
labs(title = "Histogram of customer ranks",
caption= "Straightforward normal distribution of scores")
# lolipop of top ranking customers-----
gg_customers_norm_ranked <- df_customers_norm_ranked %>%
group_by(customer_name_sold_to) %>% slice(1) %>% ungroup() %>%
mutate(customer_short = substring(customer_name_sold_to,1,25)) %>%
arrange(-norm_rank_sum) %>%
dplyr::filter(norm_rank_sum > 10) %>%
ggplot(aes(x=reorder(customer_short ,norm_rank_sum),
y=norm_rank_sum,
text=paste("Customer: ",customer_name_sold_to,
"<br>Net sales: ", mycurrency(net_sales),
"<br>Gross profit: ", mycurrency(gross_profit),
"<br>Gross profit margin:", paste0(round( (signif(gross_profit_margin,3)*100), 3), "%"),
"<br>Rank: ", norm_rank_sum)))+
geom_point(aes(fill=as.factor(norm_rank_sum)), size=3, stroke=.2, shape=21)+
geom_segment(aes(x=customer_short,
xend=customer_short,
y=0,
yend=norm_rank_sum),
alpha=.4,
color="grey40")+
theme(panel.grid.major.y = element_blank())+
coord_flip()+
theme_minimal()+
theme(legend.position = "none")+
scale_y_continuous(name = "Rank",
breaks=seq(0,36,2))+
xlab("Customer")+
labs(title = "Customers vs Rank (Rank > 10)",
caption= "Rank: sum of normalized gross sales rankings by product lines")
ggplotly(gg_customers_norm_ranked, tooltip = c("text")) %>%
layout(title = list(text = paste0('Customers vs Rank (Rank > 10)',
'<br>',
'<sup>',
"Rank: sum of normalized gross sales rankings by product lines",
'</sup>')))
# datatable of norm rank data ----
dt_customers_norm_ranked <- df_customers_norm_ranked %>%
arrange(-norm_rank_sum) %>%
dplyr::select(customer_name_sold_to,
product_line,
norm_rank_sum,
net_sales,
gross_profit,
gross_profit_margin,
norm_rank) %>%
set_colnames(c("Customer",
"Product Line",
"Customer Rank",
"Net Sales",
"Gross Profit",
"Gross Profit Margin",
"Product Line Rank"))
datatable(dt_customers_norm_ranked,
options = list(
autoWidth = TRUE
# dom = 't'
# dom = 'tip'
),
class = 'cell-border stripe',
caption = "Normalized customer rankings by product line",
rownames = FALSE) %>%
formatPercentage(c(names(dt_customers_norm_ranked[c(6)])), 1) %>%
formatCurrency(c(names(dt_customers_norm_ranked[c(4,5)])), "$") %>%
formatStyle(names(dt_customers_norm_ranked[3]),
background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[3], na.rm=TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(dt_customers_norm_ranked[4]),
background = styleColorBar(range(dt_customers_norm_ranked[4]), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(dt_customers_norm_ranked[5]),
background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[5],na.rm = TRUE)), 'lightblue'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(dt_customers_norm_ranked[6]),
background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[6],na.rm = TRUE)), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(dt_customers_norm_ranked[7]),
background = styleColorBar(as.numeric(range(dt_customers_norm_ranked[7],na.rm = TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
Though the above information is useful, it may still be encumbering to sort through numbered ranks. Once again we can look into assigning customers olympic medals. We can try a similar method as we tried above, with our scaled ranks: segregating ranks into clusters of 50%, 45%, and 5% for Bronze, Silver, and Gold ranks, respectively.
We first look at the distribution of our ranks and then onto the final visualized scoring.
Normalization reduced the heavy right-skewed distribution we saw when scaling our values, above. We do see the distribution of scores has evened out, still right-skewed, but far less right-skewed than without normalization.
# distribution of norm ranks ?
norm_ranks <- df_customers_norm_ranked %>%
dplyr::select(customer_name_sold_to, norm_rank_sum)
df_norm_ranks_distr <- df_sales %>%
group_by(customer_name_sold_to) %>%
dplyr::summarise(net_sales = sum(net_sales),
gross_profit = sum(gross_profit),
gross_profit_margin = gross_profit/net_sales) %>%
# slice(1) %>%
left_join(norm_ranks) %>%
group_by(customer_name_sold_to) %>%
slice(1) %>%
dplyr::filter(net_sales > 0 & gross_profit_margin > 0 & gross_profit_margin < 1) %>%
mutate(gross_profit_margin = gross_profit_margin*100) %>%
mutate(norm_rank_sum = ifelse(is.na(norm_rank_sum),0,norm_rank_sum))
dt_norm_ranks_distr <- df_norm_ranks_distr %>%
group_by(norm_rank_sum) %>%
add_tally() %>%
arrange(-norm_rank_sum) %>% ungroup() %>%
as_tibble() %>%
mutate(n_pct = n/nrow(df_norm_ranks_distr)) %>%
group_by(norm_rank_sum) %>% slice(1) %>% ungroup() %>%
mutate(n_pct_cusum = cumsum(n_pct)) %>%
dplyr::select(norm_rank_sum,
n,
n_pct,
n_pct_cusum) %>%
set_colnames(c("Rank",
"Num customers in rank",
"Pct customers in rank",
"Cusum pct customers in rank"))
datatable(dt_norm_ranks_distr,
options = list(
dom = 'tipl ',
autoWidth = TRUE,
pageLength = 50
# dom = 'tip'
),
fillContainer = FALSE,
class = 'cell-border stripe',
caption = "Distribution of normalized ranks",
rownames = FALSE) %>%
formatPercentage(c(names(gg_ranks_dt[c(3,4)])), 1) %>%
formatStyle(names(gg_ranks_dt[2]),
background = styleColorBar(as.numeric(range(gg_ranks_dt[2], na.rm=TRUE)), 'lightsalmon'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_ranks_dt[3]),
background = styleColorBar(range(gg_ranks_dt[3]), 'bisque'),
# backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center') %>%
formatStyle(names(gg_ranks_dt[4]),
background = styleColorBar(as.numeric(range(gg_ranks_dt[4],na.rm = TRUE)), 'lightblue'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')
# dotplot olympic norm_rank
df_norm_olympics <- df_norm_ranks_distr %>%
mutate(olympics = ifelse(norm_rank_sum < 5, "Bronze",
ifelse(norm_rank_sum < 18, "Silver", "Gold"))) %>%
mutate(gross_profit_margin = gross_profit_margin/100)
df_norm_olympics$olympics <- factor(df_norm_olympics$olympics, levels = c("Bronze","Silver","Gold"))
gg_norm_olympics <- df_norm_olympics %>%
ggplot(aes(x=net_sales, y=gross_profit_margin,
# ggplot(aes(x=net_sales, y=gross_profit_margin,
text=paste('Customer:',customer_name_sold_to,
'<br>Net Sales:',mycurrency(net_sales),
'<br>Gross Profit:',mycurrency(gross_profit),
'<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
'<br>Rank:',norm_rank_sum,
'<br>Medal:',olympics)))+
geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
theme_few()+
labs(title = "Ranked customers - Gross profit vs Net sales",
caption = "",
fill="Rank",
alpha = "")+
scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
scale_alpha_discrete(range = c(0.5, 1))+
# scale_y_continuous(name = "Gross profit (USD)",
# labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
# breaks=seq(0,4e6,1e6))+
scale_y_continuous(name = "Gross profit margin (%)",
labels=percent_format(suffix="%",accuracy=1),
breaks=seq(0,1,.25))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,15e6,5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
ggplotly(gg_norm_olympics, tooltip = c("text")) %>%
layout(title = list(text = paste0('Ranked customers - Gross profit margin vs Net sales',
'<br>',
'<sup>',
"Ranks assigned using normalized customer/product-line data",
'</sup>')))
# dotplot olympic norm_rank
df_norm_olympics <- df_norm_ranks_distr %>%
mutate(olympics = ifelse(norm_rank_sum < 5, "Bronze",
ifelse(norm_rank_sum < 18, "Silver", "Gold"))) %>%
mutate(gross_profit_margin = gross_profit_margin/100)
df_norm_olympics$olympics <- factor(df_norm_olympics$olympics, levels = c("Bronze","Silver","Gold"))
gg_norm_olympics <- df_norm_olympics %>%
ggplot(aes(x=net_sales, y=gross_profit,
text=paste('Customer:',customer_name_sold_to,
'<br>Net Sales:',mycurrency(net_sales),
'<br>Gross Profit:',mycurrency(gross_profit),
'<br>Gross Profit Margin:',paste0((round(gross_profit_margin*100,1)), "%"),
'<br>Rank:',norm_rank_sum,
'<br>Medal:',olympics)))+
geom_point(aes(fill=olympics,alpha=olympics), shape=21,size=2,stroke=.2)+
theme_few()+
labs(title = "Ranked customers - Gross profit vs Net sales",
caption = "",
fill="Rank",
alpha = "")+
scale_fill_manual(values=c("#a77044","#EEEEEE","#ffd707"))+
scale_alpha_discrete(range = c(0.5, 1))+
scale_y_continuous(name = "Gross profit (USD)",
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"),
breaks=seq(0,4e6,1e6))+
# scale_y_continuous(name = "Gross profit margin (%)",
# labels=percent_format(suffix="%",accuracy=1),
# breaks=seq(0,1,.25))+
scale_x_continuous(name = "Net sales (USD)",
breaks = seq(0,15e6,5e6),
labels = dollar_format(prefix="$", scale=1e-6,suffix="M"))
ggplotly(gg_norm_olympics, tooltip = c("text")) %>%
layout(title = list(text = paste0('Ranked customers - Gross profit vs Net sales',
'<br>',
'<sup>',
"Ranks assigned using normalized customer/product-line data",
'</sup>')))