Snowflake Statistical Functions (Part 2)

By Daniel Durling (Senior Data Scientist) and Dave Poole (Principal Data Engineer)

Background 

Welcome to Part Two in our series on Statistical Functions in Snowflake. Here we will look at some of the Snowflake statistical functions which generate multiple values that enable you to understand your data better.  

A quick reminder that in Part One we looked at a series of functions which return a single value and how that single value can help you understand your data. Part One is available here. In part three we will look at combining the outputs of these functions using one of our key visualisation tools; AWS QuickSight.  

Snowflake statistical functions (part 2) 

Distributions & Correlations 

We saw in part one how single value summary statistics can help us understand our data. However, in most cases a single value such as the average, median or mode does not provide enough information.  We would be better off looking at some form of distribution of values.  Snowflake provides us with some useful functions in this space: 

distribution function & description table

CUME_DIST 

The CUME_DIST function allows you to calculate the cumulative distribution of the data within a table or within partitions of that table. To put this another way, we are looking at the data within a table / group of entries within a table, ordering them by a given variable, then calculating how much of the whole each group (and all groups before it in the ordering) takes up. We then normalise this to a value between 0 and 1.  This 0 – 1 range represents percentages where 1 is 100%, 0.35 is 35%, etc. 

Try the following example based on the SNOWFLAKE_SAMPLE_DATA database. 

SELECT  

    O_CLERK,

    O_ORDERDATE,

    CUME_DIST() OVER (PARTITION BY O_CLERK ORDER BY O_ORDERDATE) AS Cummulative_Frequency

FROM ORDERS

WHERE O_ORDERDATE BETWEEN '1997-01-01' AND '1997-12-31' 

This shows us the cumulative frequency of orders by date for each clerk.  The records will appear as follows: 

0_clerk, 0_orderdate and cummulative_frequency table

Look at the CUMMULATIVE_FREQUENCY column carefully.  Why are there two values the same?  Because the cumulative frequency is ordered by O_ORDERDATE the column represents the cumulative frequency for the day, not just the individual records. 

There are 261 records for Clerk#000000968 

  • Each record is approximately 0.38%  (1 / 261) 
  • 2 orders occur on 1st January 1997 so 2 x 0.38% = 0.76% 

This can be a confusing representation so we can aggregate these results giving a less confusing answer: 

WITH daily_sales(O_Clerk, Order_Date, Cummulative_Frequency) 

AS ( 

  SELECT  

    O_CLERK,  

    O_ORDERDATE, 

    CUME_DIST() OVER (PARTITION BY O_CLERK ORDER BY O_ORDERDATE) AS Cummulative_Frequency 

  FROM ORDERS“> 

  WHERE O_ORDERDATE BETWEEN '1997-01-01' AND '1997-12-31' 

) 

SELECT  

    O_Clerk,  

    Order_Date,  

    MAX(Cummulative_Frequency), 

    COUNT(*) AS Daily_Orders 

FROM daily_sales 

GROUP BY O_Clerk, Order_Date 

ORDER BY O_Clerk, Order_Date;  

This gives less confusing results: 

o_clerk, o_orderdate, cummulative_frequency and daily_orders table

The CUME_DIST function is useful because its output allows us to describe the probability of random variables within our data. These distributions can then be compared group to group, allowing us to see the separation / overlap of our variables.  

As a practical example imagine that you want to run an email campaign.  You want to know if your customer segmentation model was successful.  You have two groups of email addresses 

  • Control group selected at random 
  • Test group chosen using your segmentation model 

The chart below is an example showing the cumulative distribution of responses for your two groups. Here we have: 

X axis = Percentage of each customer group contacted 

Y axis = Percentage of responses coming from the percentage of the customers contacted 

The straight blue line represents the responses from your control group.  If your control group is a truly random selection then the response rate will be constant.  When looking back on your campaign results any two random samples from your control group should produce the same response rate. 

The curved red line represents the response rate if our customers were ranked by our scoring model. 

You can see that the scoring model is successful as the campaign returns 70% responses from 30% of customers. 

Cumulative distribution used to illustrate results of a customer targeting strategy Cumulative distribution used to illustrate results of a customer targeting strategy Cumulative distribution used to illustrate results of a customer targeting strategy

Percentile vs percentage 

Next, we look at a two percentile calculation functions. Remember that a percentile is not a percentage. A percentile is used to return to us a ranked value, whereas a percentage allows us to compare different quantities of different things in a consistent way. A percentage gives us information on ratios and proportions. A percentile tells us where a given value is in relation to all the other values within the dataset. 

If you get 80% on a test of 100 questions, we know you got 80 questions correct. However, we do not know anything about how well you did in comparison to others. If your score is the 80th percentile, we know that 80% of the scores of people who took the test are equal to or below your score.  

Percentiles are a common type of Quantile. Other common types are quartiles (from which we can calculate the interquartile range – you might be familiar with this from viewing Box plots). 

PERCENTILE_DISC vs PERCENTILE_CONT 

The table below summarises the difference between the two functions: 

percentile_cont & percetile_discrete table

1 If you would like to know more about the Beaufort scale check out this link. 

 

You can experiment with the percentiles by editing the following query. 

 SET Desired_Percentile=0.55; 

 

SELECT Item_Code, 

    $Desired_Percentile, 

    PERCENTILE_DISC($Desired_Percentile) WITHIN GROUP (ORDER BY Item_Value) AS Discrete, 

    PERCENTILE_CONT($Desired_Percentile) WITHIN GROUP (ORDER BY Item_Value) AS Continuous 

FROM( 

  SELECT 1 AS Item_Code,10 as Item_Value 

  UNION ALL 

  SELECT 1 AS Item_Code,20 as Item_Value 

  UNION ALL 

  SELECT 1 AS Item_Code,30 as Item_Value 

  UNION ALL 

  SELECT 1 AS Item_Code,40 as Item_Value 

) AS DT 

GROUP BY Item_Code; 

How PERCENTILE_CONT is calculated 

Take your list of values and arrange it in ascending order 

Calculate list position (desired percentile / 100) * (number_of_datapoints – 1) + 1 

Calculate integer position by rounding down the list position 

Calculate decimal part of the list position by using the  

Datapoint at the integer part of the position in the list 

+ (Datapoint at (Integer list position +1) – Datapoint at list position) x decimal part of list position in the list 

Let us work through the 55th Percentile example 

  • List Position = ( 55/100 ) * (4-1) +1 = 0.55 * 3 – 1 = 2.65 
  • Integer List Position = 2 
  • Decimal part of List Position – 0.65 
  • Data point for List Position= 20 
  • Datapoint for next List Position = 30 
  • ( 30 – 20 ) * 0.65 + 20 = 26.5 

Percentile calculation is of great use when understanding where a given value is in relation to other values within the same group of data. 

RANK / DENSE_RANK / PERCENT_RANK  

Next let us look at these three ranking functions. These functions allow us to see the ordering of variables within our data in different ways. This information is valuable as it quickly allows us to understand the number of discrete categories within a subgroup, and the order they are in. 

RANK and DENSE_RANK do the same thing is slightly different ways. 

Let us imagine an Olympic cycle race where the winners tie.  We shall ignore the coin toss that decides the winner under Olympic rules. 

For the 2 winners, RANK and DENSE_RANK = 1 

For the person in 3rd place RANK = 3 but DENSE_RANK = 2 

Or to put it another way 

  • RANK will tell us the top 3 runners by time. 
  • DENSE_RANK will tell us the top 2 times, although this may be more than 2 runners. 

We can demonstrate this in an example query based on the TPCH_SF1.PART table within the SNOWFLAKE_SAMPLE_DATA database: 

SELECT  

P_RETAILPRICE, 

RANK() OVER (ORDER BY RETAILPRICE DESC) AS Ranked_Price, 

DENSE_RANK() OVER (ORDER BY RETAILPRICE DESC) AS Dense_Ranked_Price, 

PERCENT_RANK() OVER (ORDER BY RETAILPRICE DESC) AS Percent_Ranked_Price, 

    FROM TPCH_SF1.PART;

Let us look at the results for the top 3 prices: 

p_retailprice, ranked_price, dense_ranked_price and percent_rank table 

We can see that DENSE_RANK identifies the top three prices whereas RANK shows that the top three prices cover 9 products. 

PERCENT_RANK returns the rank but specified as a percentage ranging from 0.0 to 1.0.  

From the documentation those items with ranked in 1st place will be assigned zero percent.  From that point on the calculation is (rank -1) / (n – 1). 

n – 1 = 8  so for the other rows we can slot in the rank to see how the percent_rank was calculated 

  • (3 – 1) / 8 = 0.25 
  • (6 – 1) / 8 = 0.625 

You will see this rank is similar (but different) to our CUME_DIST function above.  

These ranking functions allow us to create and save a given order to our data. We can of course create multiple rank columns depending on what we might which to do with the data. 

COVAR_POP / COVAR_SAMP 

These two covariance functions are similar to the Variance function we looked at in Part One. The table below helpfully compares them: 

operation, variance and covariance table

Care must be taken in calculating and interpreting covariance 

  • Both sets should use the same units of measure 
  • The numbers produced can be almost any value so they must be interpreted based on the units of measure. 

Just like how the variance in part one allows us to calculate the (in this writers opinion) much more useful standard deviation; the covariance allows us to calculate the more useful correlation coefficient.  

 CORR 

The CORR function produces the correlation coefficient between two sets of values. This function returns to us a value between -1 and 1 where: 

  • -1 = where when one variable goes up the other goes down 
  •  0 = No correlation 
  • +1 = As one value goes up, so too does the other. 

 The correlation coefficient is calculated using the covariances we calculated above as follows: 

Formula for the correlation coefficient

The bottom half of the fraction is the multiplication of the standard deviations for the two sets. 

Compare the various statistics for the supply cost and retail price for each brand in the SNOWFLAKE_SAMPLE_DATABASE: 

SELECT  

P.P_BRAND,  

    COUNT(*) AS Number_Of_Parts, 

COVAR_POP(S.PS_SUPPLYCOST, P.P_RETAILPRICE) AS Covariance_Cost_Vs_Sales, 

    CORR(S.PS_SUPPLYCOST, P.P_RETAILPRICE) AS Correlation_Coefficient, 

    AVG(S.PS_SUPPLYCOST) AS Average_Supply_Cost, 

    STDDEV(S.PS_SUPPLYCOST) AS Standard_Deviation_Supply_Cost,

 

    AVG(P.P_RETAILPRICE) AS Average_Retail_Price, 

    STDDEV(P.P_RETAILPRICE) AS Standard_Deviation_Retail_Price 

FROM PARTSUPP AS S 

INNER JOIN PART AS P 

    ON PS_PARTKEY = P.P_PARTKEY 

GROUP BY P.P_BRAND 

The results show the relationship between supply cost and retail price is almost non-existent. This lack of a relationship is valuable information and puts down a marker for current correlation, and compare our data to our prior thoughts and gut instincts. 

Correlation between variables is an easy-to-understand statistic which can reveal an underlying relationship you were unaware of.  

If you analyse the correlations between all the possible pairs of the items your company sells, you will no doubt find some correlations you were unaware of. Using this information, you can then devise strategies to emphasise this correlation. For example; placing these items close together within the store, offering a discount on the lower margin item to test if it will generate an increase in sales of the lower margin item and the higher margin item leading to an increase in profits.  

Conclusion  

In this second part of our series on statistical functions in Snowflake we have highlighted some of what we think are the most useful functions which generate more than a single value / can be used to create more than one value (for functions which generate a single value please check out Part One). We looked at Cumulative distribution, discrete and continuous percentile calculations, three different types of ranking function. 

These functions (alongside those we looked at in Part One) help us understand the datasets we are working on. In part 3 of this series, we will look at visualising these summarising datapoints alongside our actual data. 

Found this useful? Fancy a chat about statistics and snowflake? Let us know here