Describing Data with Statistics in Go

In this post I’ll use Go to explore basic statistical measures of central tendency: average, median, mode and a measure of dispersion: the range.

Calculating the Average

The mean or arithmetic average is probably the measure of central tendency that you are most familiar with. When we need to find the average of a set of data, we add up all the values and then divide this total by the number of values.

Let’s see this with an example:

This is the amount Mr. Vamug has spent (GBP) on lottery in the last 12 months:

lotterySpending :=[]float64{150,105,127,130,46,106,36,33,46,73,84,112}

Before we start I am going to assume that data passed onto functions is valid so I will ignore errors handling for the sake of simplicity.

So we need to create a function that accepts a set of values (of any length) and return the average.

func Average(s []float64) (float64) {
    var sum float64
    for _,v:= range s {
        sum+=v
    }
    return sum/float64(len(s))
    
}
import "fmt"
fmt.Printf("Thus, Mr. Vamug's monthly average spending in lottery is: £%.2f",Average(lotterySpending))
Thus, Mr. Vamug's monthly average spending in lottery is: £87.33

As you can see, with the mean we can describes an entire sample with a single number that represents the center of the data

Calculating the Median

The median is another kind of average. It represents the middle of the data. Half of the observations are less than or equal to it and half of the observations are greater than or equal to it. The median is less sensitive than the mean to extreme values.

To obtain the median we need to:

Note: Array/Slice index begins at position zero to in order to get the correct position we need to account for this.

import "sort"
func Median(s []float64) (float64) {
    sort.Float64s(s)
    n:= len(s)
    p:= n-1 // index starts at zero
    if n % 2 == 0 {
        return (s[p/2] + s[(p/2)+1])/2
    }
    return s[(p+1)/2]
}
fmt.Printf("Thus, Mr. Vamug's monthly median spending in lottery is: £%.2f",Median(lotterySpending))
Thus, Mr. Vamug's monthly median spending in lottery is: £94.50

The average is slighly lower than the median. I may discuss it in further details when looking at skewness but that will be a different post.

Calculating the Mode

The mode is the value that occurs most frequently in a set of observations. You can find the mode simply by counting the number of times each value occurs in a data set. In order to achieve this we need to build a frequency table.

A frequency table is a table that lists items and shows the number of times the items occur.

I think we are better off splitting the logic in different functions.

func FreqTable(s []float64) (map[float64]int){
    ft:=make(map[float64]int)
    for _,v:=range s{
        ft[v]++
    }
    return ft
}

If we now pass lotterySpending to our function we will get a map of values along with their frequencies

FreqTable(lotterySpending)
map[33:1 36:1 46:2 73:1 84:1 105:1 106:1 112:1 127:1 130:1 150:1]

From the above map we can see that the most frequent value is £46. Let’s now build our Mode function.

func Mode(s []float64) (float64,error){
    ft:=FreqTable(s) 
    var pos float64
    var max int
    for _,v := range ft {
        max = v
        break
    }
     for index,n := range ft {
        if n > max {
            max = n
            pos = index
        }
     }
    if max ==1 {
        return 0.0, fmt.Errorf("No mode")
    }
    return pos, nil
    
}
md, err := Mode(lotterySpending)
md
46

Voila! Our function returns the value 46. What if there is no frequent value? In other words the frequency of each value is 1. In this case there is no mode and we need to handle this scenario using error.

Here is a quick example:

mode, err := Mode([]float64{1,2,3,4})
if err !=nil {
    fmt.Printf("Error: %s",err)
}
Error: No mode

Sometimes you can have more than one mode, bimodal when there are 2 peaks or modes or multi-modal for more than 2 peaks. I will leave the implementation for another post.

Calculating the Range

The calculation of the range is very straightforward. The Range is the difference between the lowest and highest values.

Use the range to understand the amount of dispersion in the data. A large range value indicates greater dispersion in the data. A small range value indicates that there is less dispersion in the data

The range can sometimes be misleading when there are extremely high or low values.

func Range(s []float64) (float64) {
    sort.Float64s(s)
    return s[len(s)-1] - s[0]
}
Range(lotterySpending)
117

That’s it! I hope you have enjoyed this post. Keep tune as I plan to release more posts about statistics in Go

Comments

comments powered by Disqus