There a variety of algorithms described for computing the variance of a set of numbers. Recall that the variance of a set values is $$v = \frac{1}{n-1} \sum_i \left( x_i - \tfrac{1}{n} \sum_{j} x_j \right)^2.$$ (This is the sample variance that is sometimes computed, the differences do not matter for our discussion.)
The easiest way to compute this formula is by first computing the mean: $$ \text{mean} = \tfrac{1}{n-1} \sum_{j} x_j $$ and then computing the variance based on the mean $$v = \frac{1}{n-1} \sum_i \left( x_i - \text{mean} \right)^2.$$
But this requires two visits to each datapoint. If there are millions or billions of datapoints, this could be expensive.
function badvar1(x::Vector{Float64})
ex2 = 0.0
ex = 0.0
n = length(x)
for i=1:n
ex2 = ex2 + x[i]^2
ex = ex + x[i]
end
@show ex2, ex^2
return 1.0/(n-1)*(ex2 - (ex)^2/n)
end
x = randn(100)
basevar = x -> (length(x)/(length(x)-1))*mean((x - mean(x)).^2)
@show badvar1(x)
@show basevar(x)
x = randn(10000)
@show badvar1(x)
@show basevar(x)
Variance computations are immune to shifts
x = randn(10000)+1e4
@show badvar1(x)
@show basevar(x)
x = randn(10000)+1e6
@show badvar1(x)
@show basevar(x)
x = randn(10000)+1e8
@show badvar1(x)
@show basevar(x)
function goodvar(x::Vector{Float64})
n = length(x); mean = 0.0; m2 = 0.0; N = 0.0
for i=1:n
N = N + 1
delta = x[i] - mean
mean = mean + delta/N
m2 = m2 + delta*(x[i]-mean)
end
return m2/(n-1)
end
x = randn(10000)+1e8
basevar = var
@show badvar1(x)
@show basevar(x)
@show goodvar(x)