## Computing the variance in one pass¶

There a variety of algorithms described for computing the variance of a set of numbers. Recall that the variance of a set values is $$v = \frac{1}{n-1} \sum_i \left( x_i - \tfrac{1}{n} \sum_{j} x_j \right)^2.$$ (This is the sample variance that is sometimes computed, the differences do not matter for our discussion.)

The easiest way to compute this formula is by first computing the mean: $$\text{mean} = \tfrac{1}{n-1} \sum_{j} x_j$$ and then computing the variance based on the mean $$v = \frac{1}{n-1} \sum_i \left( x_i - \text{mean} \right)^2.$$

But this requires two visits to each datapoint. If there are millions or billions of datapoints, this could be expensive.

In [26]:
function badvar1(x::Vector{Float64})
ex2 = 0.0
ex = 0.0
n = length(x)
for i=1:n
ex2 = ex2 + x[i]^2
ex = ex + x[i]
end
@show ex2, ex^2
return 1.0/(n-1)*(ex2 - (ex)^2/n)
end

Out[26]:
badvar1 (generic function with 1 method)
In [17]:
x = randn(100)
basevar = x -> (length(x)/(length(x)-1))*mean((x - mean(x)).^2)
@show basevar(x)

badvar1(x) = 1.032311000268402
basevar(x) = 1.0323110002684024

Out[17]:
1.0323110002684024
In [18]:
x = randn(10000)
@show basevar(x)

badvar1(x) = 1.0098437328466676
basevar(x) = 1.0098437328466736

Out[18]:
1.0098437328466736

Variance computations are immune to shifts

In [19]:
x = randn(10000)+1e4
@show basevar(x)

badvar1(x) = 1.019557363451189
basevar(x) = 1.0195578158695435

Out[19]:
1.0195578158695435
In [20]:
x = randn(10000)+1e6
@show basevar(x)

badvar1(x) = 0.9928992899289929
basevar(x) = 1.0077890284079822

Out[20]:
1.0077890284079822
In [27]:
x = randn(10000)+1e8
@show basevar(x)

(ex2,ex ^ 2) = (1.0000000002153153e20,1.0000000002153186e24)
basevar(x) = 1.0046044826796907

Out[27]:
1.0046044826796907
In [24]:
function goodvar(x::Vector{Float64})
n = length(x); mean = 0.0; m2 = 0.0; N = 0.0
for i=1:n
N = N + 1
delta = x[i] - mean
mean = mean + delta/N
m2 = m2 + delta*(x[i]-mean)
end
return m2/(n-1)
end

Out[24]:
goodvar (generic function with 1 method)
In [25]:
x = randn(10000)+1e8
basevar = var

badvar1(x) = 81.92819281928193

1.0168265735653041