# The future of the Lake Arrowhead matrix

Cleve Moler just wrote a fantastic blog post about the Lake Arrowhead graph. The Lake Arrowhead graph, or the Golub Graph as I like call it, was a coauthor graph centered around Gene collected at the 1993 Householder meeting in Lake Arrowhead, California. We'll study this network using a few one-liner Matlab link-prediction techniques. These are all matrix equations that Gene would have enjoyed, so this is a fitting tribute. We'll see which predicts the most correct links in the top 25 (with at most 5 predictions for any author ...)

There are few issues with the Matlab code that would ruffle the feathers of any numerical analysis. Namely, I've used the inv function to greatly simplify the codes. Hopefully we can agree it's okay for this simple example.

I've decided to crowdsource the actual ranking! Please vote for which list you think looks "most right"!

## Contents

Visit Gilbert's original website and download the tar.Z file, unzip it, proceed into the householder directory, and turn the "drawit.m" file into a function.

## Load the network and look at it.

The drawit function is included in the tar file with the graph. We do need to edit it, so convert it into a function, otherwise you won't get nice figures below

```load housegraph;
n = size(A,1);
drawit
axis off;
```

## Remove self-loops

When we are predicting the future, allowing self-loops hardly seems fair! Although, in truth, they have a fairly predictable affect on each of the methods we study. It's just easier to remove them.

```B = A - diag(diag(A));
```

## Katz scores

Leo Katz wrote a fantastic paper in 1953 on a new way to determine a "status index" that measures how important someone is in a social network. More recently, the Katz score between a pair of nodes was shown to predict links that would form in the future. See this paper by Jon Kleinberg (http://www.cs.cornell.edu/home/kleinber/link-pred.pdf) and some recent work by Inderjit Dhillon (http://www.cs.utexas.edu/users/inderjit/Talks/2010-IPAM-Inderjit.pdf)

The Katz score between node i and j is the number of walks between the nodes, where the weight of a walk of length k is alpha^k. These scores are usually defined via the Neumann series:

K = alpha*A + alpha^2*A^2 + alpha^3*A^3 + ...

which expresses the score for all pairs of nodes simultaneously. The relationship with the Neumann series means that we can evaluate this infinite series in Matlab using a single matrix inverse:

K = inv(eye(n) - alpha*A)

Picking alpha is always the key question with this method. While the inverse exists as long as alpha isn't one of a very few special values related to the eigenvalues of A, most of these choices are rotten. Choosing alpha = 1/largest degree is common as it'll make the Neumann series converge. We evaluate this value as well as a smaller value of alpha that should favor short connections instead of longer connections.

That's a long-winded explaination for the next three lines of Matlab code to actually compute them!

## Katz scores with a tight and nearly diagonally dominant, matrix

```fprintf('\nKatz scores with DD matrix\n');
alpha = 1./max(sum(B,2));
showpreds(inv(eye(n) - alpha*B),A,25,name);
```
```Katz scores with DD matrix
Golub - Paige                 with value 0.005967
Golub - Demmel                with value 0.005407
Golub - Schreiber             with value 0.005146
Golub - Hammarling            with value 0.004965
Schreiber - Heath                 with value 0.004887
VanLoan - VanDooren             with value 0.004752
TChan - Demmel                with value 0.004682
VanLoan - Heath                 with value 0.003874
Eisenstat - Plemmons              with value 0.003845
Golub - Liu                   with value 0.003818
VanDooren - Heath                 with value 0.003796
VanDooren - Plemmons              with value 0.003785
Golub - Funderlic             with value 0.003717
Golub - Harrod                with value 0.003699
Golub - Boley                 with value 0.003698
Moler - Demmel                with value 0.003689
Golub - NTrefethen            with value 0.003660
George - Gilbert               with value 0.003645
Heath - Meyer                 with value 0.003609
Eisenstat - Ng                    with value 0.003602
Bjorck - VanDooren             with value 0.003589
Kagstrom - Luk                   with value 0.003563
Gilbert - Eisenstat             with value 0.003547
Park - VanDooren             with value 0.003488
Schreiber - Bunch                 with value 0.003483
```

## Katz scores with a matrix that's very DD

```fprintf('\nKatz scores with very DD matrix\n');
alpha = 1./max(sum(B,2)+100); % smaller alpha bound
showpreds(inv(eye(n) - alpha*B),A,25,name);
```
```Katz scores with very DD matrix
Golub - Paige                 with value 0.000298
Golub - Demmel                with value 0.000245
Golub - Schreiber             with value 0.000242
Golub - Hammarling            with value 0.000240
Schreiber - Heath                 with value 0.000240
VanLoan - VanDooren             with value 0.000239
TChan - Demmel                with value 0.000238
VanLoan - Heath                 with value 0.000182
Eisenstat - Plemmons              with value 0.000182
Golub - Liu                   with value 0.000181
VanDooren - Heath                 with value 0.000181
VanDooren - Plemmons              with value 0.000181
Moler - Demmel                with value 0.000181
George - Gilbert               with value 0.000180
Golub - Funderlic             with value 0.000180
Golub - Harrod                with value 0.000180
Golub - Boley                 with value 0.000180
Eisenstat - Ng                    with value 0.000180
Golub - NTrefethen            with value 0.000180
Heath - Meyer                 with value 0.000179
Bjorck - VanDooren             with value 0.000179
Kagstrom - Luk                   with value 0.000179
Gilbert - Eisenstat             with value 0.000179
Park - VanDooren             with value 0.000179
Schreiber - Bunch                 with value 0.000178
```

## PageRank with alpha = 0.85

PageRank is typically used for finding the most important links on a web-page. However, it can just as easy be used to determine which links are most likely to form in the future! We compute the PageRank value for each node, where it "teleports" back to itself as a link-prediction measure.

Alpha = 0.85 is the standard value of alpha in PageRank. Let's see what predictions we get with this method. The output is a non-symmetric matrix and so we'll symmetrize it by taking the best prediction for each user.

```fprintf('\nPageRank with alpha = 0.85\n');
alpha = 0.85;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
```
```PageRank with alpha = 0.85
Crevelli - Ipsen                 with value 0.874633
Kincaid - Varga                 with value 0.866249
Wold - Kagstrom              with value 0.699082
Golub - Modersitzki           with value 0.661777
Golub - Davis                 with value 0.579365
Gutknecht - Starke                with value 0.564489
Widlund - Boman                 with value 0.526378
He - Gragg                 with value 0.520894
Golub - Smith                 with value 0.512122
Varga - Hochbruck             with value 0.508966
Schreiber - Boman                 with value 0.490197
Davis - Demmel                with value 0.483887
TChan - MuntheKaas            with value 0.478409
MuntheKaas - Demmel                with value 0.451077
Golub - Ashby                 with value 0.447502
Modersitzki - Reichel               with value 0.446283
Golub - Szyld                 with value 0.434651
Golub - Harrod                with value 0.431189
Golub - Marek                 with value 0.420211
Marek - Widlund               with value 0.412368
Golub - Benzi                 with value 0.406861
Golub - Boley                 with value 0.405866
Golub - Starke                with value 0.403452
Varga - Szyld                 with value 0.401111
Golub - Hochbruck             with value 0.395218
```

## PageRank with alpha = 0.5

```fprintf('\nPageRank with alpha = 0.5\n');
alpha = 0.5;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
```
```PageRank with alpha = 0.5
Crevelli - Ipsen                 with value 0.159729
Kincaid - Varga                 with value 0.157497
Wold - Kagstrom              with value 0.149887
Golub - Modersitzki           with value 0.110927
Gutknecht - Starke                with value 0.108330
Widlund - Boman                 with value 0.103562
Golub - Davis                 with value 0.101269
Schreiber - Boman                 with value 0.099594
Modersitzki - Reichel               with value 0.098967
Davis - Demmel                with value 0.098850
Varga - Hochbruck             with value 0.098413
He - Gragg                 with value 0.094472
TChan - MuntheKaas            with value 0.084800
MuntheKaas - Demmel                with value 0.082447
Marek - Widlund               with value 0.078260
MuntheKaas - Duff                  with value 0.076992
Varga - Szyld                 with value 0.076467
Golub - Smith                 with value 0.072827
Pothen - Demmel                with value 0.066630
Cullum - Demmel                with value 0.065399
Strakos - Demmel                with value 0.065399
Kenney - Paige                 with value 0.063973
Chandrasekaran - Jessup                with value 0.063891
Cullum - Hammarling            with value 0.062903
Strakos - Hammarling            with value 0.062903
```

## PageRank with alpha = 0.99

For a while, people were convinced that solving PageRank with alpha ~ 1 was the right thing to do as it kept the results as close to the graph as possible. For web-graphs, this is wrong as Vigna and his collaborators showed. However, for graphs that are strongly connected, it's still interesting. Let's see what we get!

```fprintf('\nPageRank with alpha = 0.99\n');
alpha = 0.99;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
```
```PageRank with alpha = 0.99
Golub - Modersitzki           with value 7.740538
Golub - Davis                 with value 7.551120
Golub - Smith                 with value 7.516129
Golub - Szyld                 with value 7.425185
Golub - Ashby                 with value 7.411030
Golub - Marek                 with value 7.408324
Golub - Starke                with value 7.373870
Golub - Hochbruck             with value 7.355581
Golub - Harrod                with value 7.329255
Golub - Young                 with value 7.320046
Golub - Boley                 with value 7.316955
Golub - Nagy                  with value 7.284151
Golub - Pan                   with value 7.284151
Golub - Benzi                 with value 7.269946
Golub - Hansen                with value 7.248056
Golub - Kincaid               with value 7.246845
Golub - Funderlic             with value 7.240728
Golub - Gu                    with value 7.235458
Golub - Paige                 with value 7.230160
Golub - Park                  with value 7.222309
Golub - NTrefethen            with value 7.219603
Golub - Bjorstad              with value 7.202310
Golub - VanHuffel             with value 7.189794
Golub - Ruhe                  with value 7.182335
Golub - Liu                   with value 7.161991
```

## The matrix exponential

Benzi, Higham, and Estrada have written about using the matrix exponential for analyzing social networks. Let's see what we get here.

```fprintf('\nMatrix exponential with adjacency\n');
showpreds(expm(B),A,25,name);
```
```Matrix exponential with adjacency
Golub - Demmel                with value 200.667542
Golub - Schreiber             with value 163.492101
Golub - Hammarling            with value 134.885201
Golub - Paige                 with value 133.287838
Golub - Liu                   with value 107.695220
VanLoan - Heath                 with value 105.109816
VanDooren - Heath                 with value 104.281025
Golub - Moler                 with value 103.008898
VanDooren - Plemmons              with value 96.089354
Golub - Laub                  with value 93.927783
Schreiber - Heath                 with value 93.858705
Golub - Gilbert               with value 93.130294
Plemmons - Demmel                with value 91.019041
VanLoan - Plemmons              with value 89.104026
Golub - Barlow                with value 88.970881
Golub - Ng                    with value 87.220015
Golub - Duff                  with value 87.169014
Eisenstat - Plemmons              with value 87.001184
Golub - Funderlic             with value 86.966110
Luk - Heath                 with value 85.078033
VanDooren - Demmel                with value 81.426481
VanLoan - VanDooren             with value 80.838813
Luk - Plemmons              with value 79.727039
Golub - Harrod                with value 78.072424
Kagstrom - Heath                 with value 77.667173
```

## The matrix exponential with the normalized Laplacian

Based on work with PageRank, and Fan Chung Graham's "Heat-Kernel PageRank" version, I've often wondered how well the matrix exponential with the normalized Laplacian does instead ...

```fprintf('\nMatrix exponential with normalized Laplcian\n');
Dhalf = diag(sqrt(1./sum(B,2)));
L = eye(n) - Dhalf*A*Dhalf;
showpreds(expm(L),A,25,name);
```
```Matrix exponential with normalized Laplcian
Varga - Hochbruck             with value 0.201888
Gutknecht - Starke                with value 0.190164
Crevelli - Ipsen                 with value 0.187998
Kincaid - Varga                 with value 0.187151
Wold - Kagstrom              with value 0.184371
BunseGerstner - Ammar                 with value 0.178577
Marek - Widlund               with value 0.153282
Varga - Szyld                 with value 0.152184
He - Gragg                 with value 0.142192
Cullum - Strakos               with value 0.138646
Widlund - Boman                 with value 0.131688
Chandrasekaran - Jessup                with value 0.118900
Modersitzki - Reichel               with value 0.117852
Smith - Szyld                 with value 0.117272
He - Byers                 with value 0.110082
Bunch - Hansen                with value 0.107937
Bjorstad - Smith                 with value 0.102005
Fierro - OLeary                with value 0.100629
Young - Starke                with value 0.099905
Marek - Young                 with value 0.099319
OLeary - Smith                 with value 0.098833
Park - VanDooren             with value 0.098414
Marek - Starke                with value 0.098340
LeBorne - Fierro                with value 0.097025
Bai - VanHuffel             with value 0.093191
```

## Commute time

The commute time distance between two nodes of a graph is the expected time for a random walk started at one node to first the second node, and then return back to the first node. Pairs of nodes with no edges and small commute time distance are ones that are likely to form edges in the future. Let's see how this measure does!

```fprintf('\nCommute time\n');
K = diag(sum(B,2)) - B;
C = zeros(n,n); Ki = pinv(full(K));
for i=1:n, for j=1:n, C(i,j) = Ki(i,i) + Ki(j,j) - 2*Ki(i,j); end, end
showpreds(-C,A,25,name); % small commute times are likely "new links"
```
```Commute time
Golub - Demmel                with value -0.143612
Golub - Schreiber             with value -0.165561
Golub - Hammarling            with value -0.181163
Schreiber - Heath                 with value -0.217498
VanDooren - Demmel                with value -0.223000
Golub - Moler                 with value -0.230225
Plemmons - Demmel                with value -0.232642
VanDooren - Heath                 with value -0.241182
Golub - Paige                 with value -0.245504
VanDooren - Plemmons              with value -0.252665
VanLoan - Heath                 with value -0.256789
Schreiber - Hammarling            with value -0.257972
VanLoan - VanDooren             with value -0.258443
Schreiber - VanDooren             with value -0.259069
Hammarling - Heath                 with value -0.259965
Golub - Gragg                 with value -0.260218
Schreiber - Plemmons              with value -0.262964
Moler - Demmel                with value -0.265665
TChan - Demmel                with value -0.276060
Moler - Heath                 with value -0.277328
VanLoan - Hammarling            with value -0.280051
Hammarling - Plemmons              with value -0.280855
VanLoan - Plemmons              with value -0.288621
George - Schreiber             with value -0.291199
Luk - Demmel                with value -0.293697
```

## Appendix

You need two extra routines for this code:

```type drawit
```
```function drawit
% DRAWIT  Script to plot the graph of coauthors from Householder 93.

% A    is the adjacency matrix.
% prcm is a permutation; any permutation could be substituted.
% name is the vector of people's names.
% xy   is just xy coords of points on the unit circle (and the origin).

Aperm = A(prcm,prcm);
nameperm = name(prcm,:);
nfolks = max(size(A));

clf
gplot(Aperm,xy);
axis off;
axis square;
x = xy(:,1) * 1.05;
y = xy(:,2) * 1.05;
x(1) = .08;         % Put names(1) in the center of the circle.
y(1) = .06;
h = text(x,y,nameperm);

for k = 2:nfolks    % Shrink the font for the other names, and rotate them.
set(h(k),'fontsize',10);
set(h(k),'rotation',180/pi*atan2(y(k),x(k)));
end;

```
```type showpreds
```
```function showpreds(P,G,k,names)
n = size(P,1);
P = P - spones(G).*P; % remove all the given links
[i,j,v] = find(triu(P,1)); % remove any self-loops
[~,p] = sort(v,'descend');
cnt = zeros(n,1);
cur=0;
Preds = zeros(k,2);
for pi=p(:)'
if cur >= k, break; end
li=i(pi); lj=j(pi);
if cnt(li) > 5 && cnt(lj) > 5, continue; end
fprintf(' %20s - %-20s  with value %f \n', ...
deblank(names(li,:)), names(lj,:), v(pi));
cnt(li) = cnt(li)+1; cnt(lj)=cnt(lj)+1;
cur = cur+1;
Preds(cur,1) = li; Preds(cur,2) = lj;
end
% Display them
Preds = sparse(Preds(:,1),Preds(:,2),1,n,n);