Download (source code in C)
What is PROXIMUS?
PROXIMUS is a non-orthogonal matrix transform based on recursive partitioning of
a dataset depending on the distance of a relation from the dominant pattern.
The dominant pattern is computed as a binary singular vector of the matrix of
relations. It provides several facilities to analyze discrete attributed data including:
- discovering dominant and deviant patterns in the data in a
hierarchical manner
- clustering of data in an error-bounded and physically interpretable form
- finding a concise representation for the data
- isolating signal from noise in a multi-resolution framework
The code provided here can be used both as an application to compute non-orthogonal decomposition
of binary matrices and as a library for applications involving computations on binary datasets.
Content
| Makefile | compiles and builds executable code into file bnd. |
| src/bnd.c | contains the main routine for the application. |
| src/bndsolve.c | library for rank-one approximation and recursive decomposition. |
| src/matrix.c | library for binary matrix operations. |
| src/vector.c | library for discrete vector operations. |
| src/binvector.c | library for binary vector operations. |
| src/system.c | library for memory management. |
| src/io.c | library providing I/O operations for the application. |
| sample/A.mtx | sample input matrix. |
All source files are associated with a header file that can be found in the "include/" directory.
Running PROXIMUS
The application bnd requires name of the file containing the input matrix as its first argument. All other arguments are optional. The general syntax for running the application is as:
% bnd [filename] [optional arguments]
| Argument | Meaning | Default value |
| [filename] | Name of the file containing input matrix | None |
| -a [algorithm] | Algorithm
| (1) | Discrete: Uses a discrete objective function that is equivalent to the cost function of the problem. Algorithm of choice for many instances. |
| (2) | Continuous: Uses a continuous approximation to the cost function of the problem. Might be handy for loose approximations while working with very sparce matrices. |
| Discrete |
| -i [init] | Initialization strategy
| (1) | Allones |
| (2) | Center |
| (3) | Maximum |
| (4) | Partition |
| (5) | GGG |
| (6) | Neighbor |
| (7) | Random-row |
| (8) | Random |
| Random-row |
| -e [epsilon] | Bound on Hamming radius. If default value is used, the application will provide an exact decomposition. Increase this value in order to get a more compact decomposition on the cost of less accuracy in approximation. | 0 |
| -w | Write output vectors If used, presence and pattern
vectors will be written in files [filename].X.out and [filename].Y.out
respectively. | FALSE |
| -c [minclustersize] | Minimum size of a cluster. If this parameter is set to a value c>1, then bnd stops for submatrices having number of rows less than c without checking the stopping criterion. The idea while using this parameter will be "a submatrix that has c rows is small enough to be considered as a good cluster of rows." | 1 |
| -m [logmemsize] | Log2 of the static memory pool to be allocated for the application. Increase this value while working with larger matrices. | 28 (256 MB) |
| -t [logtmpsize] | Log2 of the temporary memory pool to be allocated for the application. Increase this value while working with larger matrices. | 24 (16 MB) |
Input & Output Format
Proximus uses row-major format to store sparse matrices. The input and output formats are consistent with this. In this format, each line contains the ID's of columns corresponding to the nonzero entries of one row. The first line of the file is reserved for declaration of matrix parameters: number of rows, number of columns and number of nonzeros in the specified order, all being integers. A file storing the matrix
Note the following constraints:
- The file has to have exactly n+1 rows, n being the declared number of rows.
- Numbering of columns starts from 0 and goes upto m-1, m being the declared number of columns.
- No column can be repeated in the nonzero list of a particular row.
- The file has to contain nz+3 integers, nz being the declared number of nonzeros.
- The file must not contain any empty rows.
The "sample" directory contains a sample sparse matrix file.
If the application is run with -w option, the resulting pattern (Y) and presence matrices (X) will be written in files with added extensions ".Y.out" and ".X.out" respectively in the same format.The presence matrix (X) will be a sparse binary matrix of size kXm and the pattern matrix(Y) will be of size kXn, k being the number of identified patterns. The approximation matrix will then be X'Y, where X' denotes the transpose of matrix X.
Related Publications
- Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan, "Non-orthogonal decomposition of binary matrices for bounded-error data compression and analysis", ACM Transactions on Mathematical Software (in press), 2005.pdf ps (Detailed description of the software)
- Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan, "Compression, clustering and pattern discovery in very high dimensional discrete-attribute datasets", IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 447-461, 2005. pdf ps (Several applications in data mining)
- Jie Chi, Mehmet Koyuturk and Ananth Grama, "CONQUEST: A Distributed Tool for Constructing Summaries of High-Dimensional Discrete-Attributed Datasets", Proc. 4th SIAM Intl. Conf. on Data Mining (SDM 2004), pp. 154-165, 2004. pdf ps (Application in distributed data mining)
- Mehmet Koyuturk and Ananth Grama, "PROXIMUS: A Framework for Analyzing Very High Dimensional Discrete-Attributed Datasets", Proc. Ninth ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD 2003), pp. 147-156, 2003. pdf ps (Application in association rule mining)
- Mehmet Koyuturk, Ananth Grama, and Wojciech Szpankowski, "Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments", IEEE Computational Systems Bioinformatics Conf. Proc. (CSB 2003), pp.575-580, 2003. pdf ps (Application in discovery of regulation patterns in gene expression data)
- Mehmet Koyuturk, Ananth Grama and Naren Ramakrishnan, "Algebraic Techniques for Analysis of Large Discrete-Valued Datasets", Proc. 6th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD 2002), pp.311-324,2002. pdf ps
Credits
This software was developed by Mehmet Koyuturk and Ananth Grama in colloboration with Naren Ramakrishnan of Virginia Tech at the Parallel & Distributed Systems Lab of Purdue University Computer Science Department.
Errors & Bugs
Please direct any questions and report any errors and bugs that you discover to koyuturk@cs.purdue.edu. We will also appreciate if you share with us any improvement you make on this code or any application you use it in.
|