Our research focuses on the design and analysis of algorithms, in particular, randomized algorithms and probabilistic analysis of algorithms, with applications to theory of distributed computing, network algorithms, and computational biology.

Traditional computer science research has focused primarily on designing, building, analyzing and programming computers. The increasing prominence of large-scale distributed communication networks such as wireless, sensor networks, Internet, and peer-to-peer networks, as well as other large-scale real-world networks such as the Web, social, and biological networks, has now shifted focus to designing, building, analyzing, and operating such networks. Distributed and decentralized computation is key to this research as it leads to scalable algorithms. Another key is coming up with models that capture these networks (which are typically dynamic) and designing algorithms that work well on such models.

The overarching goal of our research group is to identify and solve key theoretical and algorithmic problems arising in networks including distributed communication networks such as wireless, sensor networks, Internet and peer-to-peer networks, as well as other large-scale real-world networks such as the Web, social, and biological networks. Randomness plays an important role in our research and is useful in three different aspects: (a) randomized algorithms, where randomness is used in the design of efficient distributed algorithms; (b) probabilistic analysis, which involves design and analysis of probabilistic algorithms for random graph models of networks; and (c) stochastic modeling and analysis of dynamic networks and processes.

Our major research contributions include the development of the first provably efficient distributed approximation algorithms for many fundamental network optimization problems, design and mathematical analysis of energy-efficient algorithms, and development of efficient randomized protocols for conflict-free access in distributed networks. We have made contributions to the theory and algorithms for large-scale real-world networks, including the World Wide Web, peer-to-peer networks, and ad hoc sensor networks. Our research has led to the development of new and more accurate models for the World Wide Web, provably efficient protocols for peer-to-peer networks, new random-geometric graph techniques for modeling and analyzing sensor networks, and to the understanding of the algorithmic complexity of combinatorial optimization in power-law graphs --- a class of graphs that are now widely used to model real-world networks. Another aspect of this research focuses on network algorithms in computational biology. Our research has led to development of efficient algorithms for protein structure determination and a framework for rigorously analyzing such algorithms. We have also tackled fundamental problems in online algorithms, including the development of a novel information-theoretic framework for analyzing the performance of online algorithms and the design of universal online algorithms.

Current research projects include design and analysis of efficient distributed approximation algorithms, network models and algorithms, energy-efficient algorithms for wireless and sensor networks, distributed quantum computing, and algorithms for protein structure determination. Below I give an overview of our research contributions categorized under the following four areas: (1) distributed algorithms; (2) modeling and algorithmic issues in networks; (3) computational biology; and (4) online algorithms.

We thank the National Science Foundation and the Purdue Research Foundation for their funding support.

Theory of Distributed Algorithms

With the emergence of the Internet and other networking technologies such as peer-to-peer networks, overlay networks, ad hoc wireless and sensor networks, it has become all the more important to design and analyze efficient distributed algorithms for solving various distributed computing problems. Research focus is on developing: (1) distributed approximation algorithms for network optimization; (2) energy-efficient distributed algorithms; and (3) randomized distributed algorithms for conflict-free access.

Distributed Approximation Algorithms

Distributed approximation is an emerging research area at the intersection of two well-established theoretical computer science areas: distributed computing and approximation algorithms. Distributed approximation algorithms tradeoff optimality of the solution for the amount of resources consumed by the distributed algorithm. With the advent of resource-constrained networks such as sensor and peer-to-peer networks it is critical to design efficient distributed algorithms for network optimization problems that have low communication (message) and time complexity even possibly at the cost of a reduced quality of solution [GP9]. Our research has resulted in the design of the first, almost time-optimal distributed approximation algorithms for a number of fundamental network optimization problems including minimum spanning tree (MST), minimum Steiner tree, generalized Steiner forest, and the shortest paths problem.

While distributed (exact) algorithms for MST and shortest paths have been studied extensively for more than two decades, till recently, relatively less progress has been made in design and analysis of distributed approximation algorithms for these problems (e.g., see Elkin's survey). In a joint work with Ph.D. student Maleq Khan [GP3], we gave the first, fast distributed approximation algorithm for MST that constructs an O(log n)-approximate minimum spanning tree in arbitrary networks (n is the number of nodes in the network). This work won the best student paper award in the DISC 2006 conference and was one of the four invited papers selected by the DISC program committee that appeared in the Distributed Computing special issue dedicated to DISC 20th anniversary. Our algorithm, unlike the previous known distributed MST approximation algorithm of Elkin, ran in time that is independent of edge weights. Further its time complexity was shown to be existentially optimal within polylogarithmic factors (i.e., no other O(log n)-approximate distributed algorithm can in general do better). Our algorithm runs in time O(D + L log n), where L is a parameter called the local shortest path diameter of the graph and D is the network diameter. This time bound is better because L can be much smaller than sqrt(n), which was shown to be essentially a lower bound on distributed (exact) MST computation (Peleg and Rabinovich). A significant implication of our result is that there can be a large time gap between exact and approximate distributed MST computation: we show that there exists graphs in which the running time of our approximation algorithm is exponentially faster than the time-optimal distributed MST algorithm. L is small for important classes of networks such as unit disk graphs (which are popular models of ad hoc wireless networks) and a very general class of random weighted networks (which can model power-law networks such as the Internet) with high probability. This gives a simple and local O(log n)-approximate MST algorithm in the above graphs that runs in almost optimal O(D + log^2 n) time.

Our DISC 2006 paper [GP3] introduced a simple distributed scheme called the nearest neighbor tree (NNT) scheme that has been subsequently useful in a number of other important applications. (The NNT scheme is closely related to the classic approximation algorithm for the traveling salesman problem due to Rosenkrantz, Stearns, and Lewis.) We have used this scheme to: (a) design message-efficient distributed approximation algorithms for finding low-weight higher-order connected spanning subgraphs in complete networks [GP6] ; (b) design energy-efficient distributed MST algorithms in wireless networks [ GP4, GP2]; and (c) develop fault-tolerant topology control algorithms in ad hoc networks [GP18].

Developing uniform approaches to design efficient distributed approximation algorithms for a wide variety of problems is an important goal. The NNT scheme is a useful approach to designing O(log n)-approximation algorithms for the minimum Steiner tree as well [ GP3,GP9]. However, NNT scheme does not give a O(log n)-approximation when applied to the generalized Steiner forest (GSF) problem, an important generalization of Steiner tree problem. This motivated our subsequent PODC 2008 paper, co-authored with Maleq Khan, Fabian Kuhn, Dahlia Malkhi and Kunal Talwar [GP1], which develops an uniform approach based on the technique of probabilistic tree embeddings to design the first almost time-optimal distributed O(log n)-approximation algorithms for various network optimization problems including the generalized Steiner forest problem, the shortest paths problem, and the optimum routing cost tree problem. Our approach is randomized and based on a probabilistic tree embedding due to Fakcharoenphol, Rao, and Talwar (FRT embedding). We show how to efficiently compute an (implicit) FRT embedding in a decentralized manner and how to use the embedding to obtain expected O(log n)-approximate distributed algorithms for various problems. The distributed construction of the FRT embedding is based on our distributed algorithm for the computation of least elements (LE) lists (first used by Cohen), a distributed data structure that is useful in many applications (Cohen, Cohen and Kaplan). A significant byproduct of our LE lists algorithm is an improved leader election algorithm in synchronous networks that is both time optimal and almost message optimal, thus partially answering an important open problem raised by Peleg in 1990. Peleg gave a leader election algorithm that takes O(D) time and O(D|E|) messages (|E| is the number of edges in the network) and raised an open question of whether there exists a leader election algorithm that achieves both optimal message complexity of O(|E|+ n \log n) and optimal time complexity of O(D). Our algorithm takes O(D) time (deterministically) and O(|E| min(D,log n)) messages with high probability [GP1].

Energy-Efficient Distributed Algorithms

Designing energy-efficient algorithms has become an important task for energy-constrained wireless and sensor networks. Another facet of my research has focused on designing energy-efficient distributed algorithms and developing a framework for rigorous analysis of such algorithms. In addition to the traditional complexity measures of time and number of messages, we have developed an algorithmic theory that addresses the energy complexity which accounts for the total energy associated with the messages exchanged among the nodes in a distributed algorithm.

Our research has focused on developing energy-efficient distributed algorithms for constructing exact and approximate Euclidean minimum spanning trees. In an IEEE Trans. Paral. Dist. Syst. paper with Ph.D student Maleq Khan and Anil Kumar [GP4], we design the first energy-efficient algorithms for constructing an approximate MST in a random geometric graph (nodes are distributed randomly in a unit square). Based on the NNT scheme, we give two distributed approximation algorithms: (1) a randomized algorithm, called Random-NNT, that gives a O(log n) approximation to the MST and takes O(log n) energy; (2) a deterministic algorithm, called Coordinate-NNT, that gives a O(1) approximation and takes O(1) energy which is the best possible; this algorithm requires that each node knows its own coordinates, unlike Random-NNT.

Algorithms that have optimal message complexity are not necessarily energy optimal. In [GP4], we show that the classical message-optimal distributed (exact) MST algorithm of Gallager et al. requires at least Omega(log^2 n) energy in a random geometric graph. Our work also raised a key question of whether there exists a distributed (exact) MST algorithm of O(log n) energy complexity in a random geometric graph. In a subsequent SPAA 2008 paper [GP2], we answer this in the affirmative. Furthermore, we show that this is the best possible, if nodes do not have coordinate information, by showing a matching lower bound of Omega(log n) on the energy complexity. We also extend our results to arbitrary node distributions.

We have also worked on developing efficient distributed randomized algorithms for the important problem of computing aggregate functions (such as average, minimum, maximum, sum). There has been a lot of recent theoretical research in developing distributed algorithms for aggregate computation. In a joint work with Ph.D. student Jen-Yeu Chen and Dongyan Xu, we developed a simple randomized local algorithm whose running time was shown to depend on the eigen-structure of the underlying graph [GP7]. Our algorithm is more energy-efficient compared to the previous schemes of Boyd et al. and Kempe et al.

Recently, with Ph.D. student Jen-Yeu Chen, we have developed improved gossip-based algorithms for aggregate computation [GP8] which are more efficient compared to previous algorithms (Kempe et al., Kashyap et al.).

Randomized Algorithms for Conflict-Free Distributed Access

In many distributed settings data and resources are shared and there are constraints on their simultaneous access by participating nodes. A fundamental goal in such a setting is to design local distributed contention-resolution protocols so that all the competing nodes can access the shared resources in a coordinated manner. We study the following distributed access problem which naturally arises in many settings: given a set of data items shared among nodes in a distributed network, all nodes want to access all (or a subset of) the items residing on different nodes in a contention-free manner. In addition, in certain applications, items may need to move from one processor to the other during access. The goal is to design distributed protocols so that all nodes access all the desired items as quickly as possible, while not overloading the storage space of any one processor. In an Algorithmica 2007 paper co-authored with Ph.D. student Gahyun Park [GP5], we analyze a simple and natural, randomized, distributed contention-resolution protocol for the above problem. Our analysis involves a stochastic analysis of a ``balls into bins'' problem in a dynamic setting where balls (data items) move into bins (nodes) on request and we study the time and storage requirements to move all the balls to the requested bins. We give almost tight bounds on the performance of the protocol; they show that the protocol performs almost as well as an optimal (centralized) scheme but with no coordination overhead.

[GP1] M. Khan, F. Kuhn, D. Malkhi, G. Pandurangan, and K. Talwar. Efficient Distributed Approximation Algorithms via Probabilistic Tree Embeddings, to appear in Proceedings of the 27th Annual ACM Symposium on Principles of Distributed Computing (PODC) , August, 2008. pdf

[GP2] Y. Choi, M. Khan, V. S.A Kumar, and G. Pandurangan. Energy-Optimal Distributed Minimum Spanning Tree Algorithms, in Proceedings of the 20th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2008, 188-190. pdf

[GP3] M. Khan and G. Pandurangan. A Fast Distributed Approximation Algorithm for Minimum Spanning Trees, Distributed Computing, 20, 2008, 391-402, Invited paper. Conference version: 20th International Symposium on Distributed Computing (DISC) , Springer-Verlag, LNCS 4167, 2006, 355-369. Best Student Paper Award. pdf

[GP4] M. Khan, G. Pandurangan, and V.S.A Kumar. Distributed Algorithms for Constructing Approximate Minimum Spanning Trees in Wireless Networks, IEEE Transactions on Parallel and Distributed Systems, 2008, accepted for publication (in press). pdf

[GP5] G. Pandurangan and G. Park. Analysis of Randomized Protocols for Conflict-Free Distributed Access, Algorithmica, 49(2), 2007, 109-126. Conference version: Proceedings of the 24th Annual ACM Symposium on Principles of Distributed Computing (PODC), 2005, 274. pdf

[GP6] M. Khan, G. Pandurangan, and V.S.A Kumar. A Simple Randomized Scheme for Constructing Low-Weight $k$-Connected Spanning Subgraphs with Applications to Distributed Algorithms, Theoretical Computer Science, 385(1-3), 2007, 101-114. pdf

[GP7] J. Chen, G. Pandurangan, and D. Xu. Robust Computation of Aggregates in Wireless Sensor Networks: Distributed Randomized Algorithms and Analysis, IEEE Transactions on Parallel and Distributed Systems, 17(9), 2006, 987-1000. Conference version: Proceedings of the Fourth International Conference on Information Processing in Sensor Networks (IPSN), 2005, 348-355. pdf

[GP8] J. Chen and G. Pandurangan. Optimal Gossip-based Aggregate Computation, submitted. pdf.

[GP9] G. Pandurangan and M. Khan. Theory of Communication Networks, Algorithms and Theory of Computation Handbook, Second Edition, Edited by M.J. Atallah and M. Blanton, CRC Press, forthcoming. pdf.

[GP10] V. Denchev and G. Pandurangan. Distributed Quantum Computing: A New Frontier in Distributed Systems or Science Fiction?, to appear in ACM SIGACT News, September 2008. pdf

Modeling and Algorithmic Issues in Networks

Real-world networks such as peer-to-peer networks, World Wide Web, and sensor networks have underlying statistical properties which naturally arise in their deployment, evolution, operation, or maintenance. My approach uses stochastic processes and random graphs to model such networks. This not only gives insight into the real-world properties of these networks, but also is helpful in designing efficient algorithms for the associated algorithmic problems.

Analysis of Random Geometric Graph Models for Sensor Networks

Random geometric graphs have emerged as a popular model for ad hoc sensor networks. The aim is to study key properties of these random graphs motivated by engineering ad hoc sensor networks and show bounds on the thresholds for emergence of these properties. Computing such thresholds precisely is very useful in correctly and efficiently engineering ad hoc sensor networks to exhibit desirable properties such as connectivity, coverage, and low routing stretch. In a joint work with S. Muthukrishnan (SODA 2005), we have developed an approach called bin covering to determine the thresholds of these properties of random geometric graphs [GP13] . Our technique shows a way to obtain precise thresholds in a unified and simple manner. Typically, in the past, geometric random graph analyses involved sophisticated methods from continuum percolation theory; in contrast, our bin-covering approach is discrete and very simple, yet it gives us tight threshold bounds. The technique also yields algorithmic benefits as illustrated by a simple local routing algorithm for finding paths with low stretch.

Theory of Peer-to-Peer (P2P) Networks

P2P computing has become an important paradigm over the past few years due to their many advantages: decentralized search, sharing data and resources, and server-less computing. A fundamental problem is to build connected, low-diameter P2P networks. A P2P network is highly dynamic with nodes independently joining and leaving over time. Maintaining even connectivity becomes non-trivial in such a dynamic and decentralized setting. In a paper published in FOCS 2001 [GP16], we give a local protocol for participants to build P2P networks in a distributed fashion and prove (under a reasonable stochastic model) that it results in connected networks of {\em constant} degree and logarithmic diameter. This was the first protocol with provable guarantees on connectivity and diameter under a realistic dynamic setting. This has become one of the most cited theoretical papers in P2P networks (over 200 citations in Google Scholar). Our protocol is not fully decentralized --- it assumes that nodes have access to a small number of node addresses that are available through a centralized server. A number of subsequent papers have improved on our work and devised fully decentralized protocols which guarantee similar properties (e.g., Cooper et al.).

The above protocol is for building unstructured P2P networks. On the other hand, structured P2P networks, based on a Distributed Hash Table (DHT), creates a fully decentralized index that maps file identifiers to peers and allows a peer to search for a file very efficiently without flooding. We have also devised a simple protocol [JP08] for building structured P2P networks that operate efficiently in a highly dynamic setting (``churn-tolerant"). Such protocols can be useful in deployment of DHT networks.

Web Measurement and Modeling

Developing accurate Web models is important for at least two reasons: (1) to develop a deeper and accurate understanding of the evolution and structure of the Web; (2) to use the developed models to design and analyze better algorithms for search and mining. Prior modeling has dwelt with fitting models to the observed degree distribution of the Web; this is relying not only on a single set of parameters, but also on a ``local'' property of the Web. Our approach ( [GP14]) to modeling the Web is by studying the PageRank distribution of the Web; we show that it follows a power law remarkably similar to the in-degree distribution. We then develop detailed models to explain this observation, and moreover remain faithful to the observed degree distributions. We analyze these models, and compare the analysis to both snapshots from the Web and to graphs generated by simulations on the new models. This was the first work that goes beyond fitting degree distributions. Our work has been extensively cited (over 100 citations in Google Scholar) and also discussed in the book by Baldi et al. Further, the study of the correlations between the PageRank and the in-degree values of both the Web snapshots and our models sheds insight into the evolution of the Web. Our paper also studies the PageRank based model (also independently studied by Drinea et al) for the evolution of the Web, where nodes connect to other nodes with probability proportional to the PageRank. This model can be considered as a precursor to the notion of ``search-engine bias" that has since attracted a lot of attention (e.g., Fortunato et al.).

Algorithmic Complexity of Power-law Graphs

Our research has also focused on understanding the algorithmic complexity of important optimization problems in power-law graphs. The motivation for this research is the discovery that the degree sequences of many large-scale real-world graphs ranging from WWW, Internet, to social and biological networks exhibit a power-law distribution. In such networks, the number of nodes y of a given degree x is proportional to x^{-beta} where beta> 0 is a constant that depends on the application domain. There is practical evidence that optimization in power-law graphs is easier than in general graphs, prompting the basic question: Is combinatorial optimization over power-law graphs easy? Does the answer depend on the power-law exponent beta? In a Theoretical Computer Science 2007 paper [GP11], co-authored with Alessandro Ferrante (visiting Ph.D. student from U. of Salerno, Italy) and Kihong Park, we show that many classical NP-hard graph-theoretic optimization problems remain NP-hard on power-law graphs for certain values of beta. In particular, we show that classical NP-hard problems, such as clique and coloring, remain NP-hard for all beta >= 1. Moreover, we show that all the problems that satisfy the so-called ``optimal substructure property'' remain NP-hard for all beta > 0. These include classical problems such as minimum vertex-cover, maximum independent-set, and minimum dominating-set. Our hardness proofs involve designing efficient algorithms for constructing graphs with prescribed degree sequences that are tractable with respect to various optimization problems.

Our work initiates a systematic study of complexity of combinatorial optimization in graphs with prescribed degree sequences. Our results, in conjunction with recent experimental evidence, suggest that real-world power-law graphs may not be ``worst-case'' instances of power-law graphs, but rather typical instances, and may be well-modeled by power-law random graph models. A main ingredient in our proof is a technical lemma that guarantees that any arbitrary graph can be ``embedded'' in a suitably large (but polynomial in the size of the given graph) graph that conforms to the prescribed power-law degree sequence. This lemma may be of independent interest and can have other applications as well e.g., in showing hardness of approximation of problems in power-law graphs.

Evaluating QoS Properties in Statistically Multiplexed Networks

We have worked on the problem of accurately evaluating Quality-of-Service (QoS) parameters in statistically multiplexed communication networks (e.g., ATM). This is a key problem that arises in the design of routing protocols in multiplexed networks. Our work was motivated by the inaccurate way of estimating QoS properties (such as overflow probability) in current ATM network protocols. We model a communication request by a simple stochastic source: an on-off source. This models bursty traffic -- common in the Internet. Our STOC paper [GP17] gives an efficient Monte-Carlo method for estimating the failure probability of a general communication network. This method is particularly useful in a dynamic setting in which communication requests are dynamically added and eliminated from the system. The amortized cost in our solution of updating the estimate after each change is proportional to the fraction of links involved in the change rather than to the total number of links in the network.

[GP10] G. Pandurangan and S. Jagannathan. A Simple Churn-Tolerant Peer-to-Peer Scheme, submitted. pdf.

[GP11] A. Ferrante, G. Pandurangan, and K. Park. On the Hardness of Optimization in Power-Law Graphs, Theoretical Computer Science, 393, 2008, 220-230. Conference version: Proceedings of the 13th Annual International Computing and Combinatorics Conference (COCOON), Banff, Canada, Springer-Verlag, LNCS 4598, 2007, 417-427. pdf

[GP12] P. Drineas, A. Javed, M. Ismail, G. Pandurangan, R. Virrankoski, and A. Savvides. Distance Matrix Reconstruction from Incomplete Information for Sensor Network Localization, in Third Annual IEEE Conference on Sensor, Mesh, and Ad Hoc Communications and Networks (SECON), 2006, 536-544. pdf

[GP13] S. Muthukrishnan and G. Pandurangan. The Bin-Covering Technique for Thresholding Random Geometric Graph Properties, ACM-SIAM Symposium on Discete Algorithms (SODA), 2005, 989-998. pdf

[GP14] G. Pandurangan, P. Raghavan, and E. Upfal. Using PageRank to Characterize Web Structure, Internet Mathematics, 3(1), 2006, 1-20. Conference version: Proceedings of the 8th Annual International Conference on Combinatorics and Computing (COCOON), Springer-Verlag LNCS 2387, 2002, 330-339. pdf.

[GP15] T. Czajka and G. Pandurangan. Improved Random Graph Isomorphism, Journal of Discrete Algorithms, 6, 2008, 85-92. pdf

[GP16] G. Pandurangan, P. Raghavan, and E. Upfal. Building Low-Diameter Peer-to-Peer Networks, IEEE Journal on Selected Areas in Communications (JSAC), 21(6), 2003, 995-1002. Conference version: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2001, 492-499. pdf

[GP17] G. Pandurangan and E. Upfal. Static and Dynamic Evaluation of QoS Properties, Journal of Interconnection Networks, 1(2), 2000, 135-150. Conference version: Proceedings of the ACM Symposium on Theory of Computing (STOC), 1999, 566-573. pdf

[GP18] V.S.A Kumar, M. Khan, M. Marathe, G. Pandurangan, and S.S. Ravi. Topology Control in Ad hoc Networks Under Stochastic Node Failures, submitted.

Computational Biology

Our research focus in computational biology is to develop provably efficient and practical algorithms for solving algorithmic problems arising in protein structure determination. Using techniques from random graph theory and probabilistic algorithms, the goal is to model and understand protein structure discovery and consequently to design and analyze algorithms for determining protein structure with rigorous guarantees. We have a developed a novel approach based on random graph models to protein structure determination via Nuclear magnetic resonance (NMR) spectroscopy. Our approach, for the first time, gives a rigorous framework for analyzing algorithms for protein structure determination via NMR.

NMR spectroscopy enables scientists to study protein structures, dynamics, and interactions in solution. These studies in turn provide valuable insights into the mechanisms by which proteins function, and the means of controlling these mechanisms (e.g., drug design). Graph representations and graph-theoretic problems underlie these applications. A contact graph for a protein structure has vertices for the individual amino acid residues in the protein and edges between ``close'' pairs. An interaction graph for NMR data has vertices for NMR-probed ``pseudoresidues'' (corresponding via an unknown mapping to the real residues), and edges between residue pairs that NMR data suggest might be interacting. The NMR edges are essentially the contact edges, significantly corrupted by experimental noise and ambiguity. A key problem that has to be solved in NMR studies is the assignment problem: identify the vertex correspondence between two graphs --- a NMR interaction graph (representing experimental data) and a contact graph (representing the structure). Our research has led to efficient algorithms for solving various versions of this problem (depending on the NMR experiment setup); these results have appeared in top computational biology conferences and journals: RECOMB 2004 and Jour. Comp. Bio. [GP22], Bioinformatics 2006 [GP20], ISMB 2008 and Bioinformatics 2008 [GP19].

The algorithmic problem underlying assignment is subgraph isomorphism and its variants and is computationally hard to solve. The NMR interaction graph can be viewed as a corrupted version of the contact graph, with an unknown vertex correspondence. Our approach is to use a novel random graph model to suitably model the noisy NMR interaction graph. The noisy edges in an NMR interaction graph are not randomly distributed, but are correlated, and we have developed a random graph model (different from the traditional G(n,p) random graph model) that properly captures the correlation structure among noise edges. In joint research with Chris Bailey-Kellogg and our graduate students [GP22, GP20, GP19], we develop and analyze efficient algorithms for the assignment problem. A key idea underlying these algorithms is my efficient randomized algorithm based on random walks for finding 2-factors and long paths in sparse random graphs [GP21]. We analyze the performance of our algorithms in our random graph model and show that it runs in expected polynomial time. A significant result is that our analysis also gives a threshold for efficient performance: the running time is polynomial only if the percentage of noisy edges is below a certain threshold. This threshold predicted by theory is validated by experimental results, which is a strong evidence towards our model.

Other Algorithmic Problems

We have worked in two well-known algorithmic problems in computational molecular biology: the restriction mapping problem and the de novo peptide sequencing problem. The restriction mapping problem is to locate the restriction sites of a given enzyme on a DNA molecule. In an invited paper in Jour. Comp. Sys. Sciences [GP23], we give the first polynomial-time algorithm for the restriction mapping problem. Our algorithm is also robust in dealing with laboratory errors. The De novo peptide sequencing problem arises in the reconstruction of the peptide sequence of a protein molecule. In the same paper [GP23], we give a simple and efficient algorithm for the problem which improves upon the previous result of Chen et al (SODA 2000).

[GP19] F. Xiong, G. Pandurangan, and C. Bailey-Kellogg. Contact Replacement for NMR Resonance Assignment, in Proceedings of the The 16th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), 2008. Journal version in Bioinformatics, 24(13), 2008, i205-i213. pdf.

[GP20] H. Kamichetty, C. Bailey-Kellogg, and G. Pandurangan. An Efficient Randomized Algorithm for Contact-Based NMR Backbone Resonance Assignment, Bioinformatics, 22(2), 2006, 172-180. pdf

[GP21] G. Pandurangan. On a Simple Randomized Algorithm for Finding a 2-Factor in Sparse Graphs, Information Processing Letters, 95(1), 2005, 321-327. pdf

[GP22] C. Bailey-Kellogg, S. Chainraj, and G. Pandurangan. A Random Graph Approach to NMR Sequential Assignment, Journal of Computational Biology, 12(6-7), 2005, 569-583. Invited paper. Conference version: Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2004, 58-67. pdf

[GP23] G. Pandurangan and H. Ramesh. The Restriction Mapping Problem Revisited, Journal of Computer and System Sciences, 65, 2002, 526-544. Invited paper for a special issue in computational biology. pdf

Online Algorithms

In an online setting, the input is continuously injected to the system, and the algorithm typically has to take some action (from a given set of actions) immediately after seeing the current input, without knowledge of future inputs. Each action may result in instantaneous losses at that time step. Online prefetching problem is a classical online problem which is similar to the classical caching (or demand paging) problem except that k pages can be prefetched into the cache (of size k) before every page request. Thus, unlike caching (where pages are fetched only on demand), in prefetching the loss function is memoryless, i.e., does not depend on previous action-input pairs, since before every request k (possibly new) pages can be prefetched without incurring penalty.

Our research addresses two fundamental questions: (1) How to use information-theoretic bounds to characterize performance of online algorithms? (2) How to design efficient universal online algorithms for online problems with memory loss functions (e.g., caching, where the loss depends on previous action-input pairs) ?

Using Entropy to Characterize Performance of Online Algorithms

Online algorithms have been studied extensively for the last twenty years in the theoretical computer science community mainly using the competitive analysis framework. Competitive analysis compares the performance of different algorithms, but it gives no information about the actual gain from using them. A basic theoretical question is identifying appropriate information-theoretic parameters that can accurately capture performance of online algorithms. This will be useful in allocating resources more effectively in dynamic online systems. For example, consider a malleable cache architecture where the cache can be dynamically partitioned between different data streams. A data stream that can make better use of a larger cache is assigned more space, while a stream with very little structure or repeats is allocated a smaller cache space. Efficient utilization of this technology requires a mechanism for predicting caching gain for a given data stream.

We have developed a new approach to quantify the performance of certain online algorithms based on entropy --- an intrinsic characteristic --- of the request sequence (ACM Trans. on Algorithms 2007) [GP25]. This approach complements the rich theory of competitive analysis and gives information on the actual gain in using the online algorithm. Assuming a very general stochastic input model, we showed non-trivial bounds, based on entropy rate of the source, on the performance of the best online algorithm for classical online problems such as prefetching, caching, and list accessing. Since entropy is a measure of randomness of the request sequence, it is reasonable to expect it to characterize the ``intrinsic bottleneck" in the performance of any online algorithm. In particular, for prefetching we show that the average fault-rate of the best online algorithm is a linear function of H, the entropy of the input source. These entropy-based bounds have been shown to be useful in understanding the performance of online algorithms in real applications. For example, Fonseca et al. use our bounds to explain the strong correlation between entropy and the performance of LRU algorithm in Web caches. For online problems which involve loss functions with memory we show that entropy does not fully capture performance in the following sense: two different input sources with the same entropy can have very different performance guarantees (unlike prefetching).

Universal Online Algorithms

A basic algorithmic problem in online computation is to design universal algorithms, i.e., algorithms that work well without knowledge of the input model. In many applications, no a priori knowledge of the source characteristics is available and statistical tests are either impossible, unreliable, or too costly. Universal algorithms can overcome these difficulties and indeed have been shown to be very useful in important practical applications, e.g., prefetching and data compression.

There is an extensive body of work on universal algorithms for online problems with memoryless loss functions (where the loss/penalty depends only on the current input and action) such as prediction, prefetching, and lossless data compression. However, designing such algorithms has been very challenging for online problems with memory loss functions including important problems such as caching (demand paging). Most previous works have devised online caching algorithms for specific classes of distributions with the assumption that the algorithm has full knowledge of the source. For example, Karlin, Phillips, and Raghavan present efficient algorithms when the request sequence is generated by a Markov chain under the assumption that the online algorithm has complete knowledge of the Markov chain.

Our Algorithmica 2008 paper [GP24], gives a general technique for designing universal online caching algorithms. The technique gives a way to ``convert" an algorithm that is not universal to begin with into one which is universal. We illustrate this technique by showing how to convert the Dominating Distribution (DOM) algorithm of Lund et al. into a universal caching algorithm. Our algorithm is universal for the class of mixing sources and gives an expected performance that is within 4 +o(1) times the optimal online algorithm (which has full knowledge of the input model and can use unbounded resources). Mixing sources includes powerful models such as Markov sources but is less general than stationary ergodic sources.

[GP24] G. Pandurangan and W. Szpankowski. A Universal Online Caching Algorithm Based on Pattern Matching, Algorithmica, 2008, accepted for publication (in press). Conference version: Proceedings of the IEEE International Symposium on Information Theory (ISIT), 2005, 1151-1155. pdf

[GP25] G. Pandurangan and E. Upfal. Entropy-based Bounds for Online Algorithms, ACM Transactions on Algorithms, 3(1), 2007. (19 pages) Conference version: Can Entropy Characterize Performance of Online Algorithms ? Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) 2001, 727-734. pdf

[GP26] M. Hauskrecht, G. Pandurangan, and E. Upfal. Computing Near-Optimal Strategies for Stochastic Investment Planning Problems, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, 1999, 1310-1315. pdf

References

P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms, John Wiley, 2003.

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis, and applications, Proceedings of the IEEE INFOCOM, 2005.

T. Chen, M. Kao, M. Tepel, J. Rush and G.M. Church. A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry. Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 389-398, 2000.

E. Cohen. Size-Estimation Framework with Applications to Transitive Closure and Reachability, Journal of Computer and System Sciences, 55(3), 1997, 441-453.

E. Cohen and H. Kaplan. Spatially-Decaying Aggregation Over a Network: Model and Algorithms, Journal of Computer and System Sciences, 73(3), 2007, 265-288.

C. Cooper, M. Dyer and C. Greenhill. Sampling regular graphs and a peer-to-peer network, Proc. of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , New York--Philadelphia (2005), pp. 980-988.

E. Drinea, M. Enachescu, and M. Mitzenmacher. Variations on Random Graph Models for the Web, Harvard Technical Report TR-06-01, 2001.

M. Elkin. Distributed approximation -- A survey, ACM SIGACT News, 35(4), 2004, 40-57.

M. Elkin. Unconditional Lower Bounds on the Time-Approximation Tradeoffs for the Distributed Minimum Spanning Tree Problem, Proceedings of Symposium on Theory of Computing (STOC) , 2004.

J. Fakcharoenphol, S. Rao, and K. Talwar. A Tight Bound on Approximating Arbitrary Metrics by Tree Metrics, Journal of Computer and System Sciences, 69(3), 2004, 485-497.

R. Fonseca, V. Almeida, and M. Crovella. Localilty in a Web of Streams, Comunications of the ACM , 48(1), 2005, 82-88. Conference version with B. Abrahao in Proceedings of the IEEE INFOCOM, 2003.

S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani. Topical interests and the mitigation of search engine bias, Proceedings of the National Academy of Sciences (PNAS), 103(34), 12684-12689, 2006.

R. Gallager, P. Humblet, and P. Spira. A Distributed Algorithm for Minimum-Weight Spanning Trees, ACM Transactions on Programming Languages and Systems, 5(1), 1983, 66-77.

A.R. Karlin, S.J. Phillips, and P. Raghavan. Markov Paging, SIAM Journal on Computing, 30(3), 906-922, 2000.

S. Kashyap, S. Deb, K. Naidu, R. Rastogi, and A. Srinivasan. Efficient gossip-based aggregate computation, Proceedings of the ACM conference on Principles of Database Systems (PODS), 2006.

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based Computation of Aggregate Information, Proc. The 44th Annual IEEE Symp. on Foundations of Computer Science (FOCS), 2003.

C. Lund, S. Phillips, and N. Reingold. Paging against a Distribution and IP Networking, Journal of Computer and System Sciences, 58, 1999, 222-231.

D. Peleg. A time optimal leader election algorithm in general networks. Journal of Parallel and Disributed Computing, 8, 1990, 96--99.

D. Peleg and V. Rabinovich. A near-tight lower bound on the time complexity of distributed MST construction, Proceedings of the 40th IEEE FOCS, 1999, 253--261.

D. Rosenkrantz, R. Stearns, and P. Lewis. An analysis of several heuristics for the traveling salesman problem. SIAM J. Comput., 6(3):563--581, 1977.