Bruno Ribeiro

Bruno Ribeiro

Assistant Professor of Computer Science

Purdue University

LWSN 2142C
Department of Computer Science
Purdue University
West Lafayette, IN




  1. Pedro Savarese, Mayank Kakodar, Bruno Ribeiro, From Monte Carlo to Las Vegas: Improving Restricted Boltzmann Machine Training through Stopping Sets, AAAI, 2018 [arXiv:1711.08442] [pdf].
    Abstract []
    We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach {\em Markov Chain Las Vegas} (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps $K$ (referred as MCLV-$K$). We present a MCLV-$K$ gradient estimator (LVS-$K$) for RBMs and explore the correspondence and differences between LVS-$K$ and Contrastive Divergence (CD-$K$), with LVS-$K$ significantly outperforming CD-$K$ training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.
  2. Jason Meng, Chandra S. Mouli, Bruno Ribeiro, Jennifer Neville, Subgraph Pattern Neural Networks for High-Order Graph Evolution Prediction, AAAI, 2018 [src + data][pdf].
    Abstract []
    In this work we generalize traditional node/link prediction tasks in dynamic heterogeneous networks, to consider joint prediction over larger $k$-node induced subgraphs. Our key insight is to incorporate the unavoidable dependencies in the training observations of induced subgraphs into both the input features and the model architecture itself via high-order dependencies. The strength of the representation is its invariance to isomorphisms and varying local neighborhood sizes, while still being able to take node/edge labels into account, and %enabling inductive facilitating inductive reasoning (i.e., generalization to unseen portions of the network). Empirical results show that our proposed method significantly outperforms other state-of-the-art methods designed for static and/or single node/link prediction tasks. In addition, we show that our method is scalable and learns interpretable parameters.
  3. Pinghui Wang, Junzhou Zhao, Bruno Ribeiro, John C.S. Lui, Don Towsley, and Xiaohong Guan, Practical Characterization of Large Networks Using Neighborhood Information, Knowledge and Information Systems 2018 (accepted).
    Abstract []
    Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods.
  1. Fabricio Murai, Diogo Rennó, Bruno Ribeiro, Gisele L. Pappa, Don Towsley, Krista Gile, Selective Harvesting over Networks, Data Mining and Knowledge Discovery, 2017 [arXiv:1703.05082].
    Abstract []
    Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario suffers from a tunnel vision effect: without recourse to independent sampling, the urge to query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of AS methods and standard classifiers. We find that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as an ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried next. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. From these observations we propose D3TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, matching or exceeding the performance of competing methods on seven real network datasets in our evaluation.
  2. Jiasen Yang, Bruno Ribeiro, Jennifer Neville, Stochastic Gradient Descent for Relational Logistic Regression via Partial Network Crawls, UAI Seventh International Workshop on Statistical Relational AI, 2017 [arXiv:1703.07716].
    Abstract []
    Research in statistical relational learning has produced a number of methods for learning relational models from large-scale network data. While these methods have been successfully applied in various domains, they have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. Recently, we showed that the parameter estimates for relational Bayes classifiers computed from network samples collected by existing network crawlers can be quite inaccurate, and developed a crawl-aware estimation method for such models (Yang, Ribeiro, and Neville, 2017). In this work, we extend the methodology to learning relational logistic regression models via stochastic gradient descent from partial network crawls, and show that the proposed method yields accurate parameter estimates and confidence intervals.
  3. Jiasen Yang, Bruno Ribeiro, Jennifer Neville, Should We Be Confident in Peer Effects Estimated from Partial Crawls of Social Networks?, ICWSM (poster), 2017 [pdf].
    Abstract []
    Research in social network analysis and statistical relational learning has produced a number of methods for learning relational models from large-scale network data. Unfortunately, these methods have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. While prior studies have examined the impact of network crawling on the structural characteristics of the resulting samples, this work presents the first empirical study designed to assess the impact of widely used network crawlers on the estimation of peer effects. Our experiments demonstrate that the estimates obtained from network samples collected by existing crawlers can be quite inaccurate, unless a significant portion of the network is crawled. Meanwhile, motivated by recent advances in partial network crawling, we develop crawl-aware relational methods that provide accurate estimates of peer effects with statistical guarantees from partial crawls.
  4. Kun Tu, Bruno Ribeiro, Ananthram Swami and Don Towsley, Temporal Clustering in time-varying Networks with Time Series Analysis, NIPS 2017 Time Series Workshop.
    Abstract []
  5. Mohamed S. Hassan, Bruno Ribeiro, Walid G. Aref, SBG-Sketch: A Self-Balanced Sketch for Labeled-Graph Stream Summarization, (preprint) [arXiv:1709.06723].
    Abstract []
    Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.
  1. Flavio Figueiredo, Bruno Ribeiro, Jussara Almeida, Christos Faloutsos, TribeFlow: Mining & Predicting User Trajectories, ACM International Conference on World Wide Web (WWW'16), 2016
    Version with corrected eq. (1): [pdf] [src + data].
    Earlier technical report at [arXiv:1511.01032].
    Abstract []
    Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.
  2. Bo Jiang, Daniel R. Figueiredo, Bruno Ribeiro, Don Towsley, On the Duration and Intensity of Competitions in Nonlinear Polya Urn Processes with Fitness, ACM SIGMETRICS, 2016 [pdf] Best Paper Award.
    Abstract []
  3. Konstantin Avrachenkov, Bruno Ribeiro, Jithin K. Sreedharan, Inference in OSNs via Lightweight Partial Crawls, ACM SIGMETRICS, 2016.
    Final version [pdf]
    Earlier version arXiv:1510.05407, 2015 [pdf].
    Abstract []
  4. Flavio Figueiredo, Bruno Ribeiro, Jussara M. Almeida, Christos Faloutsos, A Summary of the TribeFlow Model for Music Discovery Applications, ICML 2016 @ Machine Learning for Music Discovery Workshop (invited).
  5. Flavio Figueiredo, Bruno Ribeiro, Christos Faloutsos, Nazareno Andrade, Jussara M. Almeida, Mining Online Music Listening Trajectories, ISMIR: 17th International Symposium/Conference on Music Information Retrieval, 2016 (accepted).
    Final version [pdf]
    Abstract []
    Understanding the listening habits of users is a valuable undertaking for musicology researchers, artists, consumers and online businesses alike. With the rise of Online Mu- sic Streaming Services (OMSSs), large amounts of user behavioral data can be exploited for this task. In this pa- per, we present SWIFT-FLOWS, an approach that mod- els user listening habits in regards to how user attention transitions between artists. SWIFT-FLOWS combines re- cent advances in trajectory mining, coupled with mod- ulated Markov models as a means to capture both how users switch attention from one artist to another, as well as how users fixate their attention in a single artist over short or large periods of time. We employ SWIFT-FLOWS on OMSSs datasets showing that it provides: (1) semantically meaningful representation of habits; (2) accurately models the attention span of users.
  1. Bruno Ribeiro, Minh Hoang, Ambuj Singh, Beyond Models: Forecasting Complex Network Processes Directly from Data, ACM International Conference on World Wide Web (WWW'15), 2015 [pdf] [Tech Report] [code + data] [pptx].
    Abstract []
    Complex network phenomena - such as information cascades in online social networks - are hard to fully observe, model, and forecast. In forecasting, a recent trend has been to forgo the use of parsimonious models in favor of models with increasingly large degrees of freedom that are trained to learn the behavior of a process from historical data. Extrapolating this trend into the future, eventually we would renounce models all together. * But is it possible to forecast the evolution of a complex stochastic process directly from the data without a model? * In this work we show that model-free forecasting is possible. We present SED, an algorithm that forecasts process statistics based on relationships of statistical equivalence using two general axioms and historical data. To the best of our knowledge, SED is the first method that can perform axiomatic, model-free forecasts of complex stochastic processes. Our simulations using simple and complex evolving processes and tests performed on a large real-world dataset show promising results.
  2. Bruno Ribeiro, Christos Faloutsos, Modeling Website Popularity Competition in the Attention-Activity Marketplace, ACM International Conference on Web Search and Data (WSDM'15), 2015 [arXiv:1403.0600] [pptx].
    Abstract []
    How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laureate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.
  3. Bo Jiang, Liyuan Sun, Daniel R Figueiredo, Bruno Ribeiro and Don Towsley, On the duration and intensity of cumulative advantage competitions, J. Stat., 2015 (accepted) [pdf].
    Abstract []
    Network growth can be framed as a competition for edges among nodes in the network. As with various other social and physical systems, skill (fitness) and luck (random chance) act as fundamental forces driving competition dynamics. In the context of networks, cumulative advantage (CA) - the rich-get-richer effect — is seen as a driving principle governing the edge accumulation process. However, competitions coupled with CA exhibit non-trivial behavior and little is formally known about duration and intensity of CA competitions. By isolating two nodes in an ideal CA competition, we provide a mathematical understanding of how CA exacerbates the role of luck in detriment of skill. We show, for instance, that when nodes start with few edges, an early stroke of luck can place the less skilled in the lead for an extremely long period of time, a phenomenon we call ‘struggle of the fittest’. We prove that duration of a simple skill and luck competition model exhibit power-law tails when CA is present, regardless of skill dierence, which is in sharp contrast to the exponential tails when fitness is distinct but CA is absent. We also prove that competition intensity is always upper bounded by an exponential tail, irrespective of CA and skills. Thus, CA competitions can be extremely long (infinite mean, depending on fitness ratio) but almost never very intense. The theoretical results are corroborated by extensive numerical simulations. Our findings have important implications to competitions not only among nodes in networks but also in contexts that leverage socio-physical models embodying CA competitions.
  1. Pinghui Wang, John C.S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, Xiaohong Guan, Efficiently Estimating Motif Statistics of Large Networks, ACM Transactions on Knowledge Discovery from Data (TKDD), Sept. 2014 [pdf].
    Older ArXiv version arXiv:1306.5288.
    Abstract []
    Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks – often stored in already overloaded relational databases – effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.
  1. Ting-Kai Huang, Bruno Ribeiro, Harsha V. Madhyastha, Michalis Faloutsos, The Socio-monetary Incentives of Online Social Network Malware Campaigns, ACM Conference on Online Social Networks (COSN'14), 2014 [pdf] [slides.pdf] [slides.pptx].
    Abstract []
    Online social networks (OSNs) offer a rich medium of malware propagation. Unlike other forms of malware, OSN malware campaigns direct users to malicious websites that hijack their accounts, posting malicious messages on their behalf with the intent of luring their friends to the malicious website, thus triggering word-of-mouth infections that cascade through the network compromising thousands of accounts. * But how are OSN users lured to click on the malicious links? *
    In this work, we monitor 3.5 million Facebook accounts and explore the role of pure monetary, social, and combined socio-monetary psychological incentives in OSN malware campaigns. Among other findings we see that the majority of the malware campaigns rely on pure social incentives. However, we also observe that malware campaigns using socio-monetary incentives infect more accounts and last longer than campaigns with pure monetary or social incentives. The latter suggests the efficiency of an epidemic tactic surprisingly similar to the mechanism used by biological pathogens to cope with diverse gene pools.
  1. Flavio Figueiredo, Jussara M Almeida, Yasuko Matsubara, Bruno Ribeiro, Christos Faloutsos, Revisit Behavior in Social Media: The Phoenix-R Model and Discoveries, (ECML/PKDD 2014) [arXiv:1405.1459].
    Abstract []
    How many listens will an artist receive on a online radio? How about plays on a YouTube video? How many of these visits are new or returning users? Modeling and mining popularity dynamics of social activity has important implications for researchers, content creators and providers. We here investigate and model the effect of revisits on popularity. A revisit can be defined as a repeated play (song or video) or a re-tweet of the same hashtag over time. Using four datasets of social activity, with up to tens of millions media objects (e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect of revisits in the popularity evolution of such objects. Secondly, we propose the Phoenix-R model which captures the popularity dynamics of individual objects. Phoenix-R has the desired properties of being: (1) parsimonious, being based on the minimum description length principle, and achieving lower root mean squared error than state-of-the-art baselines; (2) applicable, the model is effective for predicting future popularity values of time series.
  1. Bruno Ribeiro, Modeling and Predicting the Growth and Death of Membership-based Websites, WWW 2014 [pdf] [arXiv:1307.1354] [slides.pptx].

    Carnegie Mellon University Press Release: In Age of Information Overload, Ability To Sustain Attention Determines Success | CMU SCS Press Release

    How the Model Works commentary (last updated Feb 6 2014)

    In the news: CNET | The Times of India | Pittsburgh Post Gazette | ACM TechNews | news | | The Connectivist | The Tartan | Business Standard | Pittsburgh Business Times | Business Insider (I never said that "website audiences that grew via word-of-mouth are much better off in the long-run for retaining and engaging users", in fact my work has a number of counter-examples to this claim) | e! Science News | Network World | Science Daily News | TechAdvisor | MIT Technology Review: A roundup of the most interesting innovations coming out of research centers and universities | NSF's Facebook Page

    Abstract []
    Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites -- encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores -- well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.
  1. (alphabetical) K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro*, and D. Towsley, Pay Few, Influence Most: Online Myopic Network Covering, IEEE NetSciCom Workshop 2014 Best Paper Award [pdf] [slides.pptx].
    *Corresponding author.
    Older version as Technical Report arXiv.
    Abstract []
    Efficient marketing or awareness-raising campaigns seek to recruit a small number, $w$, of influential individuals -- where $w$ is the campaign budget -- that are able to cover the largest possible target audience through their social connections. To date, most of the related literature on maximizing this network cover assumes that the social network topology is known. Even in such a case the coverage problem is NP-hard. In practice, the network topology is generally unknown and needs to be discovered on-the-fly. In this paper we assume that the topology is gradually discovered thanks to recruited individuals disclosing their social connections. Our goal is to provide an efficient online algorithm that recruits individuals so as to maximize the size of target audience covered by the campaign.
    We analyze the performance of a variety of online myopic algorithms (i.e. that do not have a priori information on the topology) currently used to sample and search large networks, including the Flickr and YouTube networks. We also propose a new greedy online algorithm, Maximum Expected Uncovered Degree (MEUD). Our proposed algorithm greedily maximizes the expected size of the cover, but it requires the degree distribution to be known. For a class of random power law networks we show that MEUD simplifies into a straightforward procedure, denoted as MOD because it requires only the knowledge of the Maximum Observed Degree. We substantiate our analytical results with extensive simulations and show that MOD significantly outperforms all analyzed myopic algorithms.
  1. Bo Jiang, Liyuan Sun, Daniel R. Figueiredo, Bruno Ribeiro, Don Towsley, On the duration and intensity of cumulative advantage competitions and the struggle of the fittest, [arXiv:1402.4536].
    Abstract []
    At the heart of competition for resources lie two simple principles: "survival of the fittest", which reflects how intrinsic fitnesses of competing agents influence their ability to gather resources; and "cumulative advantage", which reflects how accumulated resources help gather even more resources. In this letter we show that competitions embodying just the essence of these two principles exhibit a variety of surprising behaviors with respect to their duration, as measured by how long it takes for an indisputable winner to emerge, and intensity, as measured by how many times agents have equal amount of resources. We demonstrate that when the fitnesses of the agents are different, long-lasting competitions continue to occur in spite of a dramatic decrease in their intensity and the certainty that the fitter agent will win. We also unveil an unexpected phenomenon that we call "struggle of the fittest", where long-lasting competitions become more probable when one agent is slightly fitter than when agents are equally fit. These results have important implications for our understanding of competitions and various applications that leverage such models.
  1. Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang, Cindy Chen, Benyuan Liu, Don Towsley, Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site in Social Network Analysis – Community Detection and Evolution , Springer [arXiv:1401.5710].

    MIT Technology Review (Emerging Technology From the arXiv): Data Mining Reveals the Surprising Behavior of Users of Dating Websites

    Abstract []
    Online dating sites have become popular platforms for people to look for potential romantic partners. It is important to understand users' dating preferences in order to make better recommendations on potential dates. The message sending and replying actions of a user are strong indicators for what he/she is looking for in a potential date and reflect the user's actual dating preferences. We study how users' online dating behaviors correlate with various user attributes using a large real-world dateset from a major online dating site in China. Many of our results on user messaging behavior align with notions in social and evolutionary psychology: males tend to look for younger females while females put more emphasis on the socioeconomic status (e.g., income, education level) of a potential date. In addition, we observe that the geographic distance between two users and the photo count of users play an important role in their dating behaviors. Our results show that it is important to differentiate between users' true preferences and random selection. Some user behaviors in choosing attributes in a potential date may largely be a result of random selection. We also find that both males and females are more likely to reply to users whose attributes come closest to the stated preferences of the receivers, and there is significant discrepancy between a user's stated dating preference and his/her actual online dating behavior. These results can provide valuable guidelines to the design of a recommendation engine for potential dates.
  1. Kun Tu, Bruno Ribeiro, H. Jiang, X. Wang, David Jensen, Benyuan Liu, Don Towsley, Online Dating Recommendations: Matching Markets and Learning Preferences, 5th International Workshop on Social Recommender Systems (SRS 2014) with WWW 2014 [arXiv:1401.8042].
    Abstract []
    Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we proposed a two-side matching framework for online dating recommendations and design an LDA model to learn the user preferences from the observed user messag- ing behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves significantly the rate of suc- cessful matches by as much as 45%. Finally, using simulated matchings we show that the the LDA model can correctly capture user preferences.
  1. Yeon-sup Lim, Bruno Ribeiro, Don Towsley, Classifying Latent Infection States in Complex Networks, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1402.0013 ].
    Abstract []
    Algorithms for identifying the infection states of nodes in a network are crucial for understanding and containing infec- tions. Often, however, only a relatively small set of nodes have a known infection state. Moreover, the length of time that each node has been infected is also unknown. This miss- ing data – infection state of most nodes and infection time of the unobserved infected nodes – poses a challenge to the study of real-world cascades.
    In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and- state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness) for uncovering these un- known infection states. Our experimental results using ma- chine learning algorithms show that Infection Betweenness is the most effective feature for identifying latent infected nodes.
  1. James Atwood, Bruno Ribeiro, Don Towsley, Efficient Network Generation Under General Preferential Attachment, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1403.4521].
    James's source code.
    Abstract []
    Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distribution observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to nodes with high degree. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities (106 – 108 nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide “quicknet”, a freely-available open-source implementation of the methods described in this work.
  1. Bruno Ribeiro, Nicola Perra, and Andrea Baronchelli, Quantifying the Effect of Temporal Resolution on Time-varying Networks, Scientific Reports, 2013 [pdf]. arXiv:1211.7052.
    Abstract []
    Time-varying networks describe a wide array of systems whose constituents and interactions evolve in time. These networks are defined by an ordered stream of interactions between nodes. However, they are often represented as a sequence of static networks, resulting from aggregating all edges and nodes appearing at time intervals of size $\Delta t$. In this work we investigate the consequences of this procedure. In particular, we address the impact of an arbitrary $\Delta t$ on the description of a dynamical process taking place upon a time-varying network. We focus on the elementary random walk, and put forth a mathematical framework that provides exact results in the context of synthetic activity driven networks. Remarkably, the equations turn out to also describe accurately the behavior observed on real datasets. Our results provide the first analytical description of the bias introduced by time integrating techniques, and represent a step forward in the correct characterization of dynamical processes on time-varying graphs.
  1. Peng Xia, Bruno Ribeiro, Cindy Chen, Benyuan Liu, Don Towsley, A Study of User Behavior on an Online Dating Site (short paper), ASONAM 2013.
    Abstract []
  1. Ting-Kai Huang, Md Sazzadur Rahman, Harsha V. Madhyastha, Michalis Faloutsos, and Bruno Ribeiro An Analysis of Socware Cascades in Online Social Networks, WWW 2013 [pdf].
    Abstract []
    Online social networks (OSNs) have become a popular new vector for distributing malware and spam, which we refer to as socware. Unlike email spam, which is sent by spammers directly to intended victims, socware cascades through OSNs as compromised users spread it to their friends. In this paper, we analyze data from the walls of roughly 3 million Facebook users over five months, with the goal of developing a better understanding of socware cascades. We study socware cascades to understand: (a) their spatio-temporal properties, (b) the underlying motivations and mechanisms, and (c) the social engineering tricks used to con users. First, we identify an evolving trend in which cascades appear to be throttling their rate of growth to evade detection, and thus, lasting longer. Second, our forensic investigation into the infrastructure that supports these cascades shows that, surprisingly, Facebook seems to be inadvertently enabling most cascades; 44% of cascades are disseminated via Facebook applications. At the same time, we observe large groups of synergistic Facebook apps (more than 144 groups of size 5 or more) that collaborate to support multiple cascades. Lastly, we find that hackers rely on two social engineering tricks in equal measure—luring users with free products and appealing to users’ social curiosity—to enable socware cascades. Our findings present several promising avenues towards reducing socware on Facebook, but also highlight associated challenges.
  1. Fabricio Murai, Bruno Ribeiro, Don Towsley, and Krista Gile, Characterizing Branching Processes from Sampled Data, SIMPLEX Workshop 2013 (@ WWW'13). arXiv:1302.5847.
    Abstract []
    Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently the Galton-Watson process, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose a collection of methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.
  1. F. Murai, B. Ribeiro, D. Towsley, P. Wang, On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling, IEEE JSAC special issue on Network Science, 2013. [arXiv:1209.0736].
    Abstract []
    In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss practical implications of our findings.
  1. D. Figueiredo, P. Nain, B. Ribeiro*, E. de Souza e Silva, and D. Towsley, Characterizing Continuous Time Random Walks on Time-varying Graphs, SIGMETRICS 2012. *Corresponding author. [pdf]. Technical Report arXiv:1112.5762.
    Abstract []
    In this paper we study the behavior of a continuous time random walk (CTRW) on a stationary and ergodic time varying dynamic graph. We establish conditions under which the CTRW is a stationary and ergodic process. In general, the stationary distribution of the walker depends on the walker rate and is difficult to characterize. However, we characterize the stationary distribution in the following cases: i) the walker rate is significantly larger or smaller than the rate in which the graph changes (time-scale separation), ii) the walker rate is proportional to the degree of the node that it resides on (coupled dynamics), and iii) the degrees of node belonging to the same connected component are identical (structural constraints). We provide examples that illustrate our theoretical findings.
  1. Bruno Ribeiro, Prithwish Basu, Don Towsley, Multiple Random Walks to Uncover Short Paths in Power Law Networks, IEEE INFOCOM NetSciCom 2012. [pdf].
    Abstract []
    Developing simple distributed algorithms to allow nodes to perform topology discovery and message routing using incomplete topological information is a problem of great interest in network science. Consider the following routing problem in the context of a large power law network $G$. A small number of nodes want to exchange messages over short paths on $G$. These nodes do not have access to the topology of $G$ but are allowed to crawl the network subject to a budget constraint. Only crawlers whose paths cross are allowed to exchange topological information. In this work we study the use of random walks (RWs) to crawl $G$. We show that RWs have the ability to find short paths and that this bears no relation to the paths that they take. Instead, it relies on two properties of RWs on power law networks:
    • The RW ability to observe a sizable fraction of the network edges;
    • The near certainty that two distinct RW sample paths cross after a small percentage of the nodes have been visited.
      • We show promising simulation results on several real world networks.
  1. Bruno Ribeiro, Don Towsley On the Estimation Accuracy of Degree Distributions from Graph Sampling, IEEE Conference on Decision and Control (invited),2012.[pdf].
    This is a new version of the old TechReport UM-CS-2011-036. [pdf].
    Abstract []
    Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. In this work we present an in-depth study of the Mean Squared Error (MSE) of sampling methods such as independent random vertex (RV) and random edge (RE) sampling and crawling methods such as random walks (RWs), a.k.a. RDS, and the a Metropolis-Hastings algorithm whose target distribution is to uniformly sample vertices (MHRWu). This paper provides an upper bound for the MSE of a stationary RW as a function of the MSE of RE and the absolute value of the second most dominant eigenvalue of the RW transition probability matrix. We see that RW and RV sampling are optimal in respect to different weighted MSE optimizations and show when RW is preferable to RV sampling. Finally, we present an approximation to the MHRWu MSE. We evaluate the accuracy of our approximations and bound using simulations on large real world graphs.
  1. Bruno Ribeiro, Pinghui Wang, Fabricio Murai, and Don Towsley, Sampling Directed Graphs with Random Walks, INFOCOM 2012 [pdf].
    A version is available as UMass CMPSCI Technical Report UMCS-2011-031 [pdf].
    Errata []
    The statement of Theorem 3.1 is incorrect in the camera-ready version of our INFOCOM 2012 paper. It should read
    Theorem 3.1: \hat\pi(s_i) is an *asymptotically* unbiased estimator of \pi(s_i).
    Abstract []
    Despite recent efforts to characterize complex networks such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing edges poses a challenge to characterize large directed graphs through crawling. Unless we can crawl the entire graph or the directed graph edges are highly symmetrical, hidden incoming edges induce unknown biases in the sampled nodes. In this work we propose a random walk sampling algorithm that is less prone these biases. The driving principle behind our random walk is to construct, in real-time, an undirected graph from the directed graph in a way that is consistent with the sample path followed by the algorithm walking on either graph. We also study out-degree and in-degree distribution estimation. Out-degrees are visible to the walker while in-degrees are hidden (latent). This makes for strikingly different estimation accuracies of in- and outdegree distributions. We show that our algorithm can accurately estimate out-degree distributions and show that no algorithm can accurately estimate unbiased in-degree distributions unless the directed graph is highly symmetrical.
  1. Y. Lim, D. S. Menasche, B. Ribeiro, D. Towsley, and P. Basu, Online estimating the k central nodes of a network. IEEE Network Science Workshop (NSW), 2011 [pdf].
    Abstract []
    Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social networks. By identifying such central nodes, one can devise efficient strategies for prevention of diseases or crime and efficient marketing strategies. The goal of this paper is to estimate the k most central nodes in a network through parsimonious sampling. Centrality determines the relative importance of a particular node within the network. Conventional measures of node centrality include degree, betweenness, and closeness.
  1. Bruno Ribeiro, Daniel Figueiredo, Edmundo de Souza e Silva, and Don Towsley, Characterizing Dynamic Graphs with Continuous-time Random Walks, SIGMETRICS 2011 (Generalized to more general graph processes in our SIGMETRICS 2012 paper) [pdf].
    Abstract []
    In this paper we study the steady state behavior of continuous-time random walks (CTRW) on Markov dynamic networks. We consider two types of CTRWs: one that walks at a constant rate (CTRW-C) and another that walks with a rate proportional to the vertex degree (CTRW-D). We derive closed-form analytical expressions for the steady state (SS) distribution of these walkers. For CTRW-C we obtain the approximate SS distribution for either a very fast or very slow walker. We show that the behavior of CTRW-C and CTRW-D is strikingly different. Surprisingly, the steady state distribution of the fixed rate walker depends on the walker rate, which is not the case for the degree proportional walker. Such findings have direct implication on the design of algorithms to measure dynamic networks.
  1. Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley, Improving Random Walk Estimation Accuracy with Uniform Restarts, 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), Sept 2010.
    A version is available as INRIA Technical Report no 7394 [pdf].
    Abstract []
    This work proposes and studies the properties of a hybrid sampling scheme that mixes independent uniform node sampling and random walk (RW)-based crawling. We show that our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. In particular, our method increases the spectral gap of the random walk, and hence, accelerates convergence to the stationary distribution. The proposed method resembles PageRank but unlike PageRank preserves time-reversibility. Applying our hybrid RW to the problem of estimating degree distributions of graphs shows promising results.
  1. William Gauvin, Bruno Ribeiro, Benyuan Liu, Don Towsley, Jie Wang, Measurement and Gender-Specific Analysis of User Publishing Characteristics on MySpace, IEEE Network special issue on Online Social Networks, vol 24, number 5, Oct 2010.
    Abstract []
    Online social networks have become popular platforms for people to make connections, share information, and interact with each other. In an online social network, user publishing activities such as sending messages and posting photos represent online interactions between friends that involve the use of network and system resources. As more and more businesses use online social networks as a means to promote their brand names a good understanding of user publishing characteristics is important not only for capacity planning (server scaling estimation, network bandwidth provisioning, and performance tuning), but also for marketing analysis and security measures of online social networks. Recently there have been many efforts to measure and analyze various characteristics of online social networks. Most of these studies have focused on the network graph properties such as node degree distribution, clustering coefficient, and connected components. In this work, we measure and analyze the gender-specific user publishing characteristics on MySpace. Our results show that there are recognizable patterns with respect to profile attributes, user interactions, temporal usages, and blog contents for users of different genders. In particular, gender neutral profiles (most of them are music bands, TV shows, and other commercial sites) differ significantly from normal male and female profiles for several publishing patterns.
  1. Bruno Ribeiro and Don Towsley, Estimating and Sampling Graphs with Multidimensional Random Walks, ACM SIGCOMM Internet Measurement Conference, Nov, 2010. [src] [pdf]. IMC 2010 slides [pptx].
    Abstract []
    Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or quasi-disconnected subgraphs. In this work we propose a new $m$-dimensional random walk that uses $m$ dependent random walkers. We prove that the proposed sampling method, which we call Frontier sampling, has all of the nice sampling properties of a regular random walk. At the same time, our simulations over large real world graphs show that, in the presence of disconnected or quasi-disconnected subgraphs, Frontier sampling exhibits lower estimation errors than regular random walks. We also argue that Frontier sampling is more suitable to sample power-law graphs than random vertex sampling.
    Errata []
    Table 1: Size of LCC for Internet RLT is 190,914 not 609,066.
  1. Bruno Ribeiro, William Gauvin, Benyuan Liu, and Don Towsley, On MySpace Account Spans and Double Pareto-Like Distribution of Friends, IEEE Infocom 2010 Network Science Workshop, March, 2010. [pdf]
    An older version is available as UMass CMPSCI Technical Report UM-CS-2010-001 [pdf].
    Abstract []
    In this work we study the activity span of MySpace accounts and its connection to the distribution of the number of friends. The activity span is the time elapsed since the creation of the account until the user's last login time. We observe exponentially distributed activity spans. We also observe that the distribution of the number of friends over accounts with the same activity span is well approximated by a lognormal with a fairly light tail. These two findings shed light into the puzzling (yet unexplained) inflection point (knee) in the distribution of friends in MySpace when plotted in log-log scale. We argue that the inflection point resembles the inflection point of Reed's (Double Pareto) Geometric Brownian Motion with Exponential Stopping Times model. We also present evidence against the Dunbar number hypothesis of online social networks, which argues, without proof, that the inflection point is due to the Dunbar number (a theoretical limit on the number of people that a human brain can sustain active social contact with). While we answer many questions, we leave many others open.
2009 & ealier work
  1. Bruno Ribeiro, Daniel Figueiredo, and Don Towsley, Herding BitTorrent Traffic away from Expensive ISP Links, Jan, 2009. A version is available as UMass CMPSCI Technical Report UM-CS-2008-029 [pdf].
    Abstract []
    The steady increase in access link capacity of broadband subscribers coupled with the significant increase in traffic, partially due to popular peer-to-peer (P2P) file sharing applications, has had an adverse economic impact on last-mile Internet Service Providers (ISPs). Reducing the cost associated with P2P traffic has become a major issue for ISPs worldwide and has triggered several proposals. However, previous works advocate for changes to P2P software or require application-specific packet identification and manipulation. This paper presents a distinct approach. We leverage on the observation that traffic costs can be reduced by diverting traffic away from expensive links and into cheaper links. We focus on BitTorrent (BT) traffic and propose a change to the ISP's access router packet schedulers in order to divert traffic away from links labeled as expensive by the ISP. Our proposed packet scheduler is stateless. Our approach requires no change to applications, is blind to packet payload, and only requires changes to access routers of ISPs that desire to reduce their BT traffic costs. Our simulation and experimental results indicate that our approach can successfully herd BT traffic towards cheaper links (up to 50% reduction), without sacrificing user-level performance.
  1. Bruno Ribeiro, Tao Ye, and Don Towsley, A Resource-minimalist Flow Size Histogram Estimator, ACM SIGCOMM Internet Measurement Conference, Oct, 2008. A version is available as UMass CMPSCI Technical Report 29-08 [pdf].
    The IMC 2008 slides: [ppt]
    Description []
    This work presents a data stream algorithm for coarse-grained Internet flow size histogram estimation.
    This algorithm is a fast sub-optimal estimator that also has a small memory footprint.
    The memory savings comes from the application of a probabilistic counting technique proposed by
    Robert Morris in 1978 and from our hash folding technique.
    In this work we present two tools that are possibly of interest in their own:
    • hash folding (multiplexing two or more hash tables into the space of one table)
    • semi-probabilistic sampling of packets in a flow.
    Errata []
    Section 3.1.2 has a typo: " ... its ownership bit changed from one to zero, g_j is decremented by one" should be " ... its ownership bit changed from one to zero, g_0 is decremented by one". g_0 counts the number of virtual counters in the sketch and is decremented when a virtual counter is lost (happens when the ownership bit changes from one to zero). Thanks to Chia-Mu Yu for point out this typo.
  1. Bruno Ribeiro, Weifeng Chen, Gerome Miklau, and Don Towsley Analyzing Privacy in Enterprise Packet Trace Anonymization, Proceedings of the 15th Annual Network & Distributed System Security Symposium (NDSS), Feb, 2008 [pdf] and [ppt]
    An extended version is currently available as UMass CMPSCI Technical Report TR 48-07. (BibTeX) [pdf].
    The source code of the software used in this work is available at [src].
    Abstract []
    In this work we show (and present the formal background of) an attack on anonymized IP addresses of (full and partial) prefix-preserved
    anonymized traces. This attack that combines the use of external information (fingerprints) and the matching
    contrains imposed by prefix preservation.
    Perhaps most importantly, we develop analysis tools that allow data publishers to quantify the worst-case
    vulnerability of their trace given assumptions about the adversary's external information.
  1. Bruno Ribeiro, Don Towsley, Tao Ye, and Jean Bolot, Fisher Information of Sampled Packets: an Application to Flow Size Estimation, ACM SIGCOMM Internet Measurement Conference (IMC 2006), Rio de Janeiro, Brazil, October 2006. (BibTeX)
    The camera-ready version and the IMC'06 slides are available (also an errata) here [pdf][ppt] (recommended).
    An older version is also available as UMass CMPSCI Technical Report 06-37 [pdf]
    Also available is a tar.gz file with the source code that computes the Cramer-Rao bounds found in the paper. It needs R 2.4 or newer: [src].
    The source code to numerically obtain the MLE can be found at [src]. It contains the version of wnlib that I used. Wnlib needs csh to compile.
    Description []
    In this work we look at the amount of statistical information, more specifically the flow size distribution, obtained from probabilistic packet sampling.
    We focus primarily on TCP flows and ask if it is possible to extract enough information from a
    sampled packet stream in order to recover its original flow size distribution. We look at single and multiple
    monitor cases.
    This work also provides some insights of which TCP/IP fields can be used to improve flow size estimation.
    We also provide tools to pre-compute the number of samples needed to achieve a given confidence interval.
  1. Bruno Ribeiro, Edmundo de Souza e Silva, and Don Towsley, On the Efficiency of Path Diversity for Continuous Media Applications, A version is currently available as UMass CMPSCI Technical Report 05-19 [pdf].
  2. Description []
    Path diversity is beneficial for multimedia application in the presence of loss.
    It ameliorates the effect of end-to-end burst losses by spreading such losses.
    Multimedia applications using path diversity are known to achieve better error performance over the single path case.
    Packet interleaving is an older and more widespread technique that also spread burst losses.
    To the best of my knowledge, this work is the first to provide evidence (and a mathematical proof :) that packet interleaving can
    also be used to achieve good error performance on multimedia applications.
  1. Weifeng Chen, Yong Huang, Bruno Ribeiro, Kyoungwon Suh, Honggang Zhang, Edmundo de Souza e Silva, Jim Kurose, and Don Towsley, Exploiting the IPID field to infer network path and end-system characteristics, Passive and Active Measurement Workshop (PAM'05), Boston, March 2005. [pdf] (BibTeX)
  2. Description []
    This work shows the many uses of the IPID field for network measurements.
    Among many other applications, it presents a technique to measure end-to-end one way delay differences that does
    not require the end host to be instrumented.
    This means that two hosts A and B can measure their one-way delay difference to a
    Windows 2000/XP end host C without the need to instrument host C -- or without the knowledge or consent of host C :-O .
  1. S. C. Coutinho, Bruno Ribeiro, On Holomorphic Foliations without Algebraic Solutions, Experimental Mathematics, v.10, n.4, p.529 - 536, 2001. [pdf]
  2. Description []
    This work uses Groebner basis through the Buchberger algorithm to find if an holomorphic foliation has an algebraic solution.



Over the years I have been fortunate to work closely with a set of exceptionally talented students.


Talks & Panels:

- Panels:

- Talks