The Big Data Lab

Digital information about users is undoubtedly the oil of the new economy. Collecting, processing, and leveraging such data at large scale, a trend called Big Data, is the fuel that powers many online services like Facebook, Google, Amazon and Netflix. While Big Data has a tremendous potential, it also raises a severe concern for users’ privacy. My main research interests focus on these two sides of Big Data.

Understanding, predicting and shaping human behavior

Big Data hold many promises, not only for the individual but also for the public good. At the individual level, Big Data can help users to become more connected, productive, and entertained. At the society level, Big Data creates tremendous opportunities in areas ranging from marketing to public health and urban planning. My work in this area has focused on social networks, by trying to understand and characterize how topological features of social networks affect their overall performance. Having such an understanding, would then allow us to predict future performance and intervene when performance should be improved.

Privacy and security aspects of personal data

A series of privacy incidents over the last few months have focused public attention on how governments, businesses and other entities collect vast amounts of data about people’s lives and how that information is analyzed and used. Such concerns over privacy and data protection tamper the tremendous opportunities and extraordinary benefits of Big Data. Finding the right balance between privacy risks and rewards remains a great challenge. My own work in this area has focused on investigating technological means of protecting personal data while keeping them as useful as possible.


The Big Data Lab is equipped with a computation cluster consisting of 8 powerful servers. Our cluster includes 128 cores, 1024 GB of RAM and 384 TB of secondary storage.


Are you the next generation of researchers and big data scientists?
We are currently recruiting outstanding graduate students and postdocs…


  • Continuous Monitoring of Parkinson's Disease Patients Using Wearables

    This research is in collaboration with Intel and the Michael J Fox foundation for Parkinson’s Disease research. The study includes about one thousand patients that use wearable smart watches on their day-to-day environment to monitor their activity and their symptoms. In addition, the patients receive a smart-phone application that allows them to follow their measurements and their medication schedule. We use the collected data to inspect the effect of medications and activity on motor symptoms and to predict their effect – in hope to provide some actionable understandings to the PD community of patients and doctors.

  • Interactions and Location Privacy

    When we walk around in the city, we cross the ways of many other people every day. Such interactions might cause privacy leaks. In fact, if some of the people we meet are ‘corrupted’, they could disclose our location to a malicious entity. By analyzing a unique mobility dataset collected in a close community and containing absolute (GPS, Wi-Fi) and relative positioning (Bluetooth), we found that the chances that a person is detected by another one are extremely high. This suggests that if a number of devices are tracked, we can get a good sense of the mobility of the whole community.

  • Quantitative Land Use Planning

    Land use planning is one of the core processes leading city development: by determining the quantities, spatial allocation, and mix of amenities, it plays a key role in shaping the character of urban areas and cities as a whole. The practice of land use planning has yet to capitalize on the predictive power of universal and quantifiable patterns emerging from the last 50 years of studies of cities as complex adaptive systems. We study a quantitative framework that incorporates data-driven methods to the urban development process.

  • Privacy-Aware Cyber Security

    Cyber attacks and security breaches are quickly becoming a serious threat to organizations, governments and individuals, and this trend is expected to expand. End-users are among the favorite targets of cyber attacks since they are considered the weakest link in the security loop. To counter this threat, cyber-security mechanisms increasingly track users’ devices indirectly through network monitoring or directly with specialized software. As a result, users’ activities and data can be exposed and users’ privacy can be potentially compromised. This situation can lead users to evade cyber-security mechanisms altogether and leads regulators to limit the abilities of cyber-security technologies. We aim to evaluate methods for understanding the trade-off between privacy and cyber-security, and to propose solutions for balancing them. Specifically, we are studying how personal data stores can be used to provide cyber-security protection without exposing private data to a centralized server.

  • Scheduled Seeding for Viral Marketing

    One highly studied aspect of social networks is the identification of influential nodes that can spread ideas in a highly efficient way. The vast majority of works in this field seek for a set of nodes that if ‘seeded’ simultaneously, would then maximize the information spread in the network, by a viral infection process. However, only a few recent works have started to investigate the timing aspect, namely, finding not only which nodes should be seeded but also when to seed them. Moreover, recent works have shown that some of the underlying assumptions behind existing information spread models do not fit real-world scenarios. For example, most of these models rely on a large-scale viral infection process, while these processes have been shown to be quite rare in reality. In this work, we suggest a new model for information spread that better reflects real-world marketing scenarios, and a corresponding seeding heuristic which takes into account the timing aspect. By conducting a large set of empirical simulations, we show that under broad realistic assumptions, our suggested heuristic is able to improve the information spread by 50%-80% in comparison to state-of-the-art seeding heuristics.

  • Incentivizing Safer Driving Behavior

    Car crashes have a tremendous toll on human life and the economy. In order to decrease risky driving two complimentary efforts are needed: (1) An effective measure for evaluating driving behavior and (2) an effective incentive scheme to encourage driving behavior changes. Previous studies have mainly focused on machine automated feedback. To test several schemes for incentivizing safer driving behavior, we conducted a two months field study in cooperation with a large public transportation company in Israel. The drivers were divided into three experimental groups: a control group, an individual incentive group (in which drivers were paid based on their improvement) and a social incentive group (in which drivers were paid based on their peers’ improvement). Analyzing the experiment results, we find that the two incentive groups presented an overall improvement of 25% in driving behavior, whereas the control group presented no difference. Moreover, our analysis reveals several surprising insights regarding the effectiveness of the two incentive schemes under different circumstances.

  • Online Signature Verification Using Wrist-Worn Devices

    Many systems nowadays, such as those used by banks and government offices, rely heavily on signature verification. With recent advancements in technology, many of these systems make use of dedicated ad-hoc digital devices such as tablets and smart-pens to capture, analyze and ultimately verify the signature. This paper suggests a novel verification system which makes use of wrist-worn devices, such as smartwatches and fitness trackers, instead of ad-hoc digital devices. The suggested method uses a set of known genuine and forged signatures, captured by the motion sensors available in a wrist-worn device, to train a machine learning classifier. Given an unknown signature, the resulting classifier is able to determine whether the signature is genuine or forged. In order to validate our method, we collected 1980 genuine and forged signatures from 66 different subjects, recorded simultaneously from both a tablet device as well as a smartwatch device. Applying our method on the collected dataset, we showed that the suggested method significantly outperforms two state-of-the-art tablet-based signature verification systems, obtaining 2.36% EER and 98.52% AUC.

  • Population Dynamics in Israel

    In order to create advanced models in epidemiology, human mobility, network analysis and much more, population dynamics can be used as an input to each model independently from the other models or other input. Using CDR and data from Israel’s Central Bureau of Statistics we aim to find the population dynamics in Israel to better understand, and later predict population growth, clusters, relations and much more.

  • Optimizing Vaccination Allocation for Pertussis in Israel

    Pertussis, also known as whooping cough, is a highly contagious bacterial disease that primarily affects infants. Globally, the disease is responsible for over 200,000 deaths annually in children under five. Despite vaccination against the disease, over the past decade, reported pertussis incidence has risen in the developed countries. Furthermore, regardless of a similar vaccination policy, the per capita incidence observed in Israel is 2-4 times higher than incidence observed in the U.S. Therefore, revisiting existing vaccination policies on a country-specific basis is essential. The first part of this study aims at evaluating the actual extent of pertussis in Israel. To achieve this aim, we analyze reported cases of pertussis accumulated for nearly two decades in the surveillance systems of the Israeli Ministry of Health (IMoH) throughout the entire country. Using Markov chain Monte Carlo, We find that the pertussis incidence were quadrupled and follow a four-year pattern of periodicity. Moreover, our findings could not be better explained by human factors such a misclassification or under reporting. The second part of this study aims to offer a total vaccination policy to reduce morbidity and mortality. We develop an age-structured continuous-time Markov processes of pertussis transmission in Israel. Our model integrates the primary IMoH data alongside a large dataset (over 2 TB) of private cellphone based GPS traces to accurately capture mobility as well as the contact mixing patterns of the Israeli population. In our future work, we will finalize the construction of the transmission model, and run simulation studies to optimize vaccine effectiveness in Israel. In light of our preliminary findings, and supported by our collaborators from the IMoH, our model is predicted to shape pertussis immunization policy in Israel.

  • PDS-Based Recommender Systems

    Recommender systems have become extremely common in recent years, and are applied in a variety of applications such as movies, e-commerce, etc. Existing recommender systems exhibit two major limitations: (1) Privacy – each service (‘application’) which implements a recommender system requires a database that contains information about all the users of the service. (2) Partial view – when recommending to users, each such service can rely only on the data that was collected by the service itself and it does not have access to other data collected about the user. The Open Personal Data Store (OpenPDS) architecture was recently suggested for storing personal data in a privacy preserving way. Inspired by the OpenPDS architecture, we suggest an architecture for content-based recommender systems that overcomes the two limitations mentioned above. The suggested architecture allows users to manage and gain control over their own data, and at the same time allows the recommender system to utilize the rich data collected about the user (potentially through other services) to produce more accurate recommendations in a privacy preserving manner. We implement a prototype of the system and evaluate it through multiple recommender system settings, including web browsing data and public datasets. The evaluation focuses on the recommendation process’ enhancement by the use of multiple data sources about the user, and test whether multi-source-based recommendations perform better than single-source-based recommendations.

  • Ride Sharing

    Ride sharing’s potential to improve traffic congestion as well as assist in reducing CO2 emission and fuel consumption was recently demonstrated via analysis of available mobility datasets. By analyzing a dataset of over 14 Million taxi trips taken in New York City during January 2013, we find that if people are willing to experience up to five minutes delay, almost 70% of the rides could be shared (fig. A). Using the source-destination network of rides (fig. B), we identified seven network topological features that combined can effectively predict the benefit of ride sharing. We also observed that the number of rides is highly variable with time of the day and day of the week (fig. D). Therefore, we in future work we will investigate the time-related benefits of ride sharing, and also exploit different available datasets to suggest specific strategies for promoting ride sharing. [for further details see E. Shmueli et al., Ride Sharing: A Network Perspective. SBP 2015]