Soumajyoti Sarkar

I am a 4th year PhD candidate majoring in Computer Science at Arizona State University, Tempe. I am fortunate to be advised by Paulo Shakarian as part of the CySIS lab at ASU. My PhD thesis focuses on measuring the impact of social network interactions using observational and experimental studies. I finished my undergraduate studies at Indian Institute of Engineering Science and Technology (IIEST), Shibpur after which I briefly worked at Deloitte as a consultant before deciding to pursue my PhD.

I interned at Nokia Bell Labs as part of the ENSA team at Murray Hill, New Jersey in Summer 2018. I am currently interning in the Twitter Search team working on machine learning for search relevance at their San Francisco HQ.

Please feel free to reach out to me using my email (link below) on anything related to my research or any collaborations.

Email  /  CV  /  Google Scholar  /  Github  /  LinkedIn


I'm interested in modeling the mechanisms of information spreading or diffusion in online platforms but I go beyond that to now use these models for various predictive applications. For example, diffusion arising out of online social interactions can be used to reconstruct epidemic contact networks, they can be used to develop agent-based models for influence and rumor spreading or they can be used as sensors for monitoring the next cyber attack. I regularly use tools from time series, graph analysis, convex optimization and bayesian modeling to tackle some of the challenges in modeling human behavior for such predictive tasks.

Recently, I have grown significant interest in mechanism design for information diffusion that could be used for engagement in online platforms for crowdsourced lending and microfinance. On a lighter note, when I am not sitting in front of my computer, you can either find me on my jogging routine at Papago Park (which is my favorite place in Tempe) or in my home going through one among the list you can find here.

Selected Publications/Preprints

Impact of Social Influence on Adoption Behavior: An Online Controlled Experimental Evaluation
Soumajyoti Sarkar Ashkan Aleali, Paulo Shakarian, Mika Armenta, Danielle Sanchez, Kiran Lakkaraju
IEEE/ACM Advances in Social Networks Analysis and Mining (Oral Presentation) , Canada, 2019  
pdf / Appendix / link

It is widely believed that the adoption behavior of a decision-maker in a social network is related to the number of signals it receives from its peers in the social network. It is unclear if these same principles hold when the “pattern” by which they receive these signals vary and when potential decisions have different utilities. To investigate that, we manipulate social signal exposure in an online controlled experiment with human participants. Specifically, we change the number of signals and the pattern through which participants receive them over time. We analyze its effect through a controlled game where each participant makes a decision to select one option when presented with six choices with differing utilities, with one choice having the most utility. We avoided network effects by holding the neighborhood network of the users constant. Over multiple rounds of the game, we observe the following: (1) even in the presence of monetary risks and previously acquired knowledge of the six choices, decision-makers tend to deviate from the obvious optimal decision when their peers make similar choices, (2) when the quantity of social signals vary over time, the probability that a participant selects the decision similar to the one reflected by the social signals and therefore being responsive to social influence does not necessarily correlate proportionally to the absolute quantity of signals and (3) an early subjugation to higher quantity of peer social signals turned out to be a more effective strategy of social influence when aggregated over the rounds.

Leveraging Motifs to Model the Temporal Dynamics of Diffusion Networks
Soumajyoti Sarkar Hamidreza Alvari, Paulo Shakarian
MSM Workshop, The Web Conference (WWW) (Oral Presentation) , San Francisco, 2019  
pdf / Appendix / link

Information diffusion mechanisms based on social influence models are mainly studied using likelihood of adoption when active neighbors expose a user to a message. The problem arises primarily from the fact that for the most part, this explicit information of who-exposed-whom among a group of active neighbors in a social network, before a susceptible node is infected is not available. In this paper, we attempt to understand the diffusion process through information cascades by studying the temporal network structure of the cascades. In doing so, we accommodate the effect of exposures from active neighbors of a node through a network pruning technique that leverages network motifs to identify potential infectors responsible for exposures from among those active neighbors. We attempt to evaluate the effectiveness of the components used in modeling cascade dynamics and especially whether the additional effect of the exposure information is useful. Following this model, we develop an inference algorithm namely InferCut, that uses parameters learned from the model and the exposure information to predict the actual parent node of each potentially susceptible user in a given cascade.

Understanding and forecasting lifecycle events in information cascades
Soumajyoti Sarkar Ruocheng Guo, Paulo Shakarian
Springer Social Network Analysis and Mining (SNAM) (Journal) , 2017  
pdf / Appendix / link

Most social network sites allow users to reshare a piece of information posted by a user. As time progresses, the cascade of reshares grows, eventually saturating after a certain time period. While previous studies have focused heavily on one aspect of the cascade phenomenon, specifically predicting when the cascade would go viral, in this paper, we take a more holistic approach by analyzing the occurrence of two events within the cascade lifecycle -- the period of maximum growth in terms of surge in reshares and the period where the cascade starts declining in adoption. We address the challenges in identifying these periods and then proceed to make a comparative analysis of these periods from the perspective of network topology. We study the effect of several node-centric structural measures on the reshare responses using Granger causality which helps us quantify the significance of the network measures and understand the extent to which the network topology impacts the growth dynamics. This evaluation is performed on a dataset of 7407 cascades extracted from the Weibo social network.

Predicting enterpise cyber incidents using social network analysis on the darwkeb hacker forums
Soumajyoti Sarkar Mohammad Almukaynizi, Jana Shakarian, Paulo Shakarian
International Conference on Cyber Conflict. (CyCon U.S.) (Oral Presentation) , Washington D.C., 2018  
pdf / Appendix / link

With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks in an attempt to defend conflict during competition. We use information from the darkweb forums by leveraging the reply network structure of user interactions with the goal of predicting enterprise cyber attacks. We use a suite of social network features on top of supervised learning models and validate them on a binary classification problem that attempts to predict whether there would be an attack on any given day for an organization. We conclude from our experiments using information from 53 forums in the darkweb over a span of 12 months to predict real world organization cyber attacks of 2 different security events that analyzing the path structure between groups of users is better than just studying network centralities like Pagerank or relying on the user posting statistics in the forums.

Using network motifs to characterize temporal network evolution leading to diffusion inhibition
Soumajyoti Sarkar Ruocheng Guo, Paulo Shakarian
Springer Social Network Analysis and Mining (SNAM) (Journal) , 2019  
pdf / Appendix / link

Network motifs are patterns of over-represented node interactions in a network which have been previously used as building blocks to understand various aspects of the social networks. In this paper, we use motif patterns to characterize the information diffusion process in social networks. We study the lifecycle of information cascades to understand what leads to saturation of growth in terms of cascade reshares, thereby resulting in expiration, an event we call diffusion inhibition". In an attempt to understand what causes inhibi- tion, we use motifs to dissect the network obtained from information cascades coupled with traces of historical diffusion or social network links. Our main results follow from experiments on a dataset of cascades from the Weibo plat- form and the Flixster movie ratings. We observe the temporal counts of 5-node undirected motifs from the cascade temporal networks leading to the inhibition stage. Empirical evidences from the analysis lead us to conclude the following about stages preceding inhibition: (1) individuals tend to adopt information more from users they have known in the past through social networks or pre- vious interactions thereby creating patterns containing triads more frequently than acyclic patterns with linear chains and (2) users need multiple exposures or rounds of social reinforcement for them to adopt an information and as a result information starts spreading slowly thereby leading to the death of the cascade.

Mining user interaction patterns in the darkweb to predict enterprise cyber incidents
Soumajyoti Sarkar Mohammad Almukaynizi, Jana Shakrian, Paulo Shakarian
Journal preprint

With rise in security breaches over the past few years, there has been an increasing need to mine insights from social media platforms to raise alerts of possible attacks to raise alerts of possible attacks in an attempt to defend conflict during competition. In this study, we attempt to build a framework that utilizes unconventional signals from the darkweb forums by leveraging the reply network structure of user interactions with the goal of predicting enterprise related external cyber attacks. We use both unsupervised and supervised learning models that address the challenges that come with the lack of enterprise attack metadata for ground truth validation as well as insufficient data for training the models. We validate our models on a binary classification problem that attempts to predict cyber attacks on a daily basis for an organization. Using several controlled studies on features leveraging the network structure, we measure the extent to which the indicators from the darkweb forums can be successfully used to predict attacks. We use information from 53 forums in the darkweb over a span of 17 months for the task. Our framework to predict real world organization cyber attacks of 3 different security events, suggest that focusing on the reply path structure between groups of users based on random walk transitions and community structures has an advantage in terms of better performance solely relying on forum or user posting statistics prior to attacks.

Understanding Information Flow in Cascades Using Network Motifs
Soumajyoti Sarkar Hamidreza Alvari, Paulo Shakarian
International Conference on Social Computing, Behavioral-Cultural Modeling (SBP) (Poster) , Washington D.C., 2019  
pdf / link

A growing set of applications consider the process of network formation by using subgraphs as a tool for generating the network topology. One of the pressing research challenges is thus to be able to use these subgraphs to understand the network topology of information cascades which ultimately paves the way to theorize about how information spreads over time. In this paper, we make the first attempt at using net- work motifs to understand whether or not they can be used as generative elements for the diffusion network organization during different phases of the cascade lifecycle. In doing so, we propose a motif percolation-based algorithm that uses network motifs to measure the extent to which they can represent the temporal cascade network organization. We compare two phases of the cascade lifecycle from the perspective of diffusion{ the phase of steep growth and the phase of inhibition prior to its saturation. Our experiments on a set of cascades from the Weibo platform and with 5-node motifs demonstrate that there are only a few specific motif pat- terns with triads that are able to characterize the spreading process and hence the network organization during the inhibition region better than during the phase of high growth. In contrast, we do not find compelling results for the phase of steep growth.

Less is More: Semi-Supervised Causal Inference for Detecting Pathogenic Users in Social Media
Hamidreza Alvari, Elham Shaabani, Soumajyoti Sarkar Ghazaleh Beigi, Paulo Shakarian
CyberSafety Workshop, The Web Conference (WWW) (Oral Presentation) , San Francisco, 2019  
pdf / link

Recent years have witnessed a surge of manipulation of public opinion and political events by malicious social media actors. These users are referred to as "Pathogenic Social Media (PSM)" accounts. PSMs are key users in spreading misinformation in social media to viral proportions. These accounts can be either controlled by real users or automated bots. Identification of PSMs is thus of utmost importance for social media authorities. The burden usually falls to automatic approaches that can identify these accounts and protect social media reputation. However, lack of sufficient labeled examples for devising and training sophisticated approaches to combat these accounts is still one of the foremost challenges facing social media firms. In contrast, unlabeled data is abundant and cheap to obtain thanks to massive user-generated data. In this paper, we propose a semi-supervised causal inference PSM detection framework, SemiPsm, to compensate for the lack of labeled data. In particular, the proposed method leverages unlabeled data in the form of manifold regularization and only relies on cascade information.
A few things I am passionate about

Using Bayesian Variable Selection to uncover the causes behind funding speed on internet microfinance platforms: A Study on
Soumajyoti Sarkar
TPRC47: Research Conference on Communications, Information and Internet Policy (Conference) , Washington D.C., 2019  

Over the last couple of decades in the microfinance industry, there has been spread of financial disintermediation on a global scale. Traditionally, even for small supply of funds, banks would act as the conduit between the funds and the borrowers. It has now been possible to eliminate this with the advent of online communication via internet platforms like Kiva, Prosper, LendingClub. Kiva for example, works with Micro Finance Institutions (MFIs) in developing countries to build Internet profiles of borrowers with a brief biography, loan requested, loan term, and purpose. This has allowed funds to be disbursed from the nations with liberal lending policies towards the nations where poor people find it difficult to raise money through banks. Additionally, it overcomes the bias created through loan disbursement through auctions in Prosper which is unfavorably biased towards the credit-trustworthy users and undermines new users.
We try to investigate the factors behind the funding speed (time between loan creation and loan funding) of loans using the dataset available from Kiva - the goal is to see whether there are region-agnostic factors that favor faster funding for projects compared to the history of the lenders and the borrowers. Th
To this end, we use bayesian variable selection techniques that not only fits regression models to the data while simultaneously picking the most effective regressors, but also allows us to incorporate any prior beliefs about the attributes of the projects and the users that cannot generally be accomplished with standard regression models. Our experiments on the Kiva Loan dataset lead us to conclude that while both the country of the borrower and the borrower history impacts the funding speed, it is the mainly the sector of loan request (like, agriculture, housing, grocery stores) together with borrower history that impacts the speed much more than the region it belongs to. This counters the belief supporting Lucas paradox of region biases being responsible for decreasing flatness. At the same time, it now enables us to devise mechanisms to fight these biases on the internet platforms by designing algorithms that can fairly put emphasis on the sector by distributing attention to all sectors than just a few.

Watching Roger Federer play

ATP Tour Schedule


External Reviewer, IJCAI 2018, 2019

External Reviewer, AAMAS 2016

External Reviewer, AAAI 2016


Graduate Student Instructor, CSE 485 Software Engineering Capstone, Fall 2018

Graduate Student Instructor, CSE 494 Artificial Intelligence for Cyber Security, Spring 2019

This template was built with help from here.