Dataset for Identifying Influential Bloggers

Arizona State University, Computer Science and Engineering, Data Mining and Machine Learning


    Blogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their virtual communities of similar interests. Activities happened in Blogosphere affect the external world. One way to understand the development on Blogosphere is to find influential blog sites. There are many non-influential blog sites which form the "the long tail". Regardless of a blog site being influential or not, there are influential bloggers. Inspired by the high impact of the influentials in a physical community, we study a novel problem of identifying influential bloggers at a blog site. Active bloggers are not necessarily influential. Influential bloggers can impact fellow bloggers in various ways. In this work, we discuss the challenges of identifying influential bloggers, investigate what constitutes influential bloggers, present a preliminary model attempting to quantify an influential blogger, and pave the way for building a robust model that allows for finding various types of the influentials. To illustrate these issues, we conduct experiments with data from a real-world blog site, evaluate multi-facets of the problem of identifying influential bloggers, and discuss unique challenges. We conclude with interesting findings and future work.


The dataset is freely available for academic and research use. The use of the dataset can be referenced to the following publication:
    Author = {Nitin Agarwal and Huan Liu and Lei Tang and Philip S. Yu},
    Booktitle = {Proccedings of the First ACM International Conference on Web Search and Data Mining (Video available at:},
    Title = {Identifying the Influential Bloggers},
    Pages = {207--218},
    Url = {},
    Year = {2008},

    The dataset and the description can be downloaded from here. The package contains a README file that explains the attributes of the dataset.





This project is sponsored by ONR N000140810477 (2008), ONR N00014-09-1-0165 (2009).

Created on 04/10/2009