annotation
working definition: what're splogs?
- Splogs are blogs where the articles are fake, and are only created for spamming. (There is frequent confusion between the terms "splog" and "spam in blogs".)(Wikipedia: Spam blog)
- Spam blogs (splogs) […] are weblog sites which the author uses only for promoting affiliated websites. The purpose is to increase the PageRank of the affiliated sites, get ad impressions from visitors, and/or use the blog as a link outlet to get new sites indexed. [Content of splogs] is often nonsense or text stolen from other websites. These blogs contain an unusually high number of links to sites associated with the splog creator which are often disreputable or otherwise useless websites. (Wikipedia: Spam blog)
- As with many powerful tools, blogging services can be both used and abused. The ease of creating and updating webpages with [blogging tools] has made it particularly prone to a form of behavior known as link spamming. Blogs engaged in this behavior are called spam blogs, and can be recognized by their irrelevant, repetitive, or nonsensical text, along with a large number of links, usually all pointing to a single site. (Blogger: About Spam Blogs)
motive: why do spammer create spams?
People could earn revenue by generating traffic for commercial websites and in return receiving a per-click commission. When a website owner join an affiliate program, he places a range of banners or textual links that links to affiliated websites (In the case of Google AdSense, the links are dynamically delivered from Google search results.). When a user clicks on one of his links to the affiliated websites, their activity will be tracked by the affiliate software. He gets paid when people click on these affiliated links (pay-per-click, or PPC), or until people make a purchase on the affiliated websites (commission), depending on the rule of the affiliate program. It’s very easy and cheap to create a blog and join an affiliate program to generate revenue. Blogs created by spammers usually do not provide unique or useful content/service.
guide: how to recognize a splog?
In a typical splog, content is generated by machine in order to attract visitors through either search engine or individual blogs. We determine if a blog is a splog based on how the blog author or authors use this blog. Therefore, a blog that contains spams in the form of comment spams or trackback spams will not be considered splogs.
There are observable typical characteristics of splogs:
- Machine-generated content: A blog is considered as splog when we observe the blog entries are generated automatically, usually nonsense, gibberish, repetitive or copied from other blogs or websites.
- No value-addition: A blog is considered as splog when we observe no useful or unique information at all to its readers, i.e. no value added in this blog for the readers. There are blogs using automatic content aggregating techniques to provide useful service such as podcasting—these are legitimate blogs because of their value addition.
- Hidden agenda, usually an economic goal: A blog is considered as splog when we observe commercial intention within the blog. Such intention can be revealed if we see any affiliate ads or out-going links pointing to affiliate websites.
annotation tool
The tool has three panels:
- Left - Load & Label: lists a batch (20 blogs per batch) of blogs sampled from the database (TREC data).
- Middle - Browsing Mode: lists the blog selected in the left
panel and the
its entries that are archived in the database, where the content of the
blog or entries can be displayed in three modes on the right panel:
- www: fetches the blog directly from its hosting website,
- source: shows the HTML source code, and
- archived: fetches the blog homepage from database.
- Right - Blog Homepage/Entry Browser: displays the content of the selected blog or its entry.
splog examples
- keyword stuffing, affiliated links (to itself)
- keyword stuffing, affiliated links (to specific webpage)

- hidden links

- link farm using blogrolls
- non-endorsement links in normal blogs
how to label?
For each sampled blog, we assign one the following labels:
- [N] – Normal blog
- [S] – Sblog
- [B] – Borderline
- [U] – Undecided
- [F] – Foreign
Ask the following questions:
- Does it make sense for a human to create such blog and content? Is the content meaningful and useful for some readers? If any of these answers is “Yes”, label it as [N].
- If the content seems fake, nonsense, or not useful, ask: Is there any commercial intention within the blog? Commercial intention can be revealed if you see any affiliate ads or out-going links. If this is the case, then it’s time to check if there’s any spamming trick being used. If you can find at least one trick, label it as [S], otherwise, label it as [U].
- If you find some spamming tricks in a blog, but you think maybe the content is uniquely generated by an author or is useful for some readers, label it as [B].
- If you cannot decide, label it as [U].