Our friends operating Search Engines have to cope with finding techniques to eliminate spam polluting their Search Engine results. There is a lot of very active research going on in this field , some by the Operators themselves, such as the ineffable Matt Cutts but also by Academics (although the difference between the two isn’t always clear cut). Researchers working in Stanford’s Infolab (where Brin and Page come from) have written the following article. As its title shows, it summarises the various and main strategies used to counter unwanted Spam on the web (particularly spam invading online communities):
Paul Heymann, Georgia Koutrika, Hector Garcia-Molina. Fighting Spam on Social Websites: A Survey of Potential Approaches and Future Challenges. IEEE Internet Computing Special Issue on Social Search, 11(6): 36-45 (2007).
So what do we discover in this interesting article (although not very informative from a technical viewpoint)? We find a classification of the different methods used to fight online Spam. I’ve separated these into automatic and manual methods.
To demote (or downgrade) spam is to make sure that it doesn’t come up in the top results of search engine returns. This doesn’t necessarily mean that the page in question has been recognised as being spam. It merely means that there is enough suspicion about its contents to warrant downgrading through the use of a penalty. What kind of penalty can be imagined? Very simple things: from .info ? penalties to limiting contacts/connections with similar IPs, or more complicated setups like the famous Trustrank and Spamrank. These penalties are applied automatically and if this were done manually it would be a deliberate spam detection objective.
The idea here is to detect the spam pages in the total corpus. It means being able to recognise a page of spam. The easiest way to do this is human and a moderator can get rid of any pages that he/she examines and judges illegitimate. Radical action here means removing all the pages belonging to an author identified as a spammer. Extreme action for example would be to remove any websites with a name similar to that of an identified spammer.
Automatic techniques can also be used. The best known are based on content analysis (see works of Ntoulas, Najork, Manasse and Fetterly) but we can also find methods based on link analysis (detecting link farms, see works of Wu and Davison and many others) as well as analysing webusers’ behaviour (here I’m afraid I don’t have any references, but I assume that it is about detecting the frequency of publications etc).
The obvious problem about spam detection is that until the spamming page is found it is usually very well positioned and can stay there until it is detected. This makes life sweet for the spammer and doesn’t encourage him to give up this job.
Here the aim is to make it difficult for spam to get online and if it does, to make it an expensive enterprise.
Manual techniques are very simple: this means making automatic interaction with the system near impossible and thus forcing the spammer to spend most of his time trying to interact with the system (or paying people to do it). It is also possible to have users make micro payments for each action (this wouldn’t be painful for the normal user but would be excruciating for the spammer who sends messages in great numbers).
In the way of automated tests we have the all pervading captchas as well as more amusing techniques like limiting access to a community (limiting number of users and setting conditions for taking part in it etc) and maximising the personal parameters needed. I like this last technique a lot because the reasoning behind it is that if each user can personalize his or her page as much as she likes, then there’s no more room for spammers, seeing that trying to get into each page would be too much like hard work.
Briefly then…the article contains a pleasant and readable synthesis of methods used to fight spam, nothing new, but I encourage you to take a look at it (it contains no difficult mathematical formulas).