Abstract:
This work presents a mechanism to detect Web Spam in a massive way, using a distributed architecture based on
the paradigm MapReduce for the parallel processing and the Support Vectors Machines (SVM) as learning
algorithm for the classification. The Web Spam that is, the unjustified assignment of relevance to pages in the Web,
has become a topic very approached actually since the involved parts, the Searching Machines on one hand and
for other the users that demand information of them, can be benefited or harmed by the treatment of this issue. Our
solution presents an alternative to detect Web Spam pages that combine the programming pattern MapReduce,
implemented with Hadoop, with a cascade model of SVM using the Amazon web services that, offer a very practical
and not expensive form to carry out the computation of big quantities of information in the cloud.