Two decades ago, anyone could have imagined that search engines would play as crucial role in business facilitation as they currently do. Presently, search engines are being applied in all facilitation forms of conceivable information regarding places, things, people, and more. Most of the searches that are conducted using the search engines are made for entertainment, learning, businesses, etc. However, some searches are imbued purposefully, where the search outcomes may impact the significant decisions about a capital purchase, entrepreneur’s quest for an acquisition target or individuals’ life. The search engines have become trusted guides enhancing the online experience, other than becoming just a signpost pointing the right direction for a lost trekker on the internet scene. According to the research that was carried out in 2005 regarding internet, it was found that the number of searches increased and most internet users were satisfied with the outcomes of their searches. A similar survey was conducted in 2008; it indicated that approximately 50% of the internet users used search engines on a usual day (Kent, 2012).
Search engines are no more seen as mare convenient informational tools. They have become more powerful agents of a transformation making the business and educational environment more reliable and competitive. These environments have faced new challenges and opportunities in various sectors (Sarker, 2002).
Challenges Facing Search Engines
Most of the Web literature has deliberately tried to manipulate its placements in the rankings of different search engines. The outcome view is referred as spam. The traditional IR collections did not have spam. Owning to this, there have been limited researches that have been carried out on spam. On the other hand, web search engines have been consistently improving and developing approaches for identifying and fighting the spam. Considering the search engines development, new spam methods have been developed in response to these challenges. The web search engines do not publish the anti-spam technologies in order to avoid the spammers to circumvent them.
Historically, the tread of continuous incrementation of various spam usage was developed. Some challenging study matters exist and they are involved in developing algorithms of spam resistance. Web search engines are affected by various forms of spam that include text spam and link spam (Day, 2011).
All search engines assess the material that is contained in the documents and determine its ranking status according to the search posed to the system. The text spam technique is applied in modifying the text in such a way that the web search engines classify the document as of relevance to the reader’s request. One way through which the search engine can improve the document ranking is by concentrating on a set of key words and attempting to improve the perceived relevance for the given keyword. At the same time, the texts are presented in the small forms, or are invisible. The other method that can be used is to increase the number of the keywords, such that the document will be seen to be relevant by the search engine. Though it is a crude way, an inclusion of a dictionary at the bottom of the page tends to increase the probability that web page is returned for the obscure queries. Another technique is to include text in a different fields of the web page so as to make it appear like the key topic of the web page (Ryals, 2008).
The advent of the link evaluations by the search engines is usually accompanied by the effort of spammers that manipulate the link evaluation systems. The most common approach is the one where the author puts the link farm as the footnote on every page in the site. Its aim is to manipulate by the system of the search engine that is using raw counts of the requested input links in determination of the web page significance. Considering that the link can be easily spotted, more random linkages and advanced techniques such as pseudo web-rings can be applied.
The problem related to the link farm associates with the web user distraction, as it may contain legitimate contents on the pages. The doorway pages are the classy type of the link farms that have been developed. These pages consist of the links, which are accessible through the search engines. The doorway pages are usually made from numerous lines under one link. Both the doorway and link farms are most effective whenever the link evaluation is sensitive to the number of links. The techniques that focus on the quality of the links are never affected by those methods.
Cloaking includes serving different content by the search engine among users. Therefore, the search engine is deceived about the content of the web page and then scores the page in such way to the user. At some instances, cloaking may be used to assist the search engines by allowing them to access specific content form by the web pages that have multiple content. However, cloaking may be used to mislead the search engines by making the web author to attain the benefits of the text and the link spam without inconveniencing the readers of the page content.
Defence against Spam
Generally, the text spams in search engines are protected by the empirical approach. The text spam was common for the web sites that wrote the text on a white background, to make sure that the site user is not affected in case of text spam.On the other hand, link spams have specific patterns that are easy to detect, though these patterns may mutate similarly by the link spam detection techniques. However, the list of experimental approaches of link spam discovering are needed. One of such approaches refers to the web universal analysis applying rather than a local web page analysis. To the contrary, cloaking may only be discovered through swarming. The first time it is better to use the HTTP client and then use a simple client (Sarker, 2002).
Control of Web Content in Search Machines
While spams are designed to mislead the search engines, the web is supplied by the text that misleads the user of the site. For instance, there are numerous web sites that contain misleading information on history, health, and many other subjects. There are also sites that have information that was once correct but now it is out of date.
Though there is great amount of studies on the determination of the relevance of the web documents, the matter regarding the quality and the accuracy has not been attended. The web is not a mass; therefore the quality assessment methods of the web document are essential for development of the optimal search outcomes. One of the approaches that have proved the success of web quality assessment is the analysis of the link. However, any technique will give optimal results with some areas for improvement and for further research to be conducted considering the web as a massive body (Ryals, 2008).
The Web Conventions
As the web has developed and grown, evolution of the conventions of the web documents and pages creation took place. The web search engines have assumed adherence to these conventions to upgrade search results. Particularly, the conventions may be classified as hyperlinks, anchor text and as META tags.
Considering that the anchor text is supposed to be descriptive as a web convention, this may be exploited in the assessing function of the search engines. The search engines consider whether a web page creator has included a link to any other page. It is so since the web designer believed that the users of the web source page would easily find the destination page and it will be relevant and interesting. Depending on the various designs of the web pages, this assumption is valid. Nevertheless, there are prominent exceptions such as links exchange programs, where the web page designer has agreed for mutually link in order to improve the ranking and connectivity, and the advertisement connections. Human users are adapted at the distinguishing connections including the commercial purposes from the one included for editorial reasons (Day, 2011).
Considering that the utility links are not binary function, this complicates the matter even more. For example, most web pages have some links that allow users to download various versions of Adobe Reader. For the new user that may not have the Acrobat Reader, such links are more useful than for users that have already downloaded the program. Equally, most sites have the terms of service link at the end of each page. Whenever a user open the site, the link might be very useful, though as the user log into other web pages on the sites, the usefulness of the links becomes less useful (Kent, 2012).
The other convention is concerned with the use of the META tags. The META tags are presently the fundamental means of including the metadata in the HTML. Theoretically, META tags may include arbitrary content, though conventions have developed for meaningful materials. The META tags are important to the search engines, because the web designer describes there the content of the page. The convection dictates that the materials contained in the META tag are either short textual summary of the page or just a short list of the keywords pertaining to the context of the web page. The abuse of the META tags is rampant. The web designer may decide to include a summary of their site in the META tag, instead of merely a single page. The designer may as well, decide to include the keywords that are general as compared to the warrants, using the META description of “car for sale” on the pages that sells certain model of car (Ryals, 2008).
Generally, the accuracy of the META tags tends to be difficult for all the search engines to analyse that they are not visible to the web users and thus they are not restricted to provide help to new users. Nevertheless, there are several web page designers that use the META tag accurately. Therefore, if the web search engine can accurately judge the importance of the text provided with the META tag, then the search results may potentially be improved significantly. This will apply equally to those contents that are not displayed on the page, such as the ALT text that are related to the IMAGE tag.
While the link analysis has become important as a mean for web-based information recovery, there is lack of sufficient research in the different forms of the links on the page. Such research have tried to distinguish the commercial pages/links from the editorial links or the links that are associated to the meta-information about the web site from the links that relate to the actual information about the web site. The existing research on the analysis of the link is useful, since the designers of the visible web sites are not likely to contravene the already developed web conventions. Though, this may not be reliable or sufficient. For example, a highly visible web page is more likely to include the advertisements rather than the average web page (Kent, 2012).
Appreciative nature of the links is valuable because it allows for more sophisticated treatment of the related anchor text. The potential approach would use the text assessment of the anchor text in combination with the meta-information such as the URL of the link in connection to the information that is obtained from the web page.
The web search engines attempt to avoid indexing and crawling duplicate web pages, as such web pages tend to increase the time of crawling without add new information to the search outcomes. While individual page detection and mirror detection attempt to provide the solution for the duplicate pages, a simple variant may reap off most of the benefits while asking for little or no computational resources. The duplicate host is the largest sources of the duplicate web pages on the web; therefore, solving the duplicate host problem may result into a significant improvement of the web crawler (Sarker, 2002).
In the last decade, the IR research communities have started to work on the proper methodology to assess the web IR systems. The development of the web TREC consultation has significantly contributed to the understanding improvement of the involved subjects, and the appropriate methodology is not attained yet. Thus, there is a need in further researches to be conducted on the ways of solving the problems facing the web search engines.