In the internet search domain there is an obvious leader, Google. They have unquestionable dominance and provide the answer that the general public seems to currently be asking for. But how long can they continue to dominate?
I have not kept up with the latest statistics but I am sure the continued growth of available content on the internet is accelerating. Last time I checked the estimates where around 16 Billion documents were indexed in Google, people estimated, informed people, that 30% of the content was porn, 30% was duplicate content, and 40% was original work. Others have estimated growth trajectories such as the volume of content doubling every two years. Of these figures are correct to an order of magnitude then the challenges presenting a comprehensive search engine like the big G are immense.
The current technology approach to finding information provided by the big three leading providers; Google, Microsoft and Yahoo, all revolve around keyword searching. They take the text entered into the search box and use that to find documents that contain those words or documents that they have algorithmically decided relate to those words. Most users enter a 2 to 3 word phrase into the search box, and they rarely use any logical operators. Statistical analysis of user’s behavior shows use of advanced search functions, and especially queries building advanced search pages, are rarely used by the public. The search providers have made huge improvements in their ability to recognize “entities” in a users query, people’s names, places, companies, this has lead to them providing links to authoritative sites that have been identified as being associated with those “things” at the top of the list. In addition the search providers have now developed media dependent indexes and searches so they can provide not just text searching for “things” but also image, audio and video searching. The past few major improvements released to general web search engines have revolved around integrating these meta-searching capabilities to provide the top few hits from each index in a compelling user experience. Yahoo recently launched an enhanced search box providing a rich set of suggestions for the user to select from, this is similar in concept to the recently launched, via tv advertising, Ask.com auto suggestions. Both of these providers are subtly increasing the length of the query the users are choosing, by providing a list of phrases of 3 to 5 words people are now quickly able to input a search phrase with 30 – 50% more text than the previous average.
Several search providers are attempting to launch “semantic” search engines, these ngines attempt to find information based upon an understanding of the meaning of words and the relationships between phrases, entities etc. For example a Swallow is a bird, and something animals do to eat their food. If a query is “swallow wings” a semantic engine would in theory understand that this was now about the wings of the bird, and maybe also about eating hot spicy chicken at a bar, if the query was “swallow wing span” the engine would isolate the entity more accurately as it would understand span to me a measurement of the wing entity of the swallow (bird) entity. All this allows the search engine to generate a clear picture of the context of the user’s request. By understanding the full picture of the request the search engine is able to reduce the scope of the search in several ways; first they attempt to index the web based upon an underlying ontology (which would include, birds, swallows (as a bird), wings (as they relate to birds) and hence provide an accurate response based upon this knowledge to reduce the index of pages that are searched . The problem of semantic search is first the computational expense required to analyze the web’s content and build the relationships to the ontology and secondly to accurately be able to extract the true intent of the users query using the very few words they entered. It is worth checking out what Powerset.com is doing in this space, it is a huge challenge.
Vertical search engines attempt to improve the search precision by predetermining the context of a query, a health search engine for example makes it very clear that you should use it to find information about human health related subjects. By restricting the service provided by the platform, the service providers have several advantages; the ontology of the searches is dramatically smaller than a comprehensive search, the number of documents likely to contain information related to the subject sis far smaller, and the authoritative sources are probably well defined and known to the service provider.
Google, Microsoft and others are providing custom search tools which allow users to gather collections of sites and create their own specific index to search within. This attacks one of the problems of search; by reducing the collection of documents presumably the user will be able find what they want from within that collection with greater precision because the noise created by other sites with documents containing the keywords but that are either not on topic or are non authoritative will be reduced. Using this approach a human must maintain the index of sites to ensure new content is acquired and old redundant or inaccurate content is removed. The underlying search engine technology restricts the ability to search to that of key word frequency analysis and page rank like algorithms that unfortunately are unaware of the intent or topic that the custom search engine is focused on providing answers to.
As the size of the internet corpus grows ability of comprehensive search engines to provide precise answers will become more and more challenged, the providers will (and are constantly) look for ways to identify the context of the users search request, either through explicit declaration, prompted selection, or improving analysis of query semantics and past user behavior. Contextually sensitive search engines that provide capabilities that are tuned to a specific topic will eventually be capable of outperforming the comprehensive search providers consistently, once the general public becomes aware of this we should expect to see the large providers using their massive infrastructures to provide topically sensitive solutions. Early examples of this are cropping up frequently, Google’s Patent search, Microsoft Health Search are both great examples of what the next phase in search will be about. Google is exploiting its understanding of the patent information space to offer a service that is leagues ahead of the Patent offices, and it was built extremely quickly using their existing infrastructure and with limited modification to their underlying algorithms.
So I think the days of a single search box to find the answer to any question are numbered, this will not mean that Google won’t continue to dominate, but they will not be doing it through their original single search box in the years ahead.
Next I will continue by discussing some of my thoughts around the challenges created by how we are currently finding information on the web.
0 comments:
Post a Comment