This is another in a series of notes from the 2009 KM World.
Enterprise Search Technologies was a
preconference workshop.These notes are done near real time so please excuse any typos or spacing issues. It was led by independent consultant Miles Kehoe
from New Idea Engineering, Inc. Here is the session description.
“This workshop, by a vendor
neutral consultant who has hands-on experience with a broad range of “out of
the box,” open source, commercial, and home grown solutions, provides an
overview of the enterprise search technology landscape. It reviews technologies
currently on the market; discusses pros and cons, strengths and weaknesses, and
specific requirements Kehoe shares case studies that illuminate how search
technologies are leveraged in different types of organizations; and provides a
good introduction to and understanding of the enterprise search world.”
Miles said the characteristics
of great search include conversational capabilities, open ended, flexible, and
smart. Conversational search
allows you to interact to focus your quest. This is especially important for
enterprise search, as search is much harder inside the enterprise. Never provide just - no hits. Ask more
if you cannot find anything. Every
search engine is built around a set of indices. This even applies to Google who
creates an index through its spiders. Different search engines just add
different stuff around the indices.
Every search engine goes through a process. Some expose parts of it
which gives you added flexibility to pull specific information out.
It used to be you got plain
search results pages in enterprise search like basic Google Web search. Now you
get what he called enterprise search 2.0 with visualization, navigation,
people, facets, etc. strung around the basic results.
There are two basic parts of
search: indexing and the actual search. It is better to take time at indexing
(when people are not waiting) than search when people are waiting. However, I
asked if the real time capabilities of Twitter were changing those
expectations. People want to see
stuff as soon as it exists.
Before he reviewed vendors,
Miles said it is not the technology but the methodology. It is how you
implement the search engine. I can agree with this.
As he started to review vendors
Miles mentioned Lucene/Soir, a free open source search engine that is behind a
number of search engines, including some commercial ones. It is Java based with
an Apache license, prolific documentation, many implementations, and you have
total control over search and relevance. However, there is some implementation
work required and it is hard to find answers. There are limited enterprise
support options. SearchBlox is packaged Lucene. Lucid Imagination is packaged
Soir.
Miles’ tier one vendors are:
Autonomy, Endeca, Exalead, Fast Search (the original independent version),
Google, Vivisimo. I have reviewed Exalead (see Exalead’s CloudView Offers Integrated Search
Capabilities and Exalead Provides Ability to Integrate with EMC Documentum). His criteria are:
broad enterprise presence, multi-platform search, market penetration, and clear
product vision. People like the
Google brand so they have a perception that Google enterprise search works
well. Not being in Tier One is not necessarily bad, just not meeting all the
criteria. Other vendors I have reviewed that Miles also mentioned include Attivio
(see Attivio Aligns with Traction and Releases New Features) that is newer and Recommind (see Recommind Provides Axcelerate eDiscovery 3.0 with
New Features) which is more vertical
focused.
Dates are important but web
servers provide bad data so it is hard to trust what you get. Miles gave the
example of a 1996 document appearing as new because it had just been
re-indexed.
The wifi started working so
Miles showed us Web sites with good search capabilities. Globrix is a UK real
estate site that uses FAST and you could see a lot of facets in home listings
such as number of bedroom, bathrooms, price range, etc. Then we looked at Newssift
that displays sentiment on topics. We looked at Kosmix that provides an example
of exploratory search. It shows
things that are related and loosely related.
Next we covered supporting
technologies including document filters, connectors, social search, and
federation. Document filters are
part of the indexing process that converts binary source documents (PDF,
Office, etc.) into a stream of text for indexing. Connectors are utility tools to provide a clearly defined
interface between a search engine and external content. Some relate to indexing
and others to display. Connectbeam
is an example (see my reviews: Connectbeam Offers New Social Networking Application Integration
Possibilities).
Social search is a popular term
that applies to the capability to search corporate personal profiles to find
people in an organization with certain skills or experience. It typically
requires user to explicitly self-profile in order for searches to return
accurate results. Some products now track user behavior to implicitly associate
interest to users.
Federation refers to a program
that can dispatch user queries to one or more external data sources (search
engines, RDBMS systems, etc.) and present the combined results to the user.
Federation from unsecure resources is fairly easy. Because relevance from each
source is calculated differently, it is sometimes difficult to integrate
results in a meaningful way.
Entity extraction recognizes
people, places, or things during indexing. In unsupervised extraction entities
are recognized through algorithms. In supervised extraction, the process is
seeded by human operators prior to processing.
Sentiment analysis recognizes
positive or negative sentiment algorithmically during indexing. It is easier to
tell positive sentiment than negative.
Results clustering groups sets
of documents into categories base don content. It looks like facets and entity
extraction however clustering can be done independent of the query. Clustering
is often used in search results to assist the user to discover additional
related terms and content.
Facted search is the result of
assigning documents in a search result list into a pre-defined taxonomy-like
order. Unlike clustering, which can appear similar, facets are base don the
query and populate pre-defined classes of content (authors location, etc.). Facets
are often used to encourage interaction with user.
A key to having good search is
to monitor it over time after the initial implementation. Look at what is
happening and make corrections. Look at what people are searching for and
accommodation them. You need to pull together a diverse collection of skills to
have a great search function (e.g., business domain experts and corporate
librarians, beyond just technical skills).
Miles mentioned two blogs on
the topic that he writes: EnterpriseSearchBlog,com and
SearchComponentsOnline.com.
Comments