This is the first in a series of notes from the 2009 KM
World. Please forgive typos as this isi real time or near real time. Fundamentals
of Enterprise Search was a preconference workshop. It was led by Avi
Rappoport,
Principal - Search Tools Consulting Editor, SearchTools.com. She is an
independent consultant not connected with a vendor. Here is the session
description.
“Search engines, big and small,
have certain standard elements and processes. The more you understand them, the
easier to tune them to solve your real information needs. This practical
overview provides a big picture view of how search fits within enterprise and
websites, and a focused introduction to search technology and user experience.
Elements of search covered include robot spiders, database connectors and other
tools for locating content, indexing issues, query parsing, retrieval, relevance
ranking, and designing usable search interfaces. The workshop addresses common
search problems and solutions, security issues, languages, new interface
elements, important (and unimportant) features as well as providing tools for
choosing a search engine or evaluating an existing one.”
Avi began with similarities and
differences of enterprise search and Web search. The differences include: limited scope, fewer meaningful
hyperlinks for link analysis. Security and access control issues, content in
databases, more control (for specifying value ranking, etc.), and no search
spam.
Next she covered text search
vs. data base search. Text search indexes multiple content sources and uses
simple search commands instead of SQL. There is flexible indexing and retrieval
and relevance ranking (major issue). There are new features such as spell
check, auto completes, and facets. It works in the real world (e.g., eBay,
Google).
Then she covered how
information architecture works with search. Information architecture is the art and science of
organizing information for access and use. It creates order and systems and
provides standard vocabulary. Search can supplement information architecture
through user vocabularies and dynamic changes with new content.
KM and search are opposite ways
of approaching content. KM organizes stuff, and search finds it. There are two main types of
search: known item with short queries and “good enough” answers and exploratory
search for research purposes. This is what Darwin Ecosystems excels at. It can help you discover relationships
between content that you did know exits and were not explicitly looking for.
Avi said that search is an iceberg and people often see it as magic.
It is useful to index
everything as it is hard to know in advance what people want. Twitter has changed expectations for
real time indexing, even for intranets. Three minutes is a good expectation.
Here is another impact of consumer Web on enterprise computing.
Index security is an issue.
(see my post: Attivio Aligns with Traction and Releases New Features) Without the right security you can see stuff you should not see.
Need to work with security people on search issues to avoid this and have
capabilities in the search tool.
The first step requires knowing what needs to be controlled and then you
can determine how to do this. Be
aware of privacy laws.
After you determine what to
secure, then deal with access control. Best to keep access control info as part
of document store (what Attivio does - see post mentioned above). There are
four levels of access control. One – access to search engine, two –
collection-level access control (to portions of the search engine), three –
locked results for a teaser for subscription, four – hit-level access control –
link to a access control database at the point of display - hardest to do,
useful when constantly changing rules.
Robot spiders start with a base
URL for all hosts. For each page they repeat this process: read text info into
internet format, save document in cache, save words into index, extract all
links and check for rules, if they are new URLs add them to the list. It can
repeat process over and over.
There are common problems with
robots that SEO tries to avoid. Spiders can be disallowed by robots.txt or
robots meta. Also cannot handle URLs with ? and & (but all spiders should
handle these now), Javascript, forms, and interactive dynamic links, session
IDs that change, multiple views of same data (wikis and Lotus Notes).
External sources that have APIs
like Twitter can be brought into enterprise search. You might want to partition
this so it does not clutter standard search. This is one way you can use Darwin Ecosystem inside the
enterprise to look outside it. Relevance is relative. (sounds
obvious but need to remember this when creating relevance listing).
Indexing multimedia needs to be
dealt with now. There can be internal and external metadata to support
this. Best to use human judgment
rather than automated systems. Automated systems can be a starting point but
they need to be fine tuned by people. Speech to text and other automated
capabilities are still buggy.
Stop words are common terms or
ubiquitous terms. Traditionally you excluded them but there are consequences –
such as copyright mentions. Best to index everything, especially since storage
is much cheaper now. Avi gave a good example of excluding stop words by
searching for phrase “whatever well be,” a song title. On the other hand you can lots of
irrelevant stuff. Another example,
the rock band, The Who, Here is where relevance can help so you get a lot if
you include stop words but only need to see best ten examples. I tired this on
Google and it did work as the top results related to the band even though there
were 457,000,000. Avi said that Google may have set up an exception for this
term. She also said you might get different results on wordpress.com.
Dealing with duplicate
documents can be complex. First you need to decide what is a duplicate and then
what is the primary if there are some slight differences (e.g. typo
corrections). Exact match is easy
but similarity is more useful, harder but worth it. Best to remove duplicates from index and hide results unless
requested. This is what Google does.
Can create rules for handling duplicates. This is a good idea. However,
you need human supervision but it is worth it.
Avi went through the search
process: search form – query parser – query engine (goes to inverted index and
back) – relevance ranker (goes to
document store, get stuff and brings back – formatter – search results. This all happens very fast now. Queries
come from many sources, not simply search fields. There are alerts, saved
searches, automated searches, geographic information systems, and others. You
need to balance relevance and completeness. You cannot have both.
Relevance ranking algorithms
work differently. The most common is TF-DF or term frequencies : inverse
document frequency. How often is the query word in document and how often is
word in the index? There are others but this is most efficient. Look at title,
metadata, and top of document. Remember relevance is task specific. There is no
such thing as objective relevance. You can never please everyone. More like
berry picking than hunting, Try different stuff instead of locking on single
goal. Here is where Darwin Ecosystem can help with correlations of different
topics with target key words.
Be sure to limit the user interface complexity. Google is a
great example. Use familiar use interface elements. Put search into navigation
so it appears everywhere. With auto-complete, use a drop down menu of matching
words. Base this on search logs and use 7 -
10 most popular in alphabetic order
In summary, with enterprise search you have much more
control on the capabilities and decisions with your search capability than on
the Web. Make good use of these decisions.