My notes from Lucene Revolution
My notes and brainstorming from the Lucene Revolution conference in Cambridge, MA
- automatically search related topics that are not syntactically related (CareerBuilder example)
- type-ahead populated by terms in the same cluster (NHS example)
- clustering can recommend topics for a taxonomy based on content, but the user-visible labels for the topics must be hand-written/selected, there needs to be a (periodic?) human task of mapping the clusters to the taxonomy
- some clustering algorithms (fuzzy k-means, LDA) allow for overlapping topics
Named Entity Recognition (aka Information Extraction)
- Use a framework such as GATE for annotating parts of speech, etc. Then need to train the system based on a domain-specific vocabulary
- There is no OOTB solution for this, always need to implement something specific to our domain
- There are some public-domain dictionaries available (Wikipedia, WordNet, etc.)
- Semantic annotation can be dictionary-based (text file) or ontology-based (OWL format)
- Semantic Search takes this to the next level (SPARQL)
All the power of an intelligent search interface hinges on metadata!
What metadata do we have for our content?
Approach the search interface as if it were a conversation to better understand what the user wants and for the user to better understand what he/she wants (travel agency example, NHS example)
The more we know about a user, the better recommendations we can make
What do we know about a user?
Products / Projects
Natural Language Processing
GATE – The gorilla of open source frameworks for NLP
Ontotext – Major contributor to GATE, Bulgarian company w/ all kinds of cool semantic stuff, much of it freely available for use
LingPipe – One of the more attractive and popular commercial frameworks for NLP, but expensive ($9.5K/year production license)
Mahout – Machine Learning algorithms designed to be highly scalable and run on Hadoop
Weka – Open source framework for Machine Learning, more mature and comprehensive than Mahout, but more research-oriented
Mallet – Another research-oriented open source framework for Machine Learning, particularly good at text classification, topic modeling, and sequential tagging
Nutch – Open source Google-style web crawler + search engine
Neo4J – Database for storing relational graphs
Amazon Mechanical Turk – marketplace for data entry jobs
Multidimensional analysis of an unfolding event, need to distill information for decision makers:
- text analysis
- social media
- audio/video analysis
- positional/geo data
Company that specializes in building ontologies
Specialists in multi-lingual search + indexing, commercial plugin for Solr, supports a ton of languages (no decent open source available for non-english searching).
Also offer some commercial solutions for text analytics.
ability to search television archives based on closed-caption subtitles
CiteSeerX (Penn State)
have a ton of web crawlers, use OpenCalais for text analysis
loooots of talk about Hadoop