My notes from Lucene Revolution


My notes and brainstorming from the Lucene Revolution conference in Cambridge, MA



  • automatically search related topics that are not syntactically related (CareerBuilder example)
  • type-ahead populated by terms in the same cluster (NHS example)
  • clustering can recommend topics for a taxonomy based on content, but the user-visible labels for the topics must be hand-written/selected, there needs to be a (periodic?) human task of mapping the clusters to the taxonomy
  • some clustering algorithms (fuzzy k-means, LDA) allow for overlapping topics

Named Entity Recognition (aka Information Extraction)

  • Use a framework such as GATE for annotating parts of speech, etc. Then need to train the system based on a domain-specific vocabulary
  • There is no OOTB solution for this, always need to implement something specific to our domain
  • There are some public-domain dictionaries available (Wikipedia, WordNet, etc.)
  • Semantic annotation can be dictionary-based (text file) or ontology-based (OWL format)
  • Semantic Search takes this to the next level (SPARQL)


All the power of an intelligent search interface hinges on metadata!

What metadata do we have for our content?


Approach the search interface as if it were a conversation to better understand what the user wants and for the user to better understand what he/she wants (travel agency example, NHS example)

The more we know about a user, the better recommendations we can make

What do we know about a user?

Products / Projects

Natural Language Processing

GATE – The gorilla of open source frameworks for NLP

Ontotext – Major contributor to GATE, Bulgarian company w/ all kinds of cool semantic stuff, much of it freely available for use

LingPipe – One of the more attractive and popular commercial frameworks for NLP, but expensive ($9.5K/year production license)

Succinct list of NLP technologies

Exhaustive list of NLP technologies

Machine Learning

Mahout – Machine Learning algorithms designed to be highly scalable and run on Hadoop

Weka – Open source framework for Machine Learning, more mature and comprehensive than Mahout, but more research-oriented

Mallet – Another research-oriented open source framework for Machine Learning, particularly good at text classification, topic modeling, and sequential tagging


Nutch – Open source Google-style web crawler + search engine

Neo4J – Database for storing relational graphs

Amazon Mechanical Turk – marketplace for data entry jobs

Presentation Notes


Multidimensional analysis of an unfolding event, need to distill information for decision makers:

  • text analysis
  • social media
  • audio/video analysis
  • positional/geo data
  • CEP


Company that specializes in building ontologies


Specialists in multi-lingual search + indexing, commercial plugin for Solr, supports a ton of languages (no decent open source available for non-english searching).

Also offer some commercial solutions for text analytics.


ability to search television archives based on closed-caption subtitles

CiteSeerX (Penn State)

have a ton of web crawlers, use OpenCalais for text analysis


loooots of talk about Hadoop

Written on May 30, 2012