http://projects.apache.org/

Hive - Mahout - UIMA - Nutch - Solr - Lucene - HTTP Server

Hive

Mahout

To explore

UIMA

Probably only useful when you have a very large distributed system and have to manage in a complex way plenty of annotators, where simple pipes locally adding meta-data would quickly become unmaintainable.

To do

  1. look for geolocalization annotation services with entity detection of place names
    1. in particular for Cookbook:OpenLayersAPI and in conjunction with Lucene spatial search feature
  2. compare with Seedea:Seedea/Services and Wikipedia bots
  3. explore Natural Language Processing (almost) from Scratch, March 2011
    1. see also my topicmarks "summary"

Nutch

Local usage

Resources explored

To do

  1. find list of newly registered domain names
  2. find spam list to avoid waste time and bandwidth
    1. also find indexed lists by major search engine
      1. consider popularity list like Alexa to focus on the long tail
  3. check http://www.quora.com/Nutch
  4. consider http://wiki.apache.org/nutch/RunningNutchAndSolr

Solr

Local usage

  • pmWiki file format
  • table of source name/location/id prefix/categories/...
    • php interface to pmwiki
      • pmwiki search page
        • URLRewrite should be taken into account
          • keep existing links coherent
          • yet allow for upgrade/distribution
      • prepare query
      • generate result as php code
      • display result as html through formatted php result
      • extend result via php (level 1, 2, ... cf seeks proposal)
      • add learning search project (cf other wiki page)
    • add more source
    • synchronize sources via Crontab
    • make available via tomcat

Resources explored

To do

  1. check http://www.quora.com/Solr
    1. http://www.quora.com/Lucene
  2. consider PersonalInformationStream#Finished and the potential advantages of having a local equivalent

Lucene

  • using JFlex a lexical analyzer generator, for StandardAnalyzer instead of WhitespaceAnalyzer
  • Spatial Lucene

Note that most of the content is actually in Solr, even though it is "only" an interface for the Lucene engine.

HTTP Server

  • configure a forward proxy (for Sylvain)
  • limit access
    • mod_access
      • Order Deny,Allow
      • Deny from all
      • Allow from 127.0.0.1

Previously in Lighttpd#Apache

To do

  1. correct links previously pointing to Lighttpd#Apache.
  2. add previous links on Hadoop
    1. WithoutNotesSeptember10

Note

My notes on Tools gather what I know or want to know. Consequently they are not and will never be complete references. For this, official manuals and online communities provide much better answers.