http://projects.apache.org/
Hive - Mahout - UIMA - Nutch - Solr - Lucene - HTTP Server
Hive
Mahout
To explore
UIMA
Probably only useful when you have a very large distributed system and have to manage in a complex way plenty of annotators, where simple pipes locally adding meta-data would quickly become unmaintainable.
To do
- look for geolocalization annotation services with entity detection of place names
- in particular for Cookbook:OpenLayersAPI and in conjunction with Lucene spatial search feature
- compare with Seedea:Seedea/Services and Wikipedia bots
- explore Natural Language Processing (almost) from Scratch, March 2011
- see also my topicmarks "summary"
Nutch
Local usage
- get the list of blogs from Person and crawl them
- http://localhost:8080/nutch-1.2/en/ (via Tomcat)
- crawled <10 sites with depth 3 max 50 links the 02/04/2011
/cygdrive/e/Downloads/nutch-1.2/bin/nutch crawl urls -dir crawl -depth 4 -topN 50
- encounting an error while trying to update through Tomcat
Resources explored
To do
- find list of newly registered domain names
- find spam list to avoid waste time and bandwidth
- also find indexed lists by major search engine
- consider popularity list like Alexa to focus on the long tail
- check http://www.quora.com/Nutch
- consider http://wiki.apache.org/nutch/RunningNutchAndSolr
Solr
Local usage
- pmWiki file format
- table of source name/location/id prefix/categories/...
- php interface to pmwiki
- pmwiki search page
- URLRewrite should be taken into account
- keep existing links coherent
- yet allow for upgrade/distribution
- prepare query
- generate result as php code
- display result as html through formatted php result
- extend result via php (level 1, 2, ... cf seeks proposal)
- add learning search project (cf other wiki page)
- add more source
- synchronize sources via Crontab
- make available via tomcat
Resources explored
To do
- check http://www.quora.com/Solr
- http://www.quora.com/Lucene
- consider PersonalInformationStream#Finished and the potential advantages of having a local equivalent
Lucene
- using JFlex a lexical analyzer generator, for
StandardAnalyzer
instead of WhitespaceAnalyzer
- Spatial Lucene
Note that most of the content is actually in Solr, even though it is "only" an interface for the Lucene engine.
HTTP Server
- configure a forward proxy (for Sylvain)
- limit access
- mod_access
- Order Deny,Allow
- Deny from all
- Allow from 127.0.0.1
Previously in Lighttpd#Apache
To do
- correct links previously pointing to Lighttpd#Apache.
- add previous links on Hadoop
- WithoutNotesSeptember10
Note
My notes on Tools gather what I know or want to know. Consequently they are not and will never be complete references. For this, official manuals and online communities provide much better answers.