Fabien Benetou's PIM | Tools / ApacheProjects

http://projects.apache.org/

Hive - Mahout - UIMA - Nutch - Solr - Lucene - HTTP Server

Hive

Hive data warehouse system for Hadoop
Running Hive on Amazon ElasticMap Reduce, Amazon Web Services Developer Community October 2009
Hive and Amazon Web Services on Hadoop Wiki
Hive vs. Pig by Lars George, Lineland October 2009
Facebook has the world's largest Hadoop cluster! by Dhruba Borthakur, HDFS Hadoop Blog 2010
- http://wiki.apache.org/hadoop/PoweredBy
- http://www.quora.com/What-are-the-largest-Hadoop-clusters-to-date

Mahout

To explore

Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko, Manning 2009
Collective Intelligence in Action by Satnam Alag, Manning 2008
Programming Collective Intelligence by Toby Segaran, O'Reilly Media 2007

UIMA

Probably only useful when you have a very large distributed system and have to manage in a complex way plenty of annotators, where simple pipes locally adding meta-data would quickly become unmaintainable.

WithoutNotesFebruary11#BuildingWatson
check http://uima.apache.org/downloads/sandbox/Solrcas/SolrcasUserGuide.html
- note that I already have an OpenCalais API key for http://uima.apache.org/sandbox.html#opencalais.annotator
- check if UIMA supports AWS#MTurk
- tried http://uima.apache.org/doc-uima-examples.html
- check http://uima.apache.org/sandbox.html#lucas.consumer
http://uima.lti.cs.cmu.edu components repository

To do

look for geolocalization annotation services with entity detection of place names
1. in particular for Cookbook:OpenLayersAPI and in conjunction with Lucene spatial search feature
compare with Seedea:Seedea/Services and Wikipedia bots
explore Natural Language Processing (almost) from Scratch, March 2011
1. see also my topicmarks "summary"

Nutch

Local usage

get the list of blogs from Person and crawl them
- http://localhost:8080/nutch-1.2/en/ (via Tomcat)
  - crawled <10 sites with depth 3 max 50 links the 02/04/2011
    - /cygdrive/e/Downloads/nutch-1.2/bin/nutch crawl urls -dir crawl -depth 4 -topN 50
encounting an error while trying to update through Tomcat
- cf http://www.mail-archive.com/solr-user@lucene.apache.org/msg28393.html
  - no explicit solution
- no error via the default jetty interface on port 8983

Resources explored

To do

find list of newly registered domain names
find spam list to avoid waste time and bandwidth
1. also find indexed lists by major search engine
  1. consider popularity list like Alexa to focus on the long tail
check http://www.quora.com/Nutch
consider http://wiki.apache.org/nutch/RunningNutchAndSolr

Solr

Local usage

pmWiki file format
- consider to add previous revision and thus allow search by revision
- PmWiki:PageFileFormat
- http://localhost:8983/solr/select/?q=test
- e:\webserver\htdocs\wiki\wiki.d\solr_load.bat
- e:\webserver\htdocs\wiki\mirrors\wiki.d\solr_load.bat
table of source name/location/id prefix/categories/...
- php interface to pmwiki
  - pmwiki search page
    - URLRewrite should be taken into account
      - keep existing links coherent
      - yet allow for upgrade/distribution
  - prepare query
  - generate result as php code
  - display result as html through formatted php result
  - extend result via php (level 1, 2, ... cf seeks proposal)
  - add learning search project (cf other wiki page)
- add more source
- synchronize sources via Crontab
- make available via tomcat
  - now on http://localhost:8080/solr/admin/ following http://wiki.apache.org/solr/SolrTomcat#Single_Solr_app

Resources explored

http://lucene.apache.org/solr/tutorial.html
- very straight forward (if Java is properly installed)
#solr on freenode
Wikipedia:Apache Solr
http://wiki.apache.org/solr/SolrTomcat
- xampp 1.7.4 provides Tomcat
http://wiki.apache.org/solr/ExtractingRequestHandler for PDF and other files
Solr in 5 Minutes - Ignite Style Presentation by Mike Brevoort, 2010
Indexing Text and HTML Files Solr, the Lucene Search Server by Avi Rappoport, Lucid Imagination 2010
note that UIMA seems to be already present in the default Solr installation
index local files http://www.gossamer-threads.com/lists/lucene/general/68978
- mention of stream.file http://wiki.apache.org/solr/ContentStream
Integrating Solr: Ruby on Rails Integration by David Smiley and Eric Pugh, Packt Publishing 2010
- http://acts-as-solr.rubyforge.org/
- see also Ruby#Rails
  - in particular with websolr which provides easy usage with Ruby#Heroku
ActsAsSolrReloaded Demo by Diego Carrion, 2009

To do

check http://www.quora.com/Solr
1. http://www.quora.com/Lucene
consider PersonalInformationStream#Finished and the potential advantages of having a local equivalent

Lucene

using JFlex a lexical analyzer generator, for StandardAnalyzer instead of WhitespaceAnalyzer
Spatial Lucene
- Location-aware search with Apache Lucene and Solr by Grant Ingersoll, Lucid Imagination 2010
  - in particular when geolocalisation data will be included in the wikis

Note that most of the content is actually in Solr, even though it is "only" an interface for the Lucene engine.

HTTP Server

configure a forward proxy (for Sylvain)
- mod_proxy to allow a forward proxy with
  - ProxyRequests On
  - ProxyVia On
- mod_proxy_connect to allow SSL handshakes
- allowconnect to allow the connection through port 443
limit access
- mod_access
  - Order Deny,Allow
  - Deny from all
  - Allow from 127.0.0.1

Previously in Lighttpd#Apache

To do

correct links previously pointing to Lighttpd#Apache.
add previous links on Hadoop
1. WithoutNotesSeptember10

Note

My notes on Tools gather what I know or want to know. Consequently they are not and will never be complete references. For this, official manuals and online communities provide much better answers.

Apache Projects {Tools}

Hive

Mahout

UIMA

To do

Nutch

Local usage

Resources explored

To do

Solr

Local usage

Resources explored

To do

Lucene

HTTP Server

To do

Note