Apache Projects

Hive - Mahout - UIMA - Nutch - Solr - Lucene - HTTP Server
Hive
- Hive data warehouse system for Hadoop
- Running Hive on Amazon ElasticMap Reduce, Amazon Web Services Developer Community October 2009
- Hive and Amazon Web Services on Hadoop Wiki
- Hive vs. Pig by Lars George, Lineland October 2009
- Facebook has the world's largest Hadoop cluster! by Dhruba Borthakur, HDFS Hadoop Blog 2010
Mahout
To explore
- Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko, Manning 2009
- Collective Intelligence in Action by Satnam Alag, Manning 2008
- Programming Collective Intelligence by Toby Segaran, O'Reilly Media 2007
UIMA
Probably only useful when you have a very large distributed system and have to manage in a complex way plenty of annotators, where simple pipes locally adding meta-data would quickly become unmaintainable.
- WithoutNotesFebruary11#BuildingWatson
- check http://uima.apache.org/downloads/sandbox/Solrcas/SolrcasUserGuide.html
- note that I already have an OpenCalais API key for http://uima.apache.org/sandbox.html#opencalais.annotator
- check if UIMA supports AWS#MTurk
- tried http://uima.apache.org/doc-uima-examples.html
- check http://uima.apache.org/sandbox.html#lucas.consumer
- http://uima.lti.cs.cmu.edu components repository
To do
- look for geolocalization annotation services with entity detection of place names
- in particular for Cookbook:OpenLayersAPI and in conjunction with Lucene spatial search feature
- compare with Seedea:Seedea/Services
and Wikipedia bots
- explore Natural Language Processing (almost) from Scratch, March 2011
- see also my topicmarks "summary"
Nutch
Local usage
- get the list of blogs from Person and crawl them
- http://localhost:8080/nutch-1.2/en/ (via Tomcat)
- crawled <10 sites with depth 3 max 50 links the 02/04/2011
/cygdrive/e/Downloads/nutch-1.2/bin/nutch crawl urls -dir crawl -depth 4 -topN 50
- crawled <10 sites with depth 3 max 50 links the 02/04/2011
- http://localhost:8080/nutch-1.2/en/ (via Tomcat)
- encounting an error while trying to update through Tomcat
- cf http://www.mail-archive.com/solr-user@lucene.apache.org/msg28393.html
- no explicit solution
- no error via the default jetty interface on port 8983
- cf http://www.mail-archive.com/solr-user@lucene.apache.org/msg28393.html
Resources explored
- http://trackgc.com/tr/resources/articles/NutchGuideForDummies.htm
- http://wiki.apache.org/nutch/FAQ#Will_Nutch_be_a_distributed.2C_P2P-based_search_engine.3F
To do
- find list of newly registered domain names
- find spam list to avoid waste time and bandwidth
- also find indexed lists by major search engine
- consider popularity list like Alexa to focus on the long tail
- also find indexed lists by major search engine
- check http://www.quora.com/Nutch
- consider http://wiki.apache.org/nutch/RunningNutchAndSolr
Solr
Local usage
- pmWiki file format
- consider to add previous revision and thus allow search by revision
- PmWiki:PageFileFormat
- http://localhost:8983/solr/select/?q=test
e:\webserver\htdocs\wiki\wiki.d\solr_load.bate:\webserver\htdocs\wiki\mirrors\wiki.d\solr_load.bat
- table of source name/location/id prefix/categories/...
- php interface to pmwiki
- pmwiki search page
- URLRewrite should be taken into account
- keep existing links coherent
- yet allow for upgrade/distribution
- URLRewrite should be taken into account
- prepare query
- generate result as php code
- display result as html through formatted php result
- extend result via php (level 1, 2, ... cf seeks proposal)
- add learning search project (cf other wiki page)
- pmwiki search page
- add more source
- synchronize sources via Crontab
- make available via tomcat
- php interface to pmwiki
Resources explored
- http://lucene.apache.org/solr/tutorial.html
- very straight forward (if Java is properly installed)
- #solr on freenode
- Wikipedia:Apache Solr
- http://wiki.apache.org/solr/SolrTomcat
- xampp 1.7.4 provides Tomcat
- http://wiki.apache.org/solr/ExtractingRequestHandler for PDF and other files
- Solr in 5 Minutes - Ignite Style Presentation by Mike Brevoort, 2010
- Indexing Text and HTML Files Solr, the Lucene Search Server by Avi Rappoport, Lucid Imagination 2010
- note that UIMA seems to be already present in the default Solr installation
- index local files http://www.gossamer-threads.com/lists/lucene/general/68978
- mention of stream.file http://wiki.apache.org/solr/ContentStream
- Integrating Solr: Ruby on Rails Integration by David Smiley and Eric Pugh, Packt Publishing 2010
- http://acts-as-solr.rubyforge.org/
- see also Ruby#Rails
- in particular with websolr which provides easy usage with Ruby#Heroku
- ActsAsSolrReloaded Demo by Diego Carrion, 2009
To do
- check http://www.quora.com/Solr
- consider PersonalInformationStream#Finished and the potential advantages of having a local equivalent
Lucene
- using JFlex a lexical analyzer generator, for
StandardAnalyzerinstead ofWhitespaceAnalyzer - Spatial Lucene
- Location-aware search with Apache Lucene and Solr by Grant Ingersoll, Lucid Imagination 2010
- in particular when geolocalisation data will be included in the wikis
- Location-aware search with Apache Lucene and Solr by Grant Ingersoll, Lucid Imagination 2010
Note that most of the content is actually in Solr, even though it is "only" an interface for the Lucene engine.
HTTP Server
- configure a forward proxy (for Sylvain)
- mod_proxy to allow a forward proxy with
- ProxyRequests On
- ProxyVia On
- mod_proxy_connect to allow SSL handshakes
- allowconnect to allow the connection through port 443
- mod_proxy to allow a forward proxy with
- limit access
- mod_access
- Order Deny,Allow
- Deny from all
- Allow from 127.0.0.1
- mod_access
Previously in Lighttpd#Apache
To do
- correct links previously pointing to Lighttpd#Apache.
- add previous links on Hadoop
Note
My notes on Tools gather what I know or want to know. Consequently they are not and will never be complete references. For this, official manuals and online communities provide much better answers.
CONTENT
CONTACT
UPDATES
LAST TWEET
RSS for this page only


