Fabien Benetou's PIM | ReadingNotes / AlgorithmsOfTheIntelligentWeb

Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko - ISBN: 1933988665 - Manning 2009

Motivation

Understanding the web requires to understand not just the usages or the infrastructure but also how information is being processed to provide better and news experiences which uses increasingly complexes techniques.

Pre-reading model

Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain before having read the book.

Reading

"Unlike traditional applications, intelligent applications adjust their behavior according to their input" (p.xiv)

1 What is the intelligent web?
- example of an app that would not just check orthograph or grammar but facts (p2)
  - possible with Needs#QA and limited to a personal database of facts?
  - see also Chapter5 for a functionnality description
- defining the triangle of intelligence (p5) as aggregated content (raw data), reference structures (knowledge) and algorithms (thinking)
- paragraph on wikis (p9)
  - discussing about automatic categorization and how "natural linkage of the pages provides fertile ground for advanced search (chapter 2), clustering (chapter 4), and other analytical techniques."
- "identify the areas where an intelligent component would add most value to your application." (p11)
2 Searching
- crawling -> indexing on tokenized content (brief mention of different analyzers) -> ranking -> result of search
  - see also ApacheProjects#Lucene
  - injecting spam
  - update ranking with PageRank used against spam
    - Wikipedia:PageRank
  - using user clicks in a Naive Bayes classifier
    - Wikipedia:Naive Bayes classifier
- see also RailsCampParis3#FullTextSearch
3 Creating suggestions and recommendations
- similarity, mathematical distance and its 4 properties, metrics in general
- Wikipedia:Collaborative filtering
- Wikipedia:Jaccard metric, Wikipedia:Jaccard index, Wikipedia:Cosine similarity
- Wikipedia:Pearson correlation
- see also Recked: A Night of Recommendation Technologies held in January 2009 discovered earlier for PersonalInformationStream#Sources
4 Clustering: grouping things together
- nice viz (p135)
- mention of Wikipedia:Curse of dimensionality
- see also
  - gCluto Graphical Clustering Toolkit by Matt Rasmussen
  - Single-Link, Complete-Link & Average-Link Clustering, book chapter Hierarchical clustering of Introduction to Information Retrieval, Cambridge University Press 2008
5 Classification
- introduction on the value of having proper classes and the importance of hierarchies
  - see also on ontologies the SemanticWeb page
- mention of Wikipedia:Rete algorithm and Drools for JBoss implementation
  - on Prolog, backward and forward chaining, see class AI01/AI02 at UTC
  - mention of http://ai.eecs.umich.edu/cogarch0/common/issue/utility.html
- see also QuantitativeTrading#Chapter3 on backtesting
6 Combining classifiers
- bagging, bootstrap aggregating, introduced only after allowing to check one classifier against another
  - chi-square and z statistic
  - Cochran�s Q test and the F test
- different strategies, weight, ...
  - majority vote
- boosting, iterative improvement
  - picking training sets biased toward those instances that were previously misclassified by the ensemble
  - "the essence of boosting [...]: find out what you don�t know and bring in someone who does to cover for it" (p265)
  - e.g. arc-x4, AdaBoost
7 Putting it all together: an intelligent news portal
- rapid review and integration of most techniques that have been explained so far

Tools for examples and todos

http://code.google.com/p/yooreeka/
http://www.manning-sandbox.com/forum.jspa?forumID=438&start=0
manually added System.setProperty("iweb2.home","c:/iWeb2"); to deploy/bin/.bshrc to have the proper path
http://www.beanshell.org

Overall remarks and questions

keep BDD mindset to analyze what is actually the value of Collective Intelligence/ML/AI/... (as it seems to have been done by some during the RailsCampParis3 constantly asking if end-users would actually notice and benefit from it)
no magic and still hard problems yet partial working solutions
- symbolic AI is looking for abstract structures (ideally one encompassing and efficient abstraction)
- ML is using data (ideally small and up to date)
what are the most new usages, not techniques, and who applies them?
what are the most famous non-Java framework?
- outside of WEKA, Mahout, ...
- and why are the Java frameworks so dominant?
TDD/BDD of assertions/tests to apply to learning machine learning?
are those computation 1-use only? i.e. no re-use or generalization? hard incremental?

Synthesis

So in the end, it was about X and was based on Y.

Critics

examples requiring a rather specific environment and yet often not working without modifications
hard to read code
nearly no equations
- and the few present can be... false! e.g. http://www.manning.com/marmanis/excerpt_errata.html regarding Bayes theorem

Vocabulary

(:new_vocabulary_start:) new_word (:new_vocabulary_end:)

Post-reading model

Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain after having read the book. Link it to the pre-reading model and align the two to help easy comparison.

Algorithms Of The Intelligent Web {ReadingNotes}