Fabien Benetou's PIM | ReadingNotes / IntelligentBio

Intelligent Bioinformatics : The Application of Artificial Intelligence Techniques to Bioinformatics Problems - ISBN 978-0-470-02175-0 - Wiley 2005

Motivation

Understand how the tools of AI and ML in particular get applied to biology and how could this be a way to "edit back" our substrate, directing editing DNA, RNA or whatever representation biological information could have, eventually following homoiconicity.

Pre-reading model

Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain before having read the book.

Questions beforehand

Is DIY bio really attainable?
1. What more do I need to know in order to leverage the new tools? (cloud computing, human genome, ...)
2. Should I go further and read Structural Bioinformatic 2008 or Biobazaar 2008
What trends can I expect?
1. What will be the consequences of those trends?
2. Is my knowledge regarding AI up to date?
How does this relate to Protocells?
What do the guys in #bioinformatics who do work in the field think about it?
Should the problematic information overload vs new tools to handle it be considered a (cognitive) arm race and thus part of the Seedea:Research/Drive?

Skimming

Part 1 on introduction will ironically probably by the hardest as my knowledge on biology is still embryonic
Part 2 on current techniques is pretty straightforward so focusing on the specificity on biological data might be the more interesting aspect to it
Part 3 on future techniques should be a bit more complex yet quite classical, we should assess if those are now (2009) used (as the book was published in 2005), if yes what are the results and what are the current "future techniques", if not what limited them.

Reading

Part 1 - Introduction

Chapter 1
- key components (DNA, mRNA, '5, '3, ...)
- key processes (traduction, splicing, transcription, ...)
  - Molecular Visualizations of DNA by the Walter and Eliza Hall Institute of Medical Research
    - potentially using Brownian motion (the seemingly random movement of particles suspended in a fluid) to bring the required material where it should be
      - check instead Wikipedia:Passive transport, Wikipedia:Discrete nanoscale transport, Wikipedia:Microtubule, Wikipedia:Neurofilament, neurotubule, etc...
- the Wikipedia:Genome, the Wikipedia:Transcriptomics and the Wikipedia:Proteome
- summary of chapter p28
Chapter 2
- "assumption for proteins is that linear sequence determines shape which, in turn, determines function." (p32) resonnates with Form follows function a principle associated with modern architecture and industrial design in the 20th century.
- code comparison
  - Hamming distance (p32) -> +unit cost model = Levenshtein distance (p33)
- "computational cost" (p35) looks like combinatorics (cf N is a Number)
- animations on Microarray
  - DNA Microarray Methodology Animation Department of Biology, Davidson College
  - Microarrays from Genomics Educationby Genome British Columbia
  - Microarray Method for Genetic Testing
- reverse engineering gene networks -> gene expression data is one of the key problematic yet, as detailed regarding the number of availability of data and the number of combinations, computationaly tricky
- Ethical considerations (p48)
  - interesting point on genome closeness and relatives, how 1 analysis provide information about others and could lead to potential problems (very similar to social network analysis)
  - explanation on stem cells with totipotent, blastocyst, pluripotent, multipotent phases (p48) and a very clear schema (p49)
- summary of chapter p63
Chapter 3
- schema are easy to understand but their readings...
- summary of chapter p98

Part 2 - Current techniques

Chapter 4
- Bayesian networks + handling loops = Markov networks
- HMM (Hidden Markov Model) description
- recommended tutorial on usage and classical tool in the domain HMMER
- summary of chapter p125
Chapter 5 - Nearest Neighbour and Clustering Approaches
- '�unsupervised�' techniques
- interesting as it is not limited to 1 or 2 dimensions of analysis (p130)
Chapter 6 - Identification (Decision) Trees
- supervised as classes are not generated by the classification process itself
- sum-up algorithm
  - 1 For each feature, compute the gain criterion.
  - 2 Select the best feature and split the data according to the values in that feature.
  - 3 If each of the subsets contains just one class then stop. Otherwise, reapply points 1�3 on each of the subsets of data.
  - 4 If the data is not completely classified but there are no more splits available then stop.
- splits using information-theory measures to build the Gain criterion
- major work from John Ross Quinlan
- improvements with Bootstrap aggregating (bagging) and Boosting
- countering Over fitting and pruning using test sets
- see also uClassify with its classifiers and API
- Information on See5/C5.0 by Rulequest Research
Chapter 7 - Neural Networks
- basic "brick" that process information very basically but with a very high amonut of connection (linking with connectivism) in order to be able to generalized
- sigmoid function, discussed as a threshold model for neurons, added to Scale Free Punctuated Learning
- different architectures
  - 2 layers : one input layer and one output layer, or perceptron (Rosenblatt, 1958)
  - multiple layers with hidden layers (as popularised in Rumelhart & McClelland, 1986)
  - Kohonen Self Organizing Map (Kohonen, 1990)
    - Kohonen map according to Wikipedia, also described as self-organizing map (SOM) or self-organizing feature map (SOFM)
  - error backpropagation
- "The discrepancy between the two [input and output data] is calculated and the network then makes changes to its internal weights to reduce the error the next time this input data is presented." (p178-179)
  - learning through a change of topology in a neural network really starts to make Information Geometry fundamental
- drawbacks
  - overfitting
    - partly solved by conserving a subset of the training dataset to test
  - choosing and configuring the right architecture
    - any hidden layers used should have fewer units than the input layer
    - number of units in the hidden layer decreases from input to output
  - black-boxing, the result does not provide "human-understandable terms"
- SNNS- Stuttgart Neural Network Simulator, Developed at University of Stuttgart, maintained at University of Tubingen
- few usages in bioinformatics
  - Classification and dimensionality reduction of gene expression data
  - Identifying protein subcellular location
- summary of chapter p193
Chapter 8 - Genetic Algorithms
- inspiration from biological system yet not initially designed with bioinformatics in mind
- single/multi objectives GA
- operators/mutations/genes/populations/fitness mechanism
- usages for genes and biological data
  - Reverse engineering of regulatory networks
  - Multiple sequence alignment
- summary of chapter p217
- see also SpeedyGA. Hacking Evolution by Keki Burjorjee (not tested seriously so far or applied to bioinformatics)

Part 3 - Future Techniques

Chapter 9 - Genetic Programming
- GP = GA + parse tree to represent a solution to the problem
- created by John Koza (see also his homepage)
  - yet "automatic programming" and similar concept seems very close to the initial goal of languages like Lisp (the first homoiconic programming language) so stating who "created" it can be difficult (GP ~1990 while Homoiconicity ~1960).
- "Genetic algorithms produce solutions that contain combinations of parameter values (possibly weighted) to satisfy a function, whereas GP produces solutions that contain a series of instructions for producing desired and specified program behaviour." (p225)
- see also Gary Cziko's WithoutMiracles especially section The Computer Can Know More than the Programmer of chapter 13 Evolutionary Computing: Selection Within Silicon
- "These include �human-competitive� applications where GP has created previously patented inventions, or new inventions which are determined to be of patentable quality" (p231-232)
  - regarding automated invention and the patent system, see The Genie in the Machine: How Computer-Automated Inventing Is Revolutionizing Law and Business by Robert Plotkin, SUP 2009
- usage in bioinformatics
  - Genetic programming in data mining for drug discovery
  - Genetic programming for functional genomics in yeast data
- summary of chapter p236
Chapter 10 - Cellular Automata
- introduced by John von Neumann on the suggestion of Stan Ulam during late 1940s
- more often used on simulations rather than optimizations
- grid and behavior are equivalent to inherent laws of the system
- classical Conway's example
- usage in bioinformatics
  - Cellular automata model for enzyme kinetics
  - Simulation of an apoptosis reaction network using cellular automata
- summary of chapter p252
- see also
  - Swarm Intelligence
  - Multi-Agent systems seems similar but include a higher degree of complexity, especially related to communication (and protocols)
  - Stephen Wolfram's A New Kind of Science (NKS), 2002
  - Cellular Automata Laboratory by Rudy Rucker and John Walker
Chapter 11 - Hybrid Methods
- mash-ups fitting directly to their applications
- existing mapping between a repository of techniques and their applications?

Overall remarks and questions

the reading notes do not really answer the initial motivation of somehow seeing which technique best fit to which problem and why

Synthesis

So in the end, it was about X and was based on Y.

Critics

Point A, B and C are debatable because of e, f and j.

Vocabulary

(:new_vocabulary_start:) new_word (:new_vocabulary_end:)

Post-reading model

Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain after having read the book. Link it to the pre-reading model and align the two to help easy comparison.

Intelligent Bio {ReadingNotes}