Intelligent Bioinformatics : The Application of Artificial Intelligence Techniques to Bioinformatics Problems - ISBN 978-0-470-02175-0 - Wiley 2005
Motivation
Describe in a sentences or two what motivated me to read this book.
Pre-reading model
Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain before having read the book.
Questions beforehand
- Is DIY bio really attainable?
- What more do I need to know in order to leverage the new tools? (cloud computing, human genome, ...)
- Should I go further and read Structural Bioinformatic 2008 or Biobazaar 2008
- What trends can I expect?
- What will be the consequences of those trends?
- Is my knowledge regarding AI up to date?
- How does this relate to protocell work?
- What do the guys in #bioinformatics who do work in the field think about it?
- Should the problematic information overload vs new tools to handle it be considered a (cognitive) arm race and thus part of the drive?
Skimming
- Part 1 on introduction will ironically probably by the hardest as my knowledge on biology is still embryonic
- Part 2 on current techniques is pretty straightforward so focusing on the specificity on biological data might be the more interesting aspect to it
- Part 3 on future techniques should be a bit more complex yet quite classical, we should assess if those are now (2009) used (as the book was published in 2005), if yes what are the results and what are the current "future techniques", if not what limited them.
Reading
Part 1 - Introduction
- Chapter 1
- key components (DNA, mRNA, '5, '3, ...)
- key processes (traduction, splicing, transcription, ...)
- Molecular Visualizations of DNA by the Walter and Eliza Hall Institute of Medical Research
- potentially using Brownian motion (the seemingly random movement of particles suspended in a fluid) to bring the required material where it should be
- Molecular Visualizations of DNA by the Walter and Eliza Hall Institute of Medical Research
- the genome, the transcriptome and the proteome
- summary of chapter p28
- Chapter 2
- "assumption for proteins is that linear sequence determines shape which, in turn, determines function." (p32) resonnates with Form follows function a principle associated with modern architecture and industrial design in the 20th century.
- code comparison
- Hamming distance (p32) -> +unit cost model = Levenshtein distance (p33)
- "computational cost" (p35) looks like combinatorics (cf N is a Number)
- animations on Microarray
- DNA Microarray Methodology Animation Department of Biology, Davidson College
- Microarrays from Genomics Educationby Genome British Columbia
- Microarray Method for Genetic Testing
- reverse engineering gene networks -> gene expression data is one of the key problematic yet, as detailed regarding the number of availability of data and the number of combinations, computationaly tricky
- Ethical considerations (p48)
- interesting point on genome closeness and relatives, how 1 analysis provide information about others and could lead to potential problems (very similar to social network analysis)
- explanation on stem cells with totipotent, blastocyst, pluripotent, multipotent phases (p48) and a very clear schema (p49)
- summary of chapter p63
- Chapter 3
- schema are easy to understand but their readings...
- summary of chapter p98
Part 2 - Current techniques
- Chapter 4
- Bayesian networks + handling loops = Markov networks
- HMM (Hidden Markov Model) description
- recommended tutorial on usage and classical tool in the domain HMMER
- summary of chapter p125
- Chapter 5 - Nearest Neighbour and Clustering Approaches
- '‘unsupervised’' techniques
- interesting as it is not limited to 1 or 2 dimensions of analysis (p130)
- Chapter 6 - Identification (Decision) Trees
- supervised as classes are not generated by the classification process itself
- sum-up algorithm
- 1 For each feature, compute the gain criterion.
- 2 Select the best feature and split the data according to the values in that feature.
- 3 If each of the subsets contains just one class then stop. Otherwise, reapply points 1–3 on each of the subsets of data.
- 4 If the data is not completely classified but there are no more splits available then stop.
- splits using information-theory measures to build the Gain criterion
- major work from John Ross Quinlan
- improvements with Bootstrap aggregating (bagging) and Boosting
- countering Over fitting and pruning using test sets
- see also uClassify with its classifiers and API
- Information on See5/C5.0 by Rulequest Research
- Chapter 7 - Neural Networks
- basic "brick" that process information very basically but with a very high amonut of connection (linking with connectivism) in order to be able to generalized
- sigmoid function, discussed as a threshold model for neurons, added to Scale Free Punctuated Learning
- different architectures
- 2 layers : one input layer and one output layer, or perceptron (Rosenblatt, 1958)
- multiple layers with hidden layers (as popularised in Rumelhart & McClelland, 1986)
- Kohonen Self Organizing Map (Kohonen, 1990)
- Kohonen map according to Wikipedia, also described as self-organizing map (SOM) or self-organizing feature map (SOFM)
- error backpropagation
- "The discrepancy between the two [input and output data] is calculated and the network then makes changes to its internal weights to reduce the error the next time this input data is presented." (p178-179)
- learning through a change of topology in a neural network really starts to make Information Geometry fundamental
- drawbacks
- overfitting
- partly solved by conserving a subset of the training dataset to test
- choosing and configuring the right architecture
- any hidden layers used should have fewer units than the input layer
- number of units in the hidden layer decreases from input to output
- black-boxing, the result does not provide "human-understandable terms"
- overfitting
- SNNS- Stuttgart Neural Network Simulator, Developed at University of Stuttgart, maintained at University of Tubingen
- few usages in bioinformatics
- Classification and dimensionality reduction of gene expression data
- Identifying protein subcellular location
- summary of chapter p193
- Chapter 8 - Genetic Algorithms
- inspiration from biological system yet not initially designed with bioinformatics in mind
- single/multi objectives GA
- operators/mutations/genes/populations/fitness mechanism
- usages for genes and biological data
- Reverse engineering of regulatory networks
- Multiple sequence alignment
- summary of chapter p217
- see also SpeedyGA. Hacking Evolution by Keki Burjorjee (not tested seriously so far or applied to bioinformatics)
Part 3 - Future Techniques
- Chapter 9 - Genetic Programming
- GP = GA + parse tree to represent a solution to the problem
- created by John Koza (see also his homepage)
- yet "automatic programming" and similar concept seems very close to the initial goal of languages like Lisp (the first homoiconic programming language) so stating who "created" it can be difficult (GP ~1990 while Homoiconicity ~1960).
- "Genetic algorithms produce solutions that contain combinations of parameter values (possibly weighted) to satisfy a function, whereas GP produces solutions that contain a series of instructions for producing desired and specified program behaviour." (p225)
- see also Gary Cziko's Without Miracle especially section The Computer Can Know More than the Programmer of chapter 13 Evolutionary Computing: Selection Within Silicon
- "These include ‘human-competitive’ applications where GP has created previously patented inventions, or new inventions which are determined to be of patentable quality" (p231-232)
- regarding automated invention and the patent system, see The Genie in the Machine: How Computer-Automated Inventing Is Revolutionizing Law and Business by Robert Plotkin, SUP 2009
- usage in bioinformatics
- Genetic programming in data mining for drug discovery
- Genetic programming for functional genomics in yeast data
- summary of chapter p236
- Chapter 10 - Cellular Automata
- introduced by John von Neumann on the suggestion of Stan Ulam during late 1940s
- more often used on simulations rather than optimizations
- grid and behavior are equivalent to inherent laws of the system
- classical Conway's example
- usage in bioinformatics
- Cellular automata model for enzyme kinetics
- Simulation of an apoptosis reaction network using cellular automata
- summary of chapter p252
- see also
- Swarm Intelligence
- Multi-Agent systems seems similar but include a higher degree of complexity, especially related to communication (and protocols)
- Stephen Wolfram's A New Kind of Science (NKS), 2002
- Cellular Automata Laboratory by Rudy Rucker and John Walker
- Chapter 11 - Hybrid Methods
- mash-ups fitting directly to their applications
- existing mapping between a repository of techniques and their applications?
See also
- Health#PersonalGenomics
- Seedea's AI matrix
- MLOSS.org machine learning open source software
- especially Projects that are tagged with bioinformatics
- blog post on Bioinformatics tools by Cheng Soon Ong on March 20, 2009
- MLcomp objectively comparing machine learning programs across various datasets for multiple problem domains.
- Metamorphic code and how exon/intro and the different traduction/translation mechanisms look similar
- Section about Metamorphic Viruses in The Art of Computer Virus Research and Defense, Peter Szor, Addison Wesley Professional, 2005
- How Trivial DNA Changes Can Hurt Health by J. V. Chamary and Laurence D. Hurst, Scientific American, June 2009
- Folding@home from Stanford University
- Foldit, Solve Puzzles for Science, by University of Washington
- UWfoldit channel on YouTube for introductory videos
- the Interactorium based on Skyrails by Yose Widjaja
- Third International Workshop on Machine Learning in Systems Biology MLSB´09 - Ljubljana
- YRI Trio Dataset, Amazon Web Services Developer Community October 2009
- Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria
- 700GB
- Boolean Logic Unlocks The Key To Finding New Genes in Milliseconds by Aaron Saenz, Singularity Hub April 2010
- Artificial Intelligence and Molecular Biology eited by Lawrence Hunter (archived at AAAI.org)
Overall remarks and questions
- this? that?
Synthesis
So in the end, it was about X and was based on Y.
Critics
Point A, B and C are debatable because of e, f and j.
Vocabulary
(:new_vocabulary_start:) new_word (:new_vocabulary_end:)
Post-reading model
Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain after having read the book. Link it to the pre-reading model and align the two to help easy comparison.
Categories
Back to the Menu


