Intelligent Bioinformatics : The Application of Artificial Intelligence Techniques to Bioinformatics Problems - ISBN 978-0-470-02175-0 - Wiley 2005
note that this was finished much earlier but without recall, recall starts a month before the current date to already start with very spaced recalls
Motivation
Understand how the tools of AI and ML in particular get applied to biology and how could this be a way to "edit back" our substrate, directing editing DNA, RNA or whatever representation biological information could have, eventually following homoiconicity.
Pre-reading model
Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain before having read the book.
Questions beforehand
- Is DIY bio really attainable?
- What more do I need to know in order to leverage the new tools? (cloud computing, human genome, ...)
- Should I go further and read Structural Bioinformatic 2008 or Biobazaar 2008
- What trends can I expect?
- What will be the consequences of those trends?
- Is my knowledge regarding AI up to date?
- How does this relate to Protocells?
- What do the guys in #bioinformatics who do work in the field think about it?
- Should the problematic information overload vs new tools to handle it be considered a (cognitive) arm race and thus part of the Seedea:Research/Drive?
Skimming
- Part 1 on introduction will ironically probably by the hardest as my knowledge on biology is still embryonic
- Part 2 on current techniques is pretty straightforward so focusing on the specificity on biological data might be the more interesting aspect to it
- Part 3 on future techniques should be a bit more complex yet quite classical, we should assess if those are now (2009) used (as the book was published in 2005), if yes what are the results and what are the current "future techniques", if not what limited them.
Reading
Part 1 - Introduction
- Chapter 1
- Chapter 2
- "assumption for proteins is that linear sequence determines shape which, in turn, determines function." (p32) resonnates with Form follows function a principle associated with modern architecture and industrial design in the 20th century.
- code comparison
- "computational cost" (p35) looks like combinatorics (cf N is a Number)
- animations on Microarray
- reverse engineering gene networks -> gene expression data is one of the key problematic yet, as detailed regarding the number of availability of data and the number of combinations, computationaly tricky
- Ethical considerations (p48)
- interesting point on genome closeness and relatives, how 1 analysis provide information about others and could lead to potential problems (very similar to social network analysis)
- explanation on stem cells with totipotent, blastocyst, pluripotent, multipotent phases (p48) and a very clear schema (p49)
- summary of chapter p63
- Chapter 3
- schema are easy to understand but their readings...
- summary of chapter p98
Part 2 - Current techniques
- Chapter 4
- Bayesian networks + handling loops = Markov networks
- HMM (Hidden Markov Model) description
- recommended tutorial on usage and classical tool in the domain HMMER
- summary of chapter p125
- Chapter 5 - Nearest Neighbour and Clustering Approaches
- '‘unsupervised’' techniques
- interesting as it is not limited to 1 or 2 dimensions of analysis (p130)
- Chapter 6 - Identification (Decision) Trees
- supervised as classes are not generated by the classification process itself
- sum-up algorithm
- 1 For each feature, compute the gain criterion.
- 2 Select the best feature and split the data according to the values in that feature.
- 3 If each of the subsets contains just one class then stop. Otherwise, reapply points 1–3 on each of the subsets of data.
- 4 If the data is not completely classified but there are no more splits available then stop.
- splits using information-theory measures to build the Gain criterion
- major work from John Ross Quinlan
- improvements with Bootstrap aggregating (bagging) and Boosting
- countering Over fitting and pruning using test sets
- see also uClassify with its classifiers and API
- Information on See5/C5.0 by Rulequest Research
- Chapter 7 - Neural Networks
- basic "brick" that process information very basically but with a very high amonut of connection (linking with connectivism) in order to be able to generalized
- sigmoid function, discussed as a threshold model for neurons, added to Scale Free Punctuated Learning
- different architectures
- 2 layers : one input layer and one output layer, or perceptron (Rosenblatt, 1958)
- multiple layers with hidden layers (as popularised in Rumelhart & McClelland, 1986)
- Kohonen Self Organizing Map (Kohonen, 1990)
- error backpropagation
- "The discrepancy between the two [input and output data] is calculated and the network then makes changes to its internal weights to reduce the error the next time this input data is presented." (p178-179)
- learning through a change of topology in a neural network really starts to make Information Geometry fundamental
- drawbacks
- overfitting
- partly solved by conserving a subset of the training dataset to test
- choosing and configuring the right architecture
- any hidden layers used should have fewer units than the input layer
- number of units in the hidden layer decreases from input to output
- black-boxing, the result does not provide "human-understandable terms"
- SNNS- Stuttgart Neural Network Simulator, Developed at University of Stuttgart, maintained at University of Tubingen
- few usages in bioinformatics
- Classification and dimensionality reduction of gene expression data
- Identifying protein subcellular location
- summary of chapter p193
- Chapter 8 - Genetic Algorithms
- inspiration from biological system yet not initially designed with bioinformatics in mind
- single/multi objectives GA
- operators/mutations/genes/populations/fitness mechanism
- usages for genes and biological data
- Reverse engineering of regulatory networks
- Multiple sequence alignment
- summary of chapter p217
- see also SpeedyGA. Hacking Evolution by Keki Burjorjee (not tested seriously so far or applied to bioinformatics)
Part 3 - Future Techniques
- Chapter 9 - Genetic Programming
- GP = GA + parse tree to represent a solution to the problem
- created by John Koza (see also his homepage)
- yet "automatic programming" and similar concept seems very close to the initial goal of languages like Lisp (the first homoiconic programming language) so stating who "created" it can be difficult (GP ~1990 while Homoiconicity ~1960).
- "Genetic algorithms produce solutions that contain combinations of parameter values (possibly weighted) to satisfy a function, whereas GP produces solutions that contain a series of instructions for producing desired and specified program behaviour." (p225)
- see also Gary Cziko's WithoutMiracles especially section The Computer Can Know More than the Programmer of chapter 13 Evolutionary Computing: Selection Within Silicon
- "These include ‘human-competitive’ applications where GP has created previously patented inventions, or new inventions which are determined to be of patentable quality" (p231-232)
- usage in bioinformatics
- Genetic programming in data mining for drug discovery
- Genetic programming for functional genomics in yeast data
- summary of chapter p236
- Chapter 10 - Cellular Automata
- introduced by John von Neumann on the suggestion of Stan Ulam during late 1940s
- more often used on simulations rather than optimizations
- grid and behavior are equivalent to inherent laws of the system
- classical Conway's example
- usage in bioinformatics
- Cellular automata model for enzyme kinetics
- Simulation of an apoptosis reaction network using cellular automata
- summary of chapter p252
- see also
- Chapter 11 - Hybrid Methods
- mash-ups fitting directly to their applications
- existing mapping between a repository of techniques and their applications?
See also
- UVs at l'UTC
- Health#PersonalGenomics
- Seedea's AI matrix
- MLOSS.org machine learning open source software
- Metamorphic code and how exon/intro and the different traduction/translation mechanisms look similar
- Section about Metamorphic Viruses in The Art of Computer Virus Research and Defense, Peter Szor, Addison Wesley Professional, 2005
- How Trivial DNA Changes Can Hurt Health by J. V. Chamary and Laurence D. Hurst, Scientific American, June 2009
- Folding@home from Stanford University
- Foldit, Solve Puzzles for Science, by University of Washington
- the Interactorium based on Skyrails by Yose Widjaja
- Third International Workshop on Machine Learning in Systems Biology MLSB´09 - Ljubljana
- YRI Trio Dataset, Amazon Web Services Developer Community October 2009
- Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria
- 700GB
- Boolean Logic Unlocks The Key To Finding New Genes in Milliseconds by Aaron Saenz, Singularity Hub April 2010
- Artificial Intelligence and Molecular Biology eited by Lawrence Hunter (archived at AAAI.org)
Overall remarks and questions
- the reading notes do not really answer the initial motivation of somehow seeing which technique best fit to which problem and why
Synthesis
So in the end, it was about X and was based on Y.
Critics
Point A, B and C are debatable because of e, f and j.
Vocabulary
(:new_vocabulary_start:)
new_word
(:new_vocabulary_end:)
Post-reading model
Draw a schema (using PmGraphViz or another solution) of the situation of the area in the studied domain after having read the book. Link it to the pre-reading model and align the two to help easy comparison.
Categories
Back to the
Other read books linking to the IntelligentBio page :
Back to the Menu