Machine Learning for Bioinformatics (COMP 766-001, Fall 2006)
- 06/11/06: Homework 3 solutions posted. Project proposal due Nov. 7.
- 20/10/06: Homework 3 has been assigned, and is due Oct. 26. Solutions to Homework 3 have been posted.
- 10/10/06: My office hours on Wed., Oct. 11 is moved to 2:30pm-3:30pm.
- 04/10/06: Homework 2 has been assigned, and is due Oct. 12. You should have received e-mail with the assignment and data. They can also be obtain below in the Resources / handouts section.
- 27/09/06: Homework 1 is graded and will be handed back in class. Solutions are posted.
- 18/09/06: Homework 1 will be passed out in class tomorrow, or you can download it here: Homework_01.pdf. The data set is here: Homework_01_data.txt. The homework is due in or before class on 26/09/06, one week from tomorrow.
- 18/09/06: The paper for tuesday's (19/09/06) class is Henikoff and Henikoff (1996) "Using substitution probabilities to improve position-specific scoring matrices" CABIOS, Vol. 12, No. 2, pp 135-143.
- 11/09/06: Contrary to the previous announcement, and to what I was suggesting in the last class, we will continue to hold class in the regular, assigned classroom (MT3438 04). The MCB conference room does not have enough whiteboard space.
- 05/09/06: Starting Tuesday (12/09/06), classes will be held in the McGill Centre for Bioinformatics (MCB) conference room. This room is on the lower level of the MCB. Knock on the MCB door (room 332 of Lyman-Duff building) to gain entry, go down the stairs to the left, and you will see the conference room.
- 05/09/06: Thursday's class (07/09/06) is cancelled, as I will be out of town. We resume on Tuesday (12/09/06).
Resources / handouts:
- Background on molecular biology -- at a minimum, you should know what the following things are and what they are about: DNA, RNA, gene, intro, exon, codon, amino acid, protein, protein complex, protein domain, transcription factor (TF), TF binding site, cell cycle, prokaryote, eukaryote. Some resources to help you learn these things, include:
- Homework assignments and solutions.
Lecture schedule (in process of being reformatted):
| Lecture || Date || Topic(s) || Readings / materials |
|| Sep 5
|| Introduction - What is machine learning? Course mechanics and outline.
|| Sep 7
|| Sep 12
|| Brief review of probability theory. Parametric density estimation.
|| Bishop Chapter 2
|| Sep 14
|| More parametric density estimation. Nonparamteric density estimation.
|| Bishop Chapter 2.
|| Sep 19
|| Paper discussion.
|| Henikoff and Henikoff (1996) "Using substitution probabilities to improve position-specific scoring matrices" CABIOS, Vol. 12, No. 2, pp 135-143.
|| Sep 21
|| More nonparametric density estimation. Testing for associations between discrete variables: Chi-square test.
|| For Chi-square - just about any stats book.
|| Sep 26
|| Testsing for associations between discrete variables: information theory.
|| Sep 28
|| Paper discussion.
|| Draghici et al. (2003) "Global functional profiling of gene expression" Genetics, Vol. 81, pp. 98-104.
|| Oct 3
|| More information theory. Begin prediction / regression: Linear & polynomial regression. Logistic regression. Naive Bayes. Gaussian discriminant analysis.
|| Oct 5
|| Oct 12
|| Oct 17
|| Paper discussion.
|| Oberg et al. "Joint estimation of calibration and expression for high-density oligonucleotide arrays" Bioinformatics, Vol. 22, No. 19, pp. 2381-2387.
|| Oct 19
|| Oct 24
|| Oct 26
|| Decision and Regression Trees. Tests (for internal nodes.) Criteria for test selection. Greedy growing and pruning.
|| Mitchell Ch. 3
|| Oct 31
|| Closing comments on decision/regression trees. Random Forests.
|| Breiman "Random Forests" Machine Learning Vol. 45 pp. 5-32 (2001)
|| Nov 2
|| Boosting, especially AdaBoost. Paper reading.
|| Shapire "Theoretical views of boosting" EuroColt '99 pp. 1-10 (1999)
Li et al. "Discovery of significant rules for classifying cancer diagnosis data" Bioinformatics Vol. 19 Suppl. 2 pp. ii93-ii102 (2003)
|| Nov 7
|| Continue boosting and paper discussion.
|| See above.
|| Nov 9
|| Nearest neighbor methods. Begin support vector machines
|| Mitchell Chapter 8. For SVMs, any tutorial at www.kernel-machines.org.
|| Nov 14
|| Finish support vector machines
|| See above.
|| Nov 16
|| Paper discussion.
|| Rangwala and Karypis (2005) "Profile-based direct kernels for remote homology detection and fold recognition" Bioinformatics, Vol. 21, No. 23, pp. 4239-4247.
|| Nov 21
|| Nov 23
|| Nov 28
|| Nov 30
|| Dec 4
Taught by: Prof. Theodore J. Perkins
Office: McGill Centre for Bioinformatics
Course web page: http://www.mcb.mcgill.ca/~perkins/COMP766001_Fall2006
Class location: MT3438 04 (That is, Room 4 of 3438 McTavish, which is between the McGill Bookstore and the Undergraduate Student Union)
Class time: 1:05 PM to 2:25 Tue and Thu
What this course is about:
The purpose of this course is to introduce students with some
background, or at least interest, in bioinformatics to the major
principles and techniques of machine learning, and to look at how
these can be applied to problems in bioinformatics. The course is
intended to be accessible to students from life science deparments as
well as computer science or other technical departments. (Necessary
technical background will be kept to a minimum. On the other hand, the
course has previously been enjoyed by students who have already taken
COMP 652 - Machine Learning, for example. See more on
prerequisites below.) The specific topics to be covered include (not
necessarily in this order, and subject to revision based on student
The goals of the course are to:
- Probabilistic modeling - including the principle of maximum likelihood and Bayes's rule, density estimation, testing for association between variables, and a bit of Bayesian networks
- Unsupervised learning - including clustering and dimensionality reduction
- Supervised learning (also known as function approximation) - including linear and logistic regression, nearest neighbor, tree-based methods, artifical neural networks, support vector machines
- Modeling dynamical systems - including discrete time-series analysis, dynamical Bayes nets, and differential equation modeling
- Provide students with a "toolbox" of practical machine learning techniques that are useful for bioinformatics data analysis and research.
- Describe proper methodology for applying machine learning techniques, and common pitfalls.
- Give students enough expertise to understand and evaluate bioinformatics research papers that involves machine learning.
- Provide a sense of what can and what cannot be inferred from data.
- To examine which machine learning approaches have been most successful in
bioinformatics to date.
Format: Approximately half of the classes will be lectures taught by Dr. Perkins, and half will be discussions of bioinformatics research papers that use machine learning.
- 33% -- Homework assignments, which may include written and programming exercises. Expect about 5 assignments of moderate length.
- 33% -- Research paper critiques and paper presentation. For classes in which a research paper is the main topic of discussion, students will write a short (1-2 page) review, which summarizes the papers, evaluates strengths and weaknesses, and discuss potential improvements, alternative solutions, etc. Also, each paper will have a "student presenter", who is responsible for taking a few minutes to summarize the paper at the start of discussion.
- 33% -- Final exam or project, at each students choice.
- 1% -- Freebie!
Prerequisites: Students should have studied calculus, at least one class on probability/statistics, and have a basic background in computer science. If you are unsure, email me or talk to me in class.
Primary course materials:
- Lecture notes
- Various research papers
- Draft chapters of "Machine Learning and Bioinformatics: The Interface" by Mitra, Datta, Perkins and Michailidis
- And probably select chapters from the secondary course materials below
Secondary course materials:
- Neural Networks for Pattern Recognition. Bishop. Oxford University Press, 1997
- Machine Learning. Mitchell, McGraw-Hill, 1997
- Pattern Classification (2nd Edition). Duda, Hart, Stork. Wiley-Interscience, 2000
- The Elements of Statistical Learning. Hastie, Tibshirani, Friedman. Springer-Verlag, 2001
- Bioinformatics: A Machine Learning Approach. Baldi, Brunak. MIT Press, 1999.
- Probabilistic Reasoning in Intelligent Systems. Pearl. Morgan Kaufmann Publishers Inc., 1988
- Statistical Methods in Bioinformatics. Grant, Evans. Springer-Verlag, 2001