Recommendation Systems for Software Engineering

A graduate course on advanced techniques to automatically interpret large amounts of structured, semi-structured, and unstructured data, infer some useful knowledge from it, and present this knowledge to users in a convenient form. With a special focus on applications to software engineering problems.

Offered by Martin Robillard in the McGill School of Computer Science in Winter 2012 (4 credits). Mondays and Wednesdays 1:00-2:30 in MC103 [Shortcut to the schedule]

Overview

Software developers, like many other types of knowledge workers, have sophisticated information needs, while at the same time being overwhelmed with information. For example, searching for "How do I send an email with Java" finds 143,000,000 related documents including articles, forum posts, mailing list archives, etc. This could be too much information. Recommendation Systems are tools that help users navigate large information spaces by providing guidance and assistance in the form of recommendations, or pieces of information estimated to be relevant in the context of a given task. Recommendation Systems for Software Engineering, or RSSEs, provide recommendations in highly technical contexts where analyses of structured data (such as source code) must often complement traditional data mining techniques. For a more detailed overview of RSSEs, read this IEEE Software article, and check out this website.

Learning Outcomes

The course will cover topics in three major areas:

Mining Software Repositories: You will learn about the principles, techniques, and tools developed to process and analyze information from software repositories, including: Mining versioned data, partial program analysis, dependency analysis algorithms, traceability recovery algorithms, introduction to information retrieval, analysis of unstructured data.
Introduction to Recommendation Systems: You will learn about collaborative filtering, content-based, and knowledge-based recommendation algorithms, explanation techniques, architecture of recommendation systems, evaluation of recommendation systems.
Recommendation Systems for Software Engineering: You will become familiar with specific recommendation systems developed for software engineering tasks, such as finding code examples, learning API usage patterns, finding the expert on a technical topic, or discovering how to migrate to a new API version.

The course will also help you practice and improve important soft skills for researchers:

Critical thinking and argumentation: You will evaluate the strengths and weaknesses of technical papers, take positions about the merits and shortcomings of research work, and argue for this position during in-class discussions.
Public speaking: The course will involve regular in-class presentations about both your work and the work of others.

Course Work and Evaluation

The course will involve a combination of "roadmap" lectures and invited lectures on selected topics, student presentations and discussion of research papers, and "work-in-progress" project presentations. The course will involve a major project: the development of a prototype RSSE. The final grade will take into account the project (50%), class participation (30%) and a take-home exam (20%). The course will be based on the book "Recommender Systems: An Introduction" by Jannach et al., 2010, and from selected scientific papers.

Official Academic Integrity Statement McGill University values academic integrity. Therefore all students must understand the meaning and consequences of cheating, plagiarism and other academic offenses under the Code of Student Conduct and Disciplinary Procedures (see www.mcgill.ca/students/srr for more information).

Language Policy In accord with McGill University’s Charter of Students’ Rights, students in this course have the right to submit in English or in French any written work that is to be graded.

Seminars

The majority of class times are reserved for the presentation and discussion of research papers ("seminars"). For each paper, each student will be assigned the role of "presenter", "discussant", or "audience". The class participation grade will be based on the performance in these respective roles.

Presenter: Prepare a 20-minute (strict) presentation of the paper. Provide a brief summary of the paper, briefly explain any non-trivial algorithm or technique used, summarize the contributions of the paper, discuss its usefulness and impact and any engineering and design trade-offs, and relate it to other work. As the presenter, you act as an advocate for the paper. Slideware: You are allowed a maximum of 10 slides on plain white background. At least one slide must be dedicated to the motivation for the work and one for the contributions. Minimum font size 20pt for all text appearing on the slides except within figures. All reused sources (e.g., figures) must be attributed. Please send me the slides with file name DD-MM-N-Keyword.ext, where DD is the day of the talk, MM is the month, N is the order of presentation (1 or 2), Keyword is a mnemonic, and ext is pptx or pdf, as appropriate.
Discussant: Take an adversarial position by pointing out weak and controversial positions in the paper. Present a 5 minute rebuttal of the paper after the presentation (no slides). You should come prepared with problems and counterexamples for the presented work.
Audience: As an audience member you do not need to prepare anything besides reading the paper. You are nevertheless expected to participate by asking questions and commenting on the paper. You can take an advocate or adversarial position as you see fit.

Project

The course project is to develop a prototype RSSE. You can chose whatever application and technique you like, as long as it involves the analysis of software engineering artifacts. Although you will be expected to develop a complete and functional RSSE, you are encouraged to focus on a specific aspect that corresponds to your research area of interest (e.g., mining algorithms, data preprocessing, UIs, etc.)

At the same time as you develop the technical aspects of your project, you will write a report on it using ACM's conference formatting guidelines. There are three milestones:

Proposal: Email the instructor (by 26 January 11:53pm) a 3-page description of the RSSE you want to build. Pitch your project proposal to your classmates on 30 January and collect their feedback and reaction. Would they fund your project?
Midterm report: Email the instructor (by 24 February 11:55pm) a 5-page report (extended from your proposal) that focuses on the motivation, main techniques, and general architecture of the RSSE. Your report should include details of the data sources and mining algorithms you use, references to reused packages, and illustrations of non-standard techniques or algorithms employed. The report should also include a brief discussion of (at least) the two or three most related works, with bibliographic references. Describe your system in class on 27 February.
Grand finale: Present your completed system, including a live demo, on 11 or 16 April. Email your final, 8-page report (extended from the midterm report) to the instructor before April 17, 11:59pm.

Details on the format of the reports and presentation, and general guidelines and advice, will be provided in class.

Final Exam

A one-page essay answering a synthesis question, to be completed on your own within a 24-hour period at some point after the project demos.

Schedule

This schedule is subject to change. Seminar articles are in bold.

Date	Class Topics	Reading
Mon 9 Jan	Introduction to software engineering research. Roadmap: Recommendation systems. Overview of the project.	[RWZ2010]
Wed 11 Jan	Roadmap: Data mining software repositories	[XTL2009]
Mon 16 Jan	Seminar: Early Systems: CodeBroker and ExpertiseBrowser	[YF2002] [MH2002]
Wed 18 Jan	Seminar: Recommendations for the web: tags and shortcuts	[LM2010] [BCC2009]
Mon 23 Jan	Seminar: Applications of content-based recommendations: features and bug reports	[AHM2006] [DGH2011]
Wed 25 Jan	Seminar: Code comprehension: reuse and debugging	[HRR2009] [AJL2009]
Mon 30 Jan	Project proposals
Wed 1 Feb	Seminar: Mining code usage	[LZ2005] [BMM2009]
Mon 6 Feb	Seminar: Finding code examples	[SC2006] [BOL2010]
Wed 8 Feb	Seminar: Synthesizing code examples	[MXB2005] [DR2011]
Mon 13 Feb	Invited Lecture: Partial program analysis and the SemDiff recommender	[DR2008]
Wed 15 Feb	Invited Lecture: Mining user interaction data	[YR2011]
Mon 20 Feb	No class - Study break
Wed 22 Feb	No class - Study break
Mon 27 Feb	Work in progress presentations
Wed 29 Feb	Seminar: Specification Mining	[ABL2002] [GS2009]
Mon 5 Mar	Seminar: API property inference	[ZZX2009] [HST2010]
Wed 7 Mar	Seminar: Code Quality	[ECH2001] [KR2009]
Mon 12 Mar	Seminar: Bug prediction	[SZW2007] [BMN2011]
Wed 14 Mar	Seminar: Software Evolution	[ZWD2004] [KN2009]
Mon 19 Mar	Roadmap: Metrics and evaluation	[RRS2009] Chapter 8 [JZF2010] Chapter 7
Wed 21 Mar	Seminar: Personalization	[FYW2004] [TDH2005]
Mon 26 Mar	Seminar: Interaction traces	[PG2006] [FOM2010]
Wed 28 Mar	Seminar: User interfaces	[KRW2011] [SS2011]
Mon 2 Apr	Seminar: Explanation	[HKR2000] [VSR2009]
Wed 4 Apr	Roundtable: Privacy Issues in Recommender Systems	Selected by students
Mon 9 Apr	No class - Easter Monday
Wed 11 Apr	Project presentations
Mon 16 Apr	Project presentations

Reading List

General References

Sources not explicitly discussed as part of the seminar, but that will provide useful additional background on the course in general, or on specific topics.

[AT2005]	G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734- 749, Jun. 2005.
[CACM1997]	Communications of the ACM, Special Issue on Recommender Systems, vol. 40, no. 3, Mar. 1997.
[DR2008]	B. Dagenais and M. P. Robillard, “Recommending adaptive changes for framework evolution,” in Proceedings of the 30th ACM/IEEE International Conference on Software Engineering, 2008, pp. 481–490.
[JZF2010]	D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich, Recommender Systems: An Introduction. Cambridge Univ Press, 2010.
[RRS2009]	F. Ricci, L. Rokach, B. Shapira, and P.B. Kantor, Recommender systems handbook. Springer, 2009.
[RWZ2010]	M. Robillard, R. Walker, and T. Zimmermann, “Recommendation systems for software engineering,” IEEE Software, vol. 27, no. 4, pp. 80-86, Aug. 2010.
[XTL2009]	T. Xie, S. Thummalapenta, D. Lo, and C. Liu, “Data mining for software engineering,” IEEE Computer, vol. 42, no. 8, pp. 35–42, 2009.
[YR2011]	A. T.T. Ying and M. P. Robillard, “The Influence of the task on programmer behaviour,” in Proceedings of the 19th IEEE International Conference on Program Comprehension, 2011, pp. 31-40.

Seminar Articles

[ABL2002]	G. Ammons, R. Bodík, and J. R. Larus, “Mining specifications,” in Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2002, pp. 4–16.
[AHM2006]	J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?,” in Proceedings of the 28th ACM/IEEE International Conference on Software engineering, 2006, pp. 361–370.
[AJL2009]	B. Ashok, J. Joy, H. Liang, S. K. Rajamani, G. Srinivasa, and V. Vangala, “DebugAdvisor: a recommender system for debugging,” in Proceedings of the the 7th Joint Meeting of the European Software Engineering conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2009, pp. 373–382.
[BCC2009]	R. Baraglia et al., “Search shortcuts: a new approach to the recommendation of queries,” in Proceedings of the 3rd ACM Conference on Recommender Systems, 2009, pp. 77–84.
[BMM2009]	M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2009, pp. 213–222.
[BMN2011]	C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu, “Don’t touch my code! Examining the effects of ownership on software quality,” in Proceedings of the the 8th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2011.
[BOL2010]	S. K. Bajracharya, J. Ossher, and C. V. Lopes, “Leveraging usage similarity for effective retrieval of examples in code repositories,” in Proceedings of the 18th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, 2010, pp. 157–166.
[DGH2011]	H. Dumitru et al., “On-demand feature recommendations derived from mining public product descriptions,” in Proceedings of the 33rd ACM/IEEE International Conference on Software Engineering, 2011, pp. 181–190.
[DR2011]	E. Duala-Ekoko and M. Robillard, “Using structure-based recommendations to facilitate discoverability in APIs,” In Proceedings of the European Conference on Object-Oriented Progamming, 2011, pp. 79–104.
[ECH2001]	D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as deviant behavior: a general approach to inferring errors in systems code,” in Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001, pp. 57–72.
[FOM2010]	T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill, “A degree-of-knowledge model to capture source code familiarity,” in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 2010, pp. 385–394.
[FYW2004]	Fang Liu, C. Yu, and Weiyi Meng, “Personalized Web search for improving retrieval effectiveness,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 28- 40, Jan. 2004.
[GS2009]	M. Gabel and Z. Su, “Symbolic mining of temporal specifications,” in Proceedings of the 30th ACM/IEEE International Conference on Software Engineering, 2009, pp. 51–60.
[HKR2000]	J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering recommendations,” in Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, 2000, pp. 241–250.
[HRR2009]	R. Holmes, T. Ratchford, M. P. Robillard, and R. J. Walker, “Automatically Recommending Triage Decisions for Pragmatic Reuse Tasks,” in Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, 2009, pp. 397–408.
[HST2010]	Hao Zhong, Suresh Thummalapenta, Tao Xie, Lu Zhang, and Qing Wang, “Mining API Mapping for Language Migration,” in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 2010, p. 195--204.
[KN2009]	M. Kim and D. Notkin, “Discovering and representing systematic code changes,” in Proceedings of the 31st ACM/IEEE International Conference on Software Engineering, 2009, pp. 309–319.
[KR2009]	D. Kawrykow and M. P. Robillard, “Improving API usage through automatic dDetection of redundant code,” in Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, 2009, pp. 111–122.
[KRW2011]	B. P. Knijnenburg, N. J. M. Reijmer, and M. C. Willemsen, “Each to his own: how different users call for different interaction methods in recommender systems,” in Proceedings of the 5th ACM Conference on Recommender systems, 2011, pp. 141–148.
[LM2010]	M. Lipczak and E. Milios, “Learning in efficient tag recommendation,” in Proceedings of the 4th ACM Conference on Recommender Systems, 2010, pp. 167–174.
[LZ2005]	Z. Li and Y. Zhou, “PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code,” in Proceedings of the 10th European Software Engineering Conference held jointly with the 13th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, 2005, pp. 306–315.
[MH2002]	A. Mockus and J. D. Herbsleb, “Expertise browser: a quantitative approach to identifying expertise,” in Proceedings of the 24th ACM/IEEE International Conference on Software Engineering, 2002, pp. 503–512.
[MXB2005]	D. Mandelin, L. Xu, R. Bodík, and D. Kimelman, “Jungloid mining: helping to navigate the API jungle,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005, pp. 48–61.
[PG2006]	C. Parnin and C. Gorg, “Building usage contexts during program comprehension,” in Proceedings of the 14th IEEE International Conference on Program Comprehension, 2006, pp. 13-22.
[SC2006]	N. Sahavechaphan and K. Claypool, “XSnippet: mining for sample code,” in Proceedings of the 21st ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, 2006, pp. 413–430.
[SC2006]	Sunghun Kim, T. Zimmermann, E. J. Whitehead, and A. Zeller, “Predicting faults from cached history,” in Proceedings of the 29th ACM/IEEE International Conference on Software Engineering, 2007, pp. 489-498.
[SS2011]	E. I. Sparling and S. Sen, “Rating: how difficult is it?,” in Proceedings of the 5th ACM Conference on Recommender Systems, 2011, pp. 149–156.
[TDH2005]	J. Teevan, S. T. Dumais, and E. Horvitz, “Personalizing search via automated analysis of interests and activities,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 449–456.
[VSR2009]	J. Vig, S. Sen, and J. Riedl, “Tagsplanations: explaining recommendations using tags,” in Proceedings of the 14th International Conference on Intelligent User Interfaces, 2009, pp. 47–56.
[YF2002]	Y. Ye and G. Fischer, “Supporting reuse by delivering task-relevant and personalized information,” in Proceedings of the 24th ACM/IEEE International Conference on Software Engineering, 2002, pp. 513–523.
[ZWD2004]	T. Zimmermann, P. Weissgerber, S. Diehl, and A. Zeller, “Mining version histories to guide software changes,” in Proceedings of the 26th ACM/IEEE International Conferences on Software Engineering, 2004, pp. 563–572.
[ZZX2009]	H. Zhong, L. Zhang, T. Xie, and H. Mei, “Inferring resource specifications from natural language API documentation,” in Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, 2009, pp. 307–318.

Acknowledgements

This course draws inspiration from many sources, and in particular: discussions on RSSEs with the co-organizers of the RSSE workshop (Walid Maalej, Rob Walker, and Tom Zimmermann), joint work on API property inference with Mira Mezini and Eric Bodden at TU Darmstadt, Ahmed Hassan's course on Mining Software Engineering Data, and the exciting work of my graduate students.