ProK 2014

Bellairs 2014 Workshop on Representing Programming Knowledge

14-21 February 2014, Barbados


The last decade has witnessed an explosion in the number of programming technologies and reusable components available to software developers. Each new technology and API introduces a body of knowledge that must be assimilated to achieve expertise. The fast pace of technological development means that documentation and learning resources often lag behind the corresponding technology, and that an unprecedented amount of technical knowledge is captured informally, tacitly, or not at all.

The software engineering research community has made major advances in techniques for mining software repositories, which can then be leveraged for inferring programming knowledge. However, a question now arises: how do we effectively represent, display, and disseminate programming knowledge?

This multidisciplinary workshop will bring together researchers in software engineering, knowledge representation, and information visualization with the goal of exploring how programming knowledge collected and inferred from software development organizations can be represented for the purpose of training and supporting software developers.


The workshop will take place at McGill University's Bellairs Research Institute, located directly on a beautiful beach in Barbados. The Bellairs Institute provides basic accommodation (in double occupancy only). The workshop activities will be held directly at the institute.


The workshop is organized by Martin Robillard with the collaboration of Christoph Treude. Due to limited space availability, attendance at the workshop is by invitation only.


Andrew Begel, Microsoft Research, USA
Walid Maalej, University of Hamburg, Germany
Andrian Marcus, Wayne State University, USA
Leon Moonen, Simula Research Laboratory, Norway
Tamara Munzner, University of British Columbia, Canada
Gail Murphy, University of British Columbia, Canada
Emerson Murphy-Hill, North Carolina State University, USA
Martin Pinzger, University of Klagenfurt, Austria
Lori Pollock, University of Delaware, USA
Martin Robillard, McGill University, Canada
David Shepherd, ABB Corporate Research, USA
Jonathan Sillito, University of Calgary, Canada
Lin Tan, University of Waterloo, Canada
Christoph Treude, McGill University, Canada
Sebastian Uchitel, University of Buenos Aires, Argentina
Thomas Zimmermann, Microsoft Research, USA

Workshop Summary

To investigate the various aspects of representing programming knowledge, each participant of ProK 2014 was asked to prepare a short talk on the topic and to bring an original programming task along with a list/map/model of the knowledge required to solve it.

Sunday, February 16

Each day of the workshop aimed at answering a different question, starting with "What knowledge does a programmer need?" on the first day. Martin Robillard gave an introduction to the workshop, classifying related work into psychology literature, plan-focused work, and empirical work. The following discussion brought out a first classification of different types of programming knowledge: knowledge about practices, code behaviour, communication, and basic computational knowledge. In the second talk of the day, Thomas Zimmermann presented 145 questions that engineers at Microsoft ask about software. In addition, two concrete programming tasks along with their required knowledge items were discussed. The discussion focused on general programming knowledge versus domain knowledge, but also on different presentations of knowledge. In particular, succinct answers on Stack Overflow were generally preferred over long documentation essays.

Monday, February 17

The theme of the discussion on Monday was "Where is programming knowledge stored?" Emerson Murphy-Hill presented his work on representing and improving developers' knowledge about tools, showing that developers are often not aware of all tools available to them. Possible solutions, such as more tool-based education, tool recommenders, and making tools easier to discover, were discussed. An interesting point of discussion was the question what the unit of a tool is. Next, another programming task was discussed, focusing on various sources of knowledge, including Stack Overflow, other developers, and source code. At this point, the workshop seemed to converge on the concept of a question as a unit of programming knowledge.

The next talk by Jonathan Sillito discussed how software developers decompose features into programming changes to be implemented in source code. To do this effectively, developers need a shared vision, which can be established at different points of time, ranging from up-front planning to after-the-fact hacks. Awareness, or ideally predictability, are needed to make this process efficient. Christoph Treude discussed different representations of programming knowledge in software documentation, such as API documentation, blog posts, and Stack Overflow. His talk focused on the different dimensions along which these artifacts differ, and he concluded that while many different kinds of documentation exist, there are good reasons for this differentiation. The next programming task presented a challenge that could not be solved with Stack Overflow since the required knowledge crosscut many different levels of granularity, too many for the solution to be found in a single Stack Overflow answer. This example sparked a discussion about the organization of knowledge, in particular whether it is feasible to index knowledge by scenarios or programming tasks.

Tuesday, February 18

The overarching question for Tuesday was "How should programmer knowledge be represented?" Lori Pollock introduced the concept of an action unit, a sequence of consecutive statements that logically implement a high level action as a substep within a method's primary function. Action units are important for understanding a larger method, and can be detected via blank lines, comments, or code clones. Action units can also be used to summarize code, or to suggest refactorings. Next, Lin Tan's talk focused on making knowledge actionable. In particular, she addressed the diversity of information available in source code comments, stating that not all comments are created equal. An open research challenge is how we can automatically identify which comments are useful. In his talk on summarizing knowledge embedded in software, Andrian Marcus discussed possible solutions to source code summarization, in particular task-based summaries and their use by software developers. As a good representation for source code summaries, tree-like structures were discussed.

In an afternoon brainstorming session, different categorizations of programming knowledge were developed in small groups.

The next talk by Sebastian Uchitel explored the use of source code abstractions for validation tasks. He defined the behaviour of code as the state space of variable values, and the abstraction of this behavior as combining states into sequences to serve as units of knowledge, meant for human consumption. In his talk on context-aware software engineering, Walid Maalej proposed that there is a lot of knowledge embedded in context, which includes everything we can observe or interpret during software development apart from actual source code modifications. Context could be captured by logging low-level events which would need to be aggregated to get useful context information. Three challenges were discussed: splitting developer activity into sessions, describing what happened in a session, and comparing contexts.

Wednesday, February 19

The theme of Wednesday was "How do programmers communicate and collaborate?" Gail Murphy's talk addressed effective communication, and she started by describing issues that often lead to ineffective communication, such as failing to ask for clarification, using jargon, and being overly critical and negative. In software engineering, communication also takes place between developers and tools, and communication can be interpreted as learning. Martin Pinzger's talk addressed online collaboration around source code changes. Currently, information about changes is stored only as text, thus lacking semantic and context information. Several ideas to extract, share, understand, coordinate, and communicate context-specific source code changes were discussed. When looking at source code changes, developers care most about the impact of these changes on their own development activities.

Thursday, February 20

The theme of the last workshop day was "What does the perfect tool look like?" In Tamara Munzner's talk on visualization, she defined visualization systems as systems that provide visual representations of datasets designed to help people carry out tasks. She introduced a nested model for visualization design and validation as well as a multi-level typology of abstract visualization tasks. A crucial step when designing visualizations that also applies to programmer tools is transforming the original data into a form that is well suited for addressing the users' needs.

David Shepherd's talk discussed ways to surface program knowledge in a way that developers cannot miss it. He compared knowledge presentation in state-of-the-art IDEs to setups commonly used in practice, focusing on layout, space used for source code, and time spent in different parts of the IDE. Lessons learned from commonly used lightweight tools such as SublimeText include that source code is by far the most important aspect, and that inline tools seem to work well. Andrew Begel reported on a study of Windows Phone developers that were learning new platforms on the side. Their learning process was problem-centered, self-planned, and self-evaluated. Several shortcomings of current learning resources were discussed, such as common assumptions made by tutorial writers. The resulting question was, can we enable developers to find what they should learn next?

The discussion was wrapped up with the presentation of a few more programming tasks. Common issues included a lack of version and usage information when trying to reproduce bugs and missing information on the natural sequence for learning concepts. No consensus was reached on the definition of a unit of programming knowledge, mostly because of the varying degree in size of knowledge units, ranging from entire software architectures to directives for API calls. One way to capture and disseminate knowledge is to generate meta information about the knowledge already available to support automatic reasoning.