Martin Robillard · Blog

Embracing Discrimination (in API Documentation)

8 August 2014 by Martin P. Robillard

A flat (non-hierarchical) content presentation mostly assumes the reader wants to read everything sequentially.

This assumption is fine for suspenseful novels, but it crumples in the case of API documentation. Often, a programmer will be looking for very precise information to get a specific task done, and will have no need for extraneous details. Yet, API reference documentation pages typically collect a wild variety of information (for evidence of this, see a different blog post).

As one example, we can look at the official Javadoc page of the Java Pattern class. It contains:

  1. A super-abstract description of what a regular expression pattern is;
  2. Information on how to create a pattern object and use it;
  3. A good tip for using a convenience method;
  4. A summary of 91 regular expression constructs, including how to represent a character in a Greek block;
  5. Various considerations on how to deal with more advanced issues such as escape characters, character classes, and groups;
  6. Notes on conformance to Unicode;
  7. Comparison with Perl5 regular expressions.

Most programmers will only need to read Item 1 once in their life. Items 2 and 3 will be useful for anyone not using the API for a period of months. Some entries in item 4 are indispensable for any regexp work, but some are more esoteric (if you have ever used a Greek block please email me). Items 5 and 6 are clearly for more advanced usage scenarios, and Item 7 will be useful to readers working on compatibility concerns.

Ideally, content fulfilling such different information needs would be annotated or separated to enable a more streamlined access. How can we disentangle all this?

For his M.Sc. thesis project, Yam Chhetri used a pattern-based technique to detect and extract fragments of API documentation likely to contain common and important information that all programmers should know. The information is extracted from Javadocs, and automatically recommended to programmers who are using a related type or method.

For example, a programmer using Thread would get the following recommendation (in an Eclipse view for example):

An Executor is normally used instead of explicitly creating threads.

(Did you know this?)

The magic here is that as part of the development of the approach we mined the unordered word pattern (<API element> "is" "used") as indicating important and common programming knowledge. We then applied this pattern to find instances such as the quote above. Finally, the information was linked back to Thread because it contains the name of the class. When the use of the type Thread is detected, we can volunteer the information.

As part of the research we extracted 361 word patterns that could detect important programming knowledge, including:

We applied them to the entire Javadocs of the JDK 6 and discovered 14 597 fragments of information that we called knowledge items.

In a study involving ten independent assessors, indispensable knowledge items recommended for API types were judged useful 57% of the time and potentially useful an additional 30% of the time.

So until all technical documentation is carefully marked-up with meta-data separating the different types of information, we can use text mining approaches to separate out common knowledge from specialized or archival information.