A flat (non-hierarchical) content presentation mostly assumes the reader wants to read everything sequentially.
This assumption is fine for suspenseful novels, but it crumples in the case of API documentation. Often, a programmer will be looking for very precise information to get a specific task done, and will have no need for extraneous details. Yet, API reference documentation pages typically collect a wild variety of information (for evidence of this, see a different blog post).
As one example, we can look at
Javadoc page of the Java
class. It contains:
Most programmers will only need to read Item 1 once in their life. Items 2 and 3 will be useful for anyone not using the API for a period of months. Some entries in item 4 are indispensable for any regexp work, but some are more esoteric (if you have ever used a Greek block please email me). Items 5 and 6 are clearly for more advanced usage scenarios, and Item 7 will be useful to readers working on compatibility concerns.
Ideally, content fulfilling such different information needs would be annotated or separated to enable a more streamlined access. How can we disentangle all this?
For his M.Sc. thesis project, Yam Chhetri used a pattern-based technique to detect and extract fragments of API documentation likely to contain common and important information that all programmers should know. The information is extracted from Javadocs, and automatically recommended to programmers who are using a related type or method.
For example, a programmer using
would get the following recommendation (in an Eclipse
view for example):
An Executor is normally used instead of explicitly creating threads.
(Did you know this?)
The magic here is that as part of the development
of the approach we mined the unordered word pattern
<API element> "is" "used") as
indicating important and common programming
knowledge. We then applied this pattern to find
instances such as the quote above. Finally, the
information was linked back to
because it contains the name of the class. When the
use of the type
Thread is detected, we
can volunteer the information.
As part of the research we extracted 361 word patterns that could detect important programming knowledge, including:
<API element>, "lead")
<API element>, "identify")
We applied them to the entire Javadocs of the JDK 6 and discovered 14 597 fragments of information that we called knowledge items.
In a study involving ten independent assessors, indispensable knowledge items recommended for API types were judged useful 57% of the time and potentially useful an additional 30% of the time.
So until all technical documentation is carefully marked-up with meta-data separating the different types of information, we can use text mining approaches to separate out common knowledge from specialized or archival information.