Appendices and Datasets

This page contains additional resources related to the manuscript "Discovering Information Explaining API Types Using text Classification" submitted to ICSE 2015 by Petrosyan, Robillard, De Mori

To improve part-of-speech (POS) tags assigned by Stanford Parser in case of technical concepts we reimplemented a multi-word term detection algorithm and ran it on Official Java Tutorials. Afterwards we chose top phrases and forced POS tagger to tag them as nouns.
List of Multi-word phrases: multi_word_phrases.pdf

For out classification task we used Dependency-based Features. To find useful dependencies we used the Java tutorials to extracted 1785 dependencies in which either the governor or the dependent was a code-like term(CLT). After manual annotation we calculated a z-score and normalized it to use as a weight for the dependency. The useful dependencies instances overall mapped to 243 distinct typed dependencies and 39 distinct relations.
List of useful dependencies used: dependencies.csv
List of useful relation types used: relations.csv

For constructing training set for the classification task we needed to manually annotate the tutorials. To ensure a high level of rigour in our annotation process, we constructed a detailed annotation guide.
Annotation Guide: annotation_guide.pdf

Studying how to discover tutorial sections relevant to API types requires a corpus of tutorials. We selected five tutorials covering four different Java APIs. Here are those five tutorials after pre-processing and annotation.
Annotated tutorials: JodaTime Math Col. Official Col. Jenkov Smack