26 June 2014 by Martin P. Robillard with Audris Mockus
Much effort is currently invested to increase our
understanding of software development by analyzing
large data sets
like GitHub, StackOverflow
(or their corporate equivalent). This type of effort
is now known as Software Analytics (or more
generally business analytics applied to software
development). Unlike for data collected as part of
controlled experiments, for software analytics we have
to be careful in our interpretation since the data is
seminar, I sat down
with Audris Mockus
and we synthesized some of the discussions into a list
of desirable practices for
reporting on software analytics projects.
State the research question,
but also how it impacts software engineering, and what you expect
the answer should be and why.
Describe how the repository data was obtained.
If the data collection process is impermeable, it's not clear
what the data will mean, if anything.
Describe the missing relevant data. Repositories paint an incomplete picture of most
phenomena of interest. What other data would complete the picture?
Justify how it's acceptable to proceed without it.
Explicitly define the target population and
sample, along with their main attributes. Are you trying to
understand small or large projects? Are you making assumptions about
a programming technology? Argue why the sample is representative of
the population. For good guidelines on how to achieve this, check
Assume that all data sets have errors. Estimate
and report the error on all attributes. For example, if you are
taking into account the component flag in bug reports, what is the
estimated proportion of misclassified components? For a given
project, it could even be 100%.
Report distributions on all attributes that are
used. This does not have to mean pages full of barplots.
Consider and discuss possible confounding
factors. Typical ones include software artifact size,
programming language, calendar time.
Check the value of baseline attributes before
doing any complex modeling. Typical ones include the size and type of
Ensure there is no major correlation among the
attributes you plan to use in the data analysis.
Avoid automatic selection techniques to avoid over-fitting.
Use the simplest appropriate statistical
techniques.. If anything else, this will make the research
more widely accessible, understandable, verifiable, and replicable.
For large data sets p-values should practically be 0. Focus on effect