A HUMAN STUDY OF PATCH MAINTAINABILITY
In the study of patch maintainability, researchers investigate the challenges of manually fixing bugs versus using automated patches. They explore techniques like dynamic modification and program transformation to save developers time. A key concern is the human-understandability and long-term maintainability of machine-generated patches compared to human-generated ones. The study also delves into measuring notions of human understandability and improving maintainability, questioning the effectiveness of machine-generated patches in practice.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A HUMAN STUDY OF PATCH MAINTAINABILITY Zachary P. Fry, Bryan Landau, Westley Weimer University of Virginia {zpf5a,bal2ag,weimer}@virginia.edu
Bug Fixing 2 Fixing bugs manually is difficult and costly. Recent techniques explore automated patches: Evolutionary techniques GenProg Dynamic modification ClearView Enforcement of pre/post-conditions AutoFix-E Program transformation via static analysis AFix While these techniques save developers time, there is some concern as to whether the patches produced are human-understandable and maintainable in the long run.
Questions Moving Forward 3 How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine- generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Questions Moving Forward 4 How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine- generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Measuring quality and maintainability 5 Functional Quality Does the implementation match the specification? Does the code execute correctly ? Non-functional Quality Is the code understandable to humans? How difficult is it to understand and alter the code in the future? ?
Software Functional Quality 6 Perfect: Implementation matches specification Direct software quality metrics: Testing Defect density Mean time to failure Indirect software quality metrics: Cyclomatic complexity Coupling and cohesion (CK metrics) Software readability
Software Non-functional Quality 7 Maintainability: Human-centric factors affecting the ease with which bugs can be fixed and features can be added Broadly related to the understandability of code Not easy to concretely measure with heuristics like functional correctness These automatically-generated patches have been shown to be of high quality functionally what about non-functionally?
Patch Maintainability Defined 8 Rather than using an approximation to measure understandability, we will directly measure humans abilities to perform maintenance tasks Task: ask human participants questions that require them to read and understand a piece of code and measure the effort required to provide correct answers Simulate the maintenance process as closely as possible
Php Bug #54454 9 Title: substr_compare incorrectly reports equality in some cases Bug description: if main_stris shorter than str, substr_compare [mistakenly] checks only up to the length of main_str substr_compare( cat , catapult ) = true
Motivating Example 10 if (offset >= s1_len) { php_error_docref(NULL TSRMLS_CC, cannot exceed string length"); RETURN_FALSE; } E_WARNING, "The start position if (len > s1_len - offset) { len = s1_len - offset; } cmp_len = (uint) (len ? len : MAX(s2_len, (s1_len - offset)));
Motivating Example 11 len--; if (mode & 2) { for (i = len - 1; i >= 0; i--) { if (mask[(unsigned char)c[i]]) { len--; } else { break; } } } if (return_value) { RETVAL_STRINGL(c, len, 1); } else {
Automatic Documentation 12 Intuitions suggest that patches augmented with documentation are more maintainable Human patches can contain comments with hints as to the developer s intention when changing code Automatic approaches cannot easily reason about why a change is made, but can describe what was changed Automatically Synthesized Documentation: DeltaDoc (Buse et al. ASE 2010) Measures semantic program changes Outputs natural language descriptions of changes
Automatic Documentation 13 if (!con->conditional_is_valid[dc->comp]) { if (con->conf.log_condition_handling) { TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); } /* If not con->conditional_is_valid[dc->comp] No longer return COND_RESULT_UNSET; */ return COND_RESULT_UNSET; } /* pass the rules */ switch (dc->comp) { case COMP_HTTP_HOST: { char *ck_colon = NULL, *val_colon = NULL;
Questions Moving Forward 14 How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine- generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Evaluation 15 Focused research questions to answer: 1) How do different types of patches affect maintainability? 2) Which source code characteristics are predictive of our maintainability measurements? 3) Do participants intuitions about maintainability and its causes agree with measured maintainability? To answer these questions directly we performed a human study using over 150 participants with real patches from existing systems
Experiment - Subject Patches 16 We used patches from six benchmarks over a variety subject domains Program LOC Defects Patches gzip 491,083 77,258 61,528 1,046,421 407,917 2,812,340 4,896,547 1 7 3 9 1 2 libtiff 14 4 17 2 11 50 lighttpd php python wireshark 11 32 Total:
Experiment - Subject Patches 17 Original the defective, un-patched code used as a baseline for measuring relative changes Human-Accepted human patches that have not been reverted to date Human-Reverted human-created patches that were later reverted Machine automatically-generated patches created by the GenProg tool Machine+Doc the same patches as above, but augmented with automatically synthesized documentation
Experiment Maintenance Task 18 Sillito et al. Questions programmers ask during software evolution tasks Recorded and categorized the questions developers actually asked while performing real maintenance tasks What is the value of the variable y on line X? Not: Does this type have any siblings in the type hierarchy?
Human Study 19 15 if (dc->prev) { 16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } 25 26 } 27 28 if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } 32 33 return COND_RESULT_UNSET; 34 }
Human Study 20 Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above:
Human Study 21 15 if (dc->prev) { 16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } 25 26 } 27 28 if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } 32 33 return COND_RESULT_UNSET; 34 }
Human Study 22 Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above: False
Evaluation Metrics 23 Correctness is the right answer reported? Time what is the maintenance effort associated with understanding this code? We favor correctness over time Participants were instructed to spend as much time as they deemed necessary to correctly answer the questions The percentages of correct answers over all types of patches were not different in a statistically significant way We focus on time, as it is an analog for the software engineering effort associated with program understanding
Type of Patch vs. Maintainability 24 Percent? Time? Saved? for? Correct? Answers? ? 15? When? Compared? with? Original? Code? 10? 5? 0? -5? -10? -15? -20? -25? Human? Accepted? Machine? Human? Reverted? Machine+Doc? Patch? Type? Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
Type of Patch vs. Maintainability 25 Percent? Time? Saved? for? Correct? Answers? ? 15? When? Compared? with? Original? Code? 10? 5? 0? -5? -10? -15? -20? -25? Human? Accepted? Machine? Human? Reverted? Machine+Doc? Patch? Type? Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
Characteristics of Maintainability 26 We measured various code features for all patches used in the human study Using a logistic regression model, we can predict human accuracy when answering the questions in the study 73.16% of the time A Principle Component Analysis shows that 17 features account for 90% of the variance in the data Modeling maintainability is a complex problem
Characteristics of Maintainability 27 Code Feature Predictive Power Ratio of variable uses per assignment 0.178 Code readability 0.157 Ratio of variables declared out of scope vs. in scope 0.146 Number of total tokens 0.097 Number of non-whitespace characters 0.090 Number of macro uses 0.080 Average token length 0.078 Average line length 0.072 Number of conditionals 0.070 Number of variable declarations or assignments 0.056 Maximum conditional clauses on any path 0.055 Number of blank lines 0.054
Human Intuition vs. Measurement 28 After completing the study, participants were asked to report which code features they thought increased maintainability the most Human Reported Feature Votes Predictive Power Descriptive variable names 35 *0.000 Clear whitespace and indentation 25 *0.003 Presence of comments 25 0.022 Shorter function 8 *0.000 Presence of nested conditionals 8 0.033 Presence of compiler directives / macros 7 0.080 Presence of global variables Use of goto statements Lack of conditional complexity 5 0.146 5 *0.000 5 0.055 Uniform use and format of curly braces 5 0.014
Conclusions 29 From conducting a human study involving over 150 participants and patches fixing high- priority defects from real systems we conclude: The fact that humans take less time, on average, to answer questions about machine-generated patches with automated documentation than human-created patches validates the possibility of using automatic patch generation techniques in practice There is a strong disparity between human intuitions about maintainability and our measurements and thus we think further study is merited in this area
30 Questions?
Modified DeltaDoc 31 We modify DeltaDoc in the following ways: Include all changes, regardless of length of output Ignore all internal optimizations that lead to loss of information (e.g. ignore suspected unrelated statements) Include all relevant programmatic information (e.g. function arguments) Ignore all high-level output optimizations Favor comprehensive explanations over brevity Insert output directly above patches as comments
Experiment - Participants 32 Over 150 participants 27 fourth-year undergraduate CS students 14 CS graduate students 116 Mechanical Turk internet participants Accuracy cutoff imposed Ensuring people don t try to game the system requires special consideration Any participant who failed to answer all questions or scored below one standard deviation of the average undergraduate student s score was removed
Experiment - Questions 33 What conditions must hold to always reach line X during normal execution? What is the value of the variable y on line X? What conditions must be true for the function z() to be called on line X? At line X, which variables must be in scope? Given the following values for relevant variables, what lines are executed by beginning at line X? Y=5 && Z=True.