Importance of Version Management in Spreadsheet Development

icse 2016 n.w
1 / 31
Embed
Share

Explore the significance of version management in spreadsheet development through a detailed analysis of the VEnron versioned spreadsheet corpus and related evolution. Learn about the impact on software engineering practices and the lack of traditional version control tools in maintaining spreadsheet versions efficiently. Discover how managing spreadsheet versions can enhance collaboration and data integrity.

  • Version Management
  • Spreadsheet Development
  • Software Engineering
  • Collaboration
  • Data Integrity

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ICSE 2016 Software Engineering in Practice VEnron VEnron A Versioned Spreadsheet Corpus and Related Evolution Analysis Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, Tao Huang

  2. A spreadsheet A spreadsheet usage scenario usage scenario Service price in Service price in May May http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig Enter data Enter data Write formula Write formula Alice Alice Share Share http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW Bob Bob 2

  3. Create a spreadsheet for June Create a spreadsheet for June Service price Service price in May in May http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW Update data Update data Modify formulas Modify formulas Bob Bob Service price Service price in June in June http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig Share Share Alice Alice 3

  4. Similar to software development! Similar to software development! http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig Retrieve Retrieve Edit Edit http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW Share Share Developers Developers Users Users 4

  5. Version management is important Version management is important Software development http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig Developer Developer Developer Developer Software code is well maintained by version control tools. 5

  6. Version management is missing Version management is missing Spreadsheet development http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig User User User User However, spreadsheets are rarely maintained by version control tools, like SVN and Git. 6

  7. Do spreadsheet versions matter? Do spreadsheet versions matter? Average lifetime Average lifetime Average users on a spreadsheet Average users on a spreadsheet http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLWhttp://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLWhttp://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLWhttp://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLWhttp://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW 1F. Hermans, M. Pinzger, and A. van Deursen, ICSE 11 7

  8. Do Do spreadsheet spreadsheet versions matter versions matter? ? Find refactoring opportunities 31 days in May 31 & 30 can be changed 31 & 30 can be changed to a cell A1, which has to a cell A1, which has the number of days the number of days 30 days in June real spreadsheets from VEnron real spreadsheets from VEnron 8

  9. Do Do spreadsheet spreadsheet versions matter versions matter? ? Find inconsistency Similar Inconsistency The formula is missing real spreadsheets from VEnron real spreadsheets from VEnron 9

  10. Our goal Our goal VEnron VEnron Build an industrial-scale and publicly available spreadsheet evolution corpus Facilitate future scientific studies on spreadsheet evolution 10

  11. Why VEnron? Why VEnron? Large spreadsheet Large spreadsheet corpora so corpora so far far EUSES EUSES1 1 ~4,500 spreadsheets Obtained by searching on Google Independent spreadsheets! FUSE FUSE2 2 ~250,000 spreadsheets Extracted from ~27 billion web pages No version information! Enron Enron3 3 ~15,000 spreadsheets Extracted from an email archive in the Enron corporation 1M. Fisher and G. Rothermel, WEUSE 05 2T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill, MSR 15 3F. Hermans, M. Pinzger, and A. van Deursen, ICSE 15 11

  12. Why VEnron? Why VEnron? The change history of The change history of spreadsheets is rarely documented documented 90+ million end-user programmers with no comment tracking or version control1 spreadsheets is rarely Few companies Few companies may use SharePoint, SpreadGit, Github) for spreadsheets, they SharePoint, SpreadGit, Github) for spreadsheets, they are unwilling to share them due to are unwilling to share them due to business confidentiality confidentiality may use version management tools (e.g., version management tools (e.g., business 1P. Durusau and S. Hunting, The Markup Conference 15 12

  13. Version Version information information is not available in spreadsheets! spreadsheets! Why can we build VEnron? Why can we build VEnron? is not available in 13

  14. How are spreadsheets exchanged? How are spreadsheets exchanged? Spreadsheet development http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig User User User User Emails are a common way to exchange spreadsheets 14

  15. Emails provide version information! Emails provide version information! Emails can (partially) provide the history of spreadsheets Emails can (partially) provide the history of spreadsheets Sender Sender Time Time Receivers Receivers Previous version Previous version Spreadsheet Spreadsheet 15

  16. How can we build VEnron? How can we build VEnron? 16

  17. Why choose the Enron email archive? Why choose the Enron email archive? We choose the Enron email archive as our subject We choose the Enron email archive as our subject1 1 Publicly available Publicly available Real emails and spreadsheets in the Enron corporation Real emails and spreadsheets in the Enron corporation Large Large ~750,000 emails ~15,000 unique spreadsheets 1http://info.nuix.com/Enron.html 17

  18. Approach overview Approach overview Enron emails Enron emails Cluster Cluster spreadsheets spreadsheets evolution group A group of spreadsheets that are originated from the same spreadsheet Recover Recover v version ersion o orders rders VEnron VEnron 18

  19. How to cluster spreadsheets? How to cluster spreadsheets? Manually check ~15,000 spreadsheets and their related Manually check ~15,000 spreadsheets and their related emails? emails? Unclear semantics in these spreadsheets Unknown social network among users Mission Impossible! 19

  20. A viable way to cluster spreadsheets A viable way to cluster spreadsheets Cluster spreadsheets into different groups according to Cluster spreadsheets into different groups according to spreadsheet spreadsheet filename similarity filename similarity Spreadsheets in an evolution group often have the same shortened filename by deleting version-related substrings. Version id Version id v1 v2 v3 v4 v5 Spreadsheet filename Spreadsheet filename May00_FOM_Req2.xls Jun00_FOM_Req.xls July00_FOM_Req.xls July00_FOM_Req02.xls Aug00_FOM_Req.xls ID for versions Date for versions Shortened names 20

  21. A viable way to cluster spreadsheets A viable way to cluster spreadsheets The The clustering clustering approach may misjudge situations, and approach may misjudge situations, and wrongly wrongly cluster spreadsheets cluster spreadsheets Manually Manually validated each group, and adjusted validated each group, and adjusted each group accordingly accordingly each group Key idea: check whether all spreadsheets in a group Key idea: check whether all spreadsheets in a group share similar table structures share similar table structures Similarity on worksheet names? Similarity on the structure of corresponding worksheets? 21

  22. Recover version orders (1) Recover version orders (1) Manually recover Manually recover version related version information related version information version orders by using spreadsheet orders by using spreadsheet- - Timestamps indicate version orders Timestamps indicate version orders Spreadsheet filenames, worksheet names, spreadsheet tables Version id Version id v1 v2 v3 v4 v5 Spreadsheet filename Spreadsheet filename May00_FOM_Req2.xls Jun00_FOM_Req.xls July00_FOM_Req.xls July00_FOM_Req02.xls Aug00_FOM_Req.xls Worksheet names Worksheet names FOM May Storage FOM Jun Storage FOM Jul Storage FOM Jul Storage FOM Aug Storage 22

  23. Recover version orders (2) Recover version orders (2) The email sending history The email sending history http://t2.gstatic.com/images?q=tbn:ANd9GcSOGDtb7Ywg5kR7kx0Ao5S59OmDuGcXosvhYf2egt-DNhK6pPLW http://t0.gstatic.com/images?q=tbn:ANd9GcRdSP7F2LWzin7srjrzSsttOPkwz6Ffk1ojRlDWWaRfuMIA6fN7ig S S S S Alice Alice Bob Bob Spreadsheets S and S belong to the same evolution group 23

  24. Recover version orders (3) Recover version orders (3) Email contents may clearly describe where the current Email contents may clearly describe where the current spreadsheet come from spreadsheet come from The attached file is an update of CEF FOM June'00 request that was transmitted on 5/23/00 Previous version 24

  25. Recover Recover version orders version orders (4) (4) Timestamps Timestamps v1 v1 v2 v2 v3 v3 v4 v4 Email sending history Email sending history Total order Email contents Email contents 25

  26. How about VEnron? How about VEnron? 26

  27. Statistics of evolution groups Statistics of evolution groups 360 360 evolution evolution groups 7,294 spreadsheets and 35,373 worksheets in total 7,294 spreadsheets and 35,373 worksheets in total 83% groups have more than 5 versions 83% groups have more than 5 versions groups #group #group #spreadsheet #spreadsheet 27

  28. Statistics Statistics - - More More 78% groups are maintained by more than 1 user 78% groups are maintained by more than 1 user #group #group #user #user 28

  29. Statistics Statistics - - More More Count the number of Excel built Count the number of Excel built- -in errors E.g., #Div/0!, #N/A!, etc. 16.9% groups introduce new errors during evolution 16.9% groups introduce new errors during evolution in errors #error #error #version id #version id Find more statistics in the paper 29

  30. Takeaway Takeaway VEnron VEnron: The first spreadsheet evolution spreadsheet evolution corpus 360 evolution groups, including 7,294 spreadsheets and 35,373 worksheets : The first industrial industrial- -scale corpus scale and and publicly available publicly available Initial statistics (evidence) have shown that VEnron Initial statistics (evidence) have shown that VEnron contains interesting evolution, and could be a basis for contains interesting evolution, and could be a basis for future studies on spreadsheet evolution future studies on spreadsheet evolution Have a try! Have a try! http http://sccpu2.cse.ust.hk/venron/ ://sccpu2.cse.ust.hk/venron/ 30

  31. THANK YOU! THANK YOU!

Related


More Related Content