Insights into Open Source Collaboration and Community Dynamics

github insights understanding open source n.w
1 / 29
Embed
Share

Explore insightful visual data and research findings on open-source projects, community collaboration, and project health. Discover trends in project development, user characterizations, and software engineering behaviors. Gain valuable business and research insights to enhance your projects and communities.

  • Open Source
  • Collaboration
  • Community Dynamics
  • Software Engineering
  • Project Health

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. GitHub Insights Understanding Open Source @jeffmcaffer Microsoft Georgios Gousios Delft University of Technology (TU Delft) Kevin Lewis Microsoft

  2. Snapshot overview

  3. Inspire confidence

  4. How open is a project? % commits from project community vs core team % comments and commenters % forks contributing PR lifelines http://ghtorrent.org/pullreq-perf/

  5. Commits (core vs community)

  6. Commits (origin)

  7. Comments (core vs community)

  8. PR lifelines

  9. Are we using git in a distributed way?

  10. How may devs are there per country?

  11. Insights

  12. Business insights Project health Ours are we building a good community? Yours is this a good project to use/invest in? Product adoption Trends for products, APIs, technologies, Sample/tutorial effectiveness API health and evolution How are people using our API? Are they using it right ? Can it be improved? How many people would a change break?

  13. Research insights User characterization What is a Python developer? Is there such a thing? What else do they use? Software engineering research Project behavior Collaborative development approaches (ICSE) Code evolution Biases Gender, location, Automated code review

  14. Cross-domain insights Mix this data with Social media data StackOverflow questions Slack conversations Sentiment analysis Customer satisfaction data Get a more holistic view

  15. Operational insights Repository management Linting license, readme, contributing, Approvals controlling public access Cataloging giving structured access to an inherently unstructured world Reporting compliance Security API keys, etc. User management 2FA, de-provisioning, settings, Org/team membership, CLAs, Multi-org insights

  16. Approach Data for the masses

  17. GitHub by the numbers (Mid 2016) 14 Million users 35 Million repos (~half private) ~1 Million events per day

  18. Approach @GHTorrent (http://ghtorrent.org) Software engineering research project Open collaboration with 100s of users and researchers Archive of ALL public events Complete capture of entities related to each event Enable the analytics that people need Apply Big Data techniques Visualizations

  19. How does it work? Events or Webhooks http://api.github.com/events For each event, walk the JSON recursively Store results in MongoDB tables by entity type Remember relationships in MySQL Periodically revisit entities to update Handle missed events Handle absent events Deal with deletes and updates (GH only emits create events)

  20. Example event (condensed) { { "id": "3820922721", "type": "PushEvent", "actor": { {"id": 1054639 1054639, "url": "https://api.github.com/users/Cephei"} }, "repo": { {"id": 44221657 44221657, "url": "https://api.github.com/repos/PowerDMS/Owin.Scim"} }, "payload": { { "push_id": 1042059448 1042059448, "commits": [{ [{ "url": "https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b309d "}] }] } }, "org": { { "id": 2522349 2522349, "url": "https://api.github.com/orgs/PowerDMS" }} }}

  21. Entities Commits Commit comments Events Followers Forks Issues Issue comments Issue events Orgs Org members Pull request comments Pull requests Repo collaborators Repo labels Repos Users Watchers

  22. GHTorrent architecture Github API evt.commit Data Retrieval Data Retrieval Commits Queue Event Retrieval evt.watch Project Events Queue Data Retrieval Data Retrieval evt.fork Mirroring Cluster Events Projects Commits

  23. GHTorrent by the numbers Data from Feb 2012 to present ~5B event rows in MySQL ~10TB of entity data in MongoDB Growing by 10GB per day

  24. Using the data You can do it too!

  25. Using the data: Hosted http://ghtorrent.org Online Query live MySQL and MongoDB Convenient, nothing to get or install Great for point investigations 100 second query limit

  26. Using the data: Download Get MySQL and MongoDB dumps from ghtorrent.org Run your own database servers Full control and power to query as needed

  27. Using the data: Self-service Install and configure your own GHTorrent Seed with existing data or start fresh Seeding can take a while Use webhooks instead of events Enable tracking of your private repos https://github.com/ghtorrent/ghtorrent-webhook Need to get API key sets to avoid throttling

  28. Using the data: Azure Data Lake Scale out in Azure Data Lake Store ghtorrent.org data pumped into Data Lake Storage Exposes a WebHDFS access layer Compute Data Lake Analytics and query using U-SQL Hadoop Spark https://github.com/Microsoft/ghinsights

  29. Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis

More Related Content