Scalability Bugs in Large-Scale Software Systems

understanding scalability bugs in large scale n.w
1 / 66
Embed
Share

Scalability bugs are latent faults that manifest in large-scale deployments, typically with over 100 nodes. These bugs remain hidden in smaller deployments but surface as scale-dependent symptoms in larger systems. The complexity and impact of such bugs become evident only at a substantial scale, posing challenges for system reliability and performance.

  • Scalability
  • Bugs
  • Software Systems
  • Large-Scale

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Understanding Scalability Bugs in Large-Scale Software Systems Bogdan Bo Stoica 18 November 2024 1

  2. Joint work with Prof. Haryadi Gunawi Prof. Shan Lu Prof. Kexin Pei 2

  3. Joint work with Prof. Haryadi Gunawi Prof. Shan Lu Prof. Kexin Pei Jun Yang 3

  4. Joint work with Prof. Haryadi Gunawi Prof. Shan Lu Prof. Kexin Pei Jun Yang Wordyka Nainggolan Zahra Maharani Marcellino Gaol Natanael Siregar 4

  5. An example: Cassandra-12281 1 node bootstrapping = 1min 5

  6. An example: Cassandra-12281 1 node bootstrapping = 1min 1 node bootstrapping = 30min 6

  7. An example: Cassandra-12281 1 node bootstrapping = 1min time 1 node bootstrapping = 30min # nodes 7

  8. An example: Cassandra-12281 1 node bootstrapping = 1min time 1 node bootstrapping = 30min # nodes 8

  9. An example: Cassandra-12281 1 node bootstrapping = 1min time 1 node bootstrapping = 30min # nodes 9

  10. An example: Cassandra-12281 1 node bootstrapping = 1min Root cause: gossip protocol has cubic (!!) complexity time 1 node bootstrapping = 30min # nodes 10

  11. An example: Cassandra-12281 1 node bootstrapping = 1min Root cause: gossip protocol has cubic (!!) complexity time 1 node bootstrapping = 30min Observation: happens only at a large enough scale # nodes 11

  12. What are scalability bugs? Definition: Latent faults that are scale dependent symptoms surface in large-scale deployments (e.g., >100 nodes) but not likely in small/medium-scale (e.g., <100) deployments. and whose 12

  13. What are scalability bugs? Definition: Latent faults that are scale dependent symptoms surface in large-scale deployments (e.g., >100 nodes) but not likely in small/medium-scale (e.g., <100) deployments. and whose 13

  14. What are scalability bugs? Definition: Latent faults that are scale dependent symptoms surface in large-scale deployments (e.g., >100 nodes) but not likely in small/medium-scale (e.g., <100) deployments. and whose 14

  15. What are scalability bugs? Definition: Latent faults that are scale dependent symptoms surface in large-scale deployments (e.g., >100 nodes) but not likely in small/medium-scale (e.g., <100) deployments. and whose 15

  16. Scalability bugs data set 350+ bug reports 10 open-source systems 16

  17. Scalability bugs data set 350+ bug reports 10 open-source systems 17

  18. Scalability bugs data set 350+ bug reports 10 open-source systems Reported in last 10+ years Confirmed & fix 18

  19. How scalability bugs happen? Load e.g., # requests 19

  20. How scalability bugs happen? Load e.g., # requests Data e.g., large workload 20

  21. How scalability bugs happen? Load e.g., # requests Data e.g., large workload Cluster e.g., # nodes 21

  22. How scalability bugs happen? Load e.g., # requests Data e.g., large workload Cluster e.g., # nodes Fail e.g., # internal failures 22

  23. How scalability bugs happen? Load e.g., # requests Data e.g., large workload Cluster e.g., # nodes Fail e.g., # internal failures 23

  24. How scalability bugs happen? Load e.g., # requests Data e.g., large workload Cluster e.g., # nodes Fail e.g., # internal failures 24

  25. How scalability bugs happen? Load e.g., # requests Cancellation bugs [OSDI22] Data e.g., large workload Retry bugs [SOSP24] Cluster e.g., # nodes Fail e.g., # internal failures 25

  26. Challenges detecting scalability bugs Scalability bugs are difficult to test in-house 26

  27. Challenges detecting scalability bugs Scalability bugs are difficult to test in-house Require workloads, not just unit tests 27

  28. Challenges detecting scalability bugs Scalability bugs are difficult to test in-house Require workloads, not just unit tests How to judge a scalability bug happened? 28

  29. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful 29

  30. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports 30

  31. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports Resource mismatch 31

  32. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports Resource mismatch Missing dependencies 32

  33. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports Resource mismatch Missing dependencies Poor test coverage 33

  34. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports Resource mismatch traditional program analysis + Missing dependencies LLMs Poor test coverage 34

  35. Scalability bug reproducibility efforts 75 most recent reports, 48 reproduced, 27 unsuccessful Incomplete reports Resource mismatch traditional program analysis + Missing dependencies LLMs Poor test coverage 35

  36. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics 36

  37. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics 37

  38. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics 38

  39. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics 39

  40. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics 40

  41. SOTA: test generation for correctness bugs Iterative prompting with feedback metrics feedback metrics (typically, coverage) 41

  42. Metric for correctness bugs HDFS/DFSInputStream.java bool bool createBlockReader createBlockReader(): (): 1. while while ( (true 2. try try { { 3. block = block = refreshBlock 4. dnInfo dnInfo = = getDNInfFor 5. if if ( (dnInfo dnInfo == 6. break break; ; 7. } } 8. catch catch ( (IOException IOException e) { 9. LOG( Failed to connect to + LOG( Failed to connect to + dnInfo.addr dnInfo.addr + Retried + ++ ++retryCount retryCount + times ); 10. addToDeadNodes addToDeadNodes(dnInfo.info); 11. } } 12. } } true) { ) { refreshBlock(block); getDNInfFor(block); == null null) ) (block); (block); e) { + Retried + + times ); (dnInfo.info); 42

  43. Metric for correctness bugs HDFS/DFSInputStream.java bool bool createBlockReader createBlockReader(): (): 1. while (true) { while (true) { /* covered */ 2. try { try { 3. block = block = refreshBlock refreshBlock(block); 4. dnInfo dnInfo = = getDNInfFor getDNInfFor(block); 5. if ( if (dnInfo dnInfo == null) == null) 6. break; break; /* covered */ 7. } } 8. catch ( catch (IOException IOException e) { e) { 9. LOG( Failed to connect to + LOG( Failed to connect to + dnInfo.addr dnInfo.addr + Retried + + Retried + ++ ++retryCount retryCount + times ); + times ); /* covered */ 10. addToDeadNodes addToDeadNodes(dnInfo.info); (dnInfo.info); /* covered */ 11. } } 12. } } /* covered */ (block); /* covered */ /* covered */ (block); /* covered */ /* covered */ /* covered */ /* covered */ /* covered */ 43

  44. Metric for correctness scalability bugs? HDFS/DFSInputStream.java bool bool createBlockReader createBlockReader(): (): 1. while (true) { while (true) { /* covered */ 2. try { try { 3. block = block = refreshBlock refreshBlock(block); 4. dnInfo dnInfo = = getDNInfFor getDNInfFor(block); 5. if ( if (dnInfo dnInfo == null) == null) 6. break; break; /* covered */ 7. } } 8. catch ( catch (IOException IOException e) { e) { 9. LOG( Failed to connect to + LOG( Failed to connect to + dnInfo.addr dnInfo.addr + Retried + + Retried + ++ ++retryCount retryCount + times ); + times ); /* covered */ 10. addToDeadNodes addToDeadNodes(dnInfo.info); (dnInfo.info); /* covered */ 11. } } 12. } } /* covered */ (block); /* covered */ /* covered */ (block); /* covered */ /* covered */ /* covered */ /* covered */ /* covered */ 44

  45. Metric for correctness scalability bugs? HDFS/DFSInputStream.java bool bool createBlockReader createBlockReader(): (): 1. while (true) { while (true) { /* 100 times */ 2. try { try { 3. block = block = refreshBlock refreshBlock(block); 4. dnInfo dnInfo = = getDNInfFor getDNInfFor(block); 5. if ( if (dnInfo dnInfo == null) == null) 6. break; break; /* 15 times */ 7. } } 8. catch ( catch (IOException IOException e) { e) { 9. LOG( Failed to connect to + LOG( Failed to connect to + dnInfo.addr dnInfo.addr + Retried + + Retried + ++ ++retryCount retryCount + times ); + times ); /* 3 times */ 10. addToDeadNodes addToDeadNodes(dnInfo.info); (dnInfo.info); /* 3 times */ 11. } } 12. } } /* 100 times */ (block); /* 100 times */ /* 100 times */ (block); /* 100 times */ /* 100 times */ /* 15 times */ /* 3 times */ /* 3 times */ 45

  46. Metric for correctness scalability bugs? HDFS/DFSInputStream.java bool bool createBlockReader createBlockReader(): (): 1. while (true) { while (true) { /* 100 times */ 2. try { try { 3. block = block = refreshBlock refreshBlock(block); 4. dnInfo dnInfo = = getDNInfFor getDNInfFor(block); 5. if ( if (dnInfo dnInfo == null) == null) 6. break; break; /* 15 times */ 7. } } 8. catch catch ( (IOException IOException e) { e) { 9. LOG( Failed to connect to + LOG( Failed to connect to + dnInfo.addr dnInfo.addr + Retried + + Retried + ++ ++retryCount retryCount + times ); + times ); /* 3 times */ 10. addToDeadNodes addToDeadNodes(dnInfo.info); (dnInfo.info); /* 3 times */ 11. } } 12. } } /* 100 times */ (block); /* 100 times */ /* 100 times */ (block); /* 100 times */ /* 100 times */ /* 15 times */ /* 3 times */ /* 3 times */ 46

  47. LLM-backed stress test generation pipeline 47

  48. LLM-backed stress test generation pipeline Execution frequency Instruction count Object allocation frequency Queue contention 48

  49. LLM-backed stress test generation pipeline Execution frequency Instruction count Object allocation frequency Queue contention 49

  50. Nave prompting Ask the LLM to synthesize a stress test 50

Related


More Related Content