
Google File System Design and Architecture Overview
Explore the design and architecture of the Google File System (GFS) as presented by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung in the Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. The system's interactions, fault tolerance, measurements, and conclusions are discussed, providing insights into its scalability, reliability, and availability in managing huge files efficiently.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The GOOGLE The GOOGLE FILE SYSTEM FILE SYSTEM Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. 2003. Presenter: Yen-Yu Chen Date: June 4, 2024
Contents Contents 1. Introduction 2. Design 3. System Interactions 4. Master Operation 5. Fault Tolerance and Diagnosis 6. Measurements 7. Conclusions
1. Introduction 1. Introduction
Background Background Previous Distributed File System Performance Scalability Reliability Availability
Different points in the design Different points in the design space space Component failures are the norm rather than the exception. Data store as huge file Files are huge by traditional standards. Most files are mutated by appending new data. Sustained bandwidth more critical than low latency.
2. Design 2. Design
Interface Interface Familiar Interface create delete open close read write Moreover Snapshot Atomic Record append
Architecture Architecture
Architecture Architecture Chunk Files are divided into fixed-size chunk 64MB Larger than typical file system block sizes Advantages from large chunk size Reduce interaction between client and master Client can perform many operations on a given chunk Reduce size of metadata stored on the master
GFS GFS chunkservers chunkservers Store chunks on local disks as Linux files Read/Write chunk data specified by a chunk handle & byte range
GFS master GFS master Maintains all file system metadata in memory Namespace Access-control information Chunk locations Lease management
GFS client GFS client Linked into each application Implements the file system API Communicates with the master & chunkservers
Process(Read) Process(Read)
Process Process Metadata only Data only
3. System Interactions 3. System Interactions
Lease Lease Objective Minimize load on master Master grants lease to one replica Called primary chunkserver
Dataflow Dataflow Write step
Atomic Appends Atomic Appends GFS appends data to the file at least once atomically append data data data
Atomic Appends Atomic Appends GFS appends data to the file at least once atomically data append data append data append
Atomic Appends Atomic Appends GFS appends data to the file at least once atomically data append append data append append data append
Snapshot Snapshot Goals To quickly create branch copies of huge data sets To easily checkpoint the current state Copy-on-write technique Metadata for the source file or directory tree is duplicated Reference count for chunks are incremented Chunks are copied later at the first write
4. Master Operation 4. Master Operation
Namespace Management and Namespace Management and Locking Locking
Creation, Re Creation, Re- -replication, Rebalancing Rebalancing replication, Replicate chunks that do not have a sufficient number of copies. Prioritize replicating frequently accessed chunks. Prioritize replicating chunks that have become bottlenecks.
Garbage Collection Garbage Collection Deleted files File is renamed to a hidden name, then may be removed later Orphaned chunks(unreachable chunks)
5. 5. Fault Tolerance and Diagnosis Fault Tolerance and Diagnosis
High Availability High Availability Fast Recovery Operation log and checkpoints Chunk Replication Master Replication
Data Integrity Data Integrity
Diagnostic Diagnostic Log record all operation on metadata
6. 6. Measurements Measurements
7. Conclusions 7. Conclusions
Advantages Advantages Divide the file into chunks for storage, which can be accessed concurrently and have high throughput. Separate control flow and data flow when modifying data, making full use of the bandwidth of each machine Use lease to reduce master workload Good fault tolerance
Disadvantages Disadvantages There is only one master. If there is too much metadata, there may not be enough memory. If the number of clients is large, the load on one master will be too large. It is too inefficient for the master to perform garbage collection by browsing all chunks. The consistency is too loose and cannot handle tasks that require high consistency