Outline of Distributed File Systems and Data Center Architecture

Slide Note

"This content discusses the requirements and properties of distributed file systems like HDFS/GFS, along with insights into data center architecture. It covers important aspects such as server configurations, network issues, failures, and strategies for achieving reliable and high-performing data centers."

sekula_m Follow

Uploaded on Mar 18, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

HDFS/GFS

Outline Requirements for a Distributed File System HDFS Architecture Read/Write Research Directions Popularity Failures Network

Properties of a Data Center Servers are built from commodity devices Failure is extremely common Servers only have a limited amount of HDD space Network is over-subscribed The bandwidth between servers is different Demanding applications High throughput, low latency Resources are grouped into failure-zone Independent units of failure

Data-Center Architecture 10GB 25GB 100GB

Properties of a Data Center Servers are built from commodity devices Failure is extremely common Servers only have a limited amount of HDD space Network is over-subscribed The bandwidth between servers is different Demanding applications High throughput, low latency Resources are grouped into failure-zone Independent units of failure

Data-Center Architecture Failure Domain 1 Failure Domain 2

Data-Center Architecture Failure Domain 1 Failure Domain 2

Data-Center Architecture Failure Domain 1 Failure Domain 2

Goals for a Data Center File System Reliable Over come server failures High performing Provide good performance to application Aware of network disparities Make data local to the applications

Common Design Principles For performance: Partitioning the data Split data into chunks and distribute provides high throughput Many people can read the chunks in parallel Better than everyone one reading the same file How data is partitioned across nodes For reliability: Replication: overcome failure by making copies At least one copy should be online How data is duplicated across nodes For Network-disparity: rack-aware allocation Read from the closest block Write to the closest location

Common Design Principles For performance: Partitioning the data Split data into chunks and distribute provides high throughput Many people can read the chunks in parallel Better than everyone one reading the same file How data is partitioned across nodes For reliability: Replication: overcome failure by making copies At least one copy should be online How data is duplicated across nodes For Network-disparity: rack-aware allocation Read from the closest block Write to the closest location

Common Design Principles For performance: Partitioning the data Split data into chunks and distribute provides high throughput Many people can read the chunks in parallel Better than everyone one reading the same file How data is partitioned across nodes For reliability: Replication: overcome failure by making copies At least one copy should be online How data is duplicated across nodes For Network-disparity: rack-aware allocation Read from the closest block Write to the closest location

Outline Requirements for a Distributed File System HDFS Architecture Read/Write Research Directions Popularity Failures Network

HDFS Architecture Name Node Master (only 1 in a data center) All reads/write go through the master Manages the data nodes Detects failures triggers replication Tracks performance Tracks location of blocks Tracks block to node mapping Tracks status of data nodes Rebalances the data center Orchestrates read/writes Name Node Data Node One per server Stores the blocks Tracks status of blocks Ensures integrity of block B Data Node B` Data Node Data Node B B`

What is a Distributed FS Write? HDFS For high-performance Make N copies of the data to be written Default N= 3 B HDFS Master Write B B B

What is a Distributed FS Write? HDFS For Fault tolerance Place in two different fault domains 2 copies in the same rack 1 in a different rack B B B Zone 1 Zone 2

What is a Distributed FS Write? HDFS For Network awareness Currently does nothing Picks two random racks

What is a Distributed FS Read? HDFS For Network awareness/performance Pick closest copy to read from. Nothing specific for Reliability Name Node Read B B B B Zone 1 Zone 2

Implications of Read/Write Semantics One application write == 3 HDFS writes Writes are costly!! HDFS is optimized for write-once/read-many times workloads What is an update/edit? Rewrite blocks? Name Node Modify B B B B Zone 1 Zone 2

Implications of Read/Write Semantics One application write == 3 HDFS writes Writes are costly!! HDFS is optimized for write-once/read-many times workloads An update/Edit: delete old data + write new data Name Node Modify B B B B B` B` B`

Interesting Challenges How happens with more popular blocks? Or less popular blocks? What happens during server failures? Can you loose data? What happens if you have a better network? No oversubscription

Outline Requirements for a Distributed File System HDFS Architecture Read/Write Research Directions Popularity Failures Network

Popularity in HDFS Not all files are equivalent E.g. More people search for bball than hockey More popular blocks will have more contention Leads to slower performance Search for bball will be slower

Popularity in HDFS # of copies of a block = function(popularity) If 50 people search for bball, then make 50 blocks If only 3 search for hockey, then make 3 You want as many copies of a block as readers

Popularity in HDFS # of copies of a block = function(popularity) If 50 people search for bball, then make 50 blocks If only 3 search for hockey, then make 3 You want as many copies of a block as readers

Popularity in HDFS As data becomes old less people care about it So last year s weather versus today s weather When a block becomes old (older than a week) Reduce the number of copies. In Facebook data centers, only one copy of old data

Failures in Data Center Do servers fail???? Facebook: 1% of servers fail after-reboot Google: at least one server fails a day Name Node B Data Node Data Node B B` Failed node doesn t send heart beat Name node determines blocks on failed node Starts replication. Data Node B` Data Node B B`

Failures in Data Center Do servers fail???? Facebook: 1% of servers fail after-reboot Google: at least one server fails a day Name Node Failed node doesn t send heart beat Name node determines blocks on failed node Starts replication. B B Data Node B` Data Node Data Node Data Node B B` B`

Problems With Locality aware DFS Ignores contention on the servers I/O contention greatly impacts performance

Problems With Locality aware DFS Ignores contention on the servers I/O contention greatly impacts performance Ignores contention in the network Similar performance degradation

Types of Network Topologies Current Networks Uneven B/W everywhere Future Networks Even B/W everywhere 10GB 100GB 25GB 100GB 100GB 100GB

Implications of Network Topologies Blocks can be more spread out! No need for two blocks within the same rack Same BW everywhere so no need for locality aware placement

Summary Properties for a DFS Research Challenges Popularity Failure Data Placement

Un-discussed Cluster rebalancing Move blocks around based on utilization. Data integrity Use checksum to check if data has gotten corrupted. Staging + pipeline

Outline of Distributed File Systems and Data Center Architecture

Download Presentation

Presentation Transcript

Related

More Related Content