
Understanding File Systems and Management
Explore the concept of file systems and their management, including what a file is, the role of file systems, and the key tasks involved in managing files. Gain insights into UNIX-style file systems, trade-offs in file system design, and more.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
FILE SYSTEMS AND MORE GREGORY KESDEN, 14-848 (CLOUD INFRASTRUCTURE) FALL 2019
FILE SYSTEMS: QUICK REVIEW GREGORY KESDEN, 14-848 (CLOUD INFRASTRUCTURE) FALL 2019 BASED UPON: HTTPS://HADOOP.APACHE.ORG/DOCS/R1.2.1/HDFS_DESIGN.HTML
WHAT IS A FILE? A file is a unit of data organized by the user. The data within a file isn't necessarily meaningful to the operating system. Instead, a file is created by the user and meaningful to the user. It is the job of the file system to maintain this unit, without understanding or caring why. Contrast this with a record in a database, which is described by a schema and upon which the database enforces constraints
WHAT IS A FILE SYSTEMS? A file system is a service responsible for managing files. File systems typically implement persistent storage, although volatile file systems are also possible (/proc is such an example).
WHAT DOES IT MEAN TO MANAGE FILES? Name files in meaningful ways. The file system should allow a user to locate a file using a human-friendly name. In many cases, this is done using a hierarchical naming scheme like those we know and love in Windows, UNIX, Linux, &c. Access files. Create, destroy, read, write, append, truncate, keep track of position within, &c Physical allocation. Decide where to store things. Reduce fragmentation. Keep related things close together, &c. Security and protection. Ensure privacy, prevent accidental (or malicious) damage, &c. Resource administration. Enforce quotas, implement priorities, &c.
RECALL, FOR EXAMPLE, UNIX- STYLE FILE SYSTEMS Block/Request Cache Name cache Inode cache In-Memory Open File Table In-Memory fd tables
TRADE-OFFS Block Size Smaller blocks = finer granularity = higher utilization Small blocks = more blocks to manage = more overhead Hierarchical Naming Good for humans Tree-like lookup Access control Finer grain = more meta data More meta data to access before or while getting to data means higher latency Larger memory footprint in caches More steps before getting to the bits we want. In general: More features = more management = more overhead in space and/or time
HDFS ARCHITECTURE GREGORY KESDEN, 14-848 (CLOUD INFRASTRUCTURE) FALL 2019 BASED UPON: HTTPS://HADOOP.APACHE.ORG/DOCS/R1.2.1/HDFS_DESIGN.HTML
ASSUMPTIONS, CONTINUED Simple Coherency Model = Lower overhead, higher throughput Write Once, Read Many (WORM) Gets rid of most concurrency control and resulting need for slow, blocking coordination Moving computation is cheaper than moving data The data is huge, the network is relatively slow, and the computation per unit of data is small. Moving (Migration) may not be necessary mostly just placement of computation Portability, even across heterogeneous infrastructure At scale, things can be different, fundamentally, or as updates roll-out
NAMENODE Master-slave architecture 1x NameNode (coordinator) Manages name space, coordinates for clients Directory lookups and changes Block to DataNode mappings Files are composed of blocks Blocks are stored by DataNodes Note: User data never comes to or from a NameNode. The NameNode just coordinates
DATANODE Many DataNodes (participants) One per node in the cluster. Represent the node to the NameNode Manage storage attached to node Handles read(), write() requests, etc for clients Store blocks as per NameNode Create and Delete blocks, Replicate Blocks
NAMESPACE Hierarchical name space Directories, subdirectories, and files Managed by NameNode Maybe not needed, but low overhead Files are huge and processed in entirety Name to block lookups are rare Remember, model is streaming of large files for processing Throughput, not latency, is optimized
ACCESS MODEL (Just to be really clear) Read anywhere Streaming is in parallel across blocks across DataNodes Write only at end (append) Delete whole file (rare) No edit/random write, etc
LOCATION AWARENESS Site + 3-Tier Model is default
REPLICATION Blocks are replicated by default Blocks are all same size (except tail) Fault tolerance Opportunities for parallelism NameNode managed replication Based upon heartbeats, block reports (per dataNode report of available blocks), and replication factor for file (per file metadata)
REPLICATION Pipelined, Not Fanned Out Default: One local rack Two remote rack Observations Rack failure less likely than host failure Doesn t evenly distribute Same bandwidth to keep one and send two as to keep two and send one (Pipelining)
REPLICATION Questions Does all data need the same number of replicas? Hot or not? Replacability? Does all data need the replicas in the same place?
REPLICA PLACEMENT AND SELECTION Assume bandwidth within rack greater than outside of rack Default placement 2 nodes on same rack, one different rack (Beyond 3? Random, below replicas/rack limit) Fault tolerance, parallelism, lower network overhead than spreading farther Read from closest replica (rack, site, global)
FILESYSTEM METADATA PERSISTENCE EditLog keeps all metadata changes. Stored in local host FS FSImage keeps all FS metadata Also stored in local host FS FSImage kept in memory for use Periodically (time interval, operation count), merges in changes and checkpoints Can truncate EditLog via checkpoint Multiple copies of files can be kept for robustness Kept in sync Slows down, but okay given infrequency of metadata changes.
FAILURE OF DATANODES Disk Failure, Node Failure, Partitioning Detect via heartbeats (long delay, by default), blockmaps, etc Re-Replicate Corruption Detectable by client via checksums Client can determine what to do (nothing is an option) Metadata
DATA BLOCKS, STAGING Data blocks are large to minimize overhead for large files Staging Initial creation and writes are cached locally and delayed, request goes to NameNode when 1st chunk is full. Local caching is intended to support use of memory hierarchy and throughput needed for streaming. Don t want to block for remote end. Replication is from replica to replica, Replication pipeline Maximizes client s ability to stream
FINDING A NEEDLE IN HAYSTACK: FACEBOOK S PHOTO STORAGE OSDI 2010 DOUG BEAVER, SANJEEV KUMAR, HARRY C. LI, JASON SOBEL, PETER VAJGEL FACEBOOK.COM AN OVERVIEW PRESENTED IN 14-848 (CLOUD INFRASTRUCTURE GREGORY KESDEN
OVERVIEW Facebook is the world s largest photosharing site As of 2010: 260 billion images 20 petabytes of data 1 billion photos, 60 terabytes uploaded each week Over 1 million images/second served at peak News feed and albums are 98% of photo requests Images are written once, read often, never modified, and rarely deleted
WHY NOT USE A TRADITIONAL FILESYSTEM? No need for most metadata (directory tree, owner, group, etc) Wastes space More importantly, slows access to read, check, and use it Turned out to be bottleneck What cost? Seems small? Multiplied a lot Think about steps: Map name to inode (name cache, directory files), read inode, read data Latency is in access, itself, not tiny transfer
HAYSTACK GOALS High throughput, low latency Fault-tolerance Cost-effective (It is hugely scaled, right?) Simple (Means matures to robustness quickly, low operational cost)
LESSONS LEARNED FROM ORIGINAL CDNs serve hottest photos, e.g. profile pictures, but don t help with Long tail of requests for older photos generated by such a large volume site Significant amount of traffic, hitting backing store Too many possibilities, too few used to keep in memory cache of any kind, not just via CDN Surprising complexity Directories of thousands of images ran into metadata data inefficiencies in NAS, ~10 access/image Even when optimized to hundreds of images/directory, still took 3 access: metadata, inode, file
WHY GO CUSTOM? Needed better RAM:Disk ratio Unachievable, because would need too much RAM Had to reduce demand for RAM by reducing metadata
REALITY AND GOAL Reality: Can t keep all files in memory, or enough for long-tail Achievable Goal: Shrink metadata so it can fit in memory Result: 1 disk access per photo, for the photo, itself (not metadata)
PHOTO URL http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo> Specifies steps to retrieving the photos CDN Looks up <Logical volume, Photo>. If hit, great. If not, CDN strips <CDN> component, and asks the Cache. If Cache hits, great. If not, Cache strips <Cache> component and asks back-end Haystack Store machine If not in CDN just starts at second step.
PHOTO UPLOAD Request goes to Web server Web server requests a write-enabled logical volume from the Haystack Directory Web server assigns unique ID to photo and uploads it to each physical volume associated with logical volume
HAYSTACK DIRECTORY Functions: Logical volume to physical volumes mapping Load balances writes across logical volumes and reads across physical volumes Determines if the request should be handed by CDN or Cache Makes volumes read-only, if full, or for operational reasons (Machine-level granularity) Removes failed physical volumes, replaces with new Store Replicated database with memcache
HAYSTACK CACHE Distributed hash table with photo ID as key Cache photo iff Request is from end user (not CDN) CDN much bigger than cache. Miss there, unlikely to hit in smaller Cache. Volume is write-enabled Volumes perform better when reading or writing, but not mix, so doing one or the other is helpful Shelter reads, letting focus on writes (No need to shelter, once volume is full no more writes) Could pro-actively push newly uploaded files into cache.
HAYSTACK STORE Store needs logical volume id and offset (and size) This needs to quick to get, given the photo id no disk operations Keeps open file descriptors for each physical volume (preloaded fd cache) Keeps in-memory mapping of photo ids to fs metadata (file, offset, size) Needle represents a file stored within Haystack In memory mapping from <photoid, type (size)> to <flags, size, offset>
READS FROM STORE Cache machine requests <logical volume id, key, alternate key, cookie> Cookie is random number assigned/maintained by Directory upon upload. Prevents brute-force lookups via photo ids Store machines looks this up in in-memory metadata, if not deleted Seeks to offset in volume file and reads entire needle from disk Verifies cookie and data integrity Returns photo to cache
PHOTO WRITE Web server provides <logical volume id, key, type (size), cookie, data> to store machines (all associated with logical volume) Store machines synchronously appends needle to physical volume and updates mapping The append makes this much happier But, if files are updated, e.g. rotated, needle can t be changed, new one must be appended if multiple, greatest offset wins. (Directory can update for logical volumes)
DELETE Just a delete flag long live those party photos!
INDEX FILE Asynchronously updated checkpoint of in-memory data structures, in event of reboot Possible to reconstruct, but much data would need to be crunched Can be missing recent files and/or delete flags Upon reboot Load checkpoint Find last needle Add needles after that from volume file Restore checkpoint Store machines re-verify deleted flag after read form storage, in case index file was stale
HOST FILESYSTEM Store machines should use file system that: Requires little memory for random seeks within a large file E.g., blockmaps vs B-trees for logical to physical block mapping
RECOVERING FROM FAILURE Failure detection Proactively test Stores: Connected? Each volume available? Each volume readable? Fail? Mark logical volumes on store read only and fix. In worst case, copy over from other replicas (slow = hours)
OPTIMIZATIONS Compaction Copies over used needles, ignoring deleted ones, locks, and atomically swaps 25% of photos deleted over course of a year, more likely to be recent ones Batch Uploads Such as when whole albums uploaded Improves performance via large, sequential writes
WHY DID WE TALK ABOUT HAYSTACK All the fun bits Classical Browser-Server-Directory-CDN-Cache-Store Layering Simple, Easy Example, By Design (Simple = Robust Fast) Optimizes for use cases Memory-Disk Trade Focus on fitting metadata into memory Real-world storage concerns