
Relational Databases and Their Role in the Cloud
Explore the fundamentals of relational databases, the SQL language, and their structuring on disk. Delve into the limits of scalability within cloud environments. Discover the significance of relational databases in managing various types of data effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
14-848 CLOUD INFRASTRUCTURE LECTURE 10 * FALL 2018 * KESDEN
SOCRATIVE https://api.socrative.com/rc/Nfu6Lp
TODAYS GOALS A brief introduction to relational databases A brief introduction to the SQL language used by relational databases A brief look at how relational databases are organized on disk Reflect on how they can be used within clouds Reflect on why there are limits to their ability to scale
NOT AMONG TODAYS GOALS A deep study of relational databases Or, even minimal functional competency with SQL databases If interested, take 15-615 or 15-645
DATABASE A database is an organized collection of data, stored and accessed electronically. -- https://en.wikipedia.org/wiki/Database
RELATIONAL DATABASE A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic where all data is represented in terms of tuples, grouped into relations. Virtually all relational database systems use SQL (Structured Query Language) for querying and maintaining the database. -- https://en.wikipedia.org/wiki/Relational_database -- https://en.wikipedia.org/wiki/Relational_model
RELATIONAL DATABASES ROLE IN CLOUDS Most widely used type of database since the 1980s Unlikely to lose that title, or see reduced growth in use, any time soon Good for up to millions of records Beyond that, for Big Data applications, we ll talk about various NoSQL databases Everything from managing system configuration data to customer accounts to inventories to order and shipping records, to HR data, etc.
RELATIONAL DATABASE, INFORMALLY A database wherein the data is structured in tables known as relations. Rows represent entities and are unordered A relational database, although drawn as a table, functions more like an unordered set of rows. Columns represent attributes Constraints can be enforced w.r.t.types and values Schema for each table describes attributes and associated types. Operations can be performed across tables.
EXAMPLE RELATION: STUDENT TABLE StudentID 123456789 098765432 135790246 AndrewID lkennedy alight gmcclendon FName Laurie Angela Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016
KEYS: WHAT ARE THEY? Used to name rows within tables Used to associate rows across tables Can be tuple of attributes
KEYS: EXAMPLE ROLE StudentID 123456789 098765432 135790246 AndrewID lkennedy llight gmcclendon FName Laurie Laurie Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016 Unique within relation Used to associate relations StudentID 135790246 123456789 09876543 Course 14-848 14-736 14-740 Semester F18 S17 F17 Grade A B+ A-
SQL (SEQUEL) STRUCTURED QUERY LANGUAGE Used by nearly all relational databases Both a Data Definition Language (DDL) and a Data Manipulation Language (DML) DDL: Create, define, modify, and administer (access restrictions, etc) relations Structure, types, constraints, etc not values therein DML: Query and modify values: Tuples and attributes thereof Hides organization of data, e.g. whether and how indexed, how stored, location, etc. Big difference from older database languages that described tree paths, etc. Name derived from earlier IBM predecessor, Standard English QUery Language
KEYS: TYPES Candidate Unique to row Primary Used as name for row Foreign Key used in another table Think of a foreign key as a pointer
SQL: CREATE CREATE TABLE StudentBios ( StudentID CHAR(9) NOT NULL, FName VARCHAR(200)NOT NULL, LName VARCHAR(200)NOT NULL, YearEnrolled YEAR ) StudentID AndrewID 123456789 lkennedy 098765432 llight 135790246 gmcclendon FName Laurie Laurie Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016
SQL: CREATE, REVISITED CREATE TABLE StudentCourses ( StudentID CHAR(9) NOT NULL, Course CHAR(6) NOT NULL, Semester CHAR(3) NOT NULL, Grade CHAR(2) PRIMARY KEY (StudentID, Course, Semester), FOREIGN KEY (StudentID) REFERENCES StudentBios ) StudentID 135790246 123456789 09876543 Course 14-848 14-736 14-740 Semester F18 S17 F17 Grade A B+ A- Why is the primary key (StudentID, Course, Semester)? Not just StudentID?
SQL: INSERT INSERT INTO StudentBios (CHAR(9), VARCHAR(200),VARCHAR(200),YEAR) VALUES (123456789, lkennedy , Laurie , Kennedy , 2014) StudentID 123456789 098765432 135790246 AndrewID lkennedy llight gmcclendon FName Laurie Angela Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016
SQL: SELECT SELECT Fname, Lname from StudentBios WHERE YearEnrolled >= 2015 StudentID 123456789 098765432 135790246 AndrewID lkennedy llight gmcclendon FName Laurie Angela Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016
SQL: SELECT, CONTINUED SELECT StudentID, Fname, Lname from StudentBios WHERE YearEnrolled <= 2015 ORDER BY FName,LName StudentID 123456789 098765432 FName Angela Laurie LName Light Kennedy
SQL: DELETE DELETE FROM StudentBios WHERE StudentBios.Fname = Laurie StudentID 123456789 098765432 135790246 AndrewID lkennedy llight gmcclendon FName Laurie Laurie Gina LName Kennedy Light McClendon YearEnrolled 2014 2015 2016
SQL: JOIN Combines columns from one or more relations into a new relation Rows are aligned based upon a matching key Columns are selected from logically combined rows Yes, a relation can be joined with itself, hence one or more Complexity Some keys may be present in some relations but not others Depending upon which table is being iterated, some rows may be missed or some columns may be empty This leads to different flavors of JOIN
SQL: JOIN SELECT AndrewID, Course, Semester, Grade FROM StudentBios INNER JOIN StudentCourses WHERE StudentBIOS.StudentID = StudentCourses.StudentID This effectively forms a new relation which is the union of the StudentBios and StudentCourses tables The rows of the two are associated up based upon the StudentID of the the rows matching. INNER JOIN requires matches in both tables If there isn t a matching key in both tables it isn t included The result has (AndrewID, Course, Semester, Grade) columns
SQL: VIEWS CREATEVIEW StudentClassList AS SELECT AndrewID, Course, Semester FROM StudentBios INNER JOIN StudentCourses WHERE StudentBIOS.StudentID = StudentCourses.StudentID This forms what is essentially a virtual table Abstracts away complexity of underlying tables Allows view to be created from tables that may be from different domains or systems Allows protections to be applied to table, e.g. can access course list for student, but not course grades May enable better system performance by aiding caching, etc.
RELATIONAL DATABASES: MEMORY HIERARCHY: A QUICK REVIEW Really approximate access times (latency): Registers: Speed of processor SRAM (Cache): 2-3nS = ~2.5x10-6 mS = ~2.5/1,000,000 mS DRAM (Main Memory): 20-30nS = ~2.5x10-5 mS = 2.5/100,000 mS FlashRAM (SSD): 0.035 mS Disk Drive: 1 mS, sequential Disk Drive: 10 mS, random The name of the game is To build a system that performs nearly as well as if all of its memory is of the fastest type At a cost nearly as low as if all of its memory is of the slowest type Caching is the magic sauce
RELATIONAL DATABASES: ADDITIONAL CONSIDERATIONS Memory is limited High throughput is key ACID properties Atomicity A transaction is all done, or note done at all, but not half done (Succeed or fail) Consistency Find things consistent, leave things consistent. Don t break things with side-effects. Isolation Even if there is concurrency, effects should be the same as if sequential Durability Once it is done, it is done (.) Minimize overhead, while maximizing performance
RELATIONAL DATABASES: IMPLICATIONS OF ENVIRONMENT, GOALS Caching is critical Durability will require consideration of hardware SSD or other FlashRAM layer? Battery powered backup? Etc Indexes and data stored separately Pack as many indexes into RAM as possible. Think like Haystack, but 40 years earlier. Multiple indexes, separate for each use Indexes and data different caching domains
RELATIONAL DATABASES: TYPICAL ORGANIZATION Block/Page based storage of data pages Often LRU cached All same size for interchangeable allocation and caching, mapping to disk blocks, etc Indexes constructed to speed queries B-Tree based sorted, range-based, etc Caching may be page-based LRU. Hash based fast lookup without rich features Trades disk space for lookup speed Exact lookup only, not ranges, etc. May be dynamic Lookup results can be cached. Indexes point into data pages Commonly created by administrator, not based upon any intrinsic understanding by the database, itself
RELATIONAL DATABASES: OBTAINING ROBUSTNESS Backups are hard because they may require a freeze Blurry backups may not be useful because of ACID Some algorithms can make blurry backup and fix-up Replicated databases are a common option A few replicas provide robustness without blowing scale Need policies for working through failure and obtaining recovery Need atomic commit protocol or master-slave, etc, to maintain consistency
RELATIONAL DATABASES: SCALING UP Scaling to a point is easy enough Buy bigger disk Solution can be replicated Scaling even bigger is easy enough Buy an array Solution can be replicated Scaling even bigger gets complicated Distributing whole tables makes joins more expensive Sharding tables by row makes summarizing across attributes slow Sharding tables by column makes lookup by entity slow At a certain point, it may be better to focus on precise needs and structure a solution to meet them NoSQL Databases: Row-Oriented, Column-Oriented, Key-Value Stores, etc.
RELATIONAL DATABASES: SCALING UP, SIMPLE IDEAS Distill data as much as possible. Make representation of data as dense as possible. Keep data redundantly, if (absolutely needed) to allow distribution with performance
SQL: THE STORY CONTINUES There is a ton of theory behind relational databases There is a ton more to the SQL language There is a ton more to how SQL databases are organized, indexed, etc. Take 15-615 or 15-645 if interested.
MOVING FORWARD: DATA (NOT FILE) STORAGE AT SCALE Key-Value Stores Column, Row, and Document-Oriented Databases RAM layer storage