
Data Engineering in the Field: Overview, Challenges, Tools, and Patterns
"Explore the realm of data engineering with insights on online advertising and higher education challenges, tools like RDBMS and Python, and patterns for critical systems and data analysis. Dive into the world of data integration and software risks."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
DATA ENGINEERING IN THE FIELD
DATA ENGINEERING IN THE FIELD INTRODUCTION I m Dev Nambi Principal Data Engineer @ Fred Hutch Doing this since 2006
DATA ENGINEERING IN THE FIELD OVERVIEW Case Studies Challenges Tools Patterns Questions
CASE 1 ONLINE ADVERTISING
DATA ENGINEERING IN THE FIELD CHALLENGES People don t like ads - adversarial learning Industry changes fast - software+schema changes, DevOps Acquisitions & Business Changes - data integration Lots and lots of ads - scale & latency Profit at all costs - ethics & privacy 2006 - current tech didn t exist yet
DATA ENGINEERING IN THE FIELD TOOLS Structured files - logs, XML Distributed processing - HPC patterns Structured storage - RDBMS, caches Quick-change software - SQL, scripting languages
DATA ENGINEERING IN THE FIELD PATTERNS Stateful systems are critical Analysis on log data Statistical analysis on data Software changes + data = risk
CASE 2 HIGHER EDUCATION
DATA ENGINEERING IN THE FIELD CHALLENGES 6 mainframes, 300 DBs, 14K data systems - data integration 100+ departments - fragmented organization. Hard to know the problem Public institution - risk-averse, decision paralysis Higher education - FERPA rules
DATA ENGINEERING IN THE FIELD TOOLS Structured storage - RDBMS Machine learning & stats - Python Visualizations - Tableau, D3 Data integration - SQL, Python
DATA ENGINEERING IN THE FIELD PATTERNS Hard to know the problem - Rapid prototyping Data integration - Entity disambiguation. ML. SQL+Python Data integration - record linkage. ML. SQL+Python Convincing people - visualizations Common architecture: data > DB > ML > Viz
CASE 3 CANCER BIOLOGY
DATA ENGINEERING IN THE FIELD CHALLENGES 300+ research labs. Data integration headaches galore Scaling. Genomics data can be multi-TB per sample Patient privacy. HIPAA. De-identification needed Custom algorithms. State-of-the-art changes every month Researchers & software engineers speak different languages. Both are needed
DATA ENGINEERING IN THE FIELD TOOLS Custom apps - command-line tools & R libraries Cloud computing - how to handle the scale Containers - reproducible code Textbooks - have to learn bioinformatics
DATA ENGINEERING IN THE FIELD PATTERNS Files for everything Data immutability. Write-once, read-many. Data lakes On-demand compute to run custom pipelines Run X workflow when Y files appear Rapid prototyping to the rescue
LESSONS LEARNED COMMON CHALLENGES
DATA ENGINEERING IN THE FIELD SOLVING THE RIGHT PROBLEM Complicated domain area Fragmented organization Company politics A good PM is worth their weight in gold You think you know better than your users - you don t
DATA ENGINEERING IN THE FIELD INTEGRATING DATA Merging similar datasets - entity disambiguation Encountering missing data - impute & investigate Merging different datasets - record linkage There s no robust method to this. You ll develop a collection of scripts, tricks, and hacks to do this over time. Get familiar with SQL, shell scripts, Python+pandas, regular expressions, awk/sed, JSON A good place to start - http://www.redbook.io/
DATA ENGINEERING IN THE FIELD CHOOSING THE RIGHT SYSTEM DESIGN What s the right system design to solve a problem? Make it as simple as you can. Then make it even simpler. Don t build a feature until you need it 3 times Use proven technology & approaches, unless you can pay the cost of being an early adopter
DATA ENGINEERING IN THE FIELD DEVELOPING, TESTING, AND DEPLOYING SOFTWARE What s the best way to work in a team? What s a good way to change code to handle new & existing data? How to extend an existing system cleanly? CI/CD pipelines are great. Get used to writing simple, quick tests Integrating with other systems/components is always expensive. This will affect your work more than anything else Communication is everything. Ask for input, even when you don t think you need it. Keep an eye on what everyone else is doing
DATA ENGINEERING IN THE FIELD WORKING IN THE RIGHT PLACE All industries have a dark side Your work can create results you may disagree with: Invading people s privacy (online advertising, social media) Taking their money (finance) Denying people care (health insurance) Preserving existing gender+racial inequalities (literally everywhere) Environmental degradation (industrial ag, chemical engineering) Your ability to influence these decisions will be limited. Get controversial orders in writing (email). Decide ahead of time what your breaking point is. Network
LESSONS LEARNED COMMON TOOLS + PATTERNS
DATA ENGINEERING IN THE FIELD CLOUD COMPUTING Infinite Storage Infinite Compute Parallelism State and pipeline tools Too many features to count
DATA ENGINEERING IN THE FIELD QUERIES ON DATA Structured data inevitably uses a query language. Get used to different dialects of SQL for different data engines Instead of a single relational DB (Oracle, SQL Server, MySQL), there are hundreds of data-storage-and-processing engines Knowing the best one for the problem is key Sometimes the best tool isn t really a query language, it s a script
DATA ENGINEERING IN THE FIELD DISTRIBUTED COMPUTE There s usually too much data to process on a single machine Get used to breaking work into pieces to run concurrently Abstract compute (functions, containers) is a great way to package code to run on data Lambda trigger -> Compute on data -> results is the most common design pattern I see, and very simple
DATA ENGINEERING IN THE FIELD STATISTICS, VISUALIZATION, AND STORIES People respond to visuals, not numbers People respond to stories, not statistics Having a good ML model or statistical analysis only helps if you can convince your audience
DATA ENGINEERING IN THE FIELD SELF-DIRECTED LEARNING You re far more effective when you know the industry you re in Ask questions of everyone you work with. Even questions you think are stupid. Especially questions you think are stupid Do your own experiments. Make prototypes. Try new things. Your failures are private, your successes are public
MORE INFO: HTTPS://DEVNAMBI.COM/DATAENGTAL K