Deep Learning Workspace Engineering Practice and Lessons Learnt

Deep Learning Workspace Engineering Practice and Lessons Learnt
Slide Note
Embed
Share

Delve into the world of deep learning workspace engineering with a focus on key engineering practices, separate configuration and code, microservice architecture, high-quality modules, backups, and modern microservice architecture. Explore strategies for creating a robust and efficient workspace environment.

  • Deep Learning
  • Engineering Practice
  • Microservice Architecture
  • High Quality Modules
  • Modern Architecture

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. DL (Deep Learning) Workspace Engineering Practice and Lessons Learnt Hongzhi Li, Jin Li, Sanjeev Mehrotra

  2. Key Engineering Practice Separate Configuration & code Microservice Architecture with Mostly Stateless Module High Quality Modules

  3. Separate Configuration & code Separate Configuration & code Benefit: Significant code reuse and stability of the code base A default configuration (in deploy.py), checked into repo Cluster specific configuration (in config.yaml) is not checked in Backup/restore operation: Backup/restore cluster related configuration + keys from/to a blob Code + configuration will be rendered to a location that is gitignored, and executed there State of the cluster managed by database (SQL Azure) User database (who is authenticated, who is admin, etc..) Scheduling database (what job is being scheduled, deleted, etc..)

  4. Microservice Architecture With mostly Stateless Modules Build DL workspace as a collection of Microservice Minimum dependency among services (so that cost of switching a module is low) OpenID authentication, secured etcd/kubernete clusters MySQL vs SQL CoreOS vs Ubuntu Asp .Net core vs flask API service File share: NFS, HDFS, GlusterFS, CIFS (Azure File Share) Minimal/no change to other module when a module need to be changed/updated Stateless microservice is strongly preferred States are preserved in either SQL server or etcd servers

  5. High Quality Module Evaluate the quality of a module before using it Docker is of good quality, and has been stable Kubernete is of good quality, and has been stable Nvidia-docker (preferred platform for DL workload) Zombie process (Nvidia driver) Try best not to hack a module Use docker/nvidia-docker/kubernete/glusterfs/hdfs as is Minimal code, and most of the issues we have encountered have also been encountered by the community

  6. Backup

  7. Modern microservice architecture Modern microservice architecture Embrace micro-services Hundreds/thousands of independent services forms an ecosystem Each microservice should evolve by its own (created/justified/deprecated through usage, not top-down design) Each service is single purpose, with simple and well-defined API, modular and independent Goals of service owner: meet the needs of my clients, at minimum cost and effort Standardize communication (network protocols, data formats, schema between services), rather than service themselves A service became standard by being better than its alternative Standardize infrastructure (cluster management, monitoring, diagnostic, alerting, etc..) No need to standardize internals (e.g., programming language, framework, persistence) Encourage open-source like practice: Good documentation from the get go Searchable code/documentation and discussion forum

Related


More Related Content