Managing RDMA Resources Through Device Cgroups

1 / 10

Embed Share

Explore how to control RDMA resource allocation by extending device cgroups, preventing non-root users from consuming all resources and ensuring fair distribution among containers and kernel space consumers. Learn about the implementation, limitations, and future plans for this solution.

hann Follow

Uploaded on Jun 01, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

RDMA cgroup Parav Pandit

Problem statement: Root or non root user applications can take away all the RDMA resources, due to which kernel space consumers SRP/NFS-RDMA/iSER cannot function at all. Partial solution: Block non-root users to access /dev/infiniband/uverbsX using device cgroup white list controller. Granularity is allow/disallow. Not the amount of resource. When containers are used, unprivileged container can take away all the RDMA adapter resources, other containers and kernel space consumer are left with no resources.

Solution 1. Extend existing device cgroup for rdma resources 2. new rdma cgroup Cgroup: Kernel construct, exposed via sysfs mounted file system. Fairly straight forward interface. New directory creation, creates new cgroup with root/default cgroup properties. Processes can be migrated from one to other cgroup. Directory deletion with active processes on cgroup is disallowed. Two approaches, Common method: Keep changing cgroup limits as required to tune application, to align to SLA requirements or resource requirements. Less popular: Keep cgroup limits constant (Static at creation time), keep migrating process from one to other cgroup. Imperfect in nature with respect to enforcing limits.

Solution: device cgroup extension for RDMA Example: echo 200 > /sys/fs/cgroup/devices/grp1/rdma.resource.max_qp echo 10 > /sys/fs/cgroup/devices/grp1/rdma.resource.max_pd echo 1 > /sys/fs/cgroup/devices/grp1/rdma.resource.max_uctx Device cgroup is extended to support RDMA resource limits Per cgroup limits, instead of per rdma device limits. Works for single RDMA device (one or multi port device) Allows resource governing across multiple RDMA devices Cgroup level limits simplifies any overhead incurred due to hot plug device additions (by default resources will be available) Alternative approach of per device: Requires user daemon to track hot-plug devices Required white list to keep uverbs access disabled. Might require integration/changes with IB daemon.

Solution: device cgroup extension for RDMA At present implements only hard limit. No weight or soft limit yet. Things in plan: Fail count, Threadshold to implement higher level controlling agent that can span across cluster and not just node, to track resources across multiple applications of cgroup spread across nodes. Address review comments, implementation bug fixes and architecture concern. Other resources as pkey, gid to be worked out.

Issues/Comments with Solution: Bugs/Comments: 1. fork() was not taken care for ucontext tracking. Fixed. 2. Other minor fixes for multi-threaded applications. Fixed. Architecture comments: 1. Intel adapter doesn t have one to one mapping of verb to hw resource, doesn t track hardware resources accurately, but tracks application resource usage correctly. 2. Hardware vendor should define resource pool or hardware resource which can be tracked via RDMA cgroup. 3. Should have dedicated RDMA cgroup controller as its different function and device cgroup. 4. Instead of verb resource abstraction, some higher level abstraction of resource. unlikely to be accurate due to resource based programming model.

Comments/Questions/Feeback?

Alternative design option considered: Per device resource pool Per device resource pool specific limit Vendor defines the resource type During resource creation, uverbs layer will pass the resource pool id/pointer. Cgroup layer will keep track created resource pools

10,000 ft view system call /dev/infiniband/uverbs interface resource pools Resource limit configuration and accounting cgroup rdma controller vendor HCA driver

New APIs ib_device Query_hw_resource_types(ib_device *device); Input: ib_device Output: Struct { Int num_resources; Struct { const char *name[]; int max_value; } res_type[]; struct res_pool *create_resource_pool(ib_device *device, char *pool_name); set_resource_limit(ib_device *device, res_pool *pool, int type, in max_limit); destroy_resource_pool(ib_device *device, res_pool *pool); ib_alloc_pd(ib_device *device, res_pool *pool);

Managing RDMA Resources Through Device Cgroups

Download Presentation

Presentation Transcript

Related

More Related Content