System Calls in Linux

system calls n.w
1 / 35
Embed
Share

Linux ABI system calls distill everything into a system call in /sys, /dev, /proc for read() and write() syscalls. System calls are special purpose function calls that elevate privilege and execute functions in the kernel. A function call is a special form of jmp that executes a block of code at a given address. Function calls are implemented at the hardware level and are necessary for efficient program execution. System calls and function calls are closely related, with system calls being essentially function calls with some additional features. Implementing system calls involves invoking the syscall() function, which enters the kernel, elevates privilege, and instantiates the kernel-level environment. Various mechanisms, such as legacy int 0x80, SYSENTER, and SYSCALL, are used for implementing syscalls. Each system call has a number assigned to it, which indexes into a system call table to invoke the respective syscall handler and set up the kernel environment. Finally, syscalls return to the user space by resetting the environment to the state before the call.

  • System Calls
  • Linux
  • Function Calls
  • Kernel
  • ABI

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. System Calls

  2. Linux ABI System Calls Everything distills into a system call /sys, /dev, /proc read() & write() syscalls What is a system call? Special purpose function call Elevates privilege Executes function in kernel But what is a function call?

  3. What is a function call? Special form of jmp Execute a block of code at a given address Special instruction: call <fn-address> Why not just use jmp? What do function calls need? int foo(int arg1, char * arg2); Location: foo() Arguments: arg1, arg2, Return code: int Must be implemented at hardware level

  4. System Calls Function calls not that special Just an abstraction built on top of hardware System calls are basically function calls With a few minor changes Privilege elevation Constrained entry points Functions can call to any address System calls must go through gates

  5. Implementing system calls System calls are implemented as a single function call: syscall() read() and write() actually just invoke syscall() What does syscall do? Enters into the kernel at a known location Elevates privilege Instantiates kernel level environment Once inside the kernel, an appropriate system call handler is invoked based on arguments to syscall()

  6. x86 and Linux Number of different mechanisms for implementing syscall Legacy: int 0x80 Invokes a single interrupt handler 32 bit: SYSENTER Special instruction that sets up preset kernel environment 64 bit: SYSCALL 64 bit version of SYSENTER All jump to a preconfigured execution environment inside kernel space Either interrupt context or OS defined context What about arguments? syscall(int syscall_num, args )

  7. Specific system calls Each system call has a number assigned to it Index into a system call table Function pointers referencing each syscall handler Syscall(int syscall_num, args ) Sets up kernel environment Invokes syscall_table[syscall_num](args ); Returns to user space: Resets environment to state before call

  8. man s 2 write WRITE(2) Linux Programmer's Manual WRITE(2) NAME write - write to a file descriptor SYNOPSIS #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.

  9. SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_write(f.file, buf, count, &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); } return ret; }

  10. ssize_t __vfs_write(struct file *file, const char __user *p, size_t count, loff_t *pos) { if (file->f_op->write) return file->f_op->write(file, p, count, pos); else if (file->f_op->write_iter) return new_sync_write(file, p, count, pos); else return -EINVAL; } EXPORT_SYMBOL(__vfs_write);

  11. static ssize_t console_write(struct file * filp, const char __user * buf, size_t size, loff_t * offset) { char * tmp_buf = NULL; if (copy_from_user(tmp_buf, buf, size)) { return -EFAULT; } return size; } static struct file_operations cons_fops = { .read = console_read, .write = console_write, };

  12. int 0x80 Old style system call invocation Vectors into kernel through IDT Special Interrupt (128) only used for system calls IDT switches CPU to kernel mode Changes CS segment to kernel CS segment Hard coded as __KERNEL_CS Switches to kernel stack IRQ handler inspects register contents for syscall # and arguments System call index goes in %eax Syscall handler invoked from Syscall table Like how IRQ handlers are invoked

  13. Sysenter More modern approach to syscall invocation Allow OS to configure a syscall execution context Configured via writes to Hardware MSRs Achieves same effect as an IRQ handler, but faster Configured at boot time on each CPU SYSENTER_CS_MSR Stores Kernel Code Segment SYSENTER_EIP_MSR Address of code to handle system calls SYSENTER_ESP_MSR Kernel mode stack pointer Application issues sysenter instruction Instantiates system call context After system call, control returned to process with sysexit instruction

  14. SYSENTER/SYSEXIT SYSENTER operation SYSEXIT operation

  15. Syscall Long mode version of sysenter Separate set of MSRs for 64 bit mode Assume flat memory model (no segments) Configured at boot time on each CPU SYSCALL_STAR_MSR Stores Code Segment information SYSCALL_LSTAR_MSR Stores 64 bit instruction pointer SYSCALL_FMASK_MSR Masks for setting rflag values Application issues syscall instruction Instantiates system call context After system call, control returned to process with sysret instruction

  16. SYSCALL/SYSRET SYSCALL Operation SYSRET Operation

  17. System call optimizations System calls can be invoked in multiple ways Which one should a program use? Do you need to support all options at compile time? System calls add overhead Kernel < > User mode switches are expensive Some system calls are pretty simple and don t modify state E.g. getpid(), gettimeofday(), etc What if we can handle a syscall without invoking the kernel?

  18. VDSO Kernel provided dynamic library for making system calls Mapped into address space of each process Links with standard C library Automatically uses optimal system call mechanism Also provides optimized user space system calls System calls executed without invoking kernel mode __vdso_clock_gettime; __vdso_getcpu; __vdso_gettimeofday; __vdso_time

  19. Linux Kernel int 0x80 sysenter syscall VDSO /lib/libc.so.6 read() Stack /bin/ls Libc.so fread() Heap Code Data

  20. Kernel Environment The kernel is a C program Compiled instructions collected in a single binary Linked and loaded similar to a regular program By boot loader not OS Kernel executes in its own virtual address space This virtual address space is independent from process address spaces They do not intersect Allows kernel and processes to coexist in same virtual address space

  21. Memory layout Traditional Unix (32bit) memory invisible to user code kernel virtual memory stack Program contents on the bottom Kernel memory is on top Dynamic memory is in the middle Heap grows up Stack grows down Memory mapped region for shared libraries the brk ptr run-time heap (via malloc) uninitialized data (.bss) initialized data (.data) program text (.text) 0

  22. Memory layout VDSO kernel virtual memory Modern Linux (64bit) Many more addresses kernel physical memory stack Kernel is no longer top 1GB Sparsely mapped in at various addresses Memory mapped devices Balancing address use between stack and heap no longer an issue Heap allocated using mmap() s brk can still be used VDSO region User executable kernel code User accessible kernel data current time Plus much more Memory mapped devices Memory mapped region for shared libraries run-time heap (via malloc) run-time heap (via malloc) run-time heap (via malloc) uninitialized data (.bss) initialized data (.data) program text (.text) 0

  23. Memory management Address space of a process is virtual memory What the process sees Virtual memory may or may not be backed by physical memory Actual byte addressable memory devices on motherboard (DRAM, NVM, etc) OS managed mapping of virtual memory to physical memory Memory grouped together as pages typically 4KB of physically contiguous memory OS allocates pages for each processes OS maps allocated pages into the virtual address space of each process OS tracks current mapping of all processes What memory is assigned to whom OS can change mapping at anytime Move memory around Move memory to disk (swapping)

  24. Kernel layout

  25. Physical Address Layout Linux Kernel Boot loader copies kernel to 1MB boundary from Root partition BIOS loads boot loader from startup disk Boot loader

  26. Virtual Address Layouts (32 bit) 3 GB (0xc0000000) 16 MB (0x01000000)

  27. Virtual Address Layout (64 bit) Process cat /proc/self/maps 00400000-0040c000 r-xp 00000000 fd:00 1189777 /usr/bin/cat 0060b000-0060c000 r--p 0000b000 fd:00 1189777 /usr/bin/cat 0060c000-0060d000 rw-p 0000c000 fd:00 1189777 /usr/bin/cat 01a26000-01a47000 rw-p 00000000 00:00 0 [heap] 3dd8600000-3dd8620000 r-xp 00000000 fd:00 1179937 /usr/lib64/ld-2.18.so 3dd881f000-3dd8820000 r--p 0001f000 fd:00 1179937 /usr/lib64/ld-2.18.so 3dd8820000-3dd8821000 rw-p 00020000 fd:00 1179937 /usr/lib64/ld-2.18.so 3dd8821000-3dd8822000 rw-p 00000000 00:00 0 3dd8e00000-3dd8fb4000 r-xp 00000000 fd:00 1179948 /usr/lib64/libc-2.18.so 3dd8fb4000-3dd91b3000 ---p 001b4000 fd:00 1179948 /usr/lib64/libc-2.18.so 3dd91b3000-3dd91b7000 r--p 001b3000 fd:00 1179948 /usr/lib64/libc-2.18.so 3dd91b7000-3dd91b9000 rw-p 001b7000 fd:00 1179948 /usr/lib64/libc-2.18.so 3dd91b9000-3dd91be000 rw-p 00000000 00:00 0 7f3b66ba0000-7f3b6d0c9000 r--p 00000000 fd:00 1183411 /usr/lib/locale/locale-archive 7f3b6d0c9000-7f3b6d0cc000 rw-p 00000000 00:00 0 7f3b6d0e6000-7f3b6d0e7000 rw-p 00000000 00:00 0 7ffffed24000-7ffffed45000 rw-p 00000000 00:00 0 [stack] 7ffffedb3000-7ffffedb5000 r--p 00000000 00:00 0 [vvar] 7ffffedb5000-7ffffedb7000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

  28. Virtual Address Layout (64 bit) ======================================================================================================================== Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm __________________|____________|__________________|_________|___________________________________________________________ | | | | 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical | | | | virtual memory addresses up to the -128 TB | | | | starting offset of kernel mappings. __________________|____________|__________________|_________|___________________________________________________________ | Kernel-space virtual memory, shared between all processes: ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory __________________|____________|__________________|_________|____________________________________________________________ | | Identical layout to the 56-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole | | | | vaddr_end for KASLR fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 ffffffff80000000 |-2048 MB | | | ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space ffffffffff000000 | -16 MB | | | FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole __________________|____________|__________________|_________|___________________________________________________________ Both are contiguous ranges starting at physical address 0

  29. Kernel System.map Boot loader jumps here 0000000001000000 A phys_startup_64 ffffffff81000000 T _text ffffffff81000000 T startup_64 ffffffff81000110 T secondary_startup_64 ffffffff810001b0 T start_cpu0 Kernel initialization ffffffff810b57f0 T vprintk ffffffff8118e650 T kfree ffffffff8118f780 T __kmalloc ffffffff8130a2a0 T memset ffffffff81309ff0 T memcpy

  30. Spectre/Meltdown Kernel used to share virtual address space with process Present in each process address space Only accessible if hardware was in kernel mode Protected by page table HW Allowed system calls to be made without switching page tables Performance optimization (just increase priviledge level) Spectre/Meltdown changed that Allowed hardware to speculatively access kernel memory Result of access could be read via cache side channel Location of access could be controlled by attacker Mitigations: Kernel and processes are no longer mapped into the same page tables Effect: Lots of stuff you read is no longer accurate

  31. Linked Lists

  32. structs and memory layout fox fox fox list.next list.prev list.next list.prev list.next list.prev

  33. Linked lists in Linux Node; fox fox fox list { .next .prev } list { .next .prev } list { .next .prev }

  34. What about types? Calculates a pointer to the containing struct struct list_head fox_list; struct fox * fox_ptr = list_entry(fox_list->next, struct fox, node);

  35. List access methods struct list_head some_list; list_add(struct list_head * new_entry, struct list_head * list); list_del(struct list_head * entry_to_remove); struct type * ptr; list_for_each_entry(ptr, &some_list, node){ } struct type * ptr, * tmp_ptr; list_for_each_entry_safe(ptr, tmp_ptr, &some_list, node) { list_del(ptr); kfree(ptr); }

More Related Content