Operating System

OS history

Phase 0

Pre-OS -1960

Phase 1

Batch-based, debug with core dump.

Spooling - Simultaneous peripheral operation online

  • Multiprograming - more and more memory available - OS resident to coordinate activities

Phase 2

  • Timesharing : 1962 CTSS from MIT - Later MULTICS
    • Terminals are cheap
      • let all users interact with system at once
      • process switching more frequent
      • debuging gets a lot easier
    • New OS services
    • Problems: response time and thrashing
  • Timer interrupts : pre-emptive multitasking

Phase 3

  • Personal computer
    • One user per machine
    • Initial similar to old batch

Phase 4

  • multi-computer per person
    • embedded
    • mobile
    • cloud

Future of OSes

  • Very small
  • Very large

Introduction

What is an Operating system

  • Interface between yser and hardwawre
  • manages sshared resources
  • Provides common services
  • Goals:
    • convenient to use
    • efficient
  • OS design reacts to hardware changes
  • OS vs. kernel

Characteristics of modern OSes

  • Enormous
  • Complex
  • Poorly understood

Modern OS functionality

  • Provides a virtual processor for application code
  • Concurrency
  • Manage I/O devices
  • Memory management
  • Files
  • Networks
  • Security

Coordination

  • Concurrency
  • I/O devices
  • Memory
  • File
  • Networks
  • Security

OS - hardware interface

Generic Computer Architecture

  • CPU
  • Memory
  • System bus
  • Bus adapter
  • Device controller
  • Important:
    • interrupts and traps
    • m-mapped i/o
    • VA-PA translation
    • DMA
Hardware Res OS abstraction
CPU Processes / Threads
Main mem Address spaces
Disk File system
Network Sockets / ports
Devices Virtual devices
OS services Hardware support
Isolation bet proc Kernel/user mode
Protected instruction
MMU
Virtual memory MMU/TLB
syscall traps
Async I/O Interrupts/DMA
CPU Scheduling Timer interrupts
Synchronization Atomic instruction

Typical OS structure

  • User-level
    • app
    • lib
  • Kernel-level
    • protected mode
    • a way to switch back and fro from user mode
  • Hardware-level
    • device exports known interface
    • device writes to devices registers
  • Virtual address space
  • parts of the address space
    • code
    • data/bss
    • heap
    • stack
    • I/O
  • MMU

Kernel Mode vs. User mode

  • Kernel mode instruction
    • D-I/O
    • VA/PA mapping
    • en/dis interrupts

Syscalls

  • Arguments in registers
  • raised privilege but under control of other kernel code
  • OS returns
    • err code
    • result in user buffer
    • should always check
  • Dialogue between user-mod and kernel should be semantically simple
    • use, implement, easier to make efficient

OS organization

  • Kernel
    • not a process
    • more time passive
    • privileged routines to intermediate hardware access/cross-process operations
    • make elevated privilege safe
    • definition varies
  • to start executing in kernel mode
    • during booting(down privilege after)
    • trap(system call)
    • interrupt

Traps

  • Architecture detects special events - syscall
  • when processor detects condition
    • save min CPU state
    • switch to kernel mode
    • transfer control to trap handler
    • find in trap table
    • jumps to address in trap table
    • handler saves more state, may disable interrupts
    • IRET/SYSRET

Controlling I/O devices

  • Hardware is controlled using device registers
    • MMIO
    • Programmed IO
  • DMA
  • Device signaling: Polling vs Interrupts
    • I/O concurrent with main processor
    • interrupt raises signal on CPU pin
    • interrupts can cost performance: flush pipeline + cache/TLB misses
  • Interrupt overload:
    • buffering, adapt between polling and interrupts
  • external interrupt controller
  • interrupts using kernel stack
    • userspace stack (might raise page faults - recursive trap)
  • interrupt handler
    • perform all work associated with device
    • dont disable very long 100us
  • interrupts can be hard to get right
    • concurrency is hard
    • standard debugging may not work

Initializing Traps/Interrupts

  • Vectors pinned at known physical address
  • initialized during boot process

Trap vs Interrupts

  • Traps are synchronous
    • generated inside the executing process
    • cannot be masked
  • Interrupts are asynchronous
    • generated from outside
    • can be masked

Intro to Processes

  • process : execution context of running program
    • an instance of a program
    • everything happens in either OS or in process
  • spatial protection
    • memory protection
    • privileged instructions
  • temporal protection
    • timer interrupts
  • traps allow user process to break through protection barrier
    • OS controls entry points

Process management

  • OS
    • create
    • delete
    • suspend
    • resume
    • scheduling
    • inter-process comm & sync
    • allocate resource
  • Process
    • signals
    • sockets
    • pipes
    • files
    • synchronization for resources modified by multiple processes
  • Space:
    • Stack
    • (EMPTY)
    • Heap
    • uninitialized data(bss seg)
    • static data(data seg)
    • code(text seg)
  • Pointers:
    • PC: current instruction
    • HP: top of heap
    • SP: bottom of stack
  • Thread:
    • execution context
    • share states
    • don’t own address spaces

OS Organization & intro into processes

  • idle state/halt code
  • turn-off cpu, till interrupts

Process

Process control block

  • memory state
  • processor state
  • kernel state
  • address space
  • special pointers: PC,HP,SP

Paging example

  • virtual memory into 4k pages, into page table
    • physical page
    • valid
    • read-only
    • kernel
  • cr3 register points to the page table
  • page fault
    • crash
    • fetch page
    • allocating new pages(lazy,init when accessed)

Process State Machine

  • New
  • Ready*
  • Running*: schedule/deschedule
  • Waiting*: block on timer. IO, page fault …
  • Terminated

Process control block

  • One per process, allocated in kernel memory
  • Tracks state of a process
    • process state
    • PID
    • machine state: PC,SP,registers
    • Memory management info
    • Open file table
    • Queue pointers(waiting, IO, sibling, parent … )
    • Scheduling info
  • when process created
    • allocate, initialize, put on ready queue
  • when terminates
    • deallocate and process state clean up
  • A process is either running or on the ready queue, on a single wait queue
  • context switch on the order of us
  • fork/exec vs CreateProcess
    • fork - returns twice
    • exit - does not return
    • exec - usually does not return; overwrites current process
  • fork bomb - ulimit - prevents having too many children
  • kill another process (kill(pid,SIGKILL))
  • copy on write
  • pid_t wait(int* status)
  • Orphans and zombies
    • zombie: dead process with uncollected status
    • orphan: parent died first
    • kernel reparenting to init(who waits on every child, daemons)
  • Some portion of Kernel address space is statically mapped so that some PA are easy to calculate
  • Page tables are hard to swap out - can’t easily be relocated

Scheduling

  • multiprogramming
  • dispatching: context switch mechanism
    • dispatcher called on external/internal events
  • scheduling: policy that chooses waht to run next

context switch

  • store cpu state
  • load most of CPU state
  • setup pseudo-exception stack
  • return from exception

Metric

  • CPU util
  • throughput
  • response time - time till first response
  • wait time - time spent in ready queue
  • variance and outliers
  • fairness
  • multiple users
  • distribution related

Preemptive scheduling

  • scheduling quantum
  • OS can take away CPU from process w/o warning
  • no visible impact of preemption
  • stronger fairness guarantee
  • synchronization :lock up…
  • increase scheduling overhead

  • FCFS: non-preemptive
    • simple, somewhat fair
    • short job stuck behind long jobs
    • CPU bound keep IO bound from IO
    • Time: arrival time
    • Length: length (next chunk of work)
  • Round-robin: preemptive analog of FCFS
    • blocks or quantum expires
    • average waiting time is bad
  • Shortest Job first: SJF
    • good throughput, not fair
    • optimal with respect to waiting time
    • long jobs starve, require advance knowledge of execution time
    • Preemptive: Shortest remaining time first

Priority scheduling

  • run the highest priority runnabe process
  • nice value (-20~19) from user
  • hard on shared machine
  • starvation prone, aging can help
    • raise process priority as it sits runnable
  • Multilevel feedback queue
    • lower priority but increase quantum

realtime scheduling

  • hard realtime: must meet deadline
  • soft realtime: cause glitches
  • deterministic scheduling

Priority inversion

  • high tries to access resources locked by low
  • increase low priority

IPC

Message passing

  • msgsnd()
  • msdrecv()
  • using pid

Shared memory

map files

Processes directly access same memory

context switching & large message

cache problem

sockets

network

Threads

  • Processes are heavyweight
    • can have multiple threads
    • mem mapping, expensive to swap
    • Cache/TLB state: flushing expensive, especially side effects
    • lot of kernel state
    • context switching between processes is expensive
  • threads are lightweight
    • logic flows
    • share states
      • same address space
      • same open file/socket tables
    • context switching between threads cheaper
    • interaction between threads easier/cheaper
    • not all OSes support threads efficiently

Threads

  • speedup on a multicore
  • creating a server with a lot of internal concurrency
  • independent computations within an address space
  • hardest debugging programs ever
  • usually shared memory

Two abstractions

  • shared resources
  • execution state

implementing threads

  • address space shared by threads
    • Code
    • Data & heap
  • thread private state
    • register
    • stack
  • shared states
    • read-only
    • writable -> hard
    • synchronization and concurrency control
  • Context switch between threads
    • no address space switch
  • thread termination can be very tricky

cooperating processes

  • parallelism
  • separating activities
  • fault isolation

  • new failure modes introduced -> concurrency
  • harder to debug
  • cleanup complicated

  • multiprogramming vs. multithreading
  • higher overhead but greater isolation
  • faster context switches and tighter integration

Kernel threads vs User threads

  • kernel threads
    • Os supports
    • separate OS state
    • schedule threads not processes
  • User threads
    • If OS not directly support threads
    • user-level thread library implements threads
    • context switch entirely within user space
    • no parallelism on multicore
  • event-based programming , UI networking

synchronization

cache, dram, numa, QPI

how to cooperate, too much milk

  • concurrency control
  • using atomic operations to eliminate race conditions(mem, device,file/os resources)
  • critical section: need mutual exclusion(atomically), using lock

shared memory synchronization

  • threads shared memory
  • preemptive thread scheduling: non-determinism
  • context switch at anytime
  • race condition: processes run in parallel depends on order in which that they are executed
  • conflicting access to a resource
  • force one to wait
  • need to pretect all possible locations where might conflict
    • not interfere
    • ordering

atomic operations

  • fetch_and_add, cmp_and_swap
  • series of operations that cannot be interrupted
  • root of most synchronization solutions
  • processor has to support some
  • OS uses them to build more sophisticated atomics

Using lock

lock,unlock

  • effect
  • eliminate race condition
  • increases overhead
  • restricts concurrency
  • other synchronization: transaction, semaphores, monitors

  • only one thread
  • progress: dont wait for available rock
  • bounded waiting
    • no thread needs to wait forever
  • peterson’s algorithm, (flag[2], turn)
    • set own flag, own turn
    • check other flag own turn
    • any uniprocessor
    • relies atomicity of mem ops
    • hard to have more than 2 threads,lambert’s backery
    • doesn’t work on many multicore(not reordeting stores, non trivial and non portable)
  • 99.9% don’t write synchronization operation yourself
  • Critical section: section of code, only one process/thread executing
    • mutex
    • lack of deadlock(ensure progress)
    • bounded waiting(ensure fairness)
    • entry code(preamble)
    • CS
    • exit code(post script)
  • lock example
    • spinlock
    • arm: load & and store exclusive, (load-link, store-conditional)

Multiprocessor memory model

  • uniprocessor memory is simple
    • cache is transparent
  • sequential consistency
    • result of exexution is as if were executed in sequential order operations of each individual proccesor in the order of the program
    • reordering ll, ss, ls, sl
    • Total store ordering: store buffer
  • memory fences(you know they know)
    • loads and stores cannot be moved across mfence
    • flushing the store buffer and prevent some reordering
    • expensive: sfence/ifence
  • mfence making sure critical section not get out
  • free of data races
    • not shared
    • shared, read only
    • shared, consistentcy protected by lock
  • racy code
    • volatile
    • mfence

Blocking / not Blocking

  • spinlock
  • yielding spinlock
  • blocking locks
  • hybrid solution

  • test-and-set(ship value around, writes will cause to ship buffer)
  • compare-and-swap(x86 not drop write)
  • fetch-and-add(counter)
  • load-linked, store-conditional(transaction-like)
  • where lock resides
  • granularity of locking

Semaphores

wait for event,reader/writer

counter that processes or threads can manipulate automatically

  • P() down() wait(): till counter>0, then automatically decrement it
  • V() up() signal() post(): atomically increment counter
  • counter represent number of available resources
  • complex arrangement, awkward

bounded buffer

  • producer
  • consumer
  • concurrent threads adding/removing from buffer
  • writer acquire all locks

Conditional variable

  • wait(): release lock and go to sleep, auto acquire lock when wakes up
  • signal(): wake up threads on event
  • broadcast(): wake up all threads on event

monitors

  • collection of data items
  • a lock
  • zero or more condition variables
  • cannot forget lock acquire/release
  • groups a lock with data and code
  • answers a lot of mundane questions: how many locks/when to acquire/release
  • works for almost every problem
  • CV/lock atomicity issue
  • pthreads: semaphore/CV

  • monitors and CV are a good framework for solving many kinds of sync problems
  • mutex : locks
  • blocks threads for events: CVs semaphores
  • always grab lock on entry and release on exit
  • guard wait in while not if

Invariants

  • something taht has to be true for your program to be correct
  • any part of your program that chagnes a variable or data structure must ensure that the invariant holds
  • parts of your program that use the data structure may assume that the invariant holds
  • must hold whenever the lock is not held
  • use assertions to make sure invariants hold

locks

  • dining philosophers problem
  • flips the order of grabing the fork(no cycle, forcing resolve dispute first)
  • resources held = resources needs
  • deadlock: two or more threads are waiting on events that only those threads can generate
    • don’t go away
    • no big picture
    • hard to handle automatically
  • livelock: system takes steps but no global progress
    • goes away when load decreases
  • deadlock conditions
    • mutual exclusion
    • hold and wait
    • no preemption(resource cannot be taken away)
    • circular wait
  • ignorance
  • detection and recovery: figure out when deadlock has occurred
    • resource allocation graph, cycle detection(own: res -> thread)
    • if someone holds a resource for too long
    • terminate threads, break locks. invoke exception handlers, create more resources, reboot…
  • avoidance: reject resource requests that might lead to deadlock
    • know a list of resources/ locks that is to be required
  • prevention: introduce rules
    • shared resources 1
    • never require more than 1 res 2
    • all required at once 2
    • 2 phase locking 2
    • free current and reacquire 2
    • preemptible resources 3
    • strictly order resources (acquire in increasing order) 4
    • require global system knowledge 4
  • distributed deadlock
    • hard to get a good picture of the global system
    • hard to coordinate actions
    • often in a state of partial failure
  • user mode code on linux
    • helgrind, linux - check for circular waits
    • solaris - strict lock ordering
    • avoid starving threads

Memory Management

  • Code, Data, Heap, Stack
  • Kernel mapped into all processes(linux high)
  • MMU remaps virtual to physical, read-only,supervisor-only, detects accesses to unmapped regions

Mapping Address

  • compile time(no MMU. conflict when loading two program, and inefficient use of DRAM)
  • load time(OS load program with fixed address offset)
  • dynamic

  • Uniprogramming
    • DOS, overlapping is low
  • Static Relocation
    • Allocate continuous memory for process, no safety and low efficiency.
  • Dynamic Relocation
    • Same layout, relocate address, hardware support.
    • Base and bounds, segmentation(base & bound), paging(fixed size), segmented paging
    • bits: valid,krnl,RO,accessed,dirty
    • Page table stored in DRAM, cache recently-used page table entries
  • TLB - fully associative
    • DTLB 1TLB STLB
    • TLB reach
    • TLB way, put in different bucket
    • multi-level page table, 9 bits 4 level - PML4, PDPT, PD, PT
  • sharing memory using paging (points to the same PPN), copy-on-write, one owner
  • Superpage: permit pages to have variable size
    • kernel, frame buffer, large data structures
    • few hardwired regions/syscalls/OS watches
  • demand paging: present & valid bit, page out to disk
    • Spatial/temporal locality
    • page selection
    • page replacement policy
      • Optimal: throw out page used farthest in the future
      • Random, FIFO, LRU, NRU(approx. LRU)
      • NRU: periodically clear reference bit
      • modified fifo(second chance)
      • Clock algorithm(modified second chance)
    • Belady’s Anomaly: for some algorithm more pages leads to more page faults(FIFO)
    • thrashing

Filesystem(linear array of bytes)

  • User’s viewpoint: files/directorys & operations
  • Physical viewpoint: sectors tracks disks
  • User OS Hardware

  • Disk: Cylinder/Track - Block/sector - 100-200 IOPS
  • Container of persistent data, directory system
  • Sequential/random

  • Data structure:
    • Global open file table
    • Per-process open file table
    • Free(disk) block list
    • Free inode list
    • File buffer cache
    • inode cache
    • name cache
  • Disk:
    • superblock
    • file, file descriptor(inode)
    • Directory
    • free block/inode maps
  • Name to inode(in directory)
  • find -> scan through inode tree
  • Open -> find, check, increment open count, per-process entry, return perprocess file talke index
  • read,write,seed

allocating blocks to file

  • Contiguous, Linked, indexed, multi-level indexed
  • Extent: (starting block, length)
  • hard links:
    • problem: cannot cross filesystems, loops, unreachable directorys
    • solution: only file
  • soft links

Fault tolerance

  • RAID: striping & redundancy 0, 1, 5(mirroring + striping + parity)
  • Cluster
  • careful dist writes & fsck: order writes that disk is recoverable
    • avoid major inconsistency: point to unfree, not pointed are free, no blocks belongs to more than one file
    • not a goal: never loose data
  • logging
    • record transactions then write: sweeper process
  • Caching/aggregated I/O/Prefetching
  • Disk head scheduling, disk interleaving

Flash memory

  • NOR/NAND(nor -> long erase cycle)
  • SLC/MLC(multiple bits per cell)
    • 100,000 erases
    • 1,000 to 10,000 erases MLC
  • read/write/erase(set all to 1)
  • reduce wear: Flash-translation layer(write to a pre-erased page) VPN-PPN
  • TRIM command for flash to erase
  • Write amplification: blocks written by the device/host
    • overprovisioning

Distributed system

  • NFS,printd, ntpd,DNS,kerberos,httpd ….
  • RPC read & write stateless
  • atmost once RPC

Cluster

  • replicating for fault tolerance
  • passive replication(Primary/backup, master-slave)
    • slave tracks master and ready to take over
  • Active replication(Quorum)
    • perform oerperations insync
  • CAP Theorem
    • A system that can tolerate arbitrary network partitions can only offer either
      • consistency
      • availability

Map-reduce

  • map: [(k, 1) for k in line.split() for line in fileChunk]
  • reduce: [(k, sum(onesList)) for (k, onesList) in tempChunk]

Security

  • Protection: prevent accidental or intentional misuse, malicious abuse hard to eliminate
  • Authentication
  • authorization
  • access enforcement

  • basic design principle
    • Least privilege
    • Economy of mechanism
    • Open design
    • complete mediation - check every access
    • permision based
    • separation of privilege
    • least common mechanism
    • easy to use
  • Mandatory access control and not under user control
  • SELinux
    • Type enforcement
    • Role-based access control
    • Security model is complicated and powerful

Virtual machine

  • Virtual Machine Monitor(hypervisor)
  • simulation
  • Idea: use HW directly, simulated system/trap calls
    • VMM handles VA-PA mapping
    • VMM leaves VA spaces in guest
    • simulates interrupts/syscalls
  • paravirutalization
    • VMM do direct io
    • modify guest kernel to avoid traps
  • goals
    • consolidation and load balancing
    • isolation
    • portability
    • fault tolerance
  • Intel VTX: new privilege -1(hypervisor), Guest OS kernel still at 0