Operating System
- OS history
- Introduction
- OS - hardware interface
- OS organization
- OS Organization & intro into processes
- Scheduling
- IPC
- Threads
- synchronization
- Semaphores
- Conditional variable
- monitors
- Invariants
- locks
- Memory Management
- Filesystem(linear array of bytes)
- Flash memory
- Distributed system
- Security
- Virtual machine
OS history
Phase 0
Pre-OS -1960
Phase 1
Batch-based, debug with core dump.
Spooling - Simultaneous peripheral operation online
- Multiprograming - more and more memory available - OS resident to coordinate activities
Phase 2
- Timesharing : 1962 CTSS from MIT - Later MULTICS
- Terminals are cheap
- let all users interact with system at once
- process switching more frequent
- debuging gets a lot easier
- New OS services
- Problems: response time and thrashing
- Terminals are cheap
- Timer interrupts : pre-emptive multitasking
Phase 3
- Personal computer
- One user per machine
- Initial similar to old batch
Phase 4
- multi-computer per person
- embedded
- mobile
- cloud
Future of OSes
- Very small
- Very large
Introduction
What is an Operating system
- Interface between yser and hardwawre
- manages sshared resources
- Provides common services
- Goals:
- convenient to use
- efficient
- OS design reacts to hardware changes
- OS vs. kernel
Characteristics of modern OSes
- Enormous
- Complex
- Poorly understood
Modern OS functionality
- Provides a virtual processor for application code
- Concurrency
- Manage I/O devices
- Memory management
- Files
- Networks
- Security
- …
Coordination
- Concurrency
- I/O devices
- Memory
- File
- Networks
- Security
OS - hardware interface
Generic Computer Architecture
- CPU
- Memory
- System bus
- Bus adapter
- Device controller
- Important:
- interrupts and traps
- m-mapped i/o
- VA-PA translation
- DMA
Hardware Res | OS abstraction |
---|---|
CPU | Processes / Threads |
Main mem | Address spaces |
Disk | File system |
Network | Sockets / ports |
Devices | Virtual devices |
OS services | Hardware support |
---|---|
Isolation bet proc | Kernel/user mode |
Protected instruction | |
MMU | |
Virtual memory | MMU/TLB |
syscall | traps |
Async I/O | Interrupts/DMA |
CPU Scheduling | Timer interrupts |
Synchronization | Atomic instruction |
Typical OS structure
- User-level
- app
- lib
- Kernel-level
- protected mode
- a way to switch back and fro from user mode
- Hardware-level
- device exports known interface
- device writes to devices registers
- Virtual address space
- parts of the address space
- code
- data/bss
- heap
- stack
- I/O
- MMU
Kernel Mode vs. User mode
- Kernel mode instruction
- D-I/O
- VA/PA mapping
- en/dis interrupts
Syscalls
- Arguments in registers
- raised privilege but under control of other kernel code
- OS returns
- err code
- result in user buffer
- should always check
- Dialogue between user-mod and kernel should be semantically simple
- use, implement, easier to make efficient
OS organization
- Kernel
- not a process
- more time passive
- privileged routines to intermediate hardware access/cross-process operations
- make elevated privilege safe
- definition varies
- to start executing in kernel mode
- during booting(down privilege after)
- trap(system call)
- interrupt
Traps
- Architecture detects special events - syscall
- when processor detects condition
- save min CPU state
- switch to kernel mode
- transfer control to trap handler
- find in trap table
- jumps to address in trap table
- handler saves more state, may disable interrupts
- IRET/SYSRET
Controlling I/O devices
- Hardware is controlled using device registers
- MMIO
- Programmed IO
- DMA
- Device signaling: Polling vs Interrupts
- I/O concurrent with main processor
- interrupt raises signal on CPU pin
- interrupts can cost performance: flush pipeline + cache/TLB misses
- Interrupt overload:
- buffering, adapt between polling and interrupts
- external interrupt controller
- interrupts using kernel stack
- userspace stack (might raise page faults - recursive trap)
- interrupt handler
- perform all work associated with device
- dont disable very long 100us
- interrupts can be hard to get right
- concurrency is hard
- standard debugging may not work
Initializing Traps/Interrupts
- Vectors pinned at known physical address
- initialized during boot process
Trap vs Interrupts
- Traps are synchronous
- generated inside the executing process
- cannot be masked
- Interrupts are asynchronous
- generated from outside
- can be masked
Intro to Processes
- process : execution context of running program
- an instance of a program
- everything happens in either OS or in process
- spatial protection
- memory protection
- privileged instructions
- temporal protection
- timer interrupts
- traps allow user process to break through protection barrier
- OS controls entry points
Process management
- OS
- create
- delete
- suspend
- resume
- scheduling
- inter-process comm & sync
- allocate resource
- Process
- signals
- sockets
- pipes
- files
- synchronization for resources modified by multiple processes
- Space:
- Stack
- (EMPTY)
- Heap
- uninitialized data(bss seg)
- static data(data seg)
- code(text seg)
- Pointers:
- PC: current instruction
- HP: top of heap
- SP: bottom of stack
- Thread:
- execution context
- share states
- don’t own address spaces
OS Organization & intro into processes
- idle state/halt code
- turn-off cpu, till interrupts
Process
Process control block
- memory state
- processor state
- kernel state
- address space
- special pointers: PC,HP,SP
Paging example
- virtual memory into 4k pages, into page table
- physical page
- valid
- read-only
- kernel
- cr3 register points to the page table
- page fault
- crash
- fetch page
- allocating new pages(lazy,init when accessed)
Process State Machine
- New
- Ready*
- Running*: schedule/deschedule
- Waiting*: block on timer. IO, page fault …
- Terminated
Process control block
- One per process, allocated in kernel memory
- Tracks state of a process
- process state
- PID
- machine state: PC,SP,registers
- Memory management info
- Open file table
- Queue pointers(waiting, IO, sibling, parent … )
- Scheduling info
- when process created
- allocate, initialize, put on ready queue
- when terminates
- deallocate and process state clean up
- A process is either running or on the ready queue, on a single wait queue
- context switch on the order of us
- fork/exec vs CreateProcess
- fork - returns twice
- exit - does not return
- exec - usually does not return; overwrites current process
- fork bomb - ulimit - prevents having too many children
- kill another process (kill(pid,SIGKILL))
- copy on write
- pid_t wait(int* status)
- Orphans and zombies
- zombie: dead process with uncollected status
- orphan: parent died first
- kernel reparenting to init(who waits on every child, daemons)
- Some portion of Kernel address space is statically mapped so that some PA are easy to calculate
- Page tables are hard to swap out - can’t easily be relocated
Scheduling
- multiprogramming
- dispatching: context switch mechanism
- dispatcher called on external/internal events
- scheduling: policy that chooses waht to run next
context switch
- store cpu state
- load most of CPU state
- setup pseudo-exception stack
- return from exception
Metric
- CPU util
- throughput
- response time - time till first response
- wait time - time spent in ready queue
- variance and outliers
- fairness
- multiple users
- distribution related
Preemptive scheduling
- scheduling quantum
- OS can take away CPU from process w/o warning
- no visible impact of preemption
- stronger fairness guarantee
- synchronization :lock up…
increase scheduling overhead
- FCFS: non-preemptive
- simple, somewhat fair
- short job stuck behind long jobs
- CPU bound keep IO bound from IO
- Time: arrival time
- Length: length (next chunk of work)
- Round-robin: preemptive analog of FCFS
- blocks or quantum expires
- average waiting time is bad
- Shortest Job first: SJF
- good throughput, not fair
- optimal with respect to waiting time
- long jobs starve, require advance knowledge of execution time
- Preemptive: Shortest remaining time first
Priority scheduling
- run the highest priority runnabe process
- nice value (-20~19) from user
- hard on shared machine
- starvation prone, aging can help
- raise process priority as it sits runnable
- Multilevel feedback queue
- lower priority but increase quantum
realtime scheduling
- hard realtime: must meet deadline
- soft realtime: cause glitches
- deterministic scheduling
Priority inversion
- high tries to access resources locked by low
- increase low priority
IPC
Message passing
- msgsnd()
- msdrecv()
- using pid
Shared memory
map files
Processes directly access same memory
context switching & large message
cache problem
sockets
network
Threads
- Processes are heavyweight
- can have multiple threads
- mem mapping, expensive to swap
- Cache/TLB state: flushing expensive, especially side effects
- lot of kernel state
- context switching between processes is expensive
- threads are lightweight
- logic flows
- share states
- same address space
- same open file/socket tables
- context switching between threads cheaper
- interaction between threads easier/cheaper
- not all OSes support threads efficiently
Threads
- speedup on a multicore
- creating a server with a lot of internal concurrency
- independent computations within an address space
- hardest debugging programs ever
- usually shared memory
Two abstractions
- shared resources
- execution state
implementing threads
- address space shared by threads
- Code
- Data & heap
- thread private state
- register
- stack
- shared states
- read-only
- writable -> hard
- synchronization and concurrency control
- Context switch between threads
- no address space switch
- thread termination can be very tricky
cooperating processes
- parallelism
- separating activities
fault isolation
- new failure modes introduced -> concurrency
- harder to debug
cleanup complicated
- multiprogramming vs. multithreading
- higher overhead but greater isolation
faster context switches and tighter integration
Kernel threads vs User threads
- kernel threads
- Os supports
- separate OS state
- schedule threads not processes
- User threads
- If OS not directly support threads
- user-level thread library implements threads
- context switch entirely within user space
- no parallelism on multicore
- event-based programming , UI networking
synchronization
cache, dram, numa, QPI
how to cooperate, too much milk
- concurrency control
- using atomic operations to eliminate race conditions(mem, device,file/os resources)
- critical section: need mutual exclusion(atomically), using lock
shared memory synchronization
- threads shared memory
- preemptive thread scheduling: non-determinism
- context switch at anytime
- race condition: processes run in parallel depends on order in which that they are executed
- conflicting access to a resource
- force one to wait
- need to pretect all possible locations where might conflict
- not interfere
- ordering
atomic operations
- fetch_and_add, cmp_and_swap
- series of operations that cannot be interrupted
- root of most synchronization solutions
- processor has to support some
- OS uses them to build more sophisticated atomics
Using lock
lock,unlock
- effect
- eliminate race condition
- increases overhead
- restricts concurrency
other synchronization: transaction, semaphores, monitors
- only one thread
- progress: dont wait for available rock
- bounded waiting
- no thread needs to wait forever
- peterson’s algorithm, (flag[2], turn)
- set own flag, own turn
- check other flag own turn
- any uniprocessor
- relies atomicity of mem ops
- hard to have more than 2 threads,lambert’s backery
- doesn’t work on many multicore(not reordeting stores, non trivial and non portable)
- 99.9% don’t write synchronization operation yourself
- Critical section: section of code, only one process/thread executing
- mutex
- lack of deadlock(ensure progress)
- bounded waiting(ensure fairness)
- entry code(preamble)
- CS
- exit code(post script)
- lock example
- spinlock
- arm: load & and store exclusive, (load-link, store-conditional)
Multiprocessor memory model
- uniprocessor memory is simple
- cache is transparent
- sequential consistency
- result of exexution is as if were executed in sequential order operations of each individual proccesor in the order of the program
- reordering ll, ss, ls, sl
- Total store ordering: store buffer
- memory fences(you know they know)
- loads and stores cannot be moved across mfence
- flushing the store buffer and prevent some reordering
- expensive: sfence/ifence
- mfence making sure critical section not get out
- free of data races
- not shared
- shared, read only
- shared, consistentcy protected by lock
- racy code
- volatile
- mfence
Blocking / not Blocking
- spinlock
- yielding spinlock
- blocking locks
hybrid solution
- test-and-set(ship value around, writes will cause to ship buffer)
- compare-and-swap(x86 not drop write)
- fetch-and-add(counter)
- load-linked, store-conditional(transaction-like)
- where lock resides
granularity of locking
Semaphores
wait for event,reader/writer
counter that processes or threads can manipulate automatically
- P() down() wait(): till counter>0, then automatically decrement it
- V() up() signal() post(): atomically increment counter
- counter represent number of available resources
- complex arrangement, awkward
bounded buffer
- producer
- consumer
- concurrent threads adding/removing from buffer
- writer acquire all locks
Conditional variable
- wait(): release lock and go to sleep, auto acquire lock when wakes up
- signal(): wake up threads on event
- broadcast(): wake up all threads on event
monitors
- collection of data items
- a lock
- zero or more condition variables
- cannot forget lock acquire/release
- groups a lock with data and code
- answers a lot of mundane questions: how many locks/when to acquire/release
- works for almost every problem
- CV/lock atomicity issue
pthreads: semaphore/CV
- monitors and CV are a good framework for solving many kinds of sync problems
- mutex : locks
- blocks threads for events: CVs semaphores
- always grab lock on entry and release on exit
guard wait in while not if
Invariants
- something taht has to be true for your program to be correct
- any part of your program that chagnes a variable or data structure must ensure that the invariant holds
- parts of your program that use the data structure may assume that the invariant holds
- must hold whenever the lock is not held
- use assertions to make sure invariants hold
locks
- dining philosophers problem
- flips the order of grabing the fork(no cycle, forcing resolve dispute first)
- resources held = resources needs
- deadlock: two or more threads are waiting on events that only those threads can generate
- don’t go away
- no big picture
- hard to handle automatically
- livelock: system takes steps but no global progress
- goes away when load decreases
- deadlock conditions
- mutual exclusion
- hold and wait
- no preemption(resource cannot be taken away)
- circular wait
- ignorance
- detection and recovery: figure out when deadlock has occurred
- resource allocation graph, cycle detection(own: res -> thread)
- if someone holds a resource for too long
- terminate threads, break locks. invoke exception handlers, create more resources, reboot…
- avoidance: reject resource requests that might lead to deadlock
- know a list of resources/ locks that is to be required
- prevention: introduce rules
- shared resources 1
- never require more than 1 res 2
- all required at once 2
- 2 phase locking 2
- free current and reacquire 2
- preemptible resources 3
- strictly order resources (acquire in increasing order) 4
- require global system knowledge 4
- distributed deadlock
- hard to get a good picture of the global system
- hard to coordinate actions
- often in a state of partial failure
- user mode code on linux
- helgrind, linux - check for circular waits
- solaris - strict lock ordering
- avoid starving threads
Memory Management
- Code, Data, Heap, Stack
- Kernel mapped into all processes(linux high)
- MMU remaps virtual to physical, read-only,supervisor-only, detects accesses to unmapped regions
Mapping Address
- compile time(no MMU. conflict when loading two program, and inefficient use of DRAM)
- load time(OS load program with fixed address offset)
dynamic
- Uniprogramming
- DOS, overlapping is low
- Static Relocation
- Allocate continuous memory for process, no safety and low efficiency.
- Dynamic Relocation
- Same layout, relocate address, hardware support.
- Base and bounds, segmentation(base & bound), paging(fixed size), segmented paging
- bits: valid,krnl,RO,accessed,dirty
- Page table stored in DRAM, cache recently-used page table entries
- TLB - fully associative
- DTLB 1TLB STLB
- TLB reach
- TLB way, put in different bucket
- multi-level page table, 9 bits 4 level - PML4, PDPT, PD, PT
- sharing memory using paging (points to the same PPN), copy-on-write, one owner
- Superpage: permit pages to have variable size
- kernel, frame buffer, large data structures
- few hardwired regions/syscalls/OS watches
- demand paging: present & valid bit, page out to disk
- Spatial/temporal locality
- page selection
- page replacement policy
- Optimal: throw out page used farthest in the future
- Random, FIFO, LRU, NRU(approx. LRU)
- NRU: periodically clear reference bit
- modified fifo(second chance)
- Clock algorithm(modified second chance)
- Belady’s Anomaly: for some algorithm more pages leads to more page faults(FIFO)
- thrashing
Filesystem(linear array of bytes)
- User’s viewpoint: files/directorys & operations
- Physical viewpoint: sectors tracks disks
User OS Hardware
- Disk: Cylinder/Track - Block/sector - 100-200 IOPS
- Container of persistent data, directory system
Sequential/random
- Data structure:
- Global open file table
- Per-process open file table
- Free(disk) block list
- Free inode list
- File buffer cache
- inode cache
- name cache
- Disk:
- superblock
- file, file descriptor(inode)
- Directory
- free block/inode maps
- Name to inode(in directory)
- find -> scan through inode tree
- Open -> find, check, increment open count, per-process entry, return perprocess file talke index
read,write,seed
allocating blocks to file
- Contiguous, Linked, indexed, multi-level indexed
- Extent: (starting block, length)
Links
- hard links:
- problem: cannot cross filesystems, loops, unreachable directorys
- solution: only file
- soft links
Fault tolerance
- RAID: striping & redundancy 0, 1, 5(mirroring + striping + parity)
- Cluster
- careful dist writes & fsck: order writes that disk is recoverable
- avoid major inconsistency: point to unfree, not pointed are free, no blocks belongs to more than one file
- not a goal: never loose data
- logging
- record transactions then write: sweeper process
- Caching/aggregated I/O/Prefetching
- Disk head scheduling, disk interleaving
Flash memory
- NOR/NAND(nor -> long erase cycle)
- SLC/MLC(multiple bits per cell)
- 100,000 erases
- 1,000 to 10,000 erases MLC
- read/write/erase(set all to 1)
- reduce wear: Flash-translation layer(write to a pre-erased page) VPN-PPN
- TRIM command for flash to erase
- Write amplification: blocks written by the device/host
- overprovisioning
Distributed system
- NFS,printd, ntpd,DNS,kerberos,httpd ….
- RPC read & write stateless
- atmost once RPC
Cluster
- replicating for fault tolerance
- passive replication(Primary/backup, master-slave)
- slave tracks master and ready to take over
- Active replication(Quorum)
- perform oerperations insync
- CAP Theorem
- A system that can tolerate arbitrary network partitions can only offer either
- consistency
- availability
- A system that can tolerate arbitrary network partitions can only offer either
Map-reduce
- map: [(k, 1) for k in line.split() for line in fileChunk]
- reduce: [(k, sum(onesList)) for (k, onesList) in tempChunk]
Security
- Protection: prevent accidental or intentional misuse, malicious abuse hard to eliminate
- Authentication
- authorization
access enforcement
- basic design principle
- Least privilege
- Economy of mechanism
- Open design
- complete mediation - check every access
- permision based
- separation of privilege
- least common mechanism
- easy to use
- Mandatory access control and not under user control
- SELinux
- Type enforcement
- Role-based access control
- Security model is complicated and powerful
Virtual machine
- Virtual Machine Monitor(hypervisor)
- simulation
- Idea: use HW directly, simulated system/trap calls
- VMM handles VA-PA mapping
- VMM leaves VA spaces in guest
- simulates interrupts/syscalls
- paravirutalization
- VMM do direct io
- modify guest kernel to avoid traps
- goals
- consolidation and load balancing
- isolation
- portability
- fault tolerance
- Intel VTX: new privilege -1(hypervisor), Guest OS kernel still at 0