Operating System

OS history
- Phase 0
- Phase 1
- Phase 2
- Phase 3
- Phase 4
- Future of OSes
Introduction
OS - hardware interface
OS organization
- Traps
- Controlling I/O devices
- Initializing Traps/Interrupts
- Trap vs Interrupts
- Intro to Processes
- Process management
OS Organization & intro into processes
- Process
- Paging example
- Process State Machine
- Process control block
Scheduling
- context switch
- Metric
- Preemptive scheduling
- Priority scheduling
- realtime scheduling
- Priority inversion
IPC
- Message passing
- Shared memory
- sockets
Threads
- Threads
- Two abstractions
- implementing threads
- cooperating processes
- Kernel threads vs User threads
synchronization
- shared memory synchronization
- atomic operations
- Using lock
- Multiprocessor memory model
- Blocking / not Blocking
Semaphores
- bounded buffer
Conditional variable
monitors
Invariants
locks
Memory Management
- Mapping Address
Filesystem(linear array of bytes)
- allocating blocks to file
- Links
- Fault tolerance
Flash memory
Distributed system
- Cluster
- Map-reduce
Security
Virtual machine

OS history

Phase 0

Pre-OS -1960

Phase 1

Batch-based, debug with core dump.

Spooling - Simultaneous peripheral operation online

Multiprograming - more and more memory available - OS resident to coordinate activities

Phase 2

Timesharing : 1962 CTSS from MIT - Later MULTICS
- Terminals are cheap
  - let all users interact with system at once
  - process switching more frequent
  - debuging gets a lot easier
- New OS services
- Problems: response time and thrashing
Timer interrupts : pre-emptive multitasking

Phase 3

Personal computer
- One user per machine
- Initial similar to old batch

Phase 4

multi-computer per person
- embedded
- mobile
- cloud

Future of OSes

Very small
Very large

Introduction

What is an Operating system

Interface between yser and hardwawre
manages sshared resources
Provides common services
Goals:
- convenient to use
- efficient
OS design reacts to hardware changes
OS vs. kernel

Characteristics of modern OSes

Enormous
Complex
Poorly understood

Modern OS functionality

Provides a virtual processor for application code
Concurrency
Manage I/O devices
Memory management
Files
Networks
Security
…

Coordination

Concurrency
I/O devices
Memory
File
Networks
Security

OS - hardware interface

Generic Computer Architecture

CPU
Memory
System bus
Bus adapter
Device controller
Important:
- interrupts and traps
- m-mapped i/o
- VA-PA translation
- DMA

Hardware Res	OS abstraction
CPU	Processes / Threads
Main mem	Address spaces
Disk	File system
Network	Sockets / ports
Devices	Virtual devices

OS services	Hardware support
Isolation bet proc	Kernel/user mode
	Protected instruction
	MMU
Virtual memory	MMU/TLB
syscall	traps
Async I/O	Interrupts/DMA
CPU Scheduling	Timer interrupts
Synchronization	Atomic instruction

Typical OS structure

User-level
- app
- lib
Kernel-level
- protected mode
- a way to switch back and fro from user mode
Hardware-level
- device exports known interface
- device writes to devices registers
Virtual address space
parts of the address space
- code
- data/bss
- heap
- stack
- I/O
MMU

Kernel Mode vs. User mode

Kernel mode instruction
- D-I/O
- VA/PA mapping
- en/dis interrupts

Syscalls

Arguments in registers
raised privilege but under control of other kernel code
OS returns
- err code
- result in user buffer
- should always check
Dialogue between user-mod and kernel should be semantically simple
- use, implement, easier to make efficient

OS organization

Kernel
- not a process
- more time passive
- privileged routines to intermediate hardware access/cross-process operations
- make elevated privilege safe
- definition varies
to start executing in kernel mode
- during booting(down privilege after)
- trap(system call)
- interrupt

Traps

Architecture detects special events - syscall
when processor detects condition
- save min CPU state
- switch to kernel mode
- transfer control to trap handler
- find in trap table
- jumps to address in trap table
- handler saves more state, may disable interrupts
- IRET/SYSRET

Controlling I/O devices

Hardware is controlled using device registers
- MMIO
- Programmed IO
DMA
Device signaling: Polling vs Interrupts
- I/O concurrent with main processor
- interrupt raises signal on CPU pin
- interrupts can cost performance: flush pipeline + cache/TLB misses
Interrupt overload:
- buffering, adapt between polling and interrupts
external interrupt controller
interrupts using kernel stack
- userspace stack (might raise page faults - recursive trap)
interrupt handler
- perform all work associated with device
- dont disable very long 100us
interrupts can be hard to get right
- concurrency is hard
- standard debugging may not work

Initializing Traps/Interrupts

Vectors pinned at known physical address
initialized during boot process

Trap vs Interrupts

Traps are synchronous
- generated inside the executing process
- cannot be masked
Interrupts are asynchronous
- generated from outside
- can be masked

Intro to Processes

process : execution context of running program
- an instance of a program
- everything happens in either OS or in process
spatial protection
- memory protection
- privileged instructions
temporal protection
- timer interrupts
traps allow user process to break through protection barrier
- OS controls entry points

Process management

OS
- create
- delete
- suspend
- resume
- scheduling
- inter-process comm & sync
- allocate resource
Process
- signals
- sockets
- pipes
- files
- synchronization for resources modified by multiple processes
Space:
- Stack
- (EMPTY)
- Heap
- uninitialized data(bss seg)
- static data(data seg)
- code(text seg)
Pointers:
- PC: current instruction
- HP: top of heap
- SP: bottom of stack
Thread:
- execution context
- share states
- don’t own address spaces

OS Organization & intro into processes

idle state/halt code
turn-off cpu, till interrupts

Process

Process control block

memory state
processor state
kernel state
address space
special pointers: PC,HP,SP

Paging example

virtual memory into 4k pages, into page table
- physical page
- valid
- read-only
- kernel
cr3 register points to the page table
page fault
- crash
- fetch page
- allocating new pages(lazy,init when accessed)

Process State Machine

New
Ready*
Running*: schedule/deschedule
Waiting*: block on timer. IO, page fault …
Terminated

Process control block

One per process, allocated in kernel memory
Tracks state of a process
- process state
- PID
- machine state: PC,SP,registers
- Memory management info
- Open file table
- Queue pointers(waiting, IO, sibling, parent … )
- Scheduling info
when process created
- allocate, initialize, put on ready queue
when terminates
- deallocate and process state clean up
A process is either running or on the ready queue, on a single wait queue
context switch on the order of us
fork/exec vs CreateProcess
- fork - returns twice
- exit - does not return
- exec - usually does not return; overwrites current process
fork bomb - ulimit - prevents having too many children
kill another process (kill(pid,SIGKILL))
copy on write
pid_t wait(int* status)
Orphans and zombies
- zombie: dead process with uncollected status
- orphan: parent died first
- kernel reparenting to init(who waits on every child, daemons)
Some portion of Kernel address space is statically mapped so that some PA are easy to calculate
Page tables are hard to swap out - can’t easily be relocated

Scheduling

multiprogramming
dispatching: context switch mechanism
- dispatcher called on external/internal events
scheduling: policy that chooses waht to run next

context switch

store cpu state
load most of CPU state
setup pseudo-exception stack
return from exception

Metric

CPU util
throughput
response time - time till first response
wait time - time spent in ready queue
variance and outliers
fairness
multiple users
distribution related

Preemptive scheduling

scheduling quantum
OS can take away CPU from process w/o warning
no visible impact of preemption
stronger fairness guarantee
synchronization :lock up…
increase scheduling overhead
FCFS: non-preemptive
- simple, somewhat fair
- short job stuck behind long jobs
- CPU bound keep IO bound from IO
- Time: arrival time
- Length: length (next chunk of work)
Round-robin: preemptive analog of FCFS
- blocks or quantum expires
- average waiting time is bad
Shortest Job first: SJF
- good throughput, not fair
- optimal with respect to waiting time
- long jobs starve, require advance knowledge of execution time
- Preemptive: Shortest remaining time first

Priority scheduling

run the highest priority runnabe process
nice value (-20~19) from user
hard on shared machine
starvation prone, aging can help
- raise process priority as it sits runnable
Multilevel feedback queue
- lower priority but increase quantum

realtime scheduling

hard realtime: must meet deadline
soft realtime: cause glitches
deterministic scheduling

Priority inversion

high tries to access resources locked by low
increase low priority

IPC

Message passing

msgsnd()
msdrecv()
using pid

Shared memory

map files

Processes directly access same memory

context switching & large message

cache problem

sockets

network

Threads

Processes are heavyweight
- can have multiple threads
- mem mapping, expensive to swap
- Cache/TLB state: flushing expensive, especially side effects
- lot of kernel state
- context switching between processes is expensive
threads are lightweight
- logic flows
- share states
  - same address space
  - same open file/socket tables
- context switching between threads cheaper
- interaction between threads easier/cheaper
- not all OSes support threads efficiently

Threads

speedup on a multicore
creating a server with a lot of internal concurrency
independent computations within an address space
hardest debugging programs ever
usually shared memory

Two abstractions

shared resources
execution state

implementing threads

address space shared by threads
- Code
- Data & heap
thread private state
- register
- stack
shared states
- read-only
- writable -> hard
- synchronization and concurrency control
Context switch between threads
- no address space switch
thread termination can be very tricky

cooperating processes

parallelism
separating activities
fault isolation
new failure modes introduced -> concurrency
harder to debug
cleanup complicated
multiprogramming vs. multithreading
higher overhead but greater isolation
faster context switches and tighter integration

Kernel threads vs User threads

kernel threads
- Os supports
- separate OS state
- schedule threads not processes
User threads
- If OS not directly support threads
- user-level thread library implements threads
- context switch entirely within user space
- no parallelism on multicore
event-based programming , UI networking

synchronization

cache, dram, numa, QPI

how to cooperate, too much milk

concurrency control
using atomic operations to eliminate race conditions(mem, device,file/os resources)
critical section: need mutual exclusion(atomically), using lock

shared memory synchronization

threads shared memory
preemptive thread scheduling: non-determinism
context switch at anytime
race condition: processes run in parallel depends on order in which that they are executed
conflicting access to a resource
force one to wait
need to pretect all possible locations where might conflict
- not interfere
- ordering

atomic operations

fetch_and_add, cmp_and_swap
series of operations that cannot be interrupted
root of most synchronization solutions
processor has to support some
OS uses them to build more sophisticated atomics

Using lock

lock,unlock

effect
eliminate race condition
increases overhead
restricts concurrency
other synchronization: transaction, semaphores, monitors
only one thread
progress: dont wait for available rock
bounded waiting
- no thread needs to wait forever
peterson’s algorithm, (flag[2], turn)
- set own flag, own turn
- check other flag own turn
- any uniprocessor
- relies atomicity of mem ops
- hard to have more than 2 threads,lambert’s backery
- doesn’t work on many multicore(not reordeting stores, non trivial and non portable)
99.9% don’t write synchronization operation yourself
Critical section: section of code, only one process/thread executing
- mutex
- lack of deadlock(ensure progress)
- bounded waiting(ensure fairness)
- entry code(preamble)
- CS
- exit code(post script)
lock example
- spinlock
- arm: load & and store exclusive, (load-link, store-conditional)

Multiprocessor memory model

uniprocessor memory is simple
- cache is transparent
sequential consistency
- result of exexution is as if were executed in sequential order operations of each individual proccesor in the order of the program
- reordering ll, ss, ls, sl
- Total store ordering: store buffer
memory fences(you know they know)
- loads and stores cannot be moved across mfence
- flushing the store buffer and prevent some reordering
- expensive: sfence/ifence
mfence making sure critical section not get out
free of data races
- not shared
- shared, read only
- shared, consistentcy protected by lock
racy code
- volatile
- mfence

Blocking / not Blocking

spinlock
yielding spinlock
blocking locks
hybrid solution
test-and-set(ship value around, writes will cause to ship buffer)
compare-and-swap(x86 not drop write)
fetch-and-add(counter)
load-linked, store-conditional(transaction-like)
where lock resides
granularity of locking

Semaphores

wait for event,reader/writer

counter that processes or threads can manipulate automatically

P() down() wait(): till counter>0, then automatically decrement it
V() up() signal() post(): atomically increment counter
counter represent number of available resources
complex arrangement, awkward

bounded buffer

producer
consumer
concurrent threads adding/removing from buffer
writer acquire all locks

Conditional variable

wait(): release lock and go to sleep, auto acquire lock when wakes up
signal(): wake up threads on event
broadcast(): wake up all threads on event

monitors

collection of data items
a lock
zero or more condition variables
cannot forget lock acquire/release
groups a lock with data and code
answers a lot of mundane questions: how many locks/when to acquire/release
works for almost every problem
CV/lock atomicity issue
pthreads: semaphore/CV
monitors and CV are a good framework for solving many kinds of sync problems
mutex : locks
blocks threads for events: CVs semaphores
always grab lock on entry and release on exit
guard wait in while not if

Invariants

something taht has to be true for your program to be correct
any part of your program that chagnes a variable or data structure must ensure that the invariant holds
parts of your program that use the data structure may assume that the invariant holds
must hold whenever the lock is not held
use assertions to make sure invariants hold

locks

dining philosophers problem
flips the order of grabing the fork(no cycle, forcing resolve dispute first)
resources held = resources needs
deadlock: two or more threads are waiting on events that only those threads can generate
- don’t go away
- no big picture
- hard to handle automatically
livelock: system takes steps but no global progress
- goes away when load decreases
deadlock conditions
- mutual exclusion
- hold and wait
- no preemption(resource cannot be taken away)
- circular wait
ignorance
detection and recovery: figure out when deadlock has occurred
- resource allocation graph, cycle detection(own: res -> thread)
- if someone holds a resource for too long
- terminate threads, break locks. invoke exception handlers, create more resources, reboot…
avoidance: reject resource requests that might lead to deadlock
- know a list of resources/ locks that is to be required
prevention: introduce rules
- shared resources 1
- never require more than 1 res 2
- all required at once 2
- 2 phase locking 2
- free current and reacquire 2
- preemptible resources 3
- strictly order resources (acquire in increasing order) 4
- require global system knowledge 4
distributed deadlock
- hard to get a good picture of the global system
- hard to coordinate actions
- often in a state of partial failure
user mode code on linux
- helgrind, linux - check for circular waits
- solaris - strict lock ordering
- avoid starving threads

Memory Management

Code, Data, Heap, Stack
Kernel mapped into all processes(linux high)
MMU remaps virtual to physical, read-only,supervisor-only, detects accesses to unmapped regions

Mapping Address

compile time(no MMU. conflict when loading two program, and inefficient use of DRAM)
load time(OS load program with fixed address offset)
dynamic
Uniprogramming
- DOS, overlapping is low
Static Relocation
- Allocate continuous memory for process, no safety and low efficiency.
Dynamic Relocation
- Same layout, relocate address, hardware support.
- Base and bounds, segmentation(base & bound), paging(fixed size), segmented paging
- bits: valid,krnl,RO,accessed,dirty
- Page table stored in DRAM, cache recently-used page table entries
TLB - fully associative
- DTLB 1TLB STLB
- TLB reach
- TLB way, put in different bucket
- multi-level page table, 9 bits 4 level - PML4, PDPT, PD, PT
sharing memory using paging (points to the same PPN), copy-on-write, one owner
Superpage: permit pages to have variable size
- kernel, frame buffer, large data structures
- few hardwired regions/syscalls/OS watches
demand paging: present & valid bit, page out to disk
- Spatial/temporal locality
- page selection
- page replacement policy
  - Optimal: throw out page used farthest in the future
  - Random, FIFO, LRU, NRU(approx. LRU)
  - NRU: periodically clear reference bit
  - modified fifo(second chance)
  - Clock algorithm(modified second chance)
- Belady’s Anomaly: for some algorithm more pages leads to more page faults(FIFO)
- thrashing

Filesystem(linear array of bytes)

User’s viewpoint: files/directorys & operations
Physical viewpoint: sectors tracks disks
User OS Hardware
Disk: Cylinder/Track - Block/sector - 100-200 IOPS
Container of persistent data, directory system
Sequential/random
Data structure:
- Global open file table
- Per-process open file table
- Free(disk) block list
- Free inode list
- File buffer cache
- inode cache
- name cache
Disk:
- superblock
- file, file descriptor(inode)
- Directory
- free block/inode maps
Name to inode(in directory)
find -> scan through inode tree
Open -> find, check, increment open count, per-process entry, return perprocess file talke index
read,write,seed

allocating blocks to file

Contiguous, Linked, indexed, multi-level indexed
Extent: (starting block, length)

Fault tolerance

RAID: striping & redundancy 0, 1, 5(mirroring + striping + parity)
Cluster
careful dist writes & fsck: order writes that disk is recoverable
- avoid major inconsistency: point to unfree, not pointed are free, no blocks belongs to more than one file
- not a goal: never loose data
logging
- record transactions then write: sweeper process
Caching/aggregated I/O/Prefetching
Disk head scheduling, disk interleaving

Flash memory

NOR/NAND(nor -> long erase cycle)
SLC/MLC(multiple bits per cell)
- 100,000 erases
- 1,000 to 10,000 erases MLC
read/write/erase(set all to 1)
reduce wear: Flash-translation layer(write to a pre-erased page) VPN-PPN
TRIM command for flash to erase
Write amplification: blocks written by the device/host
- overprovisioning

Distributed system

NFS,printd, ntpd,DNS,kerberos,httpd ….
RPC read & write stateless
atmost once RPC

Cluster

replicating for fault tolerance
passive replication(Primary/backup, master-slave)
- slave tracks master and ready to take over
Active replication(Quorum)
- perform oerperations insync
CAP Theorem
- A system that can tolerate arbitrary network partitions can only offer either
  - consistency
  - availability

Map-reduce

map: [(k, 1) for k in line.split() for line in fileChunk]
reduce: [(k, sum(onesList)) for (k, onesList) in tempChunk]

Security

Protection: prevent accidental or intentional misuse, malicious abuse hard to eliminate
Authentication
authorization
access enforcement
basic design principle
- Least privilege
- Economy of mechanism
- Open design
- complete mediation - check every access
- permision based
- separation of privilege
- least common mechanism
- easy to use
Mandatory access control and not under user control
SELinux
- Type enforcement
- Role-based access control
- Security model is complicated and powerful

Virtual machine

Virtual Machine Monitor(hypervisor)
simulation
Idea: use HW directly, simulated system/trap calls
- VMM handles VA-PA mapping
- VMM leaves VA spaces in guest
- simulates interrupts/syscalls
paravirutalization
- VMM do direct io
- modify guest kernel to avoid traps
goals
- consolidation and load balancing
- isolation
- portability
- fault tolerance
Intel VTX: new privilege -1(hypervisor), Guest OS kernel still at 0

Operating System

OS history

Phase 0

Phase 1

Phase 2

Phase 3

Phase 4

Future of OSes

Introduction

OS - hardware interface

OS organization

Traps

Controlling I/O devices

Initializing Traps/Interrupts

Trap vs Interrupts

Intro to Processes

Process management

OS Organization & intro into processes

Process

Paging example

Process State Machine

Process control block

Scheduling

context switch

Metric

Preemptive scheduling

Priority scheduling

realtime scheduling

Priority inversion

IPC

Message passing

Shared memory

sockets

Threads

Threads

Two abstractions

implementing threads

cooperating processes

Kernel threads vs User threads

synchronization

shared memory synchronization

atomic operations

Using lock

Multiprocessor memory model

Blocking / not Blocking

Semaphores

bounded buffer

Conditional variable

monitors

Invariants

locks

Memory Management

Mapping Address

Filesystem(linear array of bytes)

allocating blocks to file

Links

Fault tolerance

Flash memory

Distributed system

Cluster

Map-reduce

Security

Virtual machine