Edge AI Overview — Edge AI: Mobile and IoT (Chapter 1)

Edge AI moves inference workloads from cloud servers to devices physically located near data sources. This proximity eliminates network latency, enables offline operation, and reduces bandwidth costs. For production deployments, these factors often determine whether a system is economically viable.

The fundamental constraint driving edge AI is compute-to-connectivity ratio. A cloud GPU clusters can deliver thousands of TOPS (tera operations per second) but requires consistent 100Mbps+ bandwidth and introduces 50-200ms round-trip latency. A Raspberry Pi 4 delivers approximately 0.4 TOPS—a fraction of cloud throughput—but operates with zero network dependency and processes data as it arrives.

Three primary device categories define the edge landscape. microcontrollers (Cortex-M class) handle <1 TOPS with milliwatts of power draw, suitable for simple signal classification. Single-board computers like Raspberry Pi and Jetson Nano operate at 1-10 TOPS within 5-15W thermal envelopes. Mobile system-on-chips in flagship smartphones reach 20-40 TOPS while managing thermal throttling.

Real-world failure modes cluster around thermal constraints and memory bandwidth. Mobile neural processing units (NPUs) throttle from 100% to 40% throughput within 90 seconds when ambient temperature exceeds 30°C. Memory-bound models—those with parameters exceeding on-chip cache twice over—suffer 10x slowdowns as swap operations activate. Awareness of these constraints shapes every aspect of edge deployment.

Budget allocation for edge projects typically splits 40% compute, 30% memory/storage, 20% power delivery, 10% thermal management. Neglecting power delivery causes brownout resets under peak inference load. Skimping on thermal management triggers thermal throttling that undermines all performance predictions.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.