Embodied AI — AI glossary

Embodied AI refers to AI systems that interact with the physical world through a body or sensorimotor capabilities, rather than operating purely in software. For operators running local AI, this term arises when deploying models on robots, drones, or edge devices that must process real-time sensor data (cameras, LIDAR, microphones) and generate motor commands or physical actions. The key constraint is latency: inference must complete within milliseconds to enable closed-loop control, which often requires quantized models (e.g., Q4 or Q8) running on embedded GPUs like the Jetson Orin or Apple M-series chips. VRAM and power budgets are tight, so model size and batch size are tuned to fit the hardware.

Deeper dive

Embodied AI contrasts with disembodied AI (e.g., chatbots or image generators) that only process text or images without physical interaction. The embodiment can be a robotic arm, a legged robot, a drone, or even a smartphone with sensors. The AI model typically runs a perception-action loop: sense (e.g., camera frame) -> infer (e.g., object detection, path planning) -> act (e.g., motor torque). This loop imposes strict real-time requirements. For local AI operators, common frameworks include ROS 2 with ONNX Runtime or TensorRT for inference on edge hardware. Quantization (e.g., INT8) and model pruning are standard to meet latency targets. A popular embodied AI model is RT-2 (Robotic Transformer 2) from Google, which can be run locally on a Jetson Orin at ~10 FPS with INT8 quantization. The field also includes sim-to-real transfer, where models trained in simulation (e.g., Isaac Sim) are deployed on real hardware.

Practical example

An operator deploying a mobile robot with a Jetson Orin NX 16GB runs a quantized YOLOv8n (INT8) for object detection at 30 FPS and a small policy network (e.g., 1M parameters) for collision avoidance. The total VRAM usage is ~2 GB, leaving room for sensor processing. Inference latency must stay under 33 ms to match the camera frame rate. If the operator switches to a larger model like RT-2 (300M parameters), they would need to quantize to INT4 and possibly offload layers to system RAM, dropping to ~5 FPS.

Workflow example

In a typical workflow, the operator first trains a policy in simulation (e.g., using RLlib or Isaac Gym). Then they export the model to ONNX and quantize it using ONNX Runtime's quantization tool. On the robot, they run a ROS 2 node that loads the quantized model with TensorRT and subscribes to camera topics. The node publishes motor commands at 20 Hz. If using llama.cpp for a language-guided robot, the operator would quantize a small LLM (e.g., Phi-3-mini) to Q4_K_M and run it on the Jetson, but must limit context to 512 tokens to keep inference under 100 ms.