How Computer Vision Tracks Your Body: Pose Estimation Explained

March 28, 2026 · 8 min read

When you play 67 Speed, your webcam watches your arm movements and counts them in real time — no wearable sensors, no special hardware. The technology that makes this possible is called pose estimation, and it represents one of the most impressive achievements in modern computer vision.

What Is Pose Estimation?

Pose estimation is a computer vision technique that detects the position and orientation of a human body in an image or video stream. Rather than recognizing who someone is (that's facial recognition), pose estimation figures out where the body parts are — head, shoulders, elbows, wrists, hips, knees, ankles, and more.

The output of a pose estimation model is a set of keypoints (also called landmarks) — specific points on the body with x, y, and sometimes z coordinates. By connecting these keypoints, the system builds a virtual skeleton that mirrors your movements in real time.

This technology powers a wide range of applications beyond gaming: physical therapy tracking, sports performance analysis, sign language translation, animation motion capture, and even security systems that detect falls in elderly care facilities.

How Pose Estimation Models Work

At a high level, modern pose estimation follows a three-step process:

Person detection: The model first identifies that a human body is present in the frame and draws a bounding box around it.
Keypoint localization: Within that bounding region, a neural network predicts the location of each body landmark. Most models generate a "heatmap" for each keypoint — a probability map showing where that joint is most likely located.
Skeleton assembly: The detected keypoints are connected according to anatomical rules (shoulder connects to elbow, elbow connects to wrist) to form the final pose skeleton.

The neural networks used for this task are typically convolutional neural networks (CNNs) or, more recently, transformer-based architectures. They're trained on massive datasets of annotated human poses — hundreds of thousands of images where human annotators have manually marked the location of every joint.

MediaPipe vs. OpenPose: Two Major Approaches

Two frameworks dominate the pose estimation landscape, each with distinct strengths.

OpenPose

Developed at Carnegie Mellon University, OpenPose was one of the first real-time multi-person pose estimation systems. It uses a bottom-up approach: instead of detecting people first and then finding their joints, it detects all joints in the image simultaneously and then figures out which joints belong to which person.

OpenPose detects 25 body keypoints, 21 hand keypoints per hand, and 70 facial landmarks. It's highly accurate and handles crowded scenes well, but it's computationally demanding — typically requiring a dedicated GPU to run at real-time speeds.

MediaPipe

Google's MediaPipe takes a different approach, optimized for real-time performance on consumer devices. Its Pose model uses a top-down pipeline: it first detects a person using a lightweight detector, then runs a pose model within the detected region. MediaPipe Pose tracks 33 keypoints with full 3D coordinates (x, y, and depth).

The key advantage of MediaPipe is efficiency. It runs smoothly on laptops, tablets, and even smartphones without GPU acceleration. This makes it ideal for web-based applications that need to work on everyday hardware — exactly the use case that 67 Speed requires.

MediaPipe can track 33 body landmarks in 3D at over 30 frames per second on a standard laptop — no GPU required. That's what makes browser-based body tracking games possible.

Keypoints and Landmarks: The Language of Body Tracking

Each pose estimation framework defines a specific set of keypoints. MediaPipe's 33-point model includes:

Face: Nose, left/right eye (inner, outer), left/right ear, mouth (left, right)
Upper body: Left/right shoulder, elbow, wrist, pinky, index finger, thumb
Torso: Left/right hip
Lower body: Left/right knee, ankle, heel, foot index

Each keypoint comes with a visibility score — a confidence value between 0 and 1 indicating how certain the model is that the point is visible and correctly located. This is crucial for applications like 67 Speed, where the system needs to determine whether a detected movement is a genuine arm swing or just camera noise.

Real-Time Tracking: The 30 FPS Challenge

For a game to feel responsive, it needs to process at least 30 frames per second. That gives the pose estimation model roughly 33 milliseconds per frame to detect the person, localize all keypoints, and return results — all while sharing computational resources with the game logic, rendering, and other browser processes.

Achieving this requires aggressive optimization at every stage:

Model quantization: Reducing the precision of neural network weights from 32-bit floating point to 16-bit or 8-bit integers, which dramatically speeds up inference with minimal accuracy loss.
Resolution scaling: Processing a lower-resolution version of the camera feed for pose detection while displaying the full-resolution image to the user.
Temporal smoothing: Using predictions from previous frames to constrain the current frame's estimates, reducing jitter and computational cost simultaneously.
Region-of-interest tracking: After the first detection, subsequent frames only analyze the region around the previously detected body, avoiding redundant processing of the background.

How 67 Speed Uses Pose Estimation

In 67 Speed, the pose estimation pipeline is specifically tuned for arm movement counting. Here's how the system translates raw keypoint data into your score:

Wrist tracking: The system primarily monitors the y-coordinates (vertical position) of both wrist keypoints. As you move your arms up and down, the wrist positions oscillate.
Movement detection: An algorithm analyzes the wrist trajectory to identify complete movement cycles — each time a wrist moves from a peak to a trough and back constitutes one count.
Noise filtering: Small movements caused by camera shake, breathing, or subtle posture shifts are filtered out using threshold values. Only movements exceeding a minimum displacement are counted.
Confidence gating: If the pose model's confidence in a wrist position drops below a threshold (perhaps due to motion blur at high speeds), the system may hold the last reliable position rather than counting a spurious movement.

This pipeline runs entirely in the browser using JavaScript and WebAssembly, meaning your video data never leaves your device. The webcam feed is processed locally, and only your final score is transmitted — a design choice that prioritizes both performance and privacy.

The Future of Body Tracking in Games

Pose estimation technology is improving rapidly. Emerging models offer better accuracy in challenging conditions — low light, partial occlusion, unusual body positions — while running even faster on consumer hardware. As these models mature, we'll see increasingly sophisticated webcam-based games that can track not just arm position, but hand gestures, facial expressions, and full-body dance movements with the precision that currently requires expensive motion capture studios.

For now, the next time you play 67 Speed, you can appreciate the remarkable chain of technology that makes it all work: your webcam captures light, a neural network interprets that light as a human skeleton, an algorithm counts your arm movements, and your score appears on screen — all in the time it takes to blink.

Play 67 Speed ← Back to Blog