Physics World Model — Modality Catalog
2 imaging modalities with descriptions, experimental setups, and reconstruction guidance.
3D Gaussian Splatting
3D Gaussian splatting represents scenes as a collection of learnable 3D Gaussian primitives, each parameterized by position, covariance (anisotropic 3D extent), opacity, and spherical harmonic color coefficients. Rendering rasterizes the Gaussians by projecting them to 2D screen space, sorting by depth, and alpha-compositing with a tile-based differentiable rasterizer. Training optimizes Gaussian parameters via gradient descent with adaptive density control (splitting, cloning, pruning). This achieves real-time (30+ fps) rendering at quality comparable to NeRF, from SfM point cloud initialization (COLMAP).
3D Gaussian Splatting
Description
3D Gaussian splatting represents scenes as a collection of learnable 3D Gaussian primitives, each parameterized by position, covariance (anisotropic 3D extent), opacity, and spherical harmonic color coefficients. Rendering rasterizes the Gaussians by projecting them to 2D screen space, sorting by depth, and alpha-compositing with a tile-based differentiable rasterizer. Training optimizes Gaussian parameters via gradient descent with adaptive density control (splitting, cloning, pruning). This achieves real-time (30+ fps) rendering at quality comparable to NeRF, from SfM point cloud initialization (COLMAP).
Principle
3-D Gaussian Splatting represents a scene as a set of anisotropic 3-D Gaussians, each with position, covariance, opacity, and spherical harmonics color coefficients. Novel views are rendered by projecting (splatting) these Gaussians onto the image plane and alpha-compositing them in depth order. Unlike NeRF, rendering is rasterization-based and achieves real-time frame rates (≥100 fps) with high visual quality.
How to Build the System
Start with the same multi-view image dataset as NeRF (50-200 posed images via COLMAP). Initialize 3-D Gaussians from the SfM point cloud. Train by differentiable rasterization: project Gaussians to each training view, compute photometric loss (L1 + SSIM), and optimize positions, covariances, colors, and opacities via Adam. Adaptive densification (splitting/cloning Gaussians) and pruning runs periodically during training. Training takes ~15-30 minutes on a modern GPU.
Common Reconstruction Algorithms
- 3D Gaussian Splatting (original, Kerbl et al. 2023)
- Mip-Splatting (anti-aliased multi-scale Gaussian splatting)
- SuGaR (Surface-Aligned Gaussian Splatting for mesh extraction)
- Dynamic 3D Gaussians (for dynamic scenes / video)
- Compact-3DGS (compressed Gaussian representations)
Common Mistakes
- Insufficient initial SfM points causing sparse reconstruction
- Too few training views creating holes or floater artifacts in novel views
- Excessive Gaussian count (millions) consuming too much GPU memory
- Not using adaptive densification, leaving under-reconstructed regions
- Ignoring exposure variation between training images
How to Avoid Mistakes
- Use dense SfM initialization; increase COLMAP matching thoroughness if sparse
- Capture more views, especially in regions that are under-represented
- Apply periodic pruning of low-opacity Gaussians to control memory
- Enable adaptive densification and set proper gradient thresholds for splitting
- Apply per-image exposure compensation or normalize images before training
Forward-Model Mismatch Cases
- The widefield fallback processes a single 2D (64,64) image, but Gaussian splatting renders multi-view images from a set of 3D Gaussian primitives — output shape (n_views, H, W) encodes view-dependent appearance
- Gaussian splatting is a nonlinear rendering process (alpha-compositing of projected 3D Gaussians sorted by depth) — the widefield linear blur cannot model 3D-to-2D projection, depth ordering, or view-dependent effects
How to Correct the Mismatch
- Use the Gaussian splatting operator that projects 3D Gaussian primitives onto each camera plane via differentiable rasterization with alpha compositing
- Optimize Gaussian parameters (position, covariance, opacity, color SH coefficients) to minimize rendering loss across training views using the correct splatting forward model
Experimental Setup — Signal Chain
Experimental Setup — Details
Key References
- Kerbl et al., '3D Gaussian Splatting for Real-Time Radiance Field Rendering', SIGGRAPH 2023
Canonical Datasets
- Mip-NeRF 360 (9 scenes)
- Tanks & Temples (Knapitsch et al.)
- Deep Blending (Hedman et al.)
Neural Radiance Fields (NeRF)
Neural radiance fields (NeRF) represent a 3D scene as a continuous volumetric function F(x,y,z,theta,phi) -> (RGB, sigma) parameterized by a multi-layer perceptron that maps 5D coordinates (position + viewing direction) to color and volume density. Novel views are synthesized by marching camera rays through the volume and integrating color weighted by transmittance using quadrature. Training optimizes the MLP weights to minimize photometric loss between rendered and observed images. Primary challenges include slow training/rendering, view-dependent effects, and the need for accurate camera poses (from COLMAP).
Neural Radiance Fields (NeRF)
Description
Neural radiance fields (NeRF) represent a 3D scene as a continuous volumetric function F(x,y,z,theta,phi) -> (RGB, sigma) parameterized by a multi-layer perceptron that maps 5D coordinates (position + viewing direction) to color and volume density. Novel views are synthesized by marching camera rays through the volume and integrating color weighted by transmittance using quadrature. Training optimizes the MLP weights to minimize photometric loss between rendered and observed images. Primary challenges include slow training/rendering, view-dependent effects, and the need for accurate camera poses (from COLMAP).
Principle
Neural Radiance Fields (NeRF) represent a 3-D scene as a continuous volumetric function F(x,y,z,θ,φ) → (RGB, σ) parameterized by a multi-layer perceptron (MLP). The network maps 3-D position and viewing direction to color and volume density. Novel views are synthesized by differentiable volume rendering along camera rays, and the network is trained by minimizing photometric loss against a set of posed 2-D images.
How to Build the System
Capture 50-200 images of a scene from diverse viewpoints using a calibrated camera (known intrinsics) or estimate camera poses with COLMAP structure-from-motion. Images should cover the scene uniformly. Train a NeRF MLP (typically 8 layers, 256 units, with positional encoding of input coordinates) on a GPU (≥12 GB VRAM). Training takes 12-48 hours on a single V100. Use mip-NeRF, Instant-NGP, or TensoRF for faster convergence.
Common Reconstruction Algorithms
- Vanilla NeRF (MLP + positional encoding)
- Instant-NGP (multi-resolution hash encoding, minutes training)
- mip-NeRF (anti-aliased cone tracing)
- Nerfacto (nerfstudio default combining multiple improvements)
- TensoRF (tensor factorization for compact radiance fields)
Common Mistakes
- Insufficient camera pose accuracy (SfM failure) causing blurry results
- Too few input views or views clustered in a narrow angular range
- Training only at one scale without mip-NeRF, causing aliasing at novel distances
- Floater artifacts in empty space from insufficient regularization
- Very slow training and rendering with vanilla NeRF (hours to train, seconds per frame)
How to Avoid Mistakes
- Verify COLMAP pose estimation quality; add more images if registration fails
- Capture views uniformly around the scene; include close-up and distant views
- Use mip-NeRF or multi-scale training for scale consistency
- Add distortion loss or density regularization to eliminate floater artifacts
- Use Instant-NGP or 3D Gaussian Splatting for real-time rendering requirements
Forward-Model Mismatch Cases
- The widefield fallback processes a single 2D (64,64) image, but NeRF renders multiple views of a 3D scene from a volumetric radiance field — output shape (n_views, H, W) represents images from different camera poses
- NeRF is fundamentally nonlinear (volume rendering integral: C(r) = integral of T(t)*sigma(t)*c(t) dt along each ray) — the widefield linear blur cannot model view-dependent appearance, occlusion, or 3D geometry
How to Correct the Mismatch
- Use the NeRF operator that performs differentiable volume rendering: for each pixel, cast a ray through the volumetric density/color field and integrate transmittance-weighted radiance
- Optimize the 3D radiance field (MLP or voxel grid) to minimize photometric loss across all training views using the correct volume rendering equation as the forward model
Experimental Setup — Signal Chain
Experimental Setup — Details
Key References
- Mildenhall et al., 'NeRF: Representing scenes as neural radiance fields for view synthesis', ECCV 2020
- Muller et al., 'Instant Neural Graphics Primitives (Instant-NGP)', SIGGRAPH 2022
Canonical Datasets
- NeRF Blender Synthetic (8 scenes)
- LLFF (8 forward-facing scenes)
- Mip-NeRF 360 (9 unbounded scenes)