Understanding and Exploring 3D Gaussian Splatting: A Comprehensive Overview

Loges Siva
9 min readDec 28, 2023

--

Blending Pixels and Realities with no compromise in Speed

Figure 1: Rendered scene (Source: GIF by the author)

3D Gaussian splatting is a novel approach to scene representation and rendering in computer graphics. Unlike traditional methods that rely on structured grids or neural volumetric primitives, this technique uses the flexibility of 3D Gaussians. These methods, part of the Radiance Field class like NeRFs, offer faster training and rendering, maintaining similar or better quality. They are also more straightforward to understand and post-process.

Table of Contents 📋

  • Know the History
  • Architecture
    - Differentiable 3D Gaussian Splatting
    - Optimization
    - Fast Differentiable Rasterizer
  • Miscellaneous
    - Spherical Harmonics
  • Limitations
  • How to Run
  • Conclusion

Know the History 🏺

Photogrammetry vs. NeRF

Photogrammetry is a technique that reconstructs 3D structures from 2D images by analyzing their geometric relationships. It involves creating 3D models by analyzing multiple images taken from different perspectives, often achieved by photographing an object from various angles, such as rotating it on a turntable while capturing image. However, it has limitations such as difficulties in capturing intricate details, handling non-rigid scenes, and being sensitive to lighting conditions. Neural Radiance Fields (NeRF) address these drawbacks by using deep learning to model complex 3D scenes directly from images, overcoming limitations in detail capture and handling dynamic or deformable objects. NeRFs provide a more flexible and accurate representation of 3D scenes, making them particularly effective in scenarios where traditional photogrammetry faces challenges.

NeRF vs. 3D Gaussian Splatting

Neural Radiance Fields (NeRF) is a method in computer graphics that uses neural networks to model and render complex 3D scenes directly from 2D images. By capturing both geometry and appearance in a unified model, NeRF achieves high-quality scene representation. However, NeRF’s drawback lies in its computational intensity, making it computationally expensive for real-time applications or large-scale scenes. 3D Gaussian Splatting addresses these drawbacks by using 3D Gaussians, utilizing their parameters like anisotropic covariances, positions, and opacities to efficiently model complex scenes. The parameters are learned through machine learning approach. However, rendering does not require any heavy processing — fast rendering through a tile-based rasterizer.

Architecture 🏛️

Architecture Flow
Figure 2: Starting from a sparse Structure-from-Motion (SfM) point cloud, the optimization process used a fast tile-based renderer and generates a set of 3D Gaussians, and their density is adaptively controlled. (Source: Image taken from [1])

DIFFERENTIABLE 3D GAUSSIAN SPLATTING

The goal is to create a realistic 3D scene from a few sparse images. Traditional methods struggle with sparse data, especially when estimating surface normals. Instead, this approach uses 3D Gaussians — mathematical representations that capture the scene’s structure without needing normals.

A 3D gaussian is defined by:

  • Position (Mean μ): location (XYZ)
  • Covariance Matrix (Σ): rotation and scaling
  • Opacity (𝛼): Transparency
  • Color (RGB) or Spherical Harmonics (SH) coefficients

Each Gaussian, like a blob, is centered at a point (mean) and defined by a 3D covariance matrix Σ in world space. However, to render them efficiently, a projection to 2D is necessary. The challenge is projecting these 3D shapes onto a 2D surface for rendering. The covariance matrices have physical meaning only when they are positive semi-definite. So, traditional gradient descent is ignored as it can lead to invalid matrices because it is difficult to pose such constraints on the matrix directly.

This is done with a clever workaround, using matrices for scaling S and rotation R to describe these blobs. This method allows for independent optimization, adapting the blobs to different shapes in the scene. The result? A compact and realistic 3D scene that’s visually appealing.

covariance matrix equation
  • S is a diagonal scaling matrix with 3 parameters for scale
  • R is a 3x3 rotation matrix analytically expressed with 4 quaternions

Spherical Harmonics (SH) coefficients represent directional appearance component (color) of the radiance field. This is explained below in Miscellaneous section.

OPTIMIZATION

The optimization process involves crafting a dense assembly of 3D Gaussians to closely represent a scene for free-view synthesis.

To handle 3D to 2D projection ambiguities, the optimization method dynamically creates, alters, or removes geometry. Precise covariance parameters are pivotal for a compact representation, especially in capturing large, uniform areas with a minimal number of anisotropic Gaussians.

There are 3 main steps involved in proper optimization,To handle 3D to 2D projection ambiguities, our optimization dynamically creates, alters, or removes geometry. Precise covariance parameters are pivotal for a compact representation, especially in capturing large, uniform areas with a minimal number of anisotropic Gaussians.

  • SfM Initialization
  • Gradient Descent for Parameter Optimization
  • Adaptive Densification

SfM Initialization

Structure from Motion (SfM) is a computer vision technique that reconstructs a three-dimensional scene from a set of two-dimensional images or video frames. The process involves camera motion estimation and 3D structure reconstruction.

It achieves this by first identifying and tracking distinct features across images, determining how they move between frames. Using this information, SfM then estimates the relative poses of the cameras at different points in time. Next, the technique triangulates the 3D positions of these features by finding where the corresponding rays from different camera viewpoints intersect. Finally, a refinement process, known as bundle adjustment, optimizes the camera poses and 3D points to minimize any discrepancies between the actual and projected feature locations.

Instead of random initialization which can affect the optimization process, SfM initialization of 3D points can be a good approach.

Figure 3: (Source: Image taken from ResearchGate)

Gradient Descent for Parameter Optimization

Precise covariance parameters are pivotal for a compact representation, especially in capturing large, uniform areas with a minimal number of anisotropic Gaussians. So, stochastic gradient descent is used to fit to the scene properly by estimating gaussian parameters. Positional updates follow exponential decay, guided by a loss function combining L1 and D-SSIM (Structural Dissimilarity Index) terms.

Loss function

Adaptive Densification

After generating an initial set of sparse points using Structure from Motion (SfM), adaptive densification approach dynamically adjusts the number and density of 3D Gaussians to enhance scene representation during free-view synthesis.

After an initial optimization warm-up, it consistently densifies every 100 iterations, eliminating Gaussians with nearly transparent opacity values. The adaptive control addresses both under-reconstructed regions and over-reconstructed areas with large Gaussian coverage by leveraging view-space positional gradients.

Densification involves cloning small Gaussians in under-reconstructed regions and splitting large Gaussians in high-variance zones. To balance the system’s total volume and Gaussian count, 𝛼 values are occasionally reset, allowing controlled growth or reduction. This strategic approach, combined with periodic removal of excessively large Gaussians, maintains control over the total Gaussian count without the need for space compaction or warping. This method ensures Gaussians retain their representation in Euclidean space throughout the optimization process.

Adaptive Gaussian Densification
Figure 4: Top row (under reconstruction) and Bottom row (over-reconstruction) (Source: Image taken from [1])

FAST DIFFERENTIABLE RASTERIZER

The rendering approach focuses on achieving fast overall rendering and sorting for efficient 𝛼-blending without having strict limits on the number of splats. In short, the process involves sorting of gaussians by depth and grouping the tiles.

  1. The approach employs a tile-based rasterizer which divides the screen into 16x16 tiles to manage rendering tasks.
  2. Gaussians are culled against the view frustum and each tile, retaining only those with a 99% confidence interval intersecting the frustum.
  3. A guard band helps reject Gaussians at extreme positions.
  4. Each Gaussian is instantiated based on the number of tiles they overlap, and a key is assigned considering view space depth and tile ID.
  5. Sorting of Gaussians is performed based on these keys using a fast GPU Radix sort.
  6. One thread block is launched for each tile independently, and each block loads gaussians packets into shared memory.
  7. For each pixel, color and 𝛼 values are accumulated by traversing lists of Gaussians front-to-back.
  8. Saturation of 𝛼 is the termination criterion (i.e., 𝛼 goes to 1), and threads are queried at intervals to stop processing when all pixels have saturated.
  9. During the backward pass, the traversal of per-tile lists occurs back-to-front. To compute gradients, intermediate opacities are obtained by dividing the final accumulated opacity by each point’s 𝛼 in the back-to-front traversal.
  10. This method allows an unlimited number of blended primitives to receive gradient updates, accommodating scenes with varied depth complexity without specific parameter tuning.
Rasterization Algorithm
Figure 5: (Source: Image taken from [1])

Miscellaneous 🥷🏼

Spherical Harmonics

Spherical harmonics (SH) are mathematical functions defined on the surface of a sphere. In computer graphics, they are commonly used for representing lighting information in a compact and efficient manner. SH functions are used to approximate complex lighting conditions across a spherical environment, allowing for realistic rendering in computer-generated images.

In simple terms, spherical harmonics decompose the incoming light into a set of coefficients, each associated with a specific SH function. These coefficients capture the lighting characteristics, such as intensity and color, across different directions on a spherical surface. By using a limited number of SH coefficients, the representation remains concise while providing a reasonable approximation of the lighting environment.

In computer graphics, SH is often utilized in techniques like precomputed radiance transfer (PRT) and global illumination to simulate realistic lighting effects in virtual scenes.

The formula is given by,

Spherical Harmonics Equation
  • ℓ is a non-negative integer, known as the degree of the spherical harmonic
  • m is an integer such that −ℓ≤ m ≤ℓ, known as the order
  • θ is the polar angle (colatitude)
  • ϕ is the azimuthal angle
  • Pₗᵐ​(cosθ) is the associated Legendre polynomial
Figure 6: Spherical harmonic functions for different values of l and m (Source: Image taken from here)

Limitations

Despite the huge improvement compared to previous radiance field methods, this approach still faces limitations because of its simplicity.

Artifacts like elongated or “splotchy” Gaussians can occur, especially in challenging scenes. Popping artifacts may arise when creating large Gaussians, particularly in regions with view-dependent appearances.

In regions with inadequate coverage from training views, the random initialization method tends to introduce “floaters” that cannot be
removed by optimization.

The rasterizer’s guard band and visibility algorithm may contribute to popping artifacts and sudden depth/blending order changes. Also, antialiasing is not implemented, leaving room for improvement in addressing abrupt depth/blending order changes.

GPU memory consumption, while more compact than previous methods, can peak over 20 GB during training for large scenes. There is still room for improvement in reducing memory usage.

How to Run 🏃‍♂️

  • In google colab, change runtime type to GPU
  • Make sure CUDA version is 12.2
  • Install a pytorch version compatible with CUDA version. Install plyfile and tqdm packages
!pip install torch torchvision torchaudio
!pip install plyfile
!pip install tqdm
!git clone https://github.com/graphdeco-inria/gaussian-splatting --recursive
  • Install the necessary submodules
!pip install submodules/diff-gaussian-rasterization
!pip install submodules/simple-knn
!sudo apt-get install \
git \
cmake \
ninja-build \
build-essential \
libboost-program-options-dev \
libboost-filesystem-dev \
libboost-graph-dev \
libboost-system-dev \
libeigen3-dev \
libflann-dev \
libfreeimage-dev \
libmetis-dev \
libgoogle-glog-dev \
libgtest-dev \
libsqlite3-dev \
libglew-dev \
qtbase5-dev \
libqt5opengl5-dev \
libcgal-dev \
libceres-dev

!git clone https://github.com/colmap/colmap.git
cd colmap
!git checkout dev
mkdir build
cd build
!cmake .. -GNinja -DCMAKE_CUDA_ARCHITECTURES=native
!ninja
!sudo ninja install
  • Record a video around an object or scene
  • Create a directory /data/input for saving frames of the video.
  • Install ffmpeg. Split the video into frames with ffmpeg and save the frames in /data/input dir
!sudo apt install ffmpeg

!ffmpeg -i <VIDEO_PATH> -qscale:v 1 -qmin 1 -vf fps=2 %04d.jpg
  • Construct Structure-from-Motion (SfM) of the scene with COLMAP library
!python convert.py -s <PATH_TO_SAVED_FRAMES>
  • Run the following command for training. It takes ~2–3 hours in Tesla T4 GPU with 16GB VRAM for a small scene. The final output data will be saved in /output directory
python train.py -s <PATH_TO_SAVED_SfM_DATA>
  • Install pre-built binaries for Windows for SIBR viewer from here. Extract the downloaded .zip folder
  • In command prompt, run following commands for visualization
cd viewers/bin
SIBR_gaussianViewer_app.exe -m <Path_TO_OUTPUT_FROM_TRAINING>
  • Use W, A, S, D, J, L, I, K, U and O keys to navigate the rendered scene.
Navigate the rendered scene (Source: Video by the author)

For a beginner tutorial refer The NeRF Guru video here.

Have fun playing with your splats!

Conclusion 📜

3D Gaussian Splatting emerges as a promising advancement in scene representation for novel view synthesis. The technology is evolving fast with a lot of researchers and engineers already opting and improving the method. It is an exciting time ahead for computer graphics with advancements in GPU rendering, AI techniques and new optimization algorithms. Buckle up for a ride into the future for immersive digital experiences!

References 🔖

  1. Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G., “3D Gaussian Splatting for Real-Time Radiance Field Rendering”, arXiv:2308.04079, 2023
  2. Kate Yurkova, “A Comprehensive Overview of Gaussian Splatting”,
    Towards Data Science, 2023
  3. Spherical harmonics, Wikipedia

--

--