Published on

Understanding Positional Encoding in 3D

Introduction

In 3D representation learning, raw coordinates are little more than location indices. For NeRFs, point-based networks, and coordinate-based models, the network has to resolve local geometric detail while maintaining stable global separation. If the encoding is too smooth, detail gets washed out; if it leans too heavily on periodic structure, distant positions can align spuriously in feature space.

Positional encoding therefore does more than transform coordinates once. It injects a spatial inductive bias at the input stage. It determines how the model perceives scale, how it compares positions, and which spatial relations become easier for downstream layers to exploit.

Fourier features are the clearest starting point for this story. They give coordinate networks a strong multi-scale prior, but their periodicity also introduces extra global ambiguity. PMPE can be read as a response to that weakness, while broader learnable spatial encodings push the question of how to organize space from hand-crafted design toward task adaptation.

Fourier Features: The Classical Baseline

Classic Fourier positional encoding (Mildenhall et al., 2020) maps a coordinate to sine and cosine responses at multiple frequencies:

γ(x)=[cos(ω1x),sin(ω1x),cos(ω2x),sin(ω2x),]\gamma(x) = [\cos(\omega_1 x), \sin(\omega_1 x), \cos(\omega_2 x), \sin(\omega_2 x), \dots]

This works because it expands a simple coordinate axis into a multi-scale feature representation. Low-frequency channels change slowly and are better suited to coarse spatial structure. High-frequency channels react more sharply and make small positional differences easier to separate. Instead of asking the network to build multi-scale structure from raw coordinates alone, the encoding writes that structure directly into the input.

The word frequency here does not mean that the geometry itself is oscillating. It only describes how quickly the encoded feature changes as the coordinate moves. Lower frequencies vary more gently and are better for overall shape. Higher frequencies vary more aggressively and are better at capturing boundaries, sharp corners, and thin structures. That is why Fourier features became the standard starting point for coordinate networks: they provide a direct multi-scale prior.

The first visualization can be read from two angles. The Positional Encoding panel shows how a single coordinate is expanded into a multi-channel representation. The Dot-product Similarity panel shows the similarity structure induced by that encoding between different positions. For models that rely on dot-product similarity, especially attention-based ones, that second view matters because it shows how the encoding organizes spatial neighborhoods.

Traditional Fourier / NeRF Positional Encoding

This demo shows how a 1D position x in [-1, 1] is expanded into many sine/cosine channels with different frequencies. The left plot shows the encoded values. The right plot shows how similar two positions look after encoding.

Positional Encoding

Vertical axis: input position x. Horizontal axis: encoding channel. White or dark line: the currently selected position.

Dot-product Similarity

Horizontal axis: position x_j. Vertical axis: position x_i. The cross shows the selected position.

Encoded vector for the selected position

This sparkline is the encoded feature vector of the selected x. Each point corresponds to one cosine or sine channel.

How to read this
  • More frequencies means more fine-scale variation, so nearby positions become easier to distinguish.
  • The main diagonal in the similarity plot is bright because a position is always most similar to itself.
  • Stripe-like off-diagonal patterns come from periodicity. They hint that some far-away positions can still look similar after encoding.

The limitation of Fourier features comes from the same assumption that makes them useful. Sine and cosine are periodic, so distant positions can still produce similar responses in some channels. In the similarity view, that appears as repeated stripe structure away from the main diagonal. In other words, Fourier encoding can sharpen local detail while still struggling to separate global position cleanly.

PMPE: Reducing Periodic Ambiguity

PMPE, short for Phase-Modulated Positional Encoding (Roblox Research, 2025), is meant to address exactly that periodic ambiguity in Fourier encoding. As long as the representation is built mainly from periodic functions, distant positions can end up looking similar when they should not.

Its central move is to add a phase-modulated branch alongside the original Fourier branch:

embedding(x)=[x, cos(fm)+cos(pm), sin(fm)+sin(pm)]\text{embedding}(x) = [x,\ \cos(f_m) + \cos(p_m),\ \sin(f_m) + \sin(p_m)]

From that structure, the roles are clear. The Fourier branch continues to provide multi-scale detail sensitivity, while the phase-modulated branch is used to break the repeated alignments caused by purely periodic structure. PMPE is not trying to introduce more frequencies. It is trying to reduce long-range similarities that do not reflect true spatial proximity.

That is exactly what the comparison below makes visible. Relative to pure Fourier encoding, the strong response near the main diagonal remains, while the repeated stripes away from the diagonal become noticeably weaker. PMPE does not sacrifice local detail resolution. It reorganizes similarity geometry so it better matches actual spatial relationships.

Traditional Fourier vs PMPE

This demo compares the classic Fourier-style positional encoding with a phase-modulated version. The goal is to show how PMPE keeps structured variation while reducing misleading global periodic similarity.

Traditional Fourier / NeRF Encoding

Positional Encoding

Vertical axis: input position x. Horizontal axis: encoding channel.

Dot-product Similarity

Horizontal axis: position x_j. Vertical axis: position x_i.

Phase-Modulated Positional Encoding (PMPE)

Positional Encoding

Vertical axis: input position x. Horizontal axis: encoding channel.

Dot-product Similarity

Horizontal axis: position x_j. Vertical axis: position x_i.

What to look for
  • In the traditional Fourier version, repeated stripe-like off-diagonal patterns often appear in the similarity matrix because of periodicity.
  • In the PMPE version, the similarity structure is usually more spatially coherent, meaning distant positions are less likely to look accidentally similar.
  • The core goal of PMPE is not to remove detail, but to improve how global spatial separation is reflected in the encoded space.

That correction matters in 3D because positional encoding often sits upstream of neighborhood construction, similarity computation, and attention. If distant positions still look too similar after encoding, later layers can mix signals that should remain separate. PMPE is useful because it directly targets that structural ambiguity while preserving the local-detail advantages that made Fourier features attractive to begin with.

What Happens Once the Encoding Becomes Learnable

PMPE fixes periodic ambiguity, but once that idea enters a real model, the discussion keeps moving. The question is no longer only how to fix Fourier encoding. It becomes which frequencies, phases, and spatial structures should still be hand-designed, and which parts should be learned. In Cube3D (Roblox Research, 2025), the PMPE line is extended with learnable frequencies, pushing positional encoding further into an engineering design.

Cube3D: An Engineering Realization of PMPE

In practice, Cube3D uses a composite positional encoding. It does not revert to traditional Fourier features, and it does not keep PMPE alone. It combines the two into the final embedding:

embedding(x)=[x, cos(fm)+cos(pm), sin(fm)+sin(pm)]\text{embedding}(x) = [x,\ \cos(f_m) + \cos(p_m),\ \sin(f_m) + \sin(p_m)]

where

fm=xW,pm=xπ2+cf_m = xW,\qquad p_m = x \cdot \frac{\pi}{2} + c

Here WW is a learnable parameter and cc is the carrier term.

The reason for this choice becomes clearer if we start from Cube's objective. Roblox's Cube uses this encoding at the entrance to its shape tokenizer: it samples points from a mesh surface and sends them into a perceiver-style transformer. If positional representations cannot reliably separate distant points in cross-attention, the later latent modeling and reconstruction stages will suffer.

fm=xWf_m = xW and pm=xπ2+cp_m = x \cdot \frac{\pi}{2} + c serve different purposes. The former keeps the basic expressive role of the Fourier branch. The latter introduces extra phase structure to break the repeated alignments caused by purely periodic functions.

In the official implementation, the carrier term varies across channels to shift the phase branch, and the paper explicitly notes that this helps avoid resonance between the branches.

The demo below recreates the same logic in a simplified 1D setting. The first panel shows channel responses, and the second shows how the similarity field unfolds around a position. To make the idea easier to inspect, the demo uses a fixed frequency basis in place of a learned WW, but the relationship to watch is still the same: the Fourier branch preserves detail, while the phase branch constrains global similarity.

Cube3D-Style PMPE

This demo visualizes a simplified 1D version of a Cube3D-style phase-modulated positional encoding. It combines a learnable Fourier-like branch with a structured phase branch, then adds them together to form the final embedding.

Positional Encoding

Vertical axis: input position x. Horizontal axis: encoding channel. The highlighted line marks the selected position.

Dot-product Similarity

Horizontal axis: position x_j. Vertical axis: position x_i. The cross shows the selected position.

Encoded vector for the selected position

This sparkline shows the full embedding vector of the selected x, including the raw coordinate channel and the combined cosine/sine channels.

How to read this
  • Frequencies control how many combined channels are used.
  • Seed changes the learnable Fourier-like projection, which slightly changes the detailed stripe pattern.
  • The main idea is that the final embedding is formed by adding a Fourier-like branch and a phase-modulated branch together.

Learnable Fourier Features

Methods with learnable frequency bases hand the basis itself over to training. Instead of using a preset set of frequencies, the model adjusts the basis to the target signal. That keeps the multi-scale flavor of Fourier features while making the representation fit the data more closely.

In the demo, the purple curve is the target signal and the green curve is the current fit. The model takes the form

y(x)=iαicos(ωix+bi).y(x) = \sum_i \alpha_i \cos(\omega_i x + b_i).

The model learns not only the coefficients αi\alpha_i, but also the frequency ωi\omega_i and phase bib_i of each basis element. Basis count sets the number of basis terms, Learning rate sets the update speed, and Seed controls initialization. The frequency chips on the right show the magnitudes of the learned frequencies.

Relative Position Bias

Relative Position Bias does not learn where a point is. It learns how relative displacement should affect attention. It is often written as b(dx, dy), or more generally as θ(pipj)\theta(p_i - p_j), and then added directly to the attention score or used in both the attention and feature branches. The model no longer focuses on absolute coordinates, but on how far away another point is and in which direction.

Point Transformer feeds 3D relative coordinates pipjp_i - p_j into a small MLP, and in its experiments relative encoding clearly outperforms absolute encoding (Zhao et al., 2021). In that sense, what it learns is essentially a displacement preference map.

The demo shows a toy version of exactly that kind of map. The axes are (dx, dy), and the colors indicate whether a certain relative displacement is amplified or suppressed. Locality controls the effective range, Anisotropy controls directional stretching, and Rotation controls the dominant orientation. It does not reproduce a full training process, but it is enough to make the key distinction visible: relative bias depends on displacement rather than world coordinates.

Hash Grids as Spatial Memory

Hash grids follow a different line. Instead of continuing to design functions, they learn a multi-resolution spatial storage. A point is first placed on grids at several resolutions, local features are retrieved through hashing and interpolation, and the results are concatenated into a representation. Compared with Fourier features, this is closer to a sparse, local, extensible spatial memory.

The most representative work in this family is the multi-resolution hash grid encoding in Muller et al.'s Instant-NGP (Müller et al., 2022). It maps coordinates to trainable features stored in multiple hash tables, then forms the final representation through interpolation and concatenation. In practice, this can greatly reduce the computational burden caused by the dimensional expansion of Fourier encodings, while adapting more flexibly to the spatial structure of different tasks.

In this tab, the cross marks the anchor point, and the heatmap color shows the cosine similarity between the anchor embedding and the embedding at every other location. Under the hood, the implementation is a simplified multi-resolution hash grid: each level has its own resolution, grid vertices are mapped through a hash into a finite table, each query point bilinearly interpolates the neighboring entries, and the results are concatenated across levels. Levels sets how many resolutions participate, Table size affects hash collisions, Anchor x/y moves the reference point, and Seed resamples the toy table.

Learned Spatial Encoding Examples

A compact browser-side demo of three learnable or task-adaptive encoding ideas: learnable Fourier features, learned relative bias, and a toy multi-resolution hash grid.

Training
Loss:
Tiny training loop: fitting a target signal

Purple is the target signal. Green is the learned Fourier model. The basis frequencies are trainable.

TargetPrediction
Learned frequencies

These are the current learned frequency magnitudes in cycles per unit interval.

Conclusion

If this article is reduced to one question, it is this: how do we let a model resolve local detail without globally mixing positions that should stay distinct? Fourier features were the first relatively systematic answer. They strengthened coordinate representations with multi-scale sinusoidal bases, but they also brought periodic global ambiguity into the model. Later PMPE designs, and engineering realizations such as Cube3D, continue along the same line: preserve Fourier's ability to express detail while making similarity structure better match actual spatial relationships.

Beyond that point, the question is no longer only which positional encoding to choose. It becomes which parts should remain hand-designed and which parts should be learned. The basis can be learned. Relative bias can be learned. Even the storage mechanism for space can be learned. At that stage, positional encoding is no longer just a fixed transform in front of coordinates. It becomes part of how a 3D model organizes space, exploits locality, and adapts to the task itself.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (Vol. 12346, pp. 405–421). Springer International Publishing. https://doi.org/10.1007/978-3-030-58452-8_24
Müller, T., Evans, A., Schied, C., & Keller, A. (2022). Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics (TOG), 41(4), 102:1-102:15. https://doi.org/10.1145/3528223.3530127
Roblox Research. (2025). Cube: A Roblox View of 3D Intelligence. arXiv Preprint arXiv:2503.15475. https://arxiv.org/abs/2503.15475
Zhao, H., Jiang, L., Jia, J., Torr, P. H. S., & Koltun, V. (2021). Point Transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 16259–16268.