- Published on
Understanding Positional Encoding in 3D
Introduction
Raw coordinates tell a model where a point is, but they do not tell it how to organize space. In 3D learning, that gap matters quickly. The network has to separate nearby points that differ in fine geometry, while also keeping distant points from collapsing into misleadingly similar representations. A good positional encoding has to serve both goals at once: local detail sensitivity and global spatial separation.
This is why positional encoding is more than a technical add-on in NeRFs, point-based models, and coordinate networks. It is the mechanism that turns a plain coordinate such as x or (x, y, z) into a feature space where geometry becomes easier to learn. The interesting question is not whether to encode position, but what kind of spatial structure the encoding imposes.
Fourier Features as the Baseline
Traditional Fourier positional encoding maps a coordinate through sine and cosine functions at multiple frequencies:
The reason this works is straightforward. Low-frequency channels vary slowly and carry coarse spatial structure; high-frequency channels vary quickly and make small spatial differences more visible. Instead of asking the network to build multi-scale structure from raw coordinates alone, the encoding exposes that structure in the input representation.
This is also why the language of frequency appears so often in 3D papers. It does not mean that the object is oscillating. It only describes how fast the feature changes as the coordinate moves through space. A smooth body is easier to describe with slowly varying channels, while thin parts, sharp corners, and narrow boundaries need channels that react much more quickly.
The first demo should be read in two passes. The Positional Encoding panel shows how each coordinate is expanded across channels. The Dot-product Similarity panel shows how similar two encoded positions become after that expansion, which is especially relevant for attention-based models. This second view is where the baseline starts to reveal both its strength and its weakness.
Traditional Fourier / NeRF Positional Encoding
This demo shows how a 1D position x in [-1, 1] is expanded into many sine/cosine channels with different frequencies. The left plot shows the encoded values. The right plot shows how similar two positions look after encoding.
Vertical axis: input position x. Horizontal axis: encoding channel. White or dark line: the currently selected position.
Horizontal axis: position x_j. Vertical axis: position x_i. The cross shows the selected position.
This sparkline is the encoded feature vector of the selected x. Each point corresponds to one cosine or sine channel.
- More frequencies means more fine-scale variation, so nearby positions become easier to distinguish.
- The main diagonal in the similarity plot is bright because a position is always most similar to itself.
- Stripe-like off-diagonal patterns come from periodicity. They hint that some far-away positions can still look similar after encoding.
Fourier features are powerful because they are simple, multi-scale, and effective in practice. But they are also periodic. Since sine and cosine repeat, two distant positions can still produce partially similar channel patterns. In similarity plots, this appears as off-diagonal stripe structure: the encoding captures detail well, but it does not always separate global position cleanly. That tradeoff is the point of departure for PMPE.
PMPE: Reducing Periodic Ambiguity
PMPE, short for Phase-Modulated Positional Encoding, starts from a precise critique of the Fourier baseline. If the encoding is built entirely from periodic functions, then some of the resulting ambiguity is structural rather than accidental. The goal is therefore not to discard sinusoidal features, but to preserve their sensitivity to fine detail while making the resulting similarity geometry more faithful to actual spatial relationships.
Conceptually, PMPE does this by adding a phase-modulated branch on top of the Fourier-style branch:
The important point is architectural rather than symbolic. PMPE is an additive refinement. The Fourier part still carries multi-scale detail; the phase-modulated part reshapes how positions line up globally. In practice, the desired effect is not “more frequency,” but fewer accidental similarities between positions that are far apart.
That is exactly what the comparison demo emphasizes. Relative to the Fourier baseline, the diagonal structure stays strong, but the repeated off-diagonal stripes are reduced. The representation still supports local sensitivity, yet its global similarity pattern becomes less misleading.
Traditional Fourier vs PMPE
This demo compares the classic Fourier-style positional encoding with a phase-modulated version. The goal is to show how PMPE keeps structured variation while reducing misleading global periodic similarity.
Traditional Fourier / NeRF Encoding
Vertical axis: input position x. Horizontal axis: encoding channel.
Horizontal axis: position x_j. Vertical axis: position x_i.
Phase-Modulated Positional Encoding (PMPE)
Vertical axis: input position x. Horizontal axis: encoding channel.
Horizontal axis: position x_j. Vertical axis: position x_i.
- In the traditional Fourier version, repeated stripe-like off-diagonal patterns often appear in the similarity matrix because of periodicity.
- In the PMPE version, the similarity structure is usually more spatially coherent, meaning distant positions are less likely to look accidentally similar.
- The core goal of PMPE is not to remove detail, but to improve how global spatial separation is reflected in the encoded space.
For 3D models, that shift matters because spatial encoding is often downstream of neighborhood selection, similarity scoring, or attention. If distant positions look spuriously alike, the model can mix signals that should remain separate. PMPE is interesting because it attacks that failure mode directly without giving up the useful bias that made Fourier features attractive in the first place.
From PMPE to a Cube3D-Style Embedding
Once PMPE is stated as a design idea, the next question is what it looks like in an actual model. A Cube3D-style formulation makes the answer concrete by turning the additive intuition into a specific embedding function:
with
where is a learnable projection and is the carrier term.
These two branches play different roles. The Fourier-like term keeps the logic of sinusoidal encoding, but makes the projection adaptive rather than fixed. The model is no longer locked to a hand-chosen basis; it can learn which spatial directions or frequency mixtures matter for the task. The phase-modulated term contributes a more structured global phase pattern, so the representation is not organized only by repeated periodic responses.
The additive combination is the key design choice. Cube3D-style PMPE does not try to replace Fourier features with a completely different mechanism. It keeps their expressive bias and then corrects their global failure mode with an additional phase structure. Seen this way, the method is best understood as an engineering realization of the PMPE idea rather than as a separate family.
The demo below shows the same idea in a simplified 1D setting. The first panel still shows channel responses; the second shows how the resulting similarity field is organized around a point. The goal is not to reproduce a full research implementation, but to make the interaction between learnable projection and phase structure visible.
Cube3D-Style PMPE
This demo visualizes a simplified 1D version of a Cube3D-style phase-modulated positional encoding. It combines a learnable Fourier-like branch with a structured phase branch, then adds them together to form the final embedding.
Vertical axis: input position x. Horizontal axis: encoding channel. The highlighted line marks the selected position.
Horizontal axis: position x_j. Vertical axis: position x_i. The cross shows the selected position.
This sparkline shows the full embedding vector of the selected x, including the raw coordinate channel and the combined cosine/sine channels.
- Frequencies control how many combined channels are used.
- Seed changes the learnable Fourier-like projection, which slightly changes the detailed stripe pattern.
- The main idea is that the final embedding is formed by adding a Fourier-like branch and a phase-modulated branch together.
What Changes Once the Encoding Becomes Learnable
PMPE still lives close to the Fourier worldview: coordinates are pushed through structured functions, and the main question is how to shape those functions better. A broader shift happens once the encoding itself becomes explicitly learnable. At that point, the design space opens along several axes at once: fixed basis versus learned basis, absolute coordinates versus relative offsets, and function-based descriptions of space versus memory-based ones.
The last demo is useful precisely because it does not treat these ideas as one monolithic method. It shows three different ways in which “position encoding” can become task-adaptive, and each tab exposes a different object that the model is allowed to learn.
Learnable Fourier Features
The smallest step away from hand-designed encoding is to let the Fourier projection be learned. Instead of fixing the frequencies in advance, the model adapts the basis to the target signal. This keeps the multi-scale flavor of sinusoidal features, but turns the frequency basis from prior knowledge into trainable structure.
In the demo, this idea is implemented as a tiny browser-side training loop. The purple curve is a fixed target signal made from a few sinusoids of different frequencies, while the green curve is a learnable model of the form
What is trainable here is not only the coefficient , but also the frequency and phase of each basis element. Basis count changes how many learnable cosine terms the model has, Learning rate changes the size of each gradient step, and Seed changes the random initialization. The Pause and Reset controls make it easier to see that the model is literally being optimized in real time rather than replaying a precomputed animation.
The frequency chips on the right are also part of the explanation. They show the current magnitudes of the learned frequencies, so the point of the tab is not only that the green curve gets closer to the purple one. It is that the basis itself is moving during training.
Relative Position Bias
Another shift is to encode relative position rather than absolute location. In many geometric models, what matters first is not where a point sits in the world, but how it is displaced from another point. A learnable relative encoding lets the model shape that local bias field directly, which is especially natural for local attention and neighborhood-based 3D processing.
The relative-bias tab is therefore not showing points in world space. Its axes are (dx, dy): relative offsets around a reference point. The heatmap is a toy learned bias field over those offsets. Brighter or darker regions indicate that some relative displacements are preferred, suppressed, or treated differently, even though the absolute coordinates never enter the picture.
This tab is intentionally conceptual rather than train-from-data. Under the hood, the field is built from a rotated anisotropic Gaussian-like core plus a broader ring term and a directional oscillation. That is enough to mimic the kind of shaped locality pattern a learned relative bias might produce. Locality widens or narrows the useful neighborhood, Anisotropy changes how stretched the field is across directions, and Rotation turns the preferred orientation. The point of the demo is to make one idea visually unavoidable: a relative encoding is a function of displacement, not of absolute position.
Hash Grids as Spatial Memory
Hash-grid encodings move even further from the original Fourier picture. Instead of describing a point only through waves, they store learned features in multiple grids at different resolutions and retrieve them through interpolation. The representation becomes partly a memory system: local, multi-scale, and parameter-efficient.
The hash-grid tab makes that memory-like behavior explicit. It picks an anchor point, encodes that point through several grid levels, and then compares the anchor feature with the feature of every other queried location. The cross marks the anchor. The heatmap color shows cosine similarity between the anchor embedding and the queried embedding.
Implementation-wise, the demo uses a deliberately simplified multi-resolution hash grid. Each level has a different resolution, each grid vertex is mapped through a hash function into a finite table, the table stores a short learned-style feature vector, and the queried point bilinearly interpolates the four neighboring entries. The final feature is the concatenation of those interpolated vectors across levels. Levels changes how many resolutions participate, Table size changes the capacity of the hashed storage and therefore the collision pressure, Anchor x/y moves the query center, and Seed redraws the random table values.
Taken together, these tabs show three different ways to make spatial encoding adaptive. Learnable Fourier features modify the basis, relative encodings modify the frame of reference, and hash grids modify the storage mechanism itself. The demo is not trying to reproduce a full training pipeline for all three. It is trying to isolate the object that each family learns, so the design differences are easier to see.
Learned Spatial Encoding Examples
A compact browser-side demo of three learnable or task-adaptive encoding ideas: learnable Fourier features, learned relative bias, and a toy multi-resolution hash grid.
Purple is the target signal. Green is the learned Fourier model. The basis frequencies are trainable.
These are the current learned frequency magnitudes in cycles per unit interval.
Conclusion
The arc from Fourier features to PMPE to Cube3D-style embeddings is really an arc about spatial inductive bias. Fourier features give coordinate networks a strong multi-scale prior, but periodicity leaves them vulnerable to global ambiguity. PMPE keeps the useful part of that prior and adds phase structure to make similarity geometry cleaner. Cube3D-style formulations show how that conceptual move becomes a practical embedding design.
Once the encoding becomes learnable, the space of options gets wider. The model can learn its basis, its relative bias field, or even a multi-resolution spatial memory. At that point, positional encoding in 3D is no longer one fixed recipe. It is a family of choices about how local detail, global organization, and task adaptation should interact.