Speeding Up Transforms, or How Everything Starts to Look Like a Nail

You know that feeling when you ignore a system for ages; you know the current implementation is bad, but it's good enough for your current situation, only for that same system to come and bite you in the arse later and force you to refactor large areas? Welcome to my transform system.

Tracy profiler showing a 190ms frame — A Tracy profile capture showing 190ms frame times, not enough scopes to identify why.

Each frame around 190ms. Lots of systems were generally destroying performance: the Transform update system, the instanced Material collection system, the asteroid spawning system. I'd been getting away with a rough implementation while the rest of the engine caught up, but with tens of thousands of asteroids orbiting a planet, each with their own hierarchy entry, it wasn't acceptable any more.

GatherMeshInstancesSimd before optimisation — Before: GatherMeshInstancesSimd at 207ms on the render thread.

GatherMeshInstancesSimd after optimisation — After: 760us. Same work, same entity count.

#Where it started

The original HierarchicalTransformSystem leaned on Arch's ECS query system. Eight component tag types, Transform_Level0 through Transform_Level7, each marking an entity's depth in the hierarchy. Each depth had its own query, processed in order: roots first, then children, then grandchildren.

public record struct Transform(LocalTransform Local, WorldTransform World) : IEcsComponent
{
    public ulong Version = 0;
    public TransformDirtyFlag DirtyFlag { get; set; }
}

public readonly record struct LocalTransform
{
    public Vector3 Position { get; init; }
    public Quaternion Rotation { get; init; }
    public Quaternion VisualRotation { get; init; }
    public Vector3 Scale { get; init; }
}

Transform.cs before the refactor

public record struct Transform_Level0;
public record struct Transform_Level1;
public record struct Transform_Level2;
public record struct Transform_Level3;
public record struct Transform_Level4;
public record struct Transform_Level5;
public record struct Transform_Level6;
public record struct Transform_Level7;

Eight tag components, one per hierarchy depth

That guaranteed parents were computed before their children needed them. Simple enough, but it had problems:

Maximum depth was hardcoded to eight levels.
Transform data lived in ECS archetypes, scattered across whichever chunk each entity landed in. Walking the hierarchy meant pointer-chasing between chunks.
Reparenting an entity meant removing and re-adding tagged components to shift it between depth queries.

#Pulling transforms out of the ECS

The first change was creating TransformHierarchy, a centralised manager that owns all transform data in flat, parallel arrays. That all moved into the hierarchy manager. The component shrank to an index:

public record struct Transform : IEcsComponent
{
    private TransformClusterIndex hierarchyIndex;
}

public readonly record struct LocalTransform
{
    internal readonly TransformClusterIndex hierarchyIndex;
    public Vector3 Position { get => Manager.GetLocalPosition(hierarchyIndex); }
    public Quaternion Rotation { get => Manager.GetLocalRotation(hierarchyIndex); }
    public Vector3 Scale { get => Manager.GetLocalScale(hierarchyIndex); }
}

After: Transform is just a handle into TransformHierarchy

LocalTransform and WorldTransform still exist as types, but their properties now proxy through to the hierarchy's arrays. Nothing is stored on the component itself.

This is a Structure-of-Arrays layout. Instead of each entity bundling its position, rotation, and scale together in a component, there's one array of all positions, one of all rotations, one of all scales. Seventeen arrays in total: local transforms, world transforms, dirty flags, parent indices, entity handles, version counters, and matrix caches.

Iterating one property across all entities (all world positions, say) now reads contiguous memory.

The raw state looked something like this:

private Vector3[] localPosition;
private Quaternion[] localRotation;
private Vector3[] localScale;
private Quaternion[] localVisualRotation;
private Vector3[] worldPosition;
private Quaternion[] worldRotation;
private Vector3[] worldScale;
private Matrix4x4[] worldMatrix;
private int[] matrixVersion;
private TransformDirtyFlag[] dirtyFlags;
private EntityFlags[] entityFlags;
private int[] depth;
private int[] parentIndex;
private int[] version;
private Entity[] entityHandle;

// Mirror buffers (render thread snapshot)
private Vector3[] worldPositionMirror;
private Quaternion[] worldRotationMirror;
private Vector3[] worldScaleMirror;
private Matrix4x4[] worldMatrixMirror;
private int[] matrixVersionMirror;
private int[] versionMirror;

I received some feedback here that using a more compacted TRS structure, AoS style, would perform just as well while being cleaner. Unfortunately I measured it and it was reliably 200us slower that way 😩. I'll revisit it at some point to see if it was a mistake on my end. For now: array spam.

But the arrays were flat, with no grouping by depth. I still needed to process parents before children. My first attempt was to sort all the arrays by the depth array, in descending order. That way the deepest children sat at the front and roots sat at the back. New entities almost always start at depth 0, so they'd append to the end without triggering a re-sort.

In the end, this was awful. It got me down to 100ms frames which while better, is no where near enough. Structural changes (reparenting, despawning) would result in a sort pass across every array that caused a massive blip on the framegraph. I needed a better solution.

#The clustered list

That solution was my ClusteredList.

CategorizedClusteredList<T> stores elements in fixed-size clusters of 128. Each cluster belongs to an integer category. For transforms, the category is the hierarchy depth. But the structure is generic; it only knows about categories and clusters.

#The master list

Clusters are doubly-linked in two overlapping chains:

A global list chains every active cluster, ordered by category.
Each category has its own sub-chain linking just its clusters.

Separate arrays store references to the head and tail clusters of each category, so accessing either end of any category's chain is O(1) without walking the global list.

CategorizedClusteredList master list structure

The global list gives category-ordered iteration, which for transforms means means you are able to walk across isolated transform depths. The per-category chains give targeted iteration over just one depth level. Both walk a linked list, touching only the clusters they need.

#Handles

Each element gets a ClusterIndex: a 64-bit value packing three fields.

Pool Index (32 bits)	Generation (25 bits)	Slot (7 bits)

Pool index identifies the cluster. Slot identifies the element's position within the cluster's 128-slot array. Generation is a version counter, incremented when a cluster gets freed and recycled. In debug builds, every access validates that the handle's generation matches the cluster's current one. Stale handles get caught immediately. We only need 7 bits for the slot with our clusters being only 128 slots.

#O(1) insertion

Adding an element needs a cluster with space in the right category. Searching the category chain for a non-full cluster would be O(n).

The fix is a simple enough. Within each category, partial clusters sit at the head and full clusters sit at the tail. Adding always targets the head. When an add fills a cluster to 128, that cluster is unlinked from the head position and re-linked at the tail. The next add hits the new head, or allocates a fresh cluster from the pool if none remain.

Cluster reordering: full cluster moves from head to tail

If the next add finds no partial cluster at the head, a fresh one is popped from the free pool (or allocated if the pool is empty), linked at the head of the category, and the element goes in at slot 0.

Cluster allocation: fresh cluster pulled from free pool into category chain

#O(1) removal, Swap-with-back removal

Removing an element from the middle of a cluster would leave a gap. Shifting elements to close it is O(n) and invalidates handles. Instead, the removed slot is overwritten by the last element in the cluster, and the count decrements.

Swap-with-back removal

One move, no shifting, no gaps. The cost is that the swapped element has changed position. Remove() returns the displaced element's new ClusterIndex so updating now stale ClusterIndexes is the responsibility of the caller. For TransformHierarchy, that means updating the entity-to-index dictionary.

If removal drops the cluster below full, it moves back to the head of its category chain, making it available for insertions again.

One thing to note. As far as ClusteredList is concerned the operation is O(1). Whether it actually is depends on the if/how you need to fixup stale keys as a result of the removal as the choice is passed onto the consumer. In my Transform system it results in dictionary write. In Material system documented later, the change is ignored.

#Compaction

When a cluster's count hits zero, it's unlinked from both chains and pushed onto the free pool stack. Its generation increments, invalidating any outstanding handles that might still reference the old contents. Recycling is O(1).

Cluster compaction: empty cluster returned to free pool

#Recategorisation

When an entity gets reparented, its depth changes, and it needs to move categories. Recategorize() handles this as a remove-from-old, add-to-new. It returns both the element's new index and the index of any element displaced by the swap.

For TransformHierarchy, reparenting is recursive. The entity moves, then each of its children must recategorise too. All seventeen parallel lists stay in lockstep.

#Everything in order

With categories as depths, iterating categories 0, 1, 2 gives breadth-first processing. The update loop walks all roots, then all their children, then grandchildren. At each depth, parent world transforms are already computed.

int maxDepth = localPosition.MaxCategory;

for (int depth = 0; depth <= maxDepth; depth++)
{
    foreach (ClusterRef cluster in localPosition.EnumerateCategory(depth))
    {
        Span<Vector3> positions = localPosition.GetBlockSpan(cluster);
        Span<Quaternion> rotations = localRotation.GetBlockSpan(cluster);

        for (int i = 0; i < cluster.Count; i++)
        {
            // Parent world data is already computed at this point.
            // Combine local and parent to produce world transform.
        }
    }
}

#Pushing it even further with parallelism

Within each category, clusters are 128-element contiguous spans. That's a natural unit for parallel dispatch; each cluster can be processed on its own thread with no synchronisation between them.

The job interface is minimal. You implement one method:

public interface IClusterJob
{
    void Execute(ClusterRef cluster);
}

Then hand it to ParallelClusteredListJob.Execute() along with an enumeration of clusters. The system collects every ClusterRef into a flat array, then schedules each cluster as its own individual job on the scheduler. One cluster, one job. The scheduler's work-stealing deques handle the rest: idle workers steal jobs from busy workers, so uneven cluster costs get absorbed automatically.

Parallel cluster dispatch across worker threads — Each cluster is scheduled as its own job. Workers pull from their own queues and steal from others when idle.

A ClusterRef from one list can index into any of the parallel lists since they're all structurally synchronised, so one dispatch drives access across all seventeen arrays.

If there's no scheduler available, or only one cluster to process, it falls back to sequential execution on the calling thread. No overhead for the simple case.

Going back to the transform system, the requirement there is stricter. Depth 0 must finish before depth 1 can start, depth 1 before depth 2, and so on; a parent's world transform has to be computed before any of its children read it. CategoryBarrieredParallelClusteredListJob handles this. It walks categories in order, dispatching all clusters within one category in parallel, then waits for them all to complete before moving to the next.

for (int cat = 0; cat <= list.MaxCategory; cat++)
{
    ParallelClusteredListJob.Execute(
        list.EnumerateCategory(cat), scheduler, ref job);
    // implicit barrier: Execute() waits for all jobs in this category
}

Category-barriered parallel dispatch for transform depths — Each depth level runs in parallel across workers, with a barrier before the next depth starts.

Within each depth, it's the same work-stealing dispatch as before. The barriers just make sure children never run ahead of their parents.

#Other applications

Another glaring issue in the original frame profile was GatherMeshesSimd. This method grabbed and CPU-frustum-culled all meshes while building instanced transform buffers. (The non-instanced path was equally problematic, but my asteroids were instanced).

I approached that problem in a similar way to the Transform issue. There is now a RenderableRegistry that uses the same structure. The category is no longer hierarchy depth but a unique index associated with each MaterialHost (a unique combination of Material+Mesh+Flags suitable as an instanced draw chunk). Iterating one category gives all renderables sharing a material, contiguous in memory and ready for instanced draw calls. This plays very nicely with SIMD as we can store the individual X/Y/Z/Radius values into separate arrays and load seamlessly into Vectors for my SIMD frustum cull.

I've also rolled it out to my AsteroidSpawnerSystem, for tracking spawned "cells" eligible for collection.

#What's next

I'm sure I'll find even more systems I can roll this out to. Practically any existing List where I do not need stable references or per-element ordering is a good candidate.

Once you have a good hammer, everything starts to look like a nail.