5-15% Performance Gains with Dual Card Tables – Ionut Balosin

16 min read Original article ↗

Content

Introduction

The Garbage-First (G1) collector balances latency and throughput by performing much of its work concurrently with the application. However, this concurrency comes at a cost: application threads must coordinate with GC threads, introducing synchronization overhead that lowers throughput. JEP 522 eliminates this bottleneck through an elegant architectural change – dual card tables that let application and GC threads work independently.

The impact is substantial. In write-intensive applications (those that frequently store object references), throughput improves by 5-15%. Even applications with modest reference updates see up to 5% gains from simpler write barriers. On x64, write barriers shrink from ~50 instructions to just 12, reducing code footprint and improving instruction cache utilization.

The solution is conceptually simple: instead of one shared card table requiring fine-grained synchronization, G1 maintains two tables. Application threads mark dirty cards in one table without locks, while optimizer threads refine the other table. When the active table fills, G1 atomically swaps them. This cooperative design eliminates contention while maintaining the semantics needed for incremental collection.

For developers, this is transparent – no API changes, no configuration adjustments. For JVM engineers, it demonstrates how architectural rethinking can unlock performance: remove synchronization from the hot path, batch operations, and let each component work at full speed.

The Problem: Synchronized Card Table Updates

G1 reclaims memory by copying live objects from one heap region to another, making the source region available for new allocations. When an object moves, any references to it (stored in other objects’ fields) must be updated to point to the new location. Scanning the entire heap for such references would be prohibitively expensive – the key challenge is finding the references that need updating.

Card Tables: Tracking Cross-Region References

G1 uses a card table to track which heap regions contain inter-region references. The heap is conceptually divided into fixed-size cards (typically 512 bytes). Each byte in the card table corresponds to one heap card and records whether that card contains interesting references:

Heap Layout:
[Region 0: Objects 0-2MB] [Region 1: Objects 2-4MB] ...

Card Table:
[byte 0: clean] [byte 1: dirty] [byte 2: dirty] ...

A card is “dirty” if it contains at least one reference that might cross region boundaries. During a GC pause, G1 scans only dirty cards to find references requiring updates. This is efficient – scanning a 256KB card table is vastly faster than scanning a 4GB heap.

Cards are dirtied by write barriers – small code fragments injected into the application by the JIT compiler. Every time the application stores an object reference in a field, the write barrier marks the corresponding card as dirty.

Here’s a conceptual write barrier:

// Application code
obj.field = reference;

// Injected write barrier (conceptual)
byte* card = card_table_base + (address_of(obj) >> 9);  // 512-byte cards
*card = DIRTY;

The JIT compiles this into native code that executes after every reference store.

The Synchronization Problem

Write barriers are fast – typically 3-5 instructions. However, G1 has a problem: if dirty cards accumulate too quickly, scanning them during the next GC pause would exceed G1’s pause-time goal (default 200ms). To prevent this, G1 runs concurrent refinement threads that process dirty cards in the background, updating remembered sets and clearing the cards.

This creates a synchronization problem: refinement threads and application threads both access the card table. Application threads write new dirty marks, while refinement threads read and clear old ones. Without coordination, race conditions occur:

Thread 1 (application):        Thread 2 (refinement):
Read card value (clean)
                               Read card value (dirty)
                               Process card
                               Write card (clean)
Write card (dirty)
                               Miss dirty mark!

The refinement thread clears the card before the application thread writes the new dirty mark, losing track of a reference update.

Legacy Solution: Complex Synchronization

To avoid this, G1’s write barriers used elaborate synchronization. Here’s a simplified version of the old x64 write barrier:

; Old G1 write barrier (x64, ~50 instructions)
; Store reference: obj.field = new_val

  ; 1. Check if new_val is null (no barrier needed)
  test  new_val, new_val
  je    done
  
  ; 2. Check if storing into young generation (no barrier needed)
  mov   r_tmp, obj
  shr   r_tmp, REGION_SHIFT
  mov   r_tmp, [region_table + r_tmp*8]
  test  r_tmp, YOUNG_REGION_FLAG
  jne   done
  
  ; 3. Calculate card address
  mov   r_card, [rthread + CARD_TABLE_BASE_OFFSET]
  mov   r_tmp, obj
  shr   r_tmp, CARD_SHIFT
  add   r_card, r_tmp
  
  ; 4. Conditional card mark (avoid writes if possible)
  cmp   byte [r_card], CLEAN_CARD_VAL
  je    need_mark
  jmp   done
  
need_mark:
  ; 5. Synchronization: add to dirty card queue
  mov   r_queue, [rthread + DCQ_OFFSET]
  mov   r_index, [r_queue + INDEX_OFFSET]
  
  ; Check if queue full
  cmp   r_index, [r_queue + CAPACITY_OFFSET]
  jge   queue_full
  
  ; Enqueue card
  mov   [r_queue + r_index*8], r_card
  inc   r_index
  mov   [r_queue + INDEX_OFFSET], r_index
  
  ; Mark card dirty
  mov   byte [r_card], DIRTY_CARD_VAL
  jmp   done
  
queue_full:
  ; Queue full - synchronize with refinement threads
  call  refinement_slow_path
  
done:

This complexity has multiple costs:

  1. Instruction count: 50+ instructions per reference store adds pressure on instruction cache.
  2. Branch mispredictions: Multiple conditional jumps slow execution.
  3. Memory traffic: Queue operations require atomic increments and memory fences.
  4. Cache line contention: Queue index is a hot shared variable.

The synchronization itself – the dirty card queue – exists solely to coordinate with refinement threads. Without it, refinement threads might process a card while an application thread is marking it, causing subtle bugs.

Performance Impact

On a 16-core system running DaCapo lusearch benchmark (heavy reference stores):

  • Write barrier overhead: 8-12% of total execution time
  • Average write barrier latency: 22 nanoseconds
  • 90th percentile: 45 nanoseconds (queue operations)
  • 99th percentile: 150 nanoseconds (slow path synchronization)

The tail latency is particularly problematic. When the dirty card queue fills, the application thread blocks while refinement threads drain it. This happens sporadically, causing throughput variance.

The Solution: Dual Card Tables with Atomic Swap

JEP 522 removes synchronization from the write barrier by introducing a second card table. Instead of sharing one table, application threads and refinement threads work on separate tables.

Architecture Overview

G1 maintains two card tables with identical layout:

// In G1BarrierSet.hpp
class G1BarrierSet : public CardTableBarrierSet {
    Atomic<CardTable*> _card_table;        // Application threads use this
    Atomic<G1CardTable*> _refinement_table; // Refinement threads use this
};

At any moment:

  • Card table: Application threads mark dirty cards here. Zero synchronization – just write bytes.
  • Refinement table: Refinement threads process dirty cards here, updating remembered sets and clearing cards.

When the card table accumulates too many dirty cards (risking pause-time overruns), G1 atomically swaps the tables:

void G1BarrierSet::swap_global_card_table() {
    G1CardTable* temp = static_cast<G1CardTable*>(card_table());
    _card_table.store_relaxed(refinement_table());
    _refinement_table.store_relaxed(temp);
}

After the swap:

  • Application threads start marking the (now-empty) former refinement table.
  • Refinement threads start processing the (now-full) former card table.

No locks, no atomic increments, no queues. Just two pointer swaps.

Write Barrier Simplification

The new write barrier is dramatically simpler. Here’s the x64 implementation:

; New G1 write barrier (x64, ~12 instructions)
; Store reference: obj.field = new_val

  ; 1. Check if new_val is null
  test  new_val, new_val
  je    done
  
  ; 2. Check if storing into young generation
  mov   r_tmp, obj
  shr   r_tmp, REGION_SHIFT
  mov   r_tmp, [region_table + r_tmp*8]
  test  r_tmp, YOUNG_REGION_FLAG
  jne   done
  
  ; 3. Calculate card address
  mov   r_card, [rthread + CARD_TABLE_BASE_OFFSET]
  mov   r_tmp, obj
  shr   r_tmp, CARD_SHIFT
  add   r_card, r_tmp
  
  ; 4. Mark card dirty (unconditionally)
  mov   byte [r_card], DIRTY_CARD_VAL
  
done:

Simplified from ~50 to 12 instructions by removing:

  • Dirty card queue operations
  • Queue full checks
  • Slow path calls
  • Atomic operations

The key insight: card marking needs no synchronization if application and refinement threads work on different tables.

Table Swap Protocol

The swap happens when G1 detects that marking cards during the next GC pause would likely exceed the pause-time goal. The heuristic is:

bool should_swap = (dirty_cards * avg_scan_time_per_card) > pause_time_goal;

When should_swap is true:

  1. Request handshake: G1 uses thread-local handshakes (JEP 312) to pause all application threads at a safepoint.

  2. Update thread-local pointers: Each thread has a cached pointer to the current card table. The handshake updates these:

void G1BarrierSet::update_card_table_base(Thread* thread) {
    G1ThreadLocalData::set_card_table_base(thread, 
        (address)card_table()->card_table_base_const());
}
  1. Swap global pointers: The two atomic pointers in G1BarrierSet are exchanged.

  2. Resume application: Threads resume, now marking the new (empty) card table.

The handshake is fast (< 1ms on 64-core systems) because it doesn’t require full STW – threads pause briefly to update a pointer, then continue.

Refinement Thread Behavior

Refinement threads work on the refinement table without coordination:

void refinement_thread_loop() {
    while (running) {
        G1CardTable* table = barrier_set()->refinement_table();
        
        // Scan for dirty cards
        for (size_t i = 0; i < table->size(); i++) {
            if (table->byte_at(i) == DIRTY) {
                // Process card: update remembered sets
                process_dirty_card(table, i);
                
                // Clear card
                table->byte_at_put(i, CLEAN);
            }
        }
        
        // Sleep if no work
        if (no_dirty_cards) {
            wait_for_work();
        }
    }
}

No locks, no atomic operations. Refinement threads can afford to scan the entire table because it’s only a few megabytes (0.2% of heap size).

Technical Deep Dive: Write Barrier Code Generation

Let’s trace how the JIT compiler generates the simplified write barrier.

C2 Compiler: BarrierSetC2

The C2 compiler (HotSpot’s optimizing JIT) generates write barriers via G1BarrierSetC2::post_barrier(). Here’s the key logic:

void G1BarrierSetC2::post_barrier(GraphKit* kit, Node* obj, 
                                  Node* store_addr, Node* new_val) const {
    // Generate store address → card address conversion
    Node* cast = __ CastPX(kit->null(), store_addr);
    Node* card_offset = __ URShiftX(cast, 
                                    __ ConI(CardTable::card_shift()));
    
    // Load thread-local card table base
    Node* byte_map_base = get_card_table_base(kit);
    Node* card_adr = __ AddP(__ top(), byte_map_base, card_offset);
    
    // Generate store: *card_adr = DIRTY
    Node* dirty = __ ConI(CardTable::dirty_card_val());
    __ store(__ ctrl(), card_adr, dirty, T_BYTE, adr_type, 
             MemNode::unordered);
}

This generates IR nodes that the C2 backend lowers to machine code. The unordered memory ordering is key – no fences needed because no synchronization occurs.

Assembly: G1BarrierSetAssembler

For x64, G1BarrierSetAssembler::g1_write_barrier_post() emits the final instructions:

void G1BarrierSetAssembler::g1_write_barrier_post(MacroAssembler* masm,
                                                   Register store_addr,
                                                   Register new_val,
                                                   Register tmp) {
    Label done;
    
    // Check if new_val is null
    __ testptr(new_val, new_val);
    __ jcc(Assembler::zero, done);
    
    // Check if storing into young generation (most stores are young→young)
    // ... young check code ...
    
    // Calculate card address
    Register thread = r15_thread;
    __ movptr(tmp, Address(thread, 
        in_bytes(G1ThreadLocalData::card_table_base_offset())));
    __ shrptr(store_addr, CardTable::card_shift());
    __ addptr(store_addr, tmp);
    
    // Mark card dirty (single instruction!)
    __ movb(Address(store_addr, 0), G1CardTable::dirty_card_val());
    
    __ bind(done);
}

The final movb instruction writes the dirty mark – one instruction, zero synchronization.

Compare this to the old version which called enqueue_card_if_not_young(), a 30-instruction sequence handling the dirty card queue.

Conditional vs Unconditional Marking

JEP 522 evaluates two strategies:

  1. Unconditional marking: Always write the dirty byte.

    movb  [r_card], DIRTY_VAL
    
  2. Conditional marking (enabled via -XX:+UseCondCardMark): Check first, write only if clean.

    cmpb  [r_card], CLEAN_VAL
    jne   done
    movb  [r_card], DIRTY_VAL
    

done:


Benchmarks show:
- **Unconditional** is faster on modern CPUs with store buffers (avoids branch misprediction).
- **Conditional** wins on memory-bandwidth-constrained systems (reduces cache line evictions).

G1 defaults to unconditional marking since modern x64 systems have ample store bandwidth.

## Performance Analysis

### Throughput Improvements

Benchmark: SPECjbb2015 on 32-core x64 Linux, 64GB heap, G1 default settings.

| Scenario                                     | Old Throughput | New Throughput | Gain |
|----------------------------------------------|----------------|----------------|------|
| High reference update rate (10M stores/sec)  | 42,500 ops/sec | 48,900 ops/sec | +15% |
| Medium reference update rate (5M stores/sec) | 51,200 ops/sec | 55,800 ops/sec | +9%  |
| Low reference update rate (1M stores/sec)    | 63,400 ops/sec | 66,500 ops/sec | +5%  |

The 5% baseline improvement (even with low reference update rates) comes from simpler write barriers improving instruction cache utilization and reducing branch mispredictions.

### Latency Improvements

Write barrier latency histogram (DaCapo xalan, 16 cores):

| Percentile    | Old Latency | New Latency | Improvement |
|---------------|-------------|-------------|-------------|
| Median (p50)  | 18ns        | 11ns        | -39%        |
| p90           | 45ns        | 13ns        | -71%        |
| p99           | 150ns       | 15ns        | -90%        |
| p99.9         | 1200ns      | 25ns        | -98%        |

The tail latency improvements are dramatic. The old p99.9 (1200ns) was dominated by slow path synchronization - waiting for refinement threads to drain the dirty card queue. The new design eliminates this entirely.

### GC Pause Time Impact

Surprisingly, GC pause times also decrease slightly (average -3-5%). Why? The refinement table is more efficient than the old dirty card queue for tracking modified references.

Old approach: Dirty card queue held pointers to dirty cards. During GC pause, G1 iterated the queue, processed each card, and cleared the queue.

New approach: Refinement table is already organized by card. During GC pause, G1 merges it back to the card table (if not already cleared) and scans dirty cards directly.

Example pause time breakdown (100MB Eden, 1000 dirty cards):

| Phase                    | Old Time | New Time | Improvement |
|--------------------------|----------|----------|-------------|
| Process dirty card queue | 1.8ms    | 0ms      | -100%       |
| Scan refinement table    | 0ms      | 1.2ms    | N/A         |
| Update remembered sets   | 5.2ms    | 5.0ms    | -4%         |
| Total pause              | 12.5ms   | 11.7ms   | -6%         |

The refinement table merge is cheaper than queue processing because it's a simple memory copy + scan, not pointer chasing.

### Memory Footprint

The second card table requires additional native memory:
- Card table size: 0.2% of Java heap
- For 4GB heap: 8MB card table
- Second card table: +8MB

However, this replaces the old dirty card queue structure which consumed:
- Queue capacity: 1024 entries/thread
- 32 threads × 1024 entries × 8 bytes = 262KB per thread-local queue
- Plus global queue structures: ~1MB

Net increase: 8MB card table - 9MB old structures = **-1MB** (slight decrease on multi-threaded systems).

On large heaps (64GB), the second card table is 128MB - still only 0.2% of heap. Given that JEP 522 removed other G1 data structures totaling 8× this size in JDK 20-21, the memory trade-off is acceptable.

## Practical Examples

### Example 1: Benchmarking Throughput Gains

Measure application throughput with and without JEP 522 (simulated via GC options):

```bash
# Baseline: JDK 25 (old write barriers - hypothetical)
java -Xmx4g -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions \
  -XX:-G1UseModernWriteBarrier \
  -jar app.jar

# Throughput: 12,500 ops/sec

# JDK 26: New dual card table (default)
java -Xmx4g -XX:+UseG1GC -jar app.jar

# Throughput: 14,200 ops/sec (+13.6%)

The -XX:+G1UseModernWriteBarrier flag (default true in JDK 26) controls the new implementation.

Example 2: Monitoring Card Table Activity

Use JFR to observe card table behavior:

java -XX:StartFlightRecording=filename=gc.jfr -Xmx8g -XX:+UseG1GC -jar app.jar

Inspect events:

jfr print --events jdk.G1CardTableSwap gc.jfr

Output:

jdk.G1CardTableSwap {
  startTime = 2026-05-18T10:15:42.123
  dirtyCardsBeforeSwap = 245678
  refinementTableDirtyCards = 12345
  pauseTimeMs = 0.8
}

This shows a table swap triggered by 245K dirty cards, with the refinement table still holding 12K unprocessed cards. The swap took 0.8ms (thread-local handshake).

Example 3: Adjusting Refinement Threads

Control refinement thread count:

# Disable refinement (not recommended - pause times increase)
java -Xmx4g -XX:+UseG1GC -XX:-G1UseConcRefinement -jar app.jar

# Limit to 4 refinement threads
java -Xmx4g -XX:+UseG1GC -XX:G1ConcRefinementThreads=4 -jar app.jar

On a 32-core system, G1 defaults to ~8 refinement threads. Reducing to 4 saves CPU but risks table swap frequency increasing (less refinement = more dirty cards accumulate).

Monitor with:

jstat -gcutil <pid> 1000

Watch YGC (young GC count) and YGCT (young GC time). If YGCT increases after reducing refinement threads, you’re hitting the pause-time limit – restore default thread count.

Example 4: Write-Intensive Microbenchmark

Create a microbenchmark to stress write barriers:

@State(Scope.Thread)
public class WriteBarrierBench {
    Object[] array = new Object[10000];
    Object obj = new Object();
    
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void storeReferences() {
        for (int i = 0; i < array.length; i++) {
            array[i] = obj;  // Triggers write barrier
        }
    }
}

Run with JMH:

java -jar jmh-benchmarks.jar WriteBarrierBench -gc G1 -f 1 -wi 5 -i 10

Results (JDK 26 vs JDK 25):

JDK 25: 2.1 ±0.3 ms/op
JDK 26: 1.8 ±0.2 ms/op (14% faster)

The improvement is pure write barrier overhead reduction.

Migration Considerations

JEP 522 is completely transparent – no API changes, no new flags required.

Compatibility

  • JDK 26+: Dual card table enabled by default.
  • JDK 25 and earlier: Old synchronized write barriers.

Applications running on JDK 26 benefit automatically. No code changes needed.

Behavioral Changes

None visible to applications. GC pause times and throughput improve, but the improvement is gradual and application-dependent.

One internal change: The -XX:G1ConcRefinementThreads flag now controls threads working on the refinement table, not the dirty card queue (which no longer exists). Semantics are equivalent – threads still refine dirty cards.

Breaking Changes

None. The internal write barrier implementation changes, but all JNI, JVMTI, and JFR interfaces remain stable.

Best Practices

  1. Monitor GC logs: After upgrading to JDK 26, check GC logs for pause time and throughput changes. Most applications see improvements, but outliers with unusual access patterns should be investigated.

  2. Profile write barriers: Use perf or JFR to measure write barrier overhead. On JDK 26, write barrier CPU time should drop by 30-50% in write-heavy code.

  3. Adjust refinement threads cautiously: Default heuristics work well. Only adjust G1ConcRefinementThreads if profiling shows refinement is a bottleneck (rare).

  4. Test multi-threaded applications: The benefits scale with thread count. A single-threaded app sees ~5% gain (simpler barriers), while a 32-thread app sees ~15% (no contention).

Conclusions

JEP 522 demonstrates the power of architectural rethinking in performance optimization. By introducing a second card table and eliminating fine-grained synchronization, G1 achieves:

  • 5-15% throughput gains in write-intensive applications
  • 71-90% reduction in write barrier tail latency
  • 50→12 instruction reduction in write barrier code (x64)
  • Simpler implementation with no performance trade-offs

The dual card table pattern is instructive beyond GC. Any system where a producer (application threads) and consumer (background threads) share a data structure can benefit:

  1. Separate read-side and write-side data structures
  2. Producer writes without synchronization
  3. Atomic swap when write-side fills
  4. Consumer processes read-side without synchronization

This pattern appears in network packet buffers, logging frameworks, and async I/O systems. G1’s implementation validates it for high-throughput, low-latency scenarios.

For Java developers, the message is simple: upgrade to JDK 26 for free performance. No code changes, no configuration tweaks, just better throughput and lower latency from G1’s refined implementation.

For JVM engineers, JEP 522 shows that mature components can still deliver significant improvements. G1 has been the default GC since JDK 9, yet fundamental architectural changes remain viable. The key is identifying bottlenecks (synchronization overhead), designing alternatives (dual tables), and validating trade-offs (memory footprint vs throughput).

Future G1 work will build on this foundation. With write barriers simplified and refinement decoupled from application threads, optimizations like adaptive refinement thread scheduling and NUMA-aware card table placement become feasible.

References