Performance Notes
Allocation-free hot paths
After Julia compilation warmup, the following are designed to run without heap allocations:
G(...)scalar callsG!(out, ...)scalar calls into a preallocatedRef{ComplexF64}G_batch!(out, ...)batched calls into a preallocated output vector
For high-throughput use, prefer G_batch! and reuse all buffers.
Recommended default: non-Cint integers with zero allocations (after warmup)
The C ABI expects m/len as Cint (Int32) for correctness. For convenience, HandyG.jl accepts Vector{Int} (or other integer types) and converts them internally.
To keep this conversion allocation-free in steady state:
- Make the integer arrays persistent (e.g.
constglobals or reused buffers) - Call the relevant function once during initialization to let internal scratch buffers size
Example (condensed form):
using HandyG
const m = [1, 2] # Vector{Int}
const z = [1.0, 0.5]
const y = 0.3
# warmup (compiles + sizes internal scratch)
G(m, z, y)
# steady-state: no allocations from the `m` conversion
@assert (@allocated G(m, z, y)) == 0Example (batch len):
using HandyG
const len = [4, 3] # Vector{Int}
# warmup
out = Vector{ComplexF64}(undef, 2)
g = zeros(Float64, 4, 2)
G_batch!(out, g, len)
# steady-state
@assert (@allocated G_batch!(out, g, len)) == 0Notes:
- If the internal scratch buffer needs to grow (e.g. you later pass a longer
m/len), a one-time allocation can occur at that moment. Pre-warm with the maximum expected length to avoid this. - For absolute clarity and predictability in tight loops, passing
Cint[...]/Vector{Cint}is still the “gold standard”, but the above pattern is a good default for ergonomic code.
Alternative (fastest): preconvert once to Cint
If m/len are reused many times and rarely change, the lowest-overhead approach is to convert them once up front and then call the Cint methods directly:
using HandyG
const m_int = [1, 2]
const m = Cint.(m_int) # one-time allocation
val = G(m, [1.0, 0.5], 0.3)This avoids the per-call conversion work entirely.
Input layout
For batch calls, use fixed-depth (depth_max, N) matrices with a len::Vector{Cint} per column.
This avoids:
- per-call allocations
- ragged array overhead
- pointer chasing and cache misses from “array of vectors”
Parallelism
The underlying Fortran library maintains global runtime options and caches. This generally implies the library is not re-entrant under Julia threads.
Recommended strategies:
- Use
G_batch!to maximize single-call throughput. - For true parallel scaling, prefer multi-process parallelism (
Distributed), where each process has its ownlibhandygstate.