> In a multithreaded program, a bump allocator requires locks. That kills their performance advantage.
Wait, what?
What's wrong with:
char theHeap[0x1000000];
atomic_ulong bumpPtr;
void* bump_malloc(int size){
uint32_t returnCandidate = bumpPtr.fetch_add(size, std::memory_order_relaxed);
if(returnCandidate + size >= HEAP_SIZE){
// Garbage collect. Super complicated, lets ignore it lol.
// Once garbage collect is done, try to malloc again. If fail then panic().
}
return &theHeap[returnCandidate];
}
------
You don't even need acq_release consistency here, as far as I can tell. Even purely relaxed memory ordering seems to work, which means you definitely don't need locks or even memory barriers.
The exception is maybe the garbage-collect routine. Locking for garbage collect is probably reasonable (other programmers accept that garbage-collection is heavy and may incur a large running + synchronization costs), but keeping locks outside of the "hot path" is the goal.
------
This is what I do with my GPU test programs, where locking is a very, very bad idea (but possible to do). Even atomics are kinda-bad in the GPU world but relaxed-atomics are quite fast.
-------
> In Java, this requires 15 000 separate allocations, each producing a separate reference that must be managed.
Wait, what?
points [] array = new points[15000];
This is a singular alloc in Java. 15000 _constructors_ will be called IIRC (My Java is rusty though, someone double-check me on that), but that's not the same as 15000 allocs being called, not by a long shot.
------
Frankly, it seems like this poster is reasonably good at understanding Go performance, but doesn't seem to know much about Java performance or implementations.
Even with relaxed memory ordering, you’re still bouncing the cache line around between every CPU that tries to access bumpPtr. I would expect that to be significantly slower than using per-thread heaps (especially if you have a large number of CPUs).
Java bump pointer allocations (ie: normal allocations, except for eg: large array allocations) occur in a thread local allocation buffer. Only when a thread exhausts its current allocation buffer does it need to worry about potentially contending with other threads to acquire a new one.
Wait, what?
What's wrong with:
------You don't even need acq_release consistency here, as far as I can tell. Even purely relaxed memory ordering seems to work, which means you definitely don't need locks or even memory barriers.
The exception is maybe the garbage-collect routine. Locking for garbage collect is probably reasonable (other programmers accept that garbage-collection is heavy and may incur a large running + synchronization costs), but keeping locks outside of the "hot path" is the goal.
------
This is what I do with my GPU test programs, where locking is a very, very bad idea (but possible to do). Even atomics are kinda-bad in the GPU world but relaxed-atomics are quite fast.
-------
> In Java, this requires 15 000 separate allocations, each producing a separate reference that must be managed.
Wait, what?
This is a singular alloc in Java. 15000 _constructors_ will be called IIRC (My Java is rusty though, someone double-check me on that), but that's not the same as 15000 allocs being called, not by a long shot.------
Frankly, it seems like this poster is reasonably good at understanding Go performance, but doesn't seem to know much about Java performance or implementations.