EffCompute's comments

EffCompute · 2026-04-12T17:46:55 1776016015

It's refreshing to see Oberon getting some love on the Pi. There’s a certain 'engineering elegance' in the Wirthian school of thought that we’ve largely lost in modern systems.

While working on a C++ vector engine optimized for 5M+ documents in very tight RAM (240MB), I often find myself looking back at how Oberon handled resource management. In an era where a 'hello world' app can pull in 100MB of dependencies, the idea of a full OS that is both human-readable and fits into a few megabytes is more relevant than ever.

Rochus, since you’ve worked on the IDE and the kernel: do you think the strictness of Oberon’s type system and its lean philosophy still offers a performance advantage for modern high-density data tasks, or is it primarily an educational 'ideal' at this point?

Rochus · 2026-04-12T18:46:44 1776019604

I don't know. Unfortunately we don't have an Oberon compiler doing similar optimization as e.g. GCC, so we can only speculate. I did measurements some time ago to compare a typical Oberon compiler on x86 with GCC and the performance was roughly equivalent to that of GCC without optimizations (see https://github.com/rochus-keller/Are-we-fast-yet/tree/main/O...). The C++ type system is also pretty strict, and on the other hand it's possible and even unavoidable in the Oberon system 3 to do pointer arithmetics and other things common in C behind the compiler's back (via the SYSTEM module features which are not even type safe). So the original Oberon syntax and semantics is likely not on the sweet spot of systems programming. With my Micron (i.e. Micro Oberon, see https://github.com/rochus-keller/micron/) language currently in development I try for one part to get closer to C in terms of features and performance, but with stricter type safety, and on the other hand it also supports high-level applications e.g. with a garbage collector; the availabiltiy of features is controlled via language levels which are selected on module level. This design can be regarded as a consequence of many years of studying/working with Wirth languages and the Oberon system.

vidarh · 2026-04-12T20:01:35 1776024095

There was a couple of PhD theses at ETH Zurich in the 90s on optimizations for Oberon, as well as SSA support. I haven't looked at your language yet, but depending on how advanced your compiler is, and how similar to Oberon, they might be worth looking up.

Rochus · 2026-04-12T20:16:48 1776025008

I'm only aware of Brandis’s thesis who did optimizations on a subset of Oberon for the PPC architecture. There was also a JIT compiler, but not particularly optimized. OP2 was the prevalent compiler and continued to be extended and used for AOS, and it wasn't optimizing. To really assess whether a given language can achieve higher performance than other languages due to its special design features, we should actually implement it on the same optimizing infrastructure as the other languages (e.g. LLVM) so that both implementations have the same chance to get out the maximum possible benefit. Otherwise there are always alternative explanations for performance differences.

vidarh · 2026-04-12T22:52:51 1776034371

It might have been Brandis' thesis I was primarily thinking about. Of the PhD theses at EHTz on Oberon, I'm also a big fan of Michael Franz' thesis on Semantic Dictionary Encoding, but that only touched on optimization potential as a sidenote. I'm certain there was at least one other paper on optimization, but it might not have been a PhD thesis...

I get the motivation for wanting to use LLVM, but personally I don't like it (and have the luxury of ignoring it since I only do compilers as a hobby...) and prefer to aim for self-hosting whenever I work on a language. But LLVM is of course a perfectly fine choice if your goal doesn't include self-hosting - you get a lot for free.

cxr · 2026-04-24T02:39:55 1776998395

> This paper has presented a study of a system that provides code generation and continuous code optimization as a central system service[…]

> Our results have shown that–because of the profiling feedback loop–object code produced by continuous optimizations is often of a higher quality than can be achieved using static "off-line" compilation. Optimization at runtime, if performed judiciously, can often surpass optimizations performed at compile-time, independent of whether the latter are guided by profiling information or not. Our results have also given evidence that reoptimizing an already running program in response to changes in user behavior can give rise to real performance improvements.

Kistler, Thomas, and Michael Franz. "Continuous program optimization: Design and evaluation." IEEE Transactions on Computers 50, no. 6 (2002). <https://doi.org/10.1109/12.931893>

Rochus · 2026-04-12T23:03:49 1776035029

I don’t like LLVM either, because its size and complexity are simply spiraling out of control, and especially because I consider the IR to be a total design failure. If I use LLVM at all, it would be version 4.0.1 or 3.4 at most. But it is the standard, especially if you want to run tests related to the question the fellow asked above. The alternative would be to build a frontend for GCC, but that is no less complex or time-consuming (and ultimately, you’re still dependent on binutils). However, C on LLVM or GCC should probably be considered the “upper bound” when it comes to how well a program can be optimized, and thus the benchmark for any performance measurement.

guenthert · 2026-04-13T09:29:50 1776072590

> However, C on LLVM or GCC should probably be considered the “upper bound” when it comes to how well a program can be optimized, and thus the benchmark for any performance measurement.

Is it? Isn't it rather the case that C is too low level to express intent and (hence) offer room to optimize? I would expect that a language in which, e.g. matrix multiplication can be natively expressed, could be compiled to more efficient code for such.

I would rather expect, that for compilers which don't optimize well, C is the easiest to produce fairly efficient code for (well, perhaps BCPL would be even easier, but nobody wants to use that these days).

Rochus · 2026-04-13T09:46:08 1776073568

> I would expect that a language in which, e.g. matrix multiplication can be natively expressed, could be compiled to more efficient code for such.

That's exactly the question we would hope to answer with such an experiment. Given that your language received sufficient investments to implement an optimal LLVM adaptation (as C did), we would then expect your language to be significantly faster on a benchmark heavily depending on matrix multiplication. If not, this would mean that the optimizer can get away with any language and the specific language design features have little impact on performance (and we can use them without performance worries).

EffCompute · 2026-04-13T16:48:28 1776098908

Rochus, your point about LLVM and the 'upper bound' of C optimization is a bit of a bitter pill for systems engineers. In my own work, I often hit that wall where I'm trying to express high-level data intent (like vector similarity semantics) but end up fighting the optimizer because it can't prove enough about memory aliasing or data alignment to stay efficient.

I agree with guenthert that higher-level intent should theoretically allow for better optimization, but as you said, without the decades of investment that went into the C backends, it's a David vs. Goliath situation.

The 'spiraling complexity' of LLVM you mentioned is exactly why some of us are looking back at leaner designs. For high-density data tasks (like the 5.2M documents in 240MB I'm handling), I'd almost prefer a language that gives me more predictable, transparent control over the machine than one that relies on a million-line optimizer to 'guess' what I'm trying to do. It feels like we are at a crossroads between 'massive compilers' and 'predictable languages' again.

gobdovan · 2026-04-13T07:31:32 1776065492

When you call LLVM IR a design failure, do you mean its semantic model (e.g., memory/UB), or its role as a cross-language contract? Is there a specific IR propert that prevents clean mapping from Oberon?

Rochus · 2026-04-13T09:41:17 1776073277

Several historical design choices within the IR itself have created immense complexity, leading to unsound optimizations and severe compile-time bloat. It's not high-level enough so you e.g. don't have to care about ABI details, and it's not low-level enought to actually take care of those ABI details in a decent way. And it's a continuous moving target. You cannot implement something which then continus to work.

pjmlp · 2026-04-13T12:50:35 1776084635

To be fair they also kind of share that opinion, hence why MLIR came to be, first only for AI, nowadays for everything, even C is going to get its own MLIR (ongoing effort).

girvo · 2026-04-12T21:34:35 1776029675

Is anyone attempting to implement Oberon on LLVM IR? Sounds like a fun project

Rochus · 2026-04-12T22:27:14 1776032834

Threre are at least two projects I'm aware of, but I don't think they are ready yet to make serious measurements or to make optimal use of LLVM (just too big and complex for most people).

EffCompute · 2026-04-13T04:01:52 1776052912

That benchmark is a great data point, thanks for sharing. The performance parity with unoptimized GCC makes sense, given how much heavy lifting modern LLVM/GCC backends do for C++.

Your approach with Micron and the 'language levels' is particularly interesting. One of the biggest hurdles I face in C++ with these high-density vector tasks is exactly that: balancing the raw 'unsafe' pointer arithmetic needed for SIMD and custom memory layouts with the safety needed for the rest of the application.

Having those features controlled at the module level (like your Micron levels) sounds like a much cleaner architectural 'contract' than the scattered unsafe blocks or reinterpret_cast mess we often deal with in systems programming. I'll definitely keep an eye on the Micron repository—bridging that gap between Wirth-style safety and C-level performance is something the industry is still clearly struggling with (even with Rust's rise).

alaaalawi · 2026-04-12T20:56:03 1776027363

you can check also XDS modula2/oberon-2 programming system. is an optimizing complier https://github.com/excelsior-oss/xds

EffCompute · 2026-04-12T12:10:41 1775995841

I think EnPissant has a point regarding the overhead. Mapping semantic dependencies at the patch layer sounds great in theory, but the computational cost of resolving those graphs in a repository with thousands of changes is non-trivial.

In my work with high-performance engines, 'on-the-fly' graph resolution is usually the first thing to hit a performance wall compared to simple snapshot-based lookups. Pijul is a brilliant experiment in Category Theory applied to VCS, but until it can demonstrate that it doesn't degrade linearly with history size, Git's 'dumb' but fast snapshots will likely win the network effect battle.

EffCompute · 2026-04-11T18:02:03 1775930523

I really agree with jandrewrogers' point about the insularity of the database domain. While working on a custom C++ engine to handle 10M vectors in minimal RAM, I’ve noticed that many 'mainstream' concurrency patterns simply don't scale when cache-locality is your primary bottleneck.

In the DB world, we often trade complex locking for deterministic ordering or latch-free structures, but translating those to general-purpose app code (like what this Rust crate tries to do) is where the friction happens. It’s great to see more 'DB-style' rigour (like total ordering for locks) making its way into library design.

senderista · 2026-04-12T05:56:11 1775973371

An example of this is Linux adopting wait-die locks:

https://docs.kernel.org/locking/ww-mutex-design.html

EffCompute · 2026-03-18T11:37:14 1773833834

One thing I'm trying to better understand is where the real limits are.

At this point it feels like the bottleneck is less about raw compute and more about how efficiently data is represented and accessed on the GPU.

Curious if others have seen similar behavior when pushing large-scale vector search on consumer hardware.

EffCompute · 2026-03-17T16:27:09 1773764829

Quick update:

I've been iterating on the approach and managed to push the coarse search further.

Currently seeing ~100M vectors scanned in ~10ms on a single RTX 3090 (binary stage only).

Still experimenting with trade-offs between speed and recall, but it's interesting how far this can go on consumer hardware.

Curious what kind of numbers others are seeing for large-scale vector search on GPUs.

gus_massa · 2026-03-17T18:30:52 1773772252

Is it available somewhere?

EffCompute · 2026-03-17T19:15:55 1773774955

Not yet — it's still a personal prototype and I'm actively experimenting with different approaches and optimizations.

I’m trying to better understand the limits of what’s possible on consumer hardware before deciding how to package or share it.

Happy to share more high-level insights though.