I've done a bit of HPC work, and Fortran is very much "still a thing" there. MPI...

jabl · on Nov 27, 2018

> MPI[1] is pretty much the only game in town for between-node parallelism, and MPI comes with interfaces for C, C++ and Fortran.

I believe the MPI C++ interface has been deleted from the MPI standard. It was a bit pointless, since it was essentially just the C interface with slightly different wording, and C++ can of course call C just fine. For people wanting a higher level C++ MPI interface, I believe the common choice is Boost MPI.

> The only other language that's even been run at petascale is Julia, and as far as I can tell that's still using Julia's `ccall()` under the hood to interact with the MPI C libraries [e.g., 2].

I don't think that's a bad thing, per se. No need to reinvent the wheel.

And, it's the same thing for Fortran really. The most common MPI libraries are implemented in C, with the Fortran binding being a fairly thin wrapper that calls the C implementation. The only real difference is that the Fortran binding is an official part of the MPI standard.

cbkeller · on Nov 27, 2018

Hah, I hadn't heard that! My impression has been that good C++ written for HPC tends to look a lot like plain C anyways. And ccall is pretty efficient, so I'm not complaining there either.

m_mueller · on Nov 27, 2018

The chicken-and-egg thing also applies to GPUs btw., Nvidia & PGI have supported GPU computing on Fortran for ~8 years, since the early days of CUDA.

cbkeller · on Nov 27, 2018

That's a good point. Hierarchical parallelism is becoming increasingly important, so having one language that can be used both within-node and between-node is very convenient, and could add to the lock-in factor.

m_mueller · on Nov 27, 2018

Good point and this is btw. exactly where Nvidia is heading. There will be a point in the future where you just program kernels and/or map/reduce functions and/or library functions and then call them to execute on a GPU cluster, passing in a configuration for network topology, node-level topology (how many GPUs, how are they connected) and chip-level topology (grid+block size).

The address space will be shared on the whole cluster, supported by an interconnect that’s so fast that most researcher can just stop caring about communication / data locality (see how DGX-2 works).

wahern · on Nov 27, 2018

> The address space will be shared on the whole cluster, supported by an interconnect that’s so fast that most researcher can just stop caring about communication / data locality

There will always be people who will care because locality will always matter (thanks, physics). Improvements in technology may make it easier and cheaper to solve today's problems, but as technology improves we simply begin to tackle new, more difficult problems.

Today's chips provide more performance than whole clusters from 20 years ago and can perform yesterday's jobs on a single chip. But that doesn't mean clusters stopped being a thing.

See also The Myth of RAM, http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

m_mueller · on Nov 27, 2018

I do think there’s a paradigm shift coming. it’s a combination of the ongoing shift away from latency- to throughput oriented design with the capabilities shown in new interlinks, especially nvlink/nvswitch. This allows DGX-2 to already cover a fair amount of what would otherwise have to be programmed for midsized clusters - if it can be made to scale one more order of magnitude (i.e. ~ 10 DGX) I think there’s not much left that wouldn’t fit there but would fit something like Titan. There’s not that much so embarassingly parallel that the communication overhead doesn’t constrain it, and if doesn’t, you again don’t care much about data locality as it becomes trivial (e.g. compute intensive map function).

pjmlp · on Nov 27, 2018

C++ and Fortran support on CUDA was one of the big reasons why OpenCL was left behind.

They now at least support C++14, but driver support seems to still not be quite there, from what I get reading the interwebs.

anujsharmax · on Nov 27, 2018

If you are looking for an alternative to MPI, you should try Coarray Fortran. It supports parallel programming for both "within a node" and "between node" communication.

Coarray Fortran is now part of the Fortran programming language, and has a very simple syntax similar to array operations.

Based on my experience, the performance would depend on the compiler implementation and I would recommend GCC compiler instead of Intel compiler.

cbkeller · on Nov 28, 2018

Interesting, I haven't tried that. Apparently it uses either MPI or GASNet for the between-node communication, depending on configuration? I don't know anything about the latter, but apparently it's an LBL product [1].

[1] https://gasnet.lbl.gov/

noobermin · on Nov 27, 2018

I don't think it's because MPI is better, it's just most of the supercomputers I have access to require use of MPI like constructs.

timClicks · on Nov 28, 2018

And that's often because network hardware understands MPI and is able to optimize flows between nodes at far lower latency than TCP.

Something1234 · on Nov 28, 2018

That's really cool. Source?

timClicks · on Nov 29, 2018

I used to work in HPC. The Mellanox gear, specifically InfiniBand is very good.

Fun fact: if you're working at a Saudi Arabian HPC center, say KAUST, your interconnects are purely Ethernet. Mellanox is (partially?) an Israeli company, and that's not very politically comfortable with procurement.

cbkeller · on Nov 27, 2018

Better than? Not necessarily disagreeing, but I'm not sure what the alternatives even are at the same level of abstraction. I mean, there's PNNL's global arrays [1] but that's higher level, or Sandia Portals [2] which is lower/transport level.. Perhaps there are newer/alternative options I don't know about?

[1] http://hpc.pnl.gov/globalarrays/

[2] http://www.cs.sandia.gov/Portals/portals4-libs.html

gnufx · on Nov 27, 2018

Global arrays is normally used over MPI anyhow. I guess there's SHMEM, but that's integrated with at least OpenMPI (and others, I think). CHARM++ has been used at scale, but it's semi-proprietary.