40 years of TeX! At this point it's worth emphasising that if you're a regular user of LaTeX - even and especially via Overleaf - and have a bit of spare change, please consider supporting the development of the project and the community by joining the TeX Users Group (TUG).
Have you ever installed a complete Tex platform? It's about 1GB of packages. I find the DVD thing a bit funny - but I can imagine it makes installs go quite a bit faster. And considering its a 40-year old technology - its not like you really need the latest commit.
The worst part of setting up a fresh OS installation for me over a rural satellite connection is inevitably installing TeX. It takes forever, and between the OS and it, I won't have any data left for the rest of the month.
Sometimes things in the package manager can be a bit out of date. For the great majority of LaTeX-ers it may well not matter, but some people want or need up to date.
Some people also prefer the layout (where various configuration files are kept, for instance).
But the DVDs are once a year.. I find it hard to believe that if you cared about the most up-to-date LaTeX distribution you would update once a year via DVD.
I'm sorry? Tex Live, MacTeX, and MikTeX all have a utility that you use to make updates online. Something like this, for TeX Live.
tlmgr update --all
As for the CD distribution, which sometimes people point to as showing that TeX is somehow 90's-ish, I'll just say that the world is full of all kinds of people in all kinds of circumstances. People want them, so the developers provide them.
OT but readers of this comment section are likely to know.. why is a LaTeX distribution 5.8GB in size[0]? That feels far too large for a tool that's been around for decades and primarily deals with text.
It's 5GB because it includes every single extendable package too, as well as their documentation (in PDF format, of course). There's a healthy TeX package ecosystem [0]
Windows users have the option of installing via MiKTeX instead of TeXLive, the benefit being that MiKTeX has a package management system which installs the base packages at start and then automatically downloads the others as and when they are needed. [1]
Both texlive and miktex allow selectively installing some packages only. AFAIK the difference is that miktex will show a popup saying «do you want to install foopackage?» when you try to compile a document which uses foopackage, and you don’t have foopackage on your system. Kinda like on ubuntu, you get «you can install this command by running apt-get install foopackage».
But isn't that quite a big difference? MiKTeX will actively download and install the extra package for you there and then, whereas with TeXLive you'll need to rerun the installer (?) or download the packages from CTAN?
I must admit I speak as a Linux user who's installed TeX on Windows machines, but not really gone into detail myself, so please correct me if I'm wrong.
Do NOT install MiKTeX unless you enjoy finding new ways to literally waste your time. Yes, it's a tad bit more user friendly, but it's also literally ~twice as slow as TeXLive. I really wish people gave this heads up when mentioning it... it should be a criminal offense not to! I wasted so much of my life just waiting for it to compile and I only found out TeXLive was so much faster when it was years too late.
There used to be teTeX (https://www.tug.org/tetex/) which allowed installing packages separately, but that's no longer maintained.
I think KerTeX (http://www.kergis.com/en/kertex.html) now allows a similar style of pick-and-choose package installation, but it doesn't support a lot of packages (yet?).
If you just want to compile a TeX document without drowning your hard drive, I heavily recommend Tectonic (https://tectonic-typesetting.github.io/en-US/), which behaves more like a typical development tool.
On macOS:
brew install tectonic
tectonic mypaper.tex
This will automatically download the exact set of packages needed to compile the paper, and cleanup any intermediary files generated during compilation. And it's trivial to upgrade and uninstall.
Note though that this Tectonic is based on a fork of the tangled XeTeX sources (i.e. not the actual WEB source code but the machine-generated C translation) as of some past date. This is a very weird choice probably due to the author (understandably) not being able to make sense of the mess of TeX sources to figure out what's the actual source code. As a consequence, already it has begun to diverge; for instance it does not contain the \expanded primitive that was added to XeTeX last year. Overall, Tectonic seems to have many good ideas, but suffers from being not too familiar with TeX itself and the TeX ecosystem.
By default it includes pretty much everything anyone uses. Imagine installing the most popular 90% of PyPI or CPAN, and that probably gets the idea across. Without documentation (online is usually easier) and support for umpteen languages, it's less than half that amount of space. Even better, most of it is platform independent, so can be shared between operating systems; I use the Linux installer in WSL and tell it to install the Windows binaries, and it takes barely any more space but can be used transparently from either OS.
I think it's in large part due to documentation and fonts. For example in my Fedora, I only have part of texlive installed. My /usr/share/texlive/texmf-dist is 1.3G, of which 428M are inside the doc subdirectory and 639M are in the fonts subdirectory.
Then there are many packages that contain pictures or logos. As a single random example, texmf-dist/tex/latex/fithesis is 2.4M, of which 2.1M are in fithesis/logo.
I wanted to ask the same thing. In my head ConTeXt is to LaTeX as XML is to HTML, a more general variant of the same language. Is that analogy plausible? In what use cases would ConTeXt make more sense than LaTeX?
LaTeX was written and basically frozen in the 80s/early 90s (LaTeX 2e is from 1994). The core ("kernel") of LaTeX cannot be changed as a lot of documents would break. Since then, a very large number of users of varying competence have written ad-hoc packages to extend or customize it. To change the appearance the recommendation often given is to write a .sty file, which is something of a dark art.
ConTeXt is from about a decade later, after some lessons were learned.
My impression: What you get with LaTeX is a larger ecosystem and greater likelihood that someone has already done what you need (in some package or other, though it may be only "close" and figuring out how to improve it may be hard). What you get with ConTeXt, at the cost of having to re-learn everything you learned about LaTeX and not being able to use any of the LaTeX packages, is a more consistently designed interface, a greater understanding of what's going on, more ease of modifying things (instead of the "there's a package for that" phenomenon), and best of all not having to "program" in the TeX macro "language" (you can use Lua).
Most of that is documentation (PDF files) and fonts (often in multiple formats for the same font). I skip the documentation and read them online on texdoc.net, and it trimmed down my installation size to 2.9GB (2.1GB of which are fonts).
MacTeX 2019 doesn't appear to be available yet and Homebrew isn't sufficiently wise to work with pre-releases in-advance to get formulas working sooner rather than later.
What are you talking about? MacTeX is already available [1]. MacTex is cask-only on Homebrew, so there's really not much to test in the formula. You should be able to bump up the version in the old cask formula and it should work. Having said that, installing from a cask doesn't make much sense for a program that releases a new version once per year.
One technology that has stood the test of time- amazingly well designed TeX by Donald Knuth and LaTeX by Leslie Lamport.
Plain TeX can be a bit beginner unfriendly because most beginners use LaTeX and experienced people have their own 'coding style' but it's amazingly powerful.
I only wish somebody would add the minimum required primitives to plain TeX to render reflowable text in browsers and ebook readers
TeX was designed for "beautiful" typesetting but it gives you much more control than that IMHO. I'd urge everyone to try it out once, I use it offline but I think overleaf.com allows for plain TeX too. (XeTeX may work best for unicode, it's plain tex + minimal additions for Unicode)
On that note- I am not fond of PDFs because of their awfully poor unicode search support- does anybody knowledgeable know of a good target format I should use (and the appropriate drivers?)
Ironically, Computer Modern is terrible to read on computer displays. It was designed for ink and toner bleed, and looks great when printed on 1970s-era Xerox printers [1], but it's far too thin for digital rendering.
I might suggest Bitstream Charter for a well-designed and readable analogue to CMR for digital use.
That is the Type 1 realization of it, the original Metafont outputs bitmaps tailored for the target device. Unfortunately no widely font rendering library supports Metafont directly, so they are often converted to Type 1 or OpenType and loose that capability.
The problem is that browsers choose not to implement high-quality justification algorithms like the Knuth-Plass algorithm that Tex uses because it is computationally intensive. That’s why justified text looks like garbage on the web.
There are some experimental JavaScript implementations, but without browser support reflowing high-quality justified text is a non-starter on the web.
TeX’s line-breaking algorithm is certainly not computationally intensive. On my 7-year-old MacBook Pro, it takes 0.34 seconds to run the entire 495 page The TeXbook on a single thread. That includes parsing, macro expansion, page-breaking, lots of (slower) math layout, and dvi output, which means that the line-breaking takes at most a few hundred microseconds per page.
Remember, TeX was written to be usable on a 1 megabyte, 10 megahertz machine, where it ran about a page a second. One of my contributions at the time was to modify the Pascal compiler on the Sail PDP10 to count cycles for every machine instruction executed by TeX over all users over a number of months, and Knuth fine-tuned the inner-loops of TeX here and there based on the results (the code that automatically inserts kerns and ligatures got the most attention, IIRC).
My comment was based on what I've read about this from multiple supposed authorities on web development. Intuitively it made sense that you wouldn't want to install a computationally intensive algorithm on browsers on mobile devices or when the content area changes size frequently, as on web pages. It's fascinating to have those assumptions overturned by someone so deeply involved in Tex.
Well, if you're going to be nice about it, here's some more info: The Mozilla discussion claims that TeX's line-breaking algorithm is "quadratic," which seems a bit far-fetched. So, I just pulled the raw text of Moby Dick off the web, removed the blank lines so it's all one paragraph, and ran it. TeX produces 112 pages (hmmm, it was just "Volume 1") in 2.1 seconds. So, 30x slower than "normal-size" paragraphs, but hardly quadratic, as the single-paragraph Moby Dick is 1000x as large as the average paragraph in The TeXbook. Of course, as pointed out elsewhere, with a little effort, one could make minor changes that would remove even this speed penalty.
I'm much more sympathetic to the point that, while TeX's line-breaking algorithm can easily handle paragraphs with different line lengths for each line, it needs to know at the start what the different line lengths are. It's not clear how to generalize it to be able to handle layouts where the length of the nth line of a paragraph depends on the earlier (or later!) line breaks. Think tall floating figures which impinge on the text area of the paragraph they're in. I'm guessing that was the real impediment in using it in Web-land.
My assumption was also that performance is the reason we aren't getting more esthetically pleasing line breaking. Until I read a comment[1] by Philip Walton, who works on WebRender at Mozilla, that is.
> ... it's not possible in the general case, at least not with the specs as they are today.
That's a fair response, but how about changing the (CSS) specs to allow better line breaking? Surely that would take less time than WebUSB, and Google or Mozilla could quickly push it through the IETF.
I'm not a web developer, so take this with a huge pinch of salt, but, if floats are the problem, does that imply that with layouts that use CSS Grid or Flexbox, we could have a decent justification algorithm?
I haven't tried this myself but I think ConTeXt can output PDFs with embedded XML markup and even epub files. You need to use \startsection and \stopsection instead of just \section, and maybe there are other limitations, but it's a small price to pay, isn't it?
I fail to see any fundamental incompatibility. Every text renderer of any kind is about putting glyphs in positions. The layout would declare few things not moveable and algorithms needs to decide how to fit text around those constraints. The algorithms for TeX are computationally expensive but I don't see why with faster computers, you can't reflow the text. About 15 years ago it used to take dozen second to compile my typical PDF, now overleaf.com does that in just a second in browser.
If you are outputting a layout for a reflowable medium like HTML or EPUB, you are not putting glyphs in positions. You are constructing graphical objects and defining their relationships (and how those relationships change based on form factor), and you are permitting the output device to render glyphs and put them in positions.
The web browser also "puts glyphs into positions." Neither HTML nor TeX source specifies positioning in the source format though. The same TeX document can be re-rendered at different page sizes, font sizes etc. Really not sure what fundamental difference you are seeing.
It's still way too slow. Lots of the documents I've worked on recently have taken over ten seconds to compile on my 2017 MacBook Pro (touch bar). That's 3 orders of magnitude too slow for reflowing at 60hz when you resize a window. I doubt a laptop will ever be 3 orders of magnitude faster (in sequential execution) than the one I have now, radical post-semiconductor computers notwithstanding.
XeTeX in particular does but my problem is with the uncertainty around the whole process and the frustration when a search fails on a 300 page document
3) Lines/Curves that happen to be in the shape of letters/text
If it's text it perfectly searchable. If it's one of the others the creator has to also OCR it and add the text behind the image or in front but invisible (no stroke or fill).
If you set the \XeTeXgenerateactualtext=1 option in a Xe(La)TeX document, the resulting PDF will include ActualText annotations to support searching even in complex scripts with ligatures, character reordering, etc.
Unicode search in PDF works fine, there are just a few issues that can cripple it for specific files, or specific files using specific PDF implementations. The following is advice for anyone working with the format, it should not be construed as making excuses for how PDF does things. It's a ginormous format with a ginormous spec that's been around for a very long time and has accumulated a lot of baroque qualities over the years. So I don't mean to condemn it, but I'm not exonerating it either. It is what it is, and if you have to work with it this might be useful. Anyhoo:
Regardless of what one thinks of their other qualities, if a PDF can be searched in Adobe Reader or Acrobat, then the file is probably OK but the PDF reader that you were trying to search it with has a bug on its end (or more likely, some unimplemented dark corner of the PDF spec).
On the other hand, if the file isn't searchable via Reader/Acrobat, then the problem is most likely with the authoring of the file itself. The most common thing that breaks searching is when instead of embedding all fonts used, a PDF refers to fonts by name from the local system. This can cause unpredictable issues when reading the PDF from another OS that can't resolve those font names.
Another common breaker of search that seems to be much more common with TeX are workflows which somehow produce PDFs that have embedded Type 3 fonts. Type 3 fonts represent glyphs using PDF drawing instructions. It's less a file format than something that only exists as an embedded font within a PDF, but I've only seen them in the wild when authored by pdflatex or similar. It seems like TrueType or OpenType are the most reliable formats to embed across the most commonly used PDF implementations, but that's an educated guess. Type 3 font support is spotty in non-Adobe implementations.
Finally, font subsetting might screw up search. Most software that knows how to produce PDFs can produce PDFs with embedded but subsetted fonts. This means the software that created the file embedded TrueType fonts (for example), but created a special version of the font for embedding that only contained the glyphs used in the PDF. Depending on the quality of the software used to do the font subsetting, the output PDF might not retain the mapping of glyphs back to Unicode characters. If that happens, text using that font in the PDF becomes unsearchable, but the PDF remains renderable.
None of this is meant to excuse the shortcomings of the format, it's clearly overcomplicated and fragile. But when I've seen problems with searchability of PDFs, it's most often been with the software that was used to create the files, or how that software was configured by the author. And when it is a problem with the authored file itself, it's almost always because of fonts. Either the fonts aren't embedded, or they're embedded in a slightly oddball format that's in spec for PDF but not perfectly, universally supported by all of the various non-Adobe PDF implementations.
I hit the comment limit earlier but thank you for writing in detail about what's actually going on behind the scenes so I can take care to not let the problem affect me much.
That said, and despite your yourself having mentioned that there is no excuse for the shortcomings, I feel like this decision in particular is just plain inexcusable anyway because if you're going to write portable document, at least use some form of UTF encoding rather than indexing into glyphs (or on top of !)
https://www.tug.org/
Or a local group:
https://tug.org/usergroups.html
One of the perks of TUG membership is that you get free TeXLive install DVDs sent to your door every year (as many as you need, in my experience).
You also get a subscription to TugBoat, the TeX Users Group Magazine, which is full of curios for the TeX aficionado.
[Edited to replace UK TUG group link with link to global TUG directory]