SQL is one the most amazing concepts I've ever experienced. It's nearly 5 decades old and there is no sign of a replacement. We've created countless other technologies to store and process data, and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).
I run a early stage company that builds analytics infrastructure for companies. We are betting very heavily on SQL, and Craigs post rings true now more than ever.
Increasingly, more SQL is written in companies by analysts and data scientists than typical software engineers.
The advent of the MMP data warehouse (redshift, bigquery, snowflake, etc) has given companies with even the most limited budget the ability to warehouse and query an enormous amount of data just using SQL. SQL is more powerful and valuable today than it ever has been.
When you look into a typical organization, most software engineers aren't very good at SQL. Why should they be? Most complex queries are analytics queries. ORMs can handle a majority of the basic functions application code needs to handle.
Perhaps going against Craig's point is the simple fact that we've abstracted SQL away from a lot of engineers across the backend and certainly frontend and mobile. You can be a great developer and not know a lot about SQL.
On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
But if you really understand SQL, and it seems few people truly today, you command an immense amount of power. Probably more than ever.
"we always seem to try to re-create SQL in those languages (e.g. Hive, Presto, KSQL, etc)."
This is largely because of the number of non-programmers who know SQL. Add an SQL layer on top of your non-SQL database and you instantly open up a wide variety of reporting & analytics functionality to PMs, data scientists, business analysts, finance people, librarians (seriously! I have a couple librarian-as-in-dead-trees friends who know SQL), scientists, etc.
In some ways this is too bad because SQL sucks as a language for many reasons (very clumsily compositional; verbose; duplicates math expressions & string manipulation of the host language, poorly; poor support for trees & graphs; easy to write insecure code; doesn't express the full relational algebra), but if it didn't suck in those ways it probably wouldn't have proven learnable by all those other professions that make it so popular.
SQL is amazing IF you understand it. You need to think in sets of things. It's like functional programming paradigms or recursion; once you really truly "get it" you start to feel like a Jedi master.
Unfortunately the vast majority of SQL users aren't that proficient. It's also fairly hard to learn because it's something that you only pick up with experience and specifically longer time experience with a sufficiently complex data model.
I personally would never have gotten good at SQL if I hadn't stayed in my first two jobs for 5 years each working daily with the data-models and the domain.
I was fortunate that my early SE career experience involved a lot of SQL (and my first experience with a SQL-like query language happened under PICK).
But as my career has gone forward, I'm touching it less and less; today, I hardly touch it at all outside of my personal usage.
Much of that has to do with the fact that I'm now employed building and maintaining an SPA using javascript and nodejs where the backend is accessed thru a RESTful API; we never get to touch the actual database.
The few times I have seen some queries for that DB - albeit not in our API, though I could probably find them somewhere - all I can hope is that the SQL engine being used does some kind of query optimization on-the-fly, because there's so many inner selects that make me cringe it ain't funny (like I wonder if they've heard of joins and such).
Before, I was involved in a lot of PHP web apps and backend server automation, where I needed to use SQL a lot; I feel like I am getting rusty in it.
You could make a language that had all those benefits + almost all the things nostrademons mentioned. For example, Apache Spark on top of Scala does this pretty well (but it's significantly more complex than SQL is.) Pandas in Python and SAS also have many of these features (but also have a higher learning curve.)
SQL certainly does not suck as a language, unfortunately this is a very common perception, almost a (false) consensus within the developer community.
I changed careers from Finance - I was a ACA (akin to CPA) , wizard at Excel VBA etc - to software development, after 10 years learning, I’m now finally also proficient with SQL. I have also learnt other langauges, but SQL is by far my favourite, Yes it has some mistakes in it’s design - the biggest being SELECT before FROM, Yes some key words, esp windows functions, have a silly length - this creates an impression of verbosity but SQL is actually the opposite, it is far more expressive than any Object Oriented language I can think of.
1 line of code, say a window function with a filter clause in Postgres is the equivalent of pages of Java code.
Lateral Joins enable chaining, Views CTEs and Functions provide simple safely scoped composability. Postgres has incredible aggregation and analysis ability, built in out of the box. It’s String manipulation capabilities are wondrous, a 1 line StringAgg function can achieve the same as pages of VBA code - you may scoff but VBA with it’s ancient horrible editor is still necessary, if you are in Excel, is Javascript that much less verbose though?
Pure SQL Postgres Functions are amazing, you can chain them, you can easily make them pure, they can include fast recursion and conditional logic. I never delete data, just append, a bitemporal immutable functional db is so powerful, enabling time travel and rock solid data integrity. Mat Views or Indexes or summary tables are easily created and often automatically. They provide an efficient reliable cache layer, which can address most performance issues that may arise
You're confusing the language and the databases engine. I don't think that seasoned developpers think that relational databases sucks per se, since most of their constraints are technically justified, but don't like SQL as a language because it's grammar is completely awful, it's inconsistent (toward itself and databases engines), filled with specific perks and clumsy.
To illustrate on `select` queries, you start listing the attributes then the table, while on `updates` you start by specifying the table and then the attributes on which on operate.
This illustrate the grammar problem: in one case, you start by bringing what table you will use and set on which attributes, the other the attributes you need, while keeping in mind on which table name since you specify it after.
It's not really a problem, but developer tends to hate any kind of cognitive load, and this one source of load.
I'm not an ORM fan, but developper often use the programming language of their application to build SQL queries string, and ie. with such tools you always start by specifying the target table.
Personally I really wished that RDMS would provide another intermediate language, or better, data structure, to interface with them.
SQL is wildly powerful and important, but it's also got clear deficiencies for data traversal and manipulation.
I see it on a spectrum between declerative and imperative data structures, and where I think most people go wrong with it is trying to create a monolithic solution to a broad spectrum of problems. I think you need a graduated approach where each data layer is simplifying and satisfying the next, so you're using Tables, Procs, Views, and in-memory constructs in concert. The database is a powerful tool, and SQL is just part of that bigger puzzle :)
I can't agree more with this. As with anything we use in our applications, understanding and leveraging the strengths of our different tools is vitally important. If, for example, I needed a cartesian product of two datasets, SQL is the first thing that comes to mind due to how simple it is to write and the speed with which it will be processed in the database. On the flip side, I would never want to intermix frontend code in SQL, which crazy as that sounds, I have seen before. Everything has its place in a logical development flow.
It completely blows my mind when I see stuff like this. We're not all perfect programmers, I'm sure, but still. An alarm should go off in your head when the thought occurs to put front-end code into a query. The only exception I've encountered is if I want to inject some HTML formatting into a string for emails sent from SQL Server, but even then it's extremely limited in use, and I still think to myself, there must be a better way..
Starting with the end in mind (hi covey) is actually not such a bad idea. First state what result you want the query to produce,and then start describing where that data should come from.
Updates aren't different from select statements: first you state what you want, update a table with some new column values, and then you state where this data should come from, and what data you want to update.
And this is why it sucks. Which is fine, the language comes from a different era when our understanding of computer languages was much more primitive. It is understandable that mistakes would be made. What is unfortunate is that there has been little to no progress in improving on those mistakes in this problem space.
In the procedural world, you could also say that C has some mistakes in its design. However, we've gone to great lengths to try and improve on C, for example in semi-recent times with Rust and Go. SQL could really benefit from the same. SQL gets a lot of things right. So does C. But we can do better on both fronts.
Unfortunately, it seems that people regularly confuse SQL with the underlying concepts that SQL is based on. Because there is no basically no competition in this space, it is assumed by many that this type of problem can be expressed in no other way, and that you simply do not understand SQL well enough if you do not see it as perfect. I guess it comes from the same place of those who argue that if you make memory management mistakes in C, you just don't understand C well enough.
As a software engineer who later learned SQL, I could not disagree more. Within the parameters that it is designed for, SQL is a terrific language that makes exploring and manipulating data much easier than tools like python or Scala. That doesn't mean I have no place for python or Scala, but that I definitely see a class of problems where an SQL interface is far superior.
I use python/Pandas every day for data analysis and the like, and I would never dream of not writing most of the aggregation and filtering logic in SQL. If you're working with large datasets, there is absolutely no reason to pull unnecessary data into memory.
I'm not sure what it is today - I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
That is, minimize the network bandwidth by putting the work on the DB engine.
This of course necessitated creating and understanding proper SQL query building practices. It was real easy to mess up if you didn't know what you were doing (ie - inner selects, improper joins, etc) and cause a combinatorial explosion that would consume all the RAM on the server and grind it to a halt.
That, or bring back a load of data that you then filtered on the "client" - better to let the DB server do that if you can. Of course, this was back when the clients were 486s and early Pentiums with maybe 8-16 MB RAM. Today it's a bit different, but you still want to minimize the network traffic.
> I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
We're a machine learning shop, and this is absolutely one of our core design principles.
Pulling into memory is an attribute of implementation - you could write LINQ in C# (which is a great abstraction too) and not care about the fact that it's translated to SQL that runs server-side.
Depending on the use case, I'd argue Spark can be a better choice for aggregating/filtering than SQL. SQL is great for simple queries, but once my queries start getting into the many hundreds of lines than I start to miss all of the complexity management features of a true programming language.
There are a few things, but not much that I have had a better experience in Spark with as compared to using something like apache pig with UDFs. Now this part might be a matter of how things are set up where I work, but I find that working with Tez for process management and debugging to be far easier than working with the process management built into Spark.
EDIT: when you read process management above, perhaps it's better to think task management.
I understand your perspective, but I look at it a different way.
SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.
You're right about some of it (especially string manipulation, which is brutal) but I think you look at it like most programmers. SQL engages more of a simulative mindset—one popular with analysts—than the acquisitive mindset that most software developers outside of data science employ.
Most programmers who are at all familiar with functional programming, DSLs, configuration languages, or optimizing compilers are very well versed with asking the computer for what you want rather than telling it what to do. At least when I was there, this was a very large percentage of Googlers.
My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software. SQL doesn't have this structure. If you have a query that almost does what you want but you need to add one more filter, you need to reissue the query (modulo views/temporary tables, which is why I said "very clumsily compositional" rather than "not compositional"). If you have an expression that computes some quantity and then you want to use it with slight modifications in a lot of places, you usually end up copy & pasting it (modulo stored procedures). Modern programming languages have made this sort of abstraction really easy, but it's quite clunky in SQL. You can do it (by using stored procedures, views, triggers, subqueries, etc.), but then most of your application ends up written in SQL and it starts to feel like something out of the 1970s.
My preferred interface would be something like the relational algebra where relations are represented as typed values in the host programming language and operators are normal method calls (or binary operators, depending on language flavor) and importantly, intermediate results can be assigned to variables. I don't care what the particular execution strategy is of the query, but I do think it should be possible to refine, join, project, and subquery using values you've already defined.
> My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software.
When I write a complex query it's kind of like this.
I start with one table (viewed in my head as an excel style grid of results). I join another table then filter on the join / where clause. Again the output is effectively another table. Rinse repeat.
The difference is that SQL is very powerful at what it does, so you don't need to compose blocks in the way that you would in other languages. Just think of everything as a table. The result of every join or filtering clause is another table. That's your building block.
You are going to need significantly less SQL than Java / Python to get the same results from your data.
True, if you modulo all the features like views, stored procedures, functions, foreign keys, triggers there is no reusability in SQL...
On the serious side: It's not a bad idea to contain critical business logic in the database. It's shared by any app using the database. Foreign keys esp link and let you cascade changes with no extra code.
Less overall code. Higher guarantees.
Everyone agrees it's a good idea to use a `datetime` field for a `created` field instead of `varchar` and letting the application parse/extract it's own format in there.
It's also a good idea to add triggers that do specific things to keep your db in a consistent state upon actions.
Maybe adding/deleting a row needs to update a counter in a statistics table for quick access because count(*) on that table takes too long.
You could do that it in your app - but what happens if someone changes the data without going through your app.
Instead you add it as a trigger. Now you can be sure that the number in the stats table is always the same as the actual number.
Database changes have been painful for me in every job I have worked at. They often took me longer to write, longer to test, and longer to fix when I introduced bugs. In my experience, doing things in the database is great for simple actions, transformations done in huge volumes, preprocessing large datasets, or logic that is unlikely to change. It's bad for doing things which don't have those properties, especially the last property due to database changes being painful.
I don't disagree with him (though I don't necessarily agree as strongly with respect to triggers), but the counter argument is that it's extremely easy to put up a web service that offers an API and interacts with JSON. Now all the complicated bits are in the API service, and the multiple apps don't need to know anything at all about the underlying data store.
The counter to that is that it can be very difficult to get an API that is as comprehensive and flexible as SQL is, and eventually you'd find yourself re-implementing SQL if you ever tried. Certain reporting scenarios or mass data manipulation scenarios really require actual data store access in order to perform remotely well.
Your counter counter argument on not being as flexible as SQL is spot on. A DAL that sits in front of storage mechanisms is not a bad idea though.
I've found moderate success in doing the reverse and using communication channels from SQL => services either via NOTIFY or queues.
ie. Dispatching an event that signals a non SQL service to do an action. eg saving a file to S3, then updating the database when the action is complete.
This doesn't really work when retrieving data synchronously but in most cases if the data is stored elsewhere - then you probably want to access it via non db methods anyway (ie. for an S3 file, you'd use the url)
I don't like it either, but for most scenarios that this sort of thing works for a RESTful API or microservice or equivalent (or whatever you want to call it these days since I think the expiration date on those terms has elapsed) and not all data can be meaningfully coded without a structure like JSON or XML. Or, rather, you can, but, you're reinventing the wheel just like you would with an API trying to be as flexible as SQL. You may run into various system limits on URL length, too.
Events and queues are great, but not all requests work that way. If you're USPS and you're providing an address and ZIP code resolver, you've got different requirements than getting data from one system to cascade across a series of systems with a dozen different asynchronous widgets. You'd have a burden of using a format that your customers would prefer, too.
Having worked with three large JSON http microservice based projects in the past, I abhor the day they got popular. Relative productivity has definitely gone down due to the project overheads caused by them.
Not saying microservices don't work, they are great for specific use cases, but 9/10 people just want to microservice everything just to say they use microservices.
> If you're USPS and you're providing an address and ZIP code resolver
Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?
> Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?
Until you get to how well it handles misspellings and incomplete information. Addresses are also notoriously difficult to parse. There is some logic and ranking at work behind the scenes. The only times I've seen it consistently fail are when the city is incorrect, or it's a genuinely new address (new construction or address renumbering).
In any event, is a read-only service somehow less of a service? I'd wager read-only services see a lot higher demand than anything.
Not a down voter, but I imagine it could be because triggers are associated with write amplification and visibility concerns. They have their place, just like stored procedures.
This is a very real problem I bang my head against regularly. There just seems no way to achieve all three of readability, maintainability and performance in a large enough SQL codebase. You can piece together views right up until the moment the query planner forgets to push where clauses down. You can wrap a query in a function to guarantee the where clause is evaluated then and there, but now you have to maintain that _and_ your view/table. You can slowly rewrite your code bit by bit, adding complexity to force the query planner (in this supposedly declarative language) to behave the way you want it to, for identical results. You can wrap all of it in a materialized view so you don't have to care about performance anymore, right up until the point it takes 12 hours to refresh.
I've be very much in the market for something more modular than SQL, which had a much more customisable query planner so you could say "I don't care if you go away and compile this for an hour as long as you make it as quick as physically possible and then save the query plan forever".
Yeah, some of it is based on statistics, although in my experience, the relative distributions of rows don't massively change in my databases, so an ostensibly optimal query plan on day one isn't likely to be rubbish later. You can obviously imagine other people's mileage varying.
> My preferred interface would be something like the
> relational algebra where relations are represented
> as typed values in the host programming language and
> operators are normal method calls (or binary operators,
> depending on language flavor) and importantly,
> intermediate results can be assigned to variables.
Much if not all of what you're asking for can be done today at the database level with modern SQL affordances (CTEs, etc) and vendor-specific stuff like stored procedures. You can write functions that return tables/resultsets, assign those results to variables, etc.
There's not a host programming language database library that exposes those things in quite the manner you seem to be asking for, but that's not a limitation of these SQL-based RDBMSs themselves.
In the realm of what's possible right now, you could of course write your stored procedures directly in .NET languages (MSSQL) Python (Postgres) etc.
'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.
Your preferred interface sounds like Linq. An IQueryable<T> interface represent a query and can be composed and assigned to variables, and the type T represent the type of the relation or projection. Linq composes a lot more elegantly than SQL itself. The instances does not represent intermediate results though, they represent composed queries, which are executed in one go on demand. But I think that is what you want anyway.
> 'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.
These are great, but their implementation varies pretty significantly. Notably, MySQL didn't have them at all until v8.0, and PostgreSQL's CTEs are... wonky, for lack of a better term (I believe they use an "evaluate and store" method that's closer to a temp table, and as a result they don't optimize well and have been known to have strange side effects). Not all RDBMSs support recursive CTEs, either.
There are indeed a number of decent swipes at the idea, like Linq. But they all suffer from some impedance mismatch. It would be better if the native language underneath was written to support it more natively.
As you and others cite, there's a variety of features that have crept up on the idea over the years, but they're all bodging things on to the side of something they can't really change. I'd love to see the Rust of database querying to SQL's C++. I draw that comparison because I think a lot of the same structural problems are in play preventing it; SQL is just so good that it is actively prevented from being fundamentally improved. It can be incrementally improved, but not fundamentally.
Views - Create a general top level view, then build more specific views on views using more filters, to go down, ie more granular - then just join these with yet more views to combine, or aggregate to go back up. How is that not composable? If you hit performance issues, they are easily solved by using a few materialized views.
Also CTEs and User defined Functions (I use pure SQL functions but in Postgres you can easily use Python or Javascript instead)
At least in the DB we're using, Sybase SQLAnywhere, materialized views comes with a hefty price tag. They must be dropped and recreated every time you touch any of the base tables, like adding a column, which in turn requires any indexes on the materialized views to be recreated.
For a few of our customers, that meant that a 15 minute DB change (adding a column) turned into a several hour DB change (rebuilding materialized view).
If they're refreshed immediately, they also incur a penalty on the underlying tables, since any data changes essentially runs a trigger to see if the materialized view needs changes as well.
In our case we had one such immediate refresh materialized view which caused such a performance degradation of an underlying table that we had to find an alternate solution.
Materialized views for sure add some complexity and overhead, but I think that's true of any caching mechanism. They should probably be one of the last tools you reach for when trying to optimize queries, but in certain critical situations they can be super effective.
Well, for immediate refresh materialized views, how do you keep your non-sql solution up to date?
For manual refresh, I agree, either non-sql or just plain temp tables is a decent alternative. In our case the views do a lot of joins and subqueries, so we've mostly used temp tables.
I think you make a valid point about compositionality. SQL lacks elegant composition at the source/module/syntax level. Views and stored procedures are all stateful ways of achieving something similar by using them also involves migrating that state. I don't think you're saying SQL is not expensive which seems to be the rebuttal offered by siblings.
I am the author of a small parser/typechecker for a subset[0] of SQL, roughly matching that supported by SQLite, and I agree with you COMPLETELY. What follows is my rant directed at those who don't.
It is an ugly, verbose language. You can be very familiar with thinking in sets, and still not like SQL.
It's what we've got, and the SQL databases available are very, very good products. But I do wish a cleaner language could have won.
As you know, anyone who's used LINQ in C#, particularly by directly calling the extension methods .Select(...).Where(...).OrderBy(...),
sees how much better it is from a composability standpoint.
SQL is the anti-Lisp. Lisp's design is about 4 pieces of syntax and a few fundamental operations from which all else is built.
Conversely, nearly every operation in SQL is a tacked-on special case to the ridiculously complex SELECT syntax.
Filtering results? That's a clause of SELECT.
Ordering? Clause of SELECT.
Filtering after aggregating? Oh, that's a different clause of SELECT.
Once this philosophy has infected the brain of a SQL implementer, it spreads like wildfire.
That's why you even see custom syntax pop up even in good'ol function calls sometimes, like in Postgres: overlay('abcdef' placing 'wt' from 3 for 2).
SQL fans often talk about the beauty of relational algebra. Once you achieve relational enlightenment, SQL is supposed to be beautiful.
But if we wrote math like SQL, you wouldn't say 2 * 3 + 4. There would be a grand COMPUTE statement with clauses for each operation you could wish to perform. So you'd write COMPUTE MULTIPLY 2 ADD 4 FROM 3. Of course, the COMPUTE statement is a pipeline, and multiplication comes after addition in the pipeline, so if you wanted to represent 2 * (3 + 4) you'll push that into a sub-compute, like COMPUTE MULTIPLY 2 FROM (COMPUTE MULTIPLY 1 ADD 4 FROM 3).
SQL clauses could have been "functions" with well-defined input and output types, if the language designers had come up with a type system to match the relational algebra.
These could be pipelined, rather than nested, either with an OO-style method call syntax or a functional style pipe operator.
You understand the idea. Again, as you've mentioned, the LINQ methods[1] are a great resource for people not familiar with this style.
But, counterargument. Languages with minimal syntax and great power are often claimed to be unreadable.
Sometimes it's nice to have special syntax, to help give a recognizable shape to what you're reading,
instead of it being operator / function call soup.
So what did SQL accomplish by making everything a special case of SELECT?
Well, it's got this sensible flow to every statement. You see, the execution of SELECT logically flows as I've numbered the lines below.
SELECT
8 DISTINCT
7 TOP n
5 column expressions
1 FROM tables
2 WHERE predicate
3 GROUP BY expressions
4 HAVING expressions
6 ORDER BY ...
I sure am glad they cleared that up. If it had been a chain of individual operations I'd have been utterly baffled.
Ignoring completely the syntactical design, the lack of basic operations is a pain too.
Why can't I declare variables within queries?
I often would like to do something like:
select
let x = compute_something(...)
in
x + q as column1
x + r as column2
from ...
Instead, when I really need that, I end up wrapping the whole thing in an outer query and computing X in the inner query.
Hooray for SELECT, the answer to all problems!
Oh and yes, views and stored procedures and UDFs are no answer to the need for one-off local composability within queries.
Then you have the sloppy design of the type system in virtually all SQL dialects. SQL Server doesn't even have a boolean type.
There is no type you can declare for `@x` that will let you `set @x = (1 <> 2)`.
And don't even get me started on the GROUP BY clause, the design of which contorts the whole rest of the language.
If you GROUP BY some columns, you must not refer to any non-grouped columns in the later parts of your query (refer to my table above for which parts are "later").
Unless, that is, you are referring to them in aggregates. Then the HAVING clause was tacked on so that you'd have a way to do a filter -- the same thing as WHERE -- after the GROUP BY.
Does it all make sense once you understand it? Yes, in that you can see how you'd end up with this system if you were adding things piece by piece and never went back to redesign from square one.
Wow, I have a lot of ranting to do about SQL. I feel like I haven't even scratched the surface.
And hell, I still pick SQL databases every time I start a project! The damn language is useful enough and the products work great.
But it has all the design elegance of the US tax code.
You make some good points about SQL's shortcomings but some of the claims you make are infactual or seem intentionally misleading. Since you mentioned SQL Server, I have to chime in with some corrections to your statements
While you can't declare new variables within a SELECT statement, you can modify them with functions:
DECLARE @foo INT;
SELECT
@foo = dbo.ComputeSomething(...),
@foo + @bar AS Val1
FROM
...
Additionally, SQL Server does have a boolean type by another name: bit. It is 1/0/NULL and you can convert the string values TRUE and FALSE into their respective numerical values.
To modify your example to use proper SQL:
DECLARE @x BIT;
SELECT @x = 'TRUE' WHERE 1 <> 2
Or, to use a more real-world example, you'd probably use a CASE statement:
SELECT
@x = CASE WHEN 1 <> 2 THEN 1 ELSE 0 END
Yes, it's verbose. But it's possible. (SQL's slogan, probably, and understandably the root of a lot of your remarks.)
Finally, your GROUP BY / HAVING / WHERE example lacks an important distinction. HAVING <> WHERE. HAVING is specifically meant to deal with aggregate functions, while WHERE is a simple filter. You wouldn't use "HAVING PostId = @foo" any more than you might use "WHERE COUNT(*) > @bar". This is why it comes after the GROUP BY clause -- which you should think of as a "key" for the data that's being aggregated.
Look, SQL is not perfect or near perfect. As with any language, it could use improvements. But to criticize a language based on ignorance, misleading statements, or use cases that are not meant for the language is fairly dishonest, in my opinion.
If I use your variable example I get:
A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations.
I'm assuming there is a way to make it work, but then it seems like it'd be kind of a messy imperative-style approach. Would the value of @foo persist between each row's evaluation, so now my query would be capable of (intentionally or not) stateful execution? Would that interfere with the optimizer's ability to re-order evaluation or would it be undefined behavior?
Other than the special cased string value conversion, SQL server treats bit as just a very short integer type. You cannot supply a bit value alone to WHERE or CASE, you must use a comparison operator. And likewise, you cannot assign the result of a comparison expression to a bit, you must wrap it in CASE as you have demonstrated. In fact my own project, in an attempt to squeeze various dialects of SQL into a common type system, fakes booleans in T-SQL by inserting CASE expressions and (<> 0) comparisons as needed See example:
If a normal programming language had special places where boolean expressions could appear (like in if and while loops), and boolean expressions did not produce values in and of themselves, it would stand out as being extremely hacky.
Of course that is a complaint only for T-SQL, other dialects do have true bools. But it is emblematic of design undertaken without effort to unify fundamentals (The question not asked by T-SQL's designers: "What if predicate expressions were just like any other expression, and had a value?") and that is what I find distasteful about SQL in general.
HAVING is another example of that lazy, slap-more-syntax-on design. As you correctly point out, HAVING evaluates after the GROUP BY, WHERE before, and that's why HAVING can refer to aggregate results. But if "WHERE" was just an operator it could be used in the order it was needed without hassle. In LINQ, you write:
users
.Where(u => u.AccountIsActivated)
.GroupBy(u => u.Name)
.Select(g => new { Name = g.Key, NumberOfUsersWithName = g.Count() })
.Where(g => g.NumberOfUsersWithName > 1)
.Select(g => $"The name {g.Name} is shared by {g.NumberOfUsersWithName} users")
The .Where before the grouping is the same function as the .Where after it. No need for a special case.
Of course you could do the same thing in SQL without HAVING using a subquery:
select 'The name ' + sq.Name + ' is shared by ' + sq.NumberOfUsersWithName + ' users'
from
(select u.Name, count(*) as NumberOfUsersWithName
from Users u
where u.AccountIsActivated = 1
group by u.Name) sq
where sq.NumberOfUsersWithName > 1
But it's ugly, so they threw in HAVING and that lets you do it all in one query.
select 'The name ' + u.Name + ' is shared by ' + count(*) + ' users'
from Users u
where u.AccountIsActivated = 1
group by u.Name
having count(*) > 1
Understandable choice, because subqueries start to get unreadable fast. I believe this is due to the middle-out (select outer-stuff from inner-queries where more-outer-stuff) syntax, which gets hard to follow when nested more than a level or two. I find the front-to-back pipeline approach of LINQ or F#'s Seq module far easier to track as data gets filtered and transformed in the same sequence that the code reads.
Let's go back to my needlessly cruel and exaggerated example of "SQL math", in which all expressions are written as a fill-in-the-clauses COMPUTE statement with each operation being its own optional clause, in a fixed order. Suppose that it was decided that it's helpful to be able to MULTIPLY after adding, which normally comes _before_ multiplying in the fixed order, but people were sick of writing:
COMPUTE MULTIPLY 3 FROM (COMPUTE MULTIPLY 2 ADD 1 FROM 3)
So they came up with a new keyword for multiplying that goes _after_ the addition step.
They'd use a different word so you could tell the operators apart, but it'd basically be a synonym.
COMPUTE MULTIPLY 2 ADD 1 PRODUCT WITH 3
That's how HAVING feels to me. It's a band-aid that wouldn't be necessary or even desired if WHERE was a standalone operator on table types, just as our hypothetical PRODUCT WITH clause wouldn't be desired if we had a standalone * operator.
I get that SQL works and is extremely useful. Remember, I spent quite a bit of time implementing a library for working with it. But there are languages that make you think things like, "Oh, that's simpler than I thought. I can use this kind of construct to iterate over anything that implements the right interface. That was smart of the language designers". Then there are languages that make you think, "Huh. Another special case. Couldn't they have thought this through a bit more carefully before speccing out the language?".
SQL feels like the latter and yet I see people call it a beautiful language or their favorite language and I get a strong "stop liking what I don't like" impulse.
The code snippet I posted doesn't adequately demonstrate it but the DSL is strongly typed with full type inference at every step. So for example, it knows that the type of the table after the "join" is has a field "age" of type "int", and that filtering on it is okay. Trying to filter on a non-existent field will cause a compile (not runtime) error.
Second, the "steps" are actually plain functions. `#` is simply the reverse function application operator defined as `a # f = f a`. Which means you can abstract operations. For example, you can define a generic function that can take any relation with an age field and filters for age > 60. Thanks to strong types, using it incorrectly (i.e. on a relation without an age field of type integer) would cause an compile error.
Also, the language itself is a deeply embedded DSL. So "ordinary" code also compiles down to SQL. You can do things like loops etc. (for example, create a function that filters over a list of age limits rather than just one) and it will try to generate the best SQL possible.
of course the SQL generation is currently not implemented, so all this is theoretical, but I'm getting there!
There is also the approach of https://querycombinators.org which is inspired by functional programming but results in a declarative language. Queries are composed of functions that can be named and tested separately.
You may be interested in my project "Binate"; I have some incomplete notes at http://canonical.org/~kragen/binary-relations about it, exploring a possible design space in which we might find a more concise and compositional query language. Of course that's not in a host programming language; for that there do exist some things like SQLAlchemy and SQLObject which, as a bonus, can compile the resulting query to SQL. The most popular of these is Django's "ORM", which unfortunately is very limited.
I don't see views and subqueries as clumsy. Especially with sets, you think in creating new sets, and combining those into other sets.
In postgres i've created financial year reports, with monthly summaries per category just by having a few layers of views.
I think it's really valuable i can think in logical sets, and the database will takes all those layers of views and combine those into one optimized query plan.
But views and functions still require you to persist those objects in the database first. There's no such thing as a query "variable" that you can then re-use in multiple subsequent statements. Of course, you can use table variables or temporary tables to hold intermediate data, but those are eagerly evaluated, whereas functions, views and CTE's are lazily evaluated, and that allows for a massive performance boost (due to optimization).
I can see the appeal of such a construct in SQL. The requirement that reusable objects are persisted in the database requires a top-down design approach, and that doesn't really blend well with modern coding practices.
There is no limit to reuse of views??
You can create as many intermediate or base views as you like, within a view definition, CTE’s have local scope and are similar to variables, if you need schema scope, just define the query block as a view instead.
UDFs are simple to create are lazy and can be easily reused and composed, in Postgres they are first class objects, scalar UDFs can be called anywhere in SELECT and Set Returning UDFs in the WHERE clause, use more than 1 and they also automatically Lateral Joined - which is incredibly powerful and composable - when you get your head around them
No, CTEs do not have local scope, at least not like variables. CTEs are statement-scoped; no matter how many SQL statements you have in your current scope, your CTE vanishes after the first statement terminates.
> SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.
That's basically how most tools work outside of software development. When you use Google, you aren't telling it how to find your result; you're telling it what you want.
Meh, not quite. We've gotten so accustomed to phrasing our requests in search-engine-friendly terms that we don't realize we're doing it (e.g. cutting out all the small words).
You’re definitely right that SQL is alien to many procedural / object oriented / functional / etc developers.
Even with myself, who has worked with Oracle writing PL/SQL (which is an abomination imo) in past jobs still has to sit down and get into the right mindset before writing efficient SQL for more complex queries.
I think SQL is entrenched by network effects. It has a unique conceptual paradigm, which is why it is hard to learn. But any replacement is also likely to be conceptually strange, but noone is gonna put in the time in to learn something weird unless it has the adoption of SQL. This we are stuck on a local mini.a with SQL. (A pretty good one though)
It is not hard to learn - non programmers use it all the time.
Conceptually SQL is much more simple than programming, it basically reads like english:
SELECT customer, SUM(total)
FROM orders
GROUP BY customer
WHERE created BETWEEN '2018-01-01' AND '2018-12-31'`
ORDER BY SUM(total) DESC
Compare that to the programming necessary to implement the above:
totals = {}
for row in rows:
if row.created > '2018-12-31':
continue
if row.created < '2018-01-01':
continue
if row.customer not in customer_totals_2018:
totals[customer] = 0
totals[customer] = totals[customer] + row.total
def _sorter((customer, total)):
return -total
for customer, total in sorted(totals.items(), key=_sorter):
print(customer, total)
Not to mention the SQL version gets first hand knowledge on available indexes in order to speed up the query.
Now imagine adding an AVG(total) to both the SQL and programming version...
I don't think it's non-programmers that keep SQL around; it's that there hasn't been a replacement for SQL that is better enough to replace it. I think C suffers from the same problem.
I don't disagree. I just don't think we've figured out the next step. Other paradigms (MapReduce and graph databases are perfect examples) have introduced very interesting and clever ways to querying data, but nothing has replaced SQL. If anything, it seems like it will be additive on top of what SQL has already done.
I know a lot about data, but I can't solve that problem. To anyone out there, much smarter than me: this is a real problem. If you solve it, you'll cement yourself the history of computer science.
I think graph databases with time will eventually be eating some of the RDBMs market (used for the same purpose), but they haven't had the time to mature as SQL/RDBMs.
I’m dubious, Postgres added OO features when that was the fashion, then XML, then JSON. To my mind Graphs are still relations just indirect one, I use graphs a lot within Postgres no problem, it has native recursive CTEs which are surprisingly fast for most Graph queries. For v large graphs I cache the graph with an automatically generated transitive closure table or mat view. With Postgres you can do union distinct inside the Recursive CTE to prevent cycles and it’s remarkably fast, millions of rows under 1 minute.
I agree with you and the parent comment. I run a data science department and we rely on SQL. We even embed SQL queries into R scripts to pull the latest data. It isn't worth working with ORM for an analytics project when a few dozen lines of SQL can get you everything you need.
> On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
I always ask a basic SQL question (involving sorting/grouping and one simple join) in my interviews for backend and/or data. IMO if you are calling yourself a Data Engineer and do not know SQL, then you haven't really worked with data, you've mostly just worked on data infrastructure.
(please excuse my typing. i have one hand to use ATM)
yup. we bet super hard on sql too. Its fantastic if you know what you're doing. We ca import arbitrary data and expose it as normal tables to our customers for analysis/transform/export along with having a silod access controlled place for them to see just their data. sql is a great technology, and is extremely flexible.
I love it. If you're looking to learn it and fundamentals, check out Jennifer Widom's course from Harvard. I can safely credit her w/ my career.
Certain subsets of the developer community are usually excellent with SQL. It's especially evident with people who have been working with C# for > 10 years because Microsoft have always heavily pushed SQL Server with their developers.
There are plenty of use-cases for SQL with developers: especially in batch processes such as invoicing. A well crafted SQL query can execute exponentially faster than iterative code takes a lot less time to implement.
This tells us that people think SQL is good, but it's unfortunate that this is attempted, because the good thing about SQL is how well it corresponds to its data structures. Attempting to imitate SQL's form, rather than its principal of correspondence, has been counter-productive and prevented the reproduction of its merits. Case in point: Cypher.
SQL is so long lasting because its math, sort of. It's a language for expressing data relationships in a very systematic logical and set theoretic way, not a random ad hoc hack.
Its syntax is a bit old fashioned and I do think efforts to make it more native to programming environments rather than a bolt on might be fruitful, but its concepts are timeless.
It seems to me that graph databases are far more efficient than relational ones for most tasks.
That’s because all lookups are O(1) instead of O(log N). That adds up. Also, copying a subgraph is far easier, and so is joining.
Think about it, when you shard you are essentially approaching graph databases because your hash or range by which you find your shard is basically the “pointer” half of the way there. And then you do a bunch of O(log N) lookups.
Also, data locality and caching would be better in graph databases. Especially when you have stuff distributed around the world, you aren’t going to want to do relational index lookups. You shard — in the limit you have a graph database. Why not just use one from the start?
So it seems to me that something like Neo4J and Cypher would have taken off and beat SQL. Why is it hardly heard of, and NoSQL is instead relegated to documents without schemas?
Chasing pointers (on the same medium) is usually slower than the access patterns that databases usually use.
First, a database can have hash indexes instead of btree indexes, so lookups can be O(1) too, but it turns out that btrees are often better because they can return range results efficiently, and finding the range in a btree is only logarithmic for the first lookup. If your index is clustered - if it covers the fields needed downstream in the plan - no further indirections are needed and locality is far better than a hash lookup. For example, a filter could be efficiently evaluated along with the index scan. And if your final results need to be sorted, major bonus if the index also covers this.
Second, it's best to think of databases doing operations on batches of data at a time. Depending on your database planner, different queries will tend to result in more row-based lookups (MySQL generally does everything that isn't a derived table, uncorrelated subquery or a filesort in a nested loop fashion) but others build lookups for more efficiency (a derived table in MySQL, or a hash join in Postgres). The flexibility to mix and match hash lookups with scans and sequential access - which are usually a few orders of magnitude faster than iterated indirection - means it can outperform graph traversal that is constantly getting caches misses at every level of the hierarchy.
The reason NoSQL and document databases suck from a relational standpoint is that they are bad at joins, and work better if you denormalize your joins and embed child structures in your rows and documents. Add decent indexes to NoSQL and they gradually turn back into relational databases - indexes are a denormalization of your data that is automatically handled by the database, but have nonlocality and write amplification consequences which can slow things down.
In terms of distribution, a relational plan can be parallelized and distributed somewhat easily if your data is also replicated - sometimes join results need to be redundantly recalculated or shuffled / copied around - and most analytics databases use this approach, though usually with column stores rather than row stores, again because scanning sequential access is so much faster than indirection. Joins don't always distribute well, is the main catch.
Depends a lot on the actual access patterns of your data.
Many recent web & mobile apps have a lot of screens where you just want to grab one blob of heterogenous data and format it with the UI toolkit of choice. Or if they do display multiple results, it's O(10) rather than O(1000) or O(1M). Chasing pointers is fine for use-cases like this, because you do it once and you have all the information you're looking for.
This is also behind the recent popularity of key/value stores and document databases. If all you need is a key/value lookup, well, just do a key/value lookup and don't pay the overhead of query parsing, query planning, predicate matching, joins, etc. When I was working on Google Search > 50% of features could get by with read-only datasets that supported only key/value lookup. You don't need a database for that, just binary search or hash indexes over a big file.
Oh I agree. Though I think having the rest of a relational DB is really nice should you need more than a simple key-value lookup, and not have to cobble it together through application-side joins and denormalization.
The application I work on in my day job does not match the key/value lookup idiom at all. User-defined sorts and filters over user-defined schema, and mass automated operations over data matching certain criteria. If you squint a bit, the app even looks a bit like a database in terms of user actions.
And even relational databases (at least row-oriented with primarily disk storage) have their limit here. With increasing volumes of data, it can't keep up. We can't index all the columns, and indexes can't span multiple tables. We increasingly need more denormalization solutions that convert hotter bits of data into e.g. in-memory caches that are faster for ad-hoc sorts and filters. Database first is a decent place to start, though having a first-class event feed for updates would certainly be nice...
Depends a lot on the actual access patterns of your data.
Yup. Thing is: with RDBMSs you are in control of both the storage patterns and the access patterns of your data. That's where a large part of the performance benefit comes from.
> 50% of features could get by with read-only datasets that supported only key/value lookup
Did you implement a storage pattern that was ordered by key (or hash(key) if you used hashing)?
A hash index is O(1) no matter what language it's called from. The reason most people don't use hash indices very often is that they do not allow you to do a lot of useful things that you often end up wanting to do, such as retrieving all values within a given range.
Graph databases are great for retrieving existing data with associated data.
The power of SQL comes from the fact that you can easily create new information out of the data: create new sets, group by certain features, aggregates on certain features.
It's a lot more powerful than just store and retrieve.
People forget that SQL isn't just Query (DQL), it's also definition (DDL), manipulation (DML), control (DCL), and transaction control (TCL), language [0].
Checkout that platform that a full-blown Oracle license can provide to your DBA... It's definitely more than just CRUD.
N4J's a lot slower at tons of common tasks. If you look into its underlying data model, it's clear why. It achieves the speed it does at certain graph-traversal operations by storing its data in such a way that it's highly specialized for those operations—one would expect this to come at a high cost for other operations, and sure enough, it does.
It also doesn't bother with tons of consistency guarantees and such that you (may) get from, say, PostgreSQL. Yet is still slower for many purposes.
In practice enterprise applications do a lot of relational operations.
Graph DBs might be valuable for a lot of problems, and it does feel like something like Neo4J would make a lot of sense for stuff like social networks, but for business records stuff isn't really that spread out
> SQL is one the most amazing concepts I've ever experienced. ...and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).
It's not SQL that's the concept. The concept there is set theory/intersection/union and predicates.
That's why you think you are "recreating" those, because they can be mapped using the same concept
When people say "SQL" they often refer to a bunch of concepts, some better than others, all bundled together.
SQL gets you tables and joins, sure. But it also gets you queries that (some) non-programmers can write, outputs that are always a table, a variety of GUI programs to compose queries and display the results, analytic functions to do things like find percentiles, and tools to connect from MS Excel. And it often means you get transactions, ACID compliance, compound indexes, explicit table schemas, multiple join types, arbitrary-precision decimal data types, constraints, views, and a statistics-based query optimiser.
Of course it also gets you weird null handling, stored procedures, and triggers.
People behind NoSQL movement understand SQL better, than any people glorifying SQL here. In fact, I think this is a testament of how poorly most people actually understand SQL, they can't even see basic on the surface problems with it.
All I heard from NoSQL proponents was that "relational databases don't scale". These same people were usually dealing with data sets that would fit easily onto a single server.
No, I'm pretty sure you only heard it from your fellow RDBMS friends. I have never argued for scale myself, but plenty of times tried to debunk this exact stereotype.
I think there are people who went with NoSQL because it fit their specific needs better than a regular RDBMS, and people who went NoSQL because they found it easier to work with than a RDBMS (at first).
The latter group often didn't want to learn SQL and went with NoSQL's as a shortcut, not because they had an intimate understanding of the tradeoffs between the approaches and decided NoSQL was the better option.
There was the third group of resume driven developers who bought into the MongoDB hype around 6 or 7 years ago, which probably overlapped with the second group a bit.
5 decade old sql is nothing like modern sql with tons of proprietary extensions, partition, windows, collations, typecast, json and god knows what else. Your examples " Hive, Presto, KSQL, etc" are a proof of this, they are so vastly different from each other you cannot simply learn "sql" and expect to use those tools in any serious manner.
This is precisely the proof of opposite that sql has not stood the test of time.
I understood when OP said the "concept of SQL" that they referred to a pretty broad idea of querying tabular data in rows and columns with some structured language.
I find it hard to accept that ppl are talking about 'querying tabular data in rows and columns with some structured language' when they talk about sql. They are surely talking about the specific syntax.
It would be insane to expect somebody arrive at a perfect feature full working version of anything in the first try.
That's true for any piece of software. SQL is not done yet, the fact that you can put in neat little features still into it is a proof to how well it was designed.
Heck we still have Lisp evolving today. A lot of things got done right in the early days.
I run a early stage company that builds analytics infrastructure for companies. We are betting very heavily on SQL, and Craigs post rings true now more than ever.
Increasingly, more SQL is written in companies by analysts and data scientists than typical software engineers.
The advent of the MMP data warehouse (redshift, bigquery, snowflake, etc) has given companies with even the most limited budget the ability to warehouse and query an enormous amount of data just using SQL. SQL is more powerful and valuable today than it ever has been.
When you look into a typical organization, most software engineers aren't very good at SQL. Why should they be? Most complex queries are analytics queries. ORMs can handle a majority of the basic functions application code needs to handle.
Perhaps going against Craig's point is the simple fact that we've abstracted SQL away from a lot of engineers across the backend and certainly frontend and mobile. You can be a great developer and not know a lot about SQL.
On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
But if you really understand SQL, and it seems few people truly today, you command an immense amount of power. Probably more than ever.