> > default to HTML escaping on output
> ??? Is that a joke? Don't do that! That's just as bad as magic quotes.
Microsoft's Razor and Python's Django templates have both shown that HTML-escaping-by-default can be cleanly done in a way that is not magical and that is highly reliable. I'm not honestly sure what you're protesting here.
In a way, "default escaping" isn't even the way to think of it. There is everywhere functions to emit raw characters and to emit them HTML-encoded. "Default" escaping just means that the function that emits HTML-encoded characters is easy to get at and smoother to use than the raw one.
I understand the debates Django had about changing what the unspecified output function did, but nowadays if you're building a new template language of any kind it's an absolute no-brainer: Make it easier to encode than to bypass encoding. The alternative is just awful.
If I make a string: '<P>' . $var . '</P>' - does it know to encode the variable, and not the entire string?
What if I assign that string in a variable, and then output it later?
I'm not convinced this can be done well. Maybe if all you do is make some templates and fill them in you could do it. But I do a lot more than that, I output dynamically built html all the time.
Just give people a very easy and shortly named function for escaping.
Just needs a bit of type system magic. Don't use the same type for escaped and unespaced strings. (And don't use the same type for user generated input before and after it's scrubbed / escaped of any nastiness.)
Ask any Haskell weeny for details.
Also in your example, you'd probably be better of, if your language knew about the HTML structure, e.g. something like P($var), instead of putting the tags in as strings.
I like this, although when you concatenate strings, does it actually concatenate them (and loose the type info)? Does it escape them, and then concatenate? Or does concatenation actually make a closure, which is only executed upon output? (And does doing that make things memory heavy.)
Haskell is on my next language to learn list.
> something like P($var)
Ugh. I hate that. I've spent years learning HTML, I want to write HTML, not some other language that looks like it.
Yes, judging by your questions Haskell is a good language to learn for you. To give you a sneak preview: You can use the type system in such a way, that your source program will safely discriminate between the two types of strings (e.g. tainted and untainted), but there won't be anything left in the compiled assembly (tags or wrapping or whatever).
Also about the HTML: You can of course use different syntax for what I proposed, one that's closer to actual HTML. But I think as long as the syntax tree is preserved, it's still close enough to HTML for me, and all your accumulated knowledge about HTML is still applicable. (Not to be derisive, but if your years of learning are no longer of any use upon such a cosmetic change, you should probably examine your level of comprehension.)
$foo = <p>$textvar</p>;
$foo->append(<span class='will-end-up-before-close-of-p'>Hello!</span>); // etc.
Mozilla proposed to add XML literals to JavaScript at one point, which didn't take off for security reasons, but server-side it's a different ballgame... maybe it could be worked out? Hmm.
Wow, that's even uglier. But you did not only propose an alternative syntax for HTML, but also for its manipulation. So that's a good enough excuse.
Have you looked at how Racket (Scheme, Lisp) deals with encoding HTML in S-expressions? I find that rather nice, and even prefer it to plain HTML or XML. Racket is a fine language for manipulating S-expressions, too.
Even with static typing, you might end up implementing using run-time support. (Of course the holy grail is to compile away all type information. But that's not only attainable. Even Haskell's ghc compiler keeps some information around for runtime. Something to do with typeclasses, if you want to look up the details.)
> Don't use the same type for escaped and unespaced strings.
And if you can't extend your type system to make this work, do it in your head, mutating the names of variables to help you keep it straight. For example, esStr and unStr are not of the same type, and moving data from one to the other without conversion is always an error.
This reminds me of Charles Simonyi's classic article on Hungarian Notation. I know that style gets criticized a lot, but that's usually when it has been used inappropriately. If you have a language with a weak type system then a sensible variable prefix convention can help a lot.
I'd say don't do that, refactor into templates instead. With your preferred approach to generating markup, you're largely on your own in protecting against XSS. Hopefully all the developers using your code are awesome at spotting and pro-actively dealing with XSS issues.
Still, with a HTML escape everything default, either turn it off, or use a raw "I know what I'm doing" method instead.
In Razor at least you'd create HtmlHelpers to output html, which returns a string that the system knows is html and you've already dealt with. It knows not to escape the string as you've explicitly said the string you're creating is markup.
You'd do something like @html.SalesWidget("90% off today") with the SalesWidget being responsible for escaping the string.
Also I don't really get your argument, why not give people a really short and easy way of not escaping a string instead? The opposite of what you're suggesting is just as easy and far safer. You're less liable to accidentally muck up.
The hopefully obvious answer to your question is to not generate markup in PHP (or any server-side language). These are the kinds of questions that Backbone, Spine, Ember, etc attempts to solve. You should look toward separating view concerns from your business logic and stop procedurally generating html in PHP.
Perhaps what we need is a language/platform that has built in strings that track not just the code page type encoding, but some kind of "intent assertion" as well -- is the string intended to be encoded for a particular output? Combining an "unknown" string with an HTML (or SQL, or PostScript, or JSON/JavaScript, ...) string would produce an exception.
Such a mechanism would have to include encoding functions (and assertion override functions), of course.
It seems this would help alleviate many types of fill-in-the-blank injection problems as well.
This problem has already been solved in the Haskell ecosystem [1]. For example, you get typesafe URLs so that if you have a standard query like myapp.com/person/345 you can't mistakenly misuse 345 as an article id. Every input string is tracked by the type system so the possibility for escape issues, injection attacks or cross site scripting exploits to sneak in is minimal. Static types also make sure that internal links can not be broken - if for example you decide to change the above URL to myapp.com/getperson instead, your application won't compile until you've fixed every other part that still references the old link .../person, and so on.
Not to mention the (also type safe) dead easy to use persistence framework.
I'm still in the process of evaluating different solutions for my next web project but so far I'm pretty sure this is gonna be my go-to framework in the future.