Peter Krautzberger · on the web

Towards a general method for accessible content trees or: deep aria-labels for equations revisited

Oh dear, that title is a mouth full. A while ago I wrote about two interesting results from the AIM workshop and I had promised to dive deeper. Well, take a deep breath.

A simple example

Here's a story. I think it was at the first web standard related event that I ever attended, the W3C workshop on ebooks back in 2012. Someone (maybe Janina?) presented an example of an accessible SVG and I was blown away. My memory, flawed as it is, says it was the classic SVG tiger but it was set up in a way that demonstrated amazing exploration features, providing non-visual representations that could dive into the entirety of the graphic, starting with high-level descriptions (something like a tiger's head) all the way down to detailed nuances (left whisker, 3 of 12).

I'm prone ot get the specifics wrong so here's a different example:

clip art house
By barretr (Open Clip Art Library) - http://openclipart.org/media/files/barretr/2941, CC0, Link

So this is a house. How would you describe it? Maybe: A house with a red chimney and a blue door? That's not bad but there's more so much more to be said about this house!

These descriptions could of course all be put into one very long textual representation, e.g., as a <title> or an aria-labeledby construction. And that would be ok. But I find it rather limited.

This is not how a human would describe things. Imagine I'd ask you to describe it. You would not start with the gradient of the doorknob on the first go. I bet you are much more inclined to provide some information at first and get into more detail if whoever is asking wants to dive deeper.

Sometimes, we are in the position to have more information like this on the web, too.

Imagine, this house was created in an authoring environment that specializes on such drawings; it may have been drag&dropped using pre-fabricated components, each having detailed descriptions, integrating user changes such as shape or color modifications, and being able to generate composited descriptions, perhaps combining them using simple rule sets (maybe even author customizable rule sets).

The other thing you may notice is that the house is more than the sum of its parts, i.e., a description for the house (and parts thereof) may not sufficiently be represented by stringing the descriptions of the leafs together; for example, where would the with in a roof with a chimney come from? For that matter, where would house come from? Depending on the content and context, there may be additional connecting words or phrases, there may be details to drop or reveal. Maybe the fabric of the roof or the whether the door is locked can be deduced from visual styling given other context.

If you are lucky and you have more information, then you may find yourself in a situation where you want to add differing textual representations on every level of the tree, just like you would in a real conversation, and you may want a way for users to have access to all those varying levels of representation - but not all at once as that could be overwhelming.

The most important point: like all good web standards topics, this is about a general, low-level problem. (Although solving a more general problem might appeal to mathematicians, too.)

Deep aria labels on tree structures (or: it's not just about equations)

So let's try to outline what this is about. Imagine you have

Now imagine that you have

What to want

Let's start with some fairly standard observations on accessible rendering:

Unified rendering visual and non-visual rendering should not be apart. Textual representation should be intentional, reflect the intention of the author. (This does not contradict that both graphical and textual representation will likely be created with tools, even tools leveraging heuristics.)

Progressive enhancement / graceful degradation a solution should work in a way that allows to progressively enhance content. For example, a top-level textual description (e.g., using aria-label) is a robust fallback. You may lose some convenience if that's all there is - and even some information - but it certainly isn't terrible.

Performance a solution must be performant, especially if you apply it to hundreds or thousands of content fragments.

From an author's point of view, the key affordance is precision/control. This is worth repeating: Accessibility inevitably starts with author control. If authors cannot create content in a way that they can trust to render reliably, i.e., with the precision they put into their content, then they will not care to do so.

If there's no control, the platform is failing the authors. If it's failing the authors to create accessible content, then it's failing the user because they will not receive accessible content.

This primarily means that content should be authorable in a way that does not require any heuristics on the side of rendering (visually or non-visually). Imagine AT would have to guess how many items are in a list. Or AT would have to throw computer vision at each image to guess a description. That's ok for broken content but not acceptable for good content.

There are other useful things of course - ease of authoring comes to mind. But without a solution with tangible benefits, building authoring tools or practices is never going to happen.

From a screenreader user's point of view, there are more affordances that you probably don't want to ignore.

There are many more considerations beyond this but this would be a good start.

Towards a solution: mathjax-sre-walker

Note: this is not a complete solution to all of the above. But I feel like it's heading in the right direction.

The codebase for this lighweight walker dubbed mathjax-sre-walker is on GitHub and for this first public summary we've tagged v2.0.0. As I mentioned in 208, this work with Volker Sorge grew out of a demo that David Tseng, Volker Sorge and Davide Cervone built at the AIM workshop in San Jose last year. A simplified demo in a codepen is embedded below alongside a recording of a quick demonstration.

what users get

For the visual user, it will provide a means of visually exploring the underlying (and often hard to discern) tree structure by putting the tree in focus and using the arrow keys.

For the non-visual user, it will additionally provide textual representations for each tree node, in sync with the visual representation. It doesn't but could (should we get separate Braille streams in ARIA) additionally provide a simultaneous rendering in specialized formats such as Nemeth or UEB, chemical Braille or others.

For the screenreader user, it will provide the top-level tree node in browse mode. When the tree's top-level DOM node is voiced, the screenreader should put in focus, triggering visual highlighting; the screenreader should also indicate the tree role to imply further functionality is available.

The user can switch to the screenreader's focus mode to use keyboard exploration with the arrow keys which is matched visually by the highlighting. When the user switches back to browse mode, they can continue naturally browsing to the next piece of content.

how users get it

The first, not too relevant part: the DOM tree has lots of information in data- attributes and in a first step we enrich the content with a secondary structure. Getting such information is of course not easy (luckily we can already automate that for equations thanks to speech-rule-engine) and this step can be done server-side. Ultimately that's not the hardest part - domain experts can build such tools - we're using Volker's speech-rule-engine for the equations (which is a marvel).

Yet all the extra information won't help if we can't make use of it on the web platform.

So how is this realized in the DOM tree? As a bunch of aria-labels (to add textual representations) and aria-owns to carve out the tree structure that might differ from the DOM tree; we also add a role to most nodes. In particular, we immediately get a top-level aria-label which serves as a fallback.

Now what we're missing is some kind of AT functionality that would give us an aria-owns tree walker. We have built-in table walkers in screen readers already so this does not seem like a massive stretch to imagine, especially given the evolution of the tree role so far. Sadly, we do not have general purpose tree walkers (yet).

In the second part, we overcome this by adding such a walker in JS. This walker consists of a tree structure (the aria-owns tree, generated from the embedded data for performance) and a keyboard listener. It is very close to the DOM's treewalker API and WCAG tree examples, except that we're working on the aria-owns tree because that tree may have a different order/structure from the DOM. This walker is fairly minimal, probably ~100 lines of ES6 code if you strip it down to its minimum.

Here's a demo of v2 or you can look at the one in the repository.

See the Pen mathjax-sre-walker v2 demo by Peter Krautzberger (@pkra) on CodePen.

role role role your boat, gently down the stream

A side note on the chosen role attributes. The tree role and its related roles may appear a good fit but they have been developed for specific application-like interfaces. It might be that it's smarter to use something different here, I honestly don't know.

Besides possibly being the right roles, they are also supported well across the accessibility tool chain, i.e., they happen to get the effects we'd like to see.

What are those effects?

Other roles have too many negative side effects in practice. Perhaps they shouldn't but it's often too hard to dissect if a problem comes from the ARIA specs, browser implementations, OS APIs, or screenreaders. For example, some approaches didn't work well on MathJax SVG output but worked well on the clip art house; this is probably due to use elements.

Some other roles we've tested across screenreaders:

Maybe those issues are fixable or maybe just due to my lack of understanding of specs and implementations. Of course, the mythical role=static (text etc.) might be very appropriate but, alas, it doesn't exist.

Personally, I don't care which role I use. Whatever role works, I'm happy to use it. Tree seems both adequate and semantically fitting, and they have a history of steady improvement.

In real life

Below is a recording with NVDA and Chrome on Windows 10.

Support

Overall, this works well on Firefox and Chrome while Edge and Safari generally don't get your more than the top-level label, i.e., the fallback; I haven't taken the time to compile for IE11 to test it.

NVDA seems best so far, JAWS seems to have a problem tracking focus (it jumps away when getting back into browse mode / virtual cursor), and Orca struggles with CSS rendering (see update below). VoiceOver with Safari is doing its thing (treating everything as a group) but VO works well with Chrome on MacOS. On iOS and Android we get the top-level labels (except VO with CSS rendering for some reason). The current code lacks touch input because (as far as I know) neither Talkback nor VoiceOver have a way to switch into (some form of) focus mode; it could be added, perhaps the visual exploration is interesting enough. I'll be publishing more demo runs as we move along.

Overall, I'm excited about the robustness at this stage and I plan to use this at work soon(ish). I also hope to bring the discussion around standardization of tree walkers to the ARIA Working Group - it seems to align with the evolution of tree widgets (e.g., for tab focus management, positional information) and a lot of content could benefit from some defaults in AT (much like with table walkers). But first we really need separate Braille streams.

update 2019-01-24 Joanmarie Diggs was kind enough to look into the issues with CSS layout (commits 9357aa9c and 87d78dad) and Orca now matches NVDA beautifully.