The Way AI is Destroying a Markup Language

So originally, I was going to make a video on a devlog about this tool I made for the Godot game engine that creates a navigation structure that works more like what blind people expect with a screenreader - a tool that blind people use to navigate UI elements on a computer or phone. I decided to opt against this since I felt it was a bit redundant for my channel and frankly I didn’t really have the energy for it. I included a link to it in the description if you would like to read it though. When I originally made this prototype, I used the native TTS function on the computer as an initial base, but I knew I had to incorporate at least NVDA support. NVDA is an open source screenreader that is also the most popular screenreader used on Windows. Thanks to contributions by NightBomb, I was able to get NVDA support working with my work.

However, I noticed in my implementation of his work that there were some limitations in comparison to the OS TTS. Specifically, Godot’s TTS calls allow you to change voices, which is important for cross-language settings, and variables like speed, pitch and volume. I decided later that allowing developers to modify pitch and volume could be problematic for players with low hearing, but language and speed were still desirable traits. Specifically, I wanted to build a flashcard app for practicing French and a TTS voice would allow me to have more practice without needing voice samples. Sounding like a robot in French isn’t the worst thing anyways, I mean I’m already learning it from how Felix Guattari writes.

Anyways, I started to research how to do this. It turns out that the most reasonable solution in NVDA was to use something called SSML, or Speech Synthesis Markup Language. See, in HTML, you can switch the language read by a string easily by using a lang attribute, but because you are working more directly with TTS and screenreaders than in web development, you have to find your own way to do this. SSML was created so that SAPI5 could have its speech modified with special tags. SSML offered a reasonable solution and was said to be supported by NVDA. Supposedly, Godot also supports SSML, but I was never able to get it to work.

I figured that, being a standard that was documented by W3C, it would be easy to figure out how to use SSML. However, this could not be further from the truth.

First off, as I found out through looking deep into the code, NVDA’s implementation of SSML is actually custom made and separate from any other SSML engine. This is actually a good call, because this means NVDA can control how speech is synthesized with its settings and allow users to override potentially problematic SSML synthesis, such as disabling voice switching from automatic language detection. However, since its functionality wasn’t documented, it was a bit annoying to actually get it to work, and I had to dig through the source code to actually get a language switch working.

Much more annoyingly though was that while trying to research implementation examples of SSML, I discovered that every single implementation of SSML was different and used for use cases not related to Just-In-Time (JIT) text to speech such as the voices used for accessibility. Google for example has a completely different way of approaching language and voice than Microsoft does, and it doesn’t even match the initial syntax used by Microsoft’s original speech API SAPI - and it appears to be the case because of market dynamics surrounding AI. Let’s examine Microsoft as an example.

Microsoft has separate SSML documentation for Microsoft Azure cloud services, using your voice to produce AI generated text to speech from your own voice samples, and an extremely limited access to embedded speech, which does not require an internet connection. This should be raising some red flags to any developer concerned about user control, because this marks yet another subtle way that Microsoft is encouraging an always-connected approach to design, and forcing the use of these AI voices on their own hardware and away from user control. Additionally, it is also demonstrates how not only is the implementation not common with other implementations, but that these differences in implementation are focused around desires in the marketplace, not with usability. These services are designed to produce voice samples from text to speech that sound like a human with a wide range of expressions to compete in the constantly growing AI space. I don’t know much about how this service is being implemented in the industry right now, but it is extremely frustrating to see how something literally defined in a coding standard is completely ignored because of so-called “revolutions” in AI… more like revolutions in ripping people off and damaging the consistency of markup languages!

Of course, it is unreasonable to assume that developers should be forced into working within the full confines of a markup language, and this presents a problem with standardization of code implementations in the first place. We can actually analyze this problem of SSML through analyzing the territories surrounding it. SSML initially launched to be used largely with SAPI5, and was implemented into the W3C standard around 2010. This meant that the territory of what “SSML” was, was very stable and well understood, and other implementations such as eSpeak could follow the same standards, keeping SSML consistent.

However, as time went on, different social machines surrounding SSML changed. Markets transformed, people started to see text to speech as something more than just an accessibility tool or aesthetic choice, but also as a way to emulate human voices. Google produced more and more desirable sounding voices, and with more advanced voices came new means to try to control them. Microsoft and other companies counter-played with their own new speech implementations. The migration in the motives can really be seen by seeing that SAPI5’s latest documentation was written over 10 years ago, while the documentation for AI implementation of speech synthesis are far more recent and well-developed. As a result, SSML was no longer just one thing, but it was many things - Microsoft’s implementation, Google’s implementation, NVDA’s implementation - and thus, it had become “deterritorialized”. And at the same time, it is “reterritorialized” instantly into all these different implementations that are competing against each other for marketplace dominance in a field that has little to no value for previous TTS implementations. What SSML really is now, is the rough idea of what SSML “should” be, a black box of an idea of what SSML is. But the details are now completely different in every implementation.

This is an excellent example of why standardizing code implementation is so difficult, and why forcing developers to confine themselves to a series of expectations is not possible. Even within more controlled environments such as iOS and web browsers, developers “hack” around these territories, and create more and more complicated environments that can interfere with settings, configurations, accessibility software or even create unpredictable security vulnerabilities. And what is a developer to do in such a “deterritorialized” environment? Is really returning to the authority of standards itself really feasible if they keep being continually broken down at a faster rate than standards committee’s can keep up - standards committees that are directly composed of largely corporate interests?

Of course, this is not to say that standards don’t have value - I’m not a code anarchist. Believe me, I did my time and I have learned to value some structure! But at the same time, we also cannot design interfaces in such a way that we assume that all standards are followed. Code standards only can provide a “reterritorializing” effect of cleaning up some of the mess, by organizing radical experimentation in development into recaptured standardized codes, agreed upon within the industry. It’s not really that standards are just stupid authoritarian structures that tell us how to do everything, but rather that what developers desire to create will always squeeze outside of them, little by little, regardless of how vigorous these standards can be.

Instead, we have to approach these questions more in a vein similar to understanding computer security - a field that uses standards not as a set of rules, but as a tool for analysis, and understands deeply how these standards are toyed around with and violated to achieve a hacker’s goals. A key aspect to designing secure interfaces is to design machines that can prevent the external flows from penetrating its surface, through designing logical gates that prevent such exploitation possibilities. A system that cannot logically allow for access to the database from a frontend facing command cannot be penetrated by that vector. In a similar vein, NVDA’s SSML interface produces a surface that allows for control of the text-to-speech to be secured and contained within the screenreader, to prevent developers from creating inaccessible or undesirable interfaces.

What fascinates me to no end about this problem is that, both with security and with inaccessible interfaces, we are analyzing the scope of possibilities to try to reduce penetration attempts. Where penetration actually occurs is where the language of the machine was unenunciated. This does not merely mean the code itself - but the junction of social, political, economic and ecological machines intersecting at that very moment in space and time. It is a common error to believe that “bad code” is the only reason for the failure of security or core functional parts of code, but really, it’s always the codes that the developer could never see that actually produces these conditions. It is the undefined behavior that creates all glitches, all penetrations. After all, most security breaches start with socially engineering bored and tired employees to phish for credentials, and tense political social engineering is a critical aspect of cyber warfare.

Anyways, I still need to complete my implementation. But I had to share this madness with the world. It represents a problem that I’ve been trying to explain that happens constantly with software design as it integrates with market dynamics that is not really discussed much outside of adapting standards to fit new market developments.

posted on 12:39:05 PM, 01/11/25 filed under: tech [top] [newer] | [older]