Add Speech to Form Interactions

May 12, 2025
9 min. read

A few years ago I wrote a post about how to add speech to form interactions. Inspired by a blog post by Pamela Fox, I want to revisit this strategy and enhance it with additional techniques and technologies.

What is the problem? #

Additional forms of feedback can help improve usability and accessibility. For example, we can use audio feebdack when a form field is left empty or when fields are too short. This is especially useful for users with disabilities, such as visual impairments, who may not be able to see the error messages.

We can also use speech recognition to dictate text into form fields. This is useful for users who have difficulty typing or for those who prefer to use their voice to input text.

Finally, we can use AI to improve the accuracy of speech recognition and speech synthesis. AI can help us create more natural-sounding voices and improve the accuracy of speech recognition.

This post seeks to answer two questions:

How can we add audio feeback to form interactions?
How can we add speech recognition to web pages?

Audio Feedback with the Speech Synthesis API #

The example will provide audio cues when a field is left empty after the user has focused on it. The code is extensible; we can add more fields and messages as needed

The first part of the code is to create an aray of the error messages we want to make available to the user. The keys are the types of errors and the values are the messages we want to use.

const errorMessages = {
  usernameEmpty: "The Username field cannot be empty",
  passwordEmpty: "The Password field cannot be empty",
  // add more keys/messages here as needed...
};

The core of the code is the speakError function, which takes a message and a language code as parameters. The function creates a new SpeechSynthesisUtterance object with the message and sets the language. Finally, it calls the speechSynthesis.speak method to play the audio.

function speakError(
  message,
  lang = "en-US"
) {
  const utterance = new SpeechSynthesisUtterance(message);
  utterance.lang = lang;
  speechSynthesis.speak(utterance);
}

validateNotEmpty is a function that checks if the value of an input element is empty. If it is, it sets the border color to red and calls speakError with the appropriate message. If the value is not empty, it sets the border color to black.

Other errors can be added to the errorMessages object, and we can create functions to handle other error types.

function validateNotEmpty(element, messageKey) {
  if (element.value.trim().length === 0) {
    element.style.border = "1px solid red";
    speakError(errorMessages[messageKey]);
  } else {
    element.style.border = "1px solid black";
  }
}

We create an object that associates the fields with their corresponding error messages. The id property is the ID of the input element, and the messageKey property is the key in the errorMessages object.

We then loop through the fields and add a blur event listener to each input element. When the user leaves the field, the validateNotEmpty function is called with the element and its corresponding message key.

const fields = [{
    id: "username",
    messageKey: "usernameEmpty"
  },
  {
    id: "password",
    messageKey: "passwordEmpty"
  },
];

// Wire up blur listeners once DOM is ready
fields.forEach(({
  id,
  messageKey
}) => {
  const el = (document.getElementById(id));
  if (!el) return;
  el.addEventListener("blur", () => validateNotEmpty(el, messageKey));
});
// Prevent form submission for demo purposes
const form = document.getElementById("loginForm");
if (form) {
  form.addEventListener("submit", (e) => e.preventDefault());
}

You can see the result of the audio enhanced form validation in this CodePen demo:

Speaking input with the SpeechRecognition API #

The SpeechRecognition API allows us to convert speech into text. This is useful for dictation and other large text input.

This sample application will listen for speech input and display the recognized text in a div element. It will also handle errors and display them in a separate div.

In the first section we capture the browser and microphone capabilities into constants we'll use later.

The code checks if the SpeechRecognition API is supported by the browser (unprefixed or with the WebKit prefix) and if the getUserMedia API is available.

Next, it checks if the browser supports either navigator.mediaDevices or navigator.mediaDevices.getUserMedia.

If not, it display an error message and throw an error.

const SR = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!SR) {
  document.getElementById('error').textContent =
    '⚠️ SpeechRecognition not supported by this browser.';
  throw new Error('SpeechRecognition not supported');
}

if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
  document.getElementById('error').textContent =
    '⚠️ getUserMedia not supported. Serve over HTTPS or localhost.';
  throw new Error('getUserMedia not supported');
}

In modern browsers, you could do the same thing with the following code that uses the with optional‑chaining operator:

if (!navigator.mediaDevices?.getUserMedia) {}

But I've decided to keep the more verbose code for compatibility with older browsers.

The requestMicAccess function wraps the getUserMedia call with a timeout. If the user does not respond to the microphone permission prompt within the specified timeout, it rejects the promise with an error message.

/** wrap getUserMedia with a timeout */
function requestMicAccess(timeoutMillis = 10000) {
  return new Promise((resolve, reject) => {
    const id = setTimeout(() => {
			reject(
				new Error(
					'No response to mic‑permission prompt (timeout)'
				)
			);
    }, timeoutMillis);

    navigator.mediaDevices.getUserMedia({ audio: true })
      .then(stream => {
        clearTimeout(id);
        resolve(stream);
      })
      .catch(err => {
        clearTimeout(id);
        reject(err);
      });
  });
}

It then creates a new SpeechRecognition object and sets its properties.

lang specifies the language for recognition
interimResults determines whether to return interim results
maxAlternatives specifies the maximum number of alternative transcriptions to return

const recognition = new SR();
recognition.lang = 'en-US';
recognition.interimResults = false;
recognition.maxAlternatives = 1;

The next step is to capture references to the DOM elements we'll use to display results and errors.

const startBtn = (document.getElementById('start-btn'));
const transcriptDiv = (document.getElementById('transcript'));
const errorDiv = (document.getElementById('error'));

The startRecognition function is called when the user clicks the "Start" button. It requests microphone access and starts the speech recognition process. If the user denies access, it displays an error message in the errorDiv.

async function startRecognition() {
  transcriptDiv.textContent = '';
  errorDiv.textContent = '';
  transcriptDiv.textContent = 'Requesting mic access…';

  try {
    const stream = await requestMicAccess();
    transcriptDiv.textContent = 'Listening…';
    recognition.start();
    // stop raw audio tracks so they don’t linger in the background
    stream.getTracks().forEach(t => t.stop());
  } catch (err) {
    transcriptDiv.textContent = '—';
    errorDiv.textContent = `❌ ${err.message}`;
  }
}

The handleResult function is called when the speech recognition service returns a result.

Destructure the results list and the resultIndex of the current result
From the list of SpeechRecognitionResult objects, pick the one at resultIndex, then grab its first (best‑confidence) SpeechRecognitionAlternative located at index 0
Display the recognized transcript text in your UI. It updates the transcriptDiv with the recognized text

function handleResult(event) {
  const { results, resultIndex } = event;
  const best = results[ resultIndex ][ 0 ];
  transcriptDiv.textContent = best.transcript;
}

The last function handles errors. It updates errorDiv with the error message returned by the SpeechRecognition code.

function handleError(event) {
  errorDiv.textContent = `Error: ${event.error}`;
}

Finally, we wire event listeners to the different elements we captured earlier.

startBtn.addEventListener('click', startRecognition);
recognition.addEventListener('result', handleResult);
recognition.addEventListener('error', handleError);

Conclusion #

Adding speech to form interactions can greatly enhance the user experience, especially for users with disabilities. By using the Speech Synthesis API for audio feedback and the Speech Recognition API for dictation, we can create more accessible and user-friendly web applications without third-party tools and external services.

Both Caniuse or the Web platform features explorer show Speech Recognition as unsupported in all browser but testing code in Chrome works as designed, Safari hangs when starting dictation and Firefox does not support the API. This should dictate the use of the Speech Recognition API in your web applications.

Links and resources #

Add browser speech input & output to your app
SpeechRecognition API — MDN
SpeechSynthesis API — MDN

Edit on Github

Revisiting adaptive streaming video Using The User Timing API