Aug 15, 2024

Role Play Language Learning with OpenAI's Whisper

Language learning has always been a challenge, and books and gamified apps can only get you so far. At some point you’re going to need to have a conversation. Speaking the language with native speakers gives one a leg up in becoming fluent, but not everyone has one of those lying around or can afford to hire one.

In this month’s OpenAI Application Explorers Meetup, Godfrey Nolan, President, RIIS LLC. demonstrated how you could utilize OpenAI’s Whisper and Gradio to build a role-playing language learning application. Short of hiring your own theater troop, it’s probably the best low stakes emulation of the real-world conversations you will need to practice to gain fluency.

You can follow along with the video or read the tutorial below, and you’ll be one step closer to your dream of language fluency.

Introducing OpenAI’s Whisper

Whisper, part of OpenAI’s suite of tools, is a powerful speech recognition engine that performs speech recognition, translation, and language identification. It can transcribe audio files and generate text in multiple languages. This technology forms the backbone of the app we’ll be building today.

Understanding Whisper’s Capabilities

Whisper offers several key functionalities:

Transcription: Whisper can convert speech to text accurately.
Translation: It can translate audio from one language to another.
Text-to-Speech: Whisper can generate spoken audio from text input.
Multiple Language Support: It supports dozens of languages, from Afrikaans to Vietnamese.

Transcription

Here’s a basic example of using Whisper for transcription:

import openai
openai.api_key = config.OPENAI_API_KEY

media_file_path = 'SteveJobsSpeech_64kb.mp3'
media_file = open(media_file_path, 'rb')

def transcribe_audio(audio_file):
    with open(audio_file, "rb") as file:
        transcription = openai.audio.transcriptions.create(
            model="whisper-1",
            file=file        )
    return transcription.text

This function takes an audio file as input and returns the transcribed text using Whisper’s transcription capabilities. It uses a file of a famous Steve Job’s speech. If you were to run it, you’d get something like this:

Text-to-Speech

Whisper also provides text-to-speech functionality:

from openai import OpenAI

cleint = OpenAI()

speech_file = 'french.mp3'

response = openai.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Le rapide renard brun sauta par dessus le chien parresseux"
)

response.stream_to_file(speech_file)

This function generates an audio file of the input text being spoken.

Translation

The final key to the puzzle is Whisper’s translation functionality.

from openai import OpenAI

client = OpenAI()

french_file = open('./french.mp3', 'rb')

response = client.audio.transaltions.create(
    model="whisper-1",
    file=french_file
)

print(response.text)

The resulting response of this should be “The quick brown fox jumped over the lazy dog.”

Building a Language Learning System

By combining ChatGPT’s conversational abilities with Whisper’s speech recognition and synthesis capabilities, we can create a powerful language learning tool. The basic flow of such a system would be:

The learner speaks a phrase in their target language.
Whisper transcribes and translates this input to English.
ChatGPT generates an appropriate response based on the context.
The response is translated back to the target language.
Whisper converts this text response to speech, which is played back to the learner.

Implementing the System

To implement this system, we’ll use Gradio, a Python library for creating simple web interfaces for machine learning models. As you can see here, the interface is quite svelte for such an easy-to-implement framework.

If you haven’t already use a pip command to install Gradio.

pip install gradio

Also, let’s intall PyAudio, which if you can’t already tell by the name will be pretty helpful.

pip install PyAudio

Create a file and name it app.py. Then import the following at the top:

import openai
import config
import pyaudio

The next thing we are going to do is create a msgs list variable. We are going to pre-load it with our first item, the system directive. This puts up some guard rails for how the OpenAI can interpret and respond to our user prompt which will be coming later.

msgs=[
        {"role": "system", "content": "You are a hotel receptionist, your job is to help customers check in to their room. \
        You can only respond when a customer asks you a question. Do not start the conversation. Always wait for a question.\
        The customer does not speak your language, so keep your responses simple and clear."},
    ]

You’ll notice our role for this is set to system. Our next role and content pair will be the ‘user’, and the prompt.

Now let’s go step-by-step using the Whisper features we learned earlier to implement our role-playing language coach.

Def transcribe(audio):
    global msgs

    media_file = open(audio, "rb")

    # Step 1 - read in the French audio and convert it into English
    translation = openai.audio.translations.create(
        model="whisper-1",
        file=media_file,
    )    
    msgs.append({"role": "user", "content": translation.text})
    if(len(translation.text)>10):
        print(translation.text)

        # Step 2 - use the English text to generate a response
        response2 = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=msgs
        )
        msgs.append({"role": "assistant", "content": response2.choices[0].message.content})
        print("step2: ", msgs)

        # Step 3 - convert the response into French text
        response3 = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                    {
                        "role": "user",
                        "content": "Translate the following text into French: " + response2.choices[0].message.content
                    }
            ] )
        print("step3: ", msgs)

In this code, the transcribe function is taking us on a world tour of using OpenAI's speech and translation APIs to process an audio file. First, it opened up the audio file and used the Whisper model to transcribe French audio into English text. If the text was longer than 10 characters, it added this text to a global list called msgs and printed it out for us. Then, it generated a response with the GPT-3.5-turbo model, added that to msgs, and printed it.

We’ve got two more steps to go but we need to take a bit of a pause because Step 4 is the one of the more complicated steps.

This is where it translates the English response into French and turns that French text into audio using the tts-1 text-to-speech model, streaming the audio output. Our PyAudio library comes in handy here. You’ll notice in the code we had to set up some parameters to make this happen. Step 5 is rather easy. We just update our text on screen with the response.

        # step 4 - convert the french text to audio and stream the response
        stream = p.open(format=8,
                    channels=1,
                    rate=24_000,
                    output=True)

        with openai.audio.speech.with_streaming_response.create(
            model="tts-1",
            voice="alloy",
            input=response3.choices[0].message.content,
            response_format="pcm"
        ) as response:
            for chunk in response.iter_bytes(1024):
                    stream.write(chunk)

        p.close(stream)

        # step 5 - update the text on the screen
        chat = response3.choices[0].message.content
    else:
        chat = ''
    return chat

Now to take advantage of that Gradio install earlier we are going to create a simple interface for this system:

import gradio as gr
gr.Interface(
    fn=transcribe,
    live=True,
    inputs=gr.Audio(sources="microphone", type="filepath", streaming=True),
    outputs="text",
).launch()

Alright, let's dive into what this snippet is doing! Here, we're using Gradio to create an interface for our transcribe function. The gr.Interface is set up to call the transcribe function whenever it receives input. We've got it configured to take audio input directly from a microphone, with the audio being streamed as a file path. The output is then displayed as text. By setting live=True, we're making sure that the interface processes the audio input in real-time. Finally, the .launch() method kicks everything off, opening up the interface so users can start interacting with it.

Basically, this creates a web interface where users can speak into their microphone and receive text responses in real-time. Gradio is super powerful and this is only one of the things it does, so check it out when you have time.

Conclusion

Throughout this guide, you learned how to create an interactive language learning tool using OpenAI's Whisper and ChatGPT models, combined with Gradio for a user-friendly interface. You explored how to transcribe and translate spoken language, generate responses, and convert text back into speech, all while understanding the integration of these technologies. By implementing these steps, you gained practical experience in building a real-time application that allows users to practice language skills by speaking into a microphone and receiving text responses. On top of all that, you now can build upon these principles to create a fully customizable language coach!