Nov 5, 2024

Using ChatGPT to Make a Cyberdog Talk

Control an AI-powered Cyberdog that assists first responders, using ChatGPT for triage in emergencies.

In the rapidly evolving landscape of artificial intelligence and robotics, innovative applications are emerging that have the potential to revolutionize various industries. One such application, presented recently by Godfrey Nolan, President of RIIS, at the OpenAI Applications Explorers meetup, showcases the integration of OpenAI’s ChatGPT with a Xiaomi Cyberdog to create a system capable of assisting first responders in emergency situations.

Introduction

Today’s project aims to utilize a Cyberdog equipped with ChatGPT capabilities to interview injured individuals in catastrophic events, providing crucial information to medical personnel before they arrive on the scene.

Why Use AI and Robotics?

The implementation of AI and robotics in emergency response situations presents several advantages:

Scalability: Multiple Cyberdogs can assess numerous people simultaneously, increasing the efficiency of triage operations.
Safety: By sending robots into potentially hazardous areas, the risk to human responders is significantly reduced.
Efficiency: The system enables faster medical response times, crucial in life-threatening situations.
Resilience: Cyberdogs demonstrate impressive adaptability to challenging terrains and obstacles.
Accuracy: AI-powered analysis can provide consistent and emotion-free evaluations of patient conditions.

The First Responder Challenge

First responders face numerous challenges in their line of work:

Immediate needs and urgent calls require quick decision-making and action.
Critical assessments must be made under pressure and with limited information.
Responders are often stretched thin, handling multiple emergencies simultaneously.
The need for rapid triage and response is paramount in saving lives.

Combining Robotics and AI for Triage

Having a Cyberdog powered by OpenAI technology can potentially revolutionize emergency response by navigating hazardous areas inaccessible to humans, collecting crucial audio-visual data, and communicating with injured individuals through pre-recorded questions. In this tutorial we’re going to utilize OpenAI's Whisper API for audio transcription and ChatGPT for analysis to enable rapid assessment of victims' conditions, allowing for efficient prioritization of medical assistance.

Why the Cyberdog, and why not Spot

The Xiaomi Cyberdog 1.0 used in this project is a quadruped robot powered by NVIDIA’s Jetson Xavier AI platform. Key features include:

Price point of $2,500, significantly more affordable than alternatives like Boston Dynamics’ Spot ($75,000)
Maximum speed of 3.2 m/s
Multiple sensors including cameras, GPS, and an ultra-wide angle fisheye lens
Intel RealSense camera for depth perception
Compatibility with ROS 2 (Robot Operating System)

The Cyberdog’s balance of affordability and capabilities make it an ideal platform for research and development in AI-assisted emergency response. Boston Dynamic’s Spot is somewhere around $75,000, so the choice is easy for this research experiment.

Integration with OpenAI

Here’s the flow. First, we are going to use Whisper with Text-to-Speech to have the Cyberdog ask our first aid questions. Since the Cyberdog is literally a walking Unix box with lots of sensors, we will use it to collect our data, both visual and audio. Once we have our data, the audio from the patient will be transcribed from Speech-to-Text (opposite of before). Chat GPT will analyze the text to determine the person’s condition. Those who need immediate help will then be flagged for the operator.

Our pre-recorded questions aren’t recorded in the traditional sense. Instead we use OpenAI’s Text-to-Speech API, to turn a list of pre-generated questions stored on the Cyberdog into audio for the injured person to hear that we will later store on Cyberdog.

This simple Python script below shows how you can query the Whisper API with your text and have it respond with a finished audio file.

import openai
import config

openai.api_key =  config.OPENAI_API_KEY

speech_file = "recording1.wav"

with openai.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="alloy",
    input="Are you having trouble breathing?") as response:
    response.stream_to_file(speech_file)

Getting data on and off the dog

The Cyberdog wasn’t built for this use, so we have to take advantage of the available ports. Out of the available HDMI, Charger, Extension for Sensors, and Download ports, we are going to use the Download port. We simply connect to it, then issue a simple Unix SCP command to copy over the necessary files. The SCP (Secure Copy) command is a tool used in Unix-like operating systems to securely transfer files between hosts on a network.

Next, to capture and stream video and audio from the Cyberdog, we set up a Real-Time Streaming Protocol (RTSP) server:

Setting up RTSP Feed: A Real-Time Streaming Protocol server is established on the Cyberdog to transmit video and audio data.
Data Capture: An application running on the operator’s laptop captures the RTSP feed, recording the audio for further analysis.
Transcription and Analysis: The recorded audio is transcribed using OpenAI’s Whisper API and then analyzed using ChatGPT to summarize the patient’s condition.

This integration allows the Cyberdog to function as an intelligent first responder, capable of gathering and analyzing critical information in emergency situations.

Cyberdog Questions

The system uses a set of pre-defined questions to assess the patient’s condition:

Cyberdog Questions

Breathing
Are you having trouble breathing?

Bleeding
Is there a lot of blood?
Where is the blood?

Pain
Where does it hurt?

Movement
Can you move your arms and legs

These questions are designed to quickly identify critical issues that may require immediate medical attention. The responses to these questions are analyzed by the AI to determine the severity of the patient’s condition and prioritize care.

Push Questions to Cyberdog

To enable the Cyberdog to communicate with injured individuals, pre-recorded questions are pushed to the robot’s onboard storage. This process involves using the OpenAI API to generate audio files of the questions, which are then transferred to the Cyberdog.

RTSP Streaming

To enable real-time video and audio streaming from the Cyberdog, a Real-Time Streaming Protocol (RTSP) server is set up on the robot. RTSP is a network protocol designed for use in entertainment and communications systems to control streaming media servers.

The following Python script sets up an RTSP server on the Cyberdog:

import gi
gi.require_version('Gst', '1.0')
gi.require_version('GstRtspServer', '1.0')
gi.require_version('GLib', '2.0')
from gi.repository import Gst, GstRtspServer, GLib

VIDEO_DEVICE = '/dev/video0'  # Change to your video device
AUDIO_DEVICE = 'default'       # Change to your audio device
SERVER_IP = '10.5.2.31'     # Change to your IP address
PORT = 8554                   # RTSP server port

class VideoStreamServer(GstRtspServer.RTSPMediaFactory):
    def __init__(self):
        super(VideoStreamServer, self).__init__()
        self.set_shared(True)
    def do_create_element(self, url):
        pipeline_str = (
            f"v4l2src device={VIDEO_DEVICE} ! videoconvert ! video/x-raw,format=I420 ! "
            f"x264enc tune=zerolatency bitrate=500 speed-preset=ultrafast ! rtph264pay name=pay0 pt=96 "            
            f"alsasrc device={AUDIO_DEVICE} ! audioconvert ! audioresample ! opusenc ! rtpopuspay name=pay1 pt=97"        
        )
        return Gst.parse_launch(pipeline_str)
        
class Server:
    def __init__(self):
        Gst.init(None)
        self.server = GstRtspServer.RTSPServer()
        self.server.props.service = str(PORT)
        self.server.set_address(SERVER_IP)
        
        factory = VideoStreamServer()
        self.server.get_mount_points().add_factory("/video_stream", factory)
        self.server.attach(None)
        
    def run(self):
        print(f"RTSP server is running at rtsp://{SERVER_IP}:{PORT}/video_stream")
        loop = GLib.MainLoop()
        loop.run()
        
if __name__ == "__main__":
    server = Server()
    server.run()

This script creates an RTSP server that streams video and audio from the Cyberdog’s camera and microphone. The stream can be accessed by connecting to the specified IP address and port. Note that you will have to change the IP address, Video Device, and Audio Device to your own. The script also allows multiple clients to connect to the stream simultaneously, making it suitable for real-time audio/video streaming applications. When you run, it prints the URL where the stream can be accessed.

Data Capture Overview

The data capture process involves recording the audio and video feed from the Cyberdog for further analysis. This is crucial for the AI system to process the responses from injured individuals and assess their condition.

Data Capture Steps

To play the recordings to the patient, we implemented a simple bash script to run.

#!/bin/bash

wav_files(
	"recording1.wav"
	"recording2.wav"
	"recording2a.wav"
	"recording3.wav"
	"recording4.wav"
)

play_audio() {
	local file_path="$1"
	echo "Playing: $file_path"
	aplay "$file_path"
)

play_audio "${wav_files[0]}"
sleep 10

play_audio "${wav_files[1]}"
sleep 10

play_audio "${wav_files[2]}"
sleep 10

play_audio "${wav_files[3]}"
sleep 10

play_audio "${wav_files[4]}"
sleep 10

echo "Script finished."

All this is doing is taking our recorded questions and issuing them at 10 second intervals. Although in a perfect world you would have the dog wait for a user response, but to reduce the variables that could lead to failure, this method is very efficient.

So that covers the prompts to the patient. Now we have to capture the data from them.

The next immediate thing to do is check that you have FFmpeg installed. It’s going to be important for processing our audio and video files.

We’ll implement these steps in a Python script that runs on the operator’s laptop. The script will connect to the RTSP stream provided by the Cyberdog and records the incoming audio data. This is what it looks like:

import subprocess
import soundfile as sf
import config as config
import numpy as np
import threading
import signal

sample_rate = 44100
channels = 1
RTSP_URL = "rtsp://127.0.0.1:8554/video_stream"  # Change to stream IPaudio_
filename = 'recording.wav'
audio_file = None
process = None
stop_flag = threading.Event()

def capture_audio_stream():
    global process
    try:
        process = subprocess.Popen(
            ['ffmpeg', '-rtsp_transport', 'tcp', '-i', RTSP_URL, '-vn', 
            '-acodec', 'pcm_s16le', '-ar', '44100', '-ac', '1',
             '-f', 'wav', 'pipe:1', '-loglevel', 'error', '-buffer_size', 
             '1000000', '-tune', 'zerolatency'],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE
        )
        return process
    except Exception as e:
        print(f"Error initializing FFmpeg: {e}")
        return None
        
def start_recording():
    global audio_file, process
    try:
        process = capture_audio_stream()
        if process is None or process.stdout is None:
            return     
               
        print("Recording started...")
        
        audio_file = sf.SoundFile(audio_filename, mode='w', samplerate=sample_rate, channels=channels)
        
        while not stop_flag.is_set():
            audio_data = process.stdout.read(16384)
            if not audio_data:
                break            
                
            try:
                audio_array = np.frombuffer(audio_data, dtype=np.int16)
                if audio_file is not None:
                    audio_file.write(audio_array)
            except Exception as e:
                break    
    except Exception as e:
        print(f"An error occurred during recording: {e}")
    finally:
        if audio_file is not None:
            audio_file.close()
            audio_file = None            
            print("Audio file closed.")
            
def handle_exit_signal(signum, frame):
    global stop_flag
    stop_flag.set()
    
    if process is not None:
        process.terminate()
        process.kill()
        
def main():
    signal.signal(signal.SIGTERM, handle_exit_signal)
    signal.signal(signal.SIGINT, handle_exit_signal)
    recording_thread = threading.Thread(target=start_recording)
    recording_thread.start()
    
    try:
        recording_thread.join()
    except KeyboardInterrupt:
        handle_exit_signal(None, None)

if __name__ == "__main__":
    main()

In the above code, capture_audio_stream function uses FFmpeg to capture the audio stream from the RTSP URL. It sets up FFmpeg to output raw audio data to stdout, which we'll process later. Then our start_recording function starts the recording process. It captures audio data in chunks, converts it to a NumPy array, and writes it to a WAV file using the soundfile library. The handle_exit_signal ensures we have a graceful termination of the signal. Finally, The main() function sets up signal handlers, starts the recording in a separate thread, and waits for the recording to finish or for a keyboard interrupt.

Transcribing the audio.

The next block of code we are going to add is a very basic call to the Whisper API with our new captured and processed audio.

import openai
import config

openai.api_key =  config.OPENAI_API_KEY
media_file_path = 'recording_good.wav'
media_file = open(media_file_path, 'rb')

transcription = openai.audio.transcriptions.create(
    model="whisper-1",
    file=media_file,
)

#print(transcription.text)

response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", 
        "content": "You are a helpful first responder who is trying to see how \                            much in need of attention.  Here are the questions we asked the patient: \                            Are you having trouble breathing? Is there a lot of blood? \                            Where is the blood? and Can you move your arms and legs? \                            Summarize the conversation if you can, only use the responses."   
        },
        {
            "role": "user",
            "content": transcription.text,
        }
    ]
)

print(response.choices[0].message.content)

As you can see, there’s not a lot happening. The script loads an audio file, transcribes it, and then uses the AI model to interpret the transcription. Boom! If everything goes as planned you should get a console output, similar to this:

f you want to troubleshoot and see what has been transcribed from the audio file, just uncomment out the #print(transcription.text) line.

Conclusion

The integration of AI and robotics in emergency response situations, as demonstrated by the Cyberdog project, represents a significant advancement in first responder capabilities. By combining the mobility and resilience of the Cyberdog with the analytical power of ChatGPT, this system offers a unique solution to the challenges faced by emergency responders.

By following this guide, you've learned how to set up an RTSP server on the Cyberdog, capture audio data from a remote source, transcribe it using OpenAI's Whisper API, and analyze the content with GPT-3.5-turbo. This system you've built can potentially revolutionize how first responders gather critical information in emergency situations, making triage more efficient and potentially saving lives. You've combined cutting-edge technologies to create a solution that addresses real-world challenges faced by emergency responders.