A Simple R2-D2 Voice Simulator

Parker King-Fournier

R2-D2

Often times the most recognizeable aspects of a favourite movie or television show are the sounds that accompany the work. The world of Star Wars is a perfect example. The Star Wars Universe is home to the distinct sounds of lightsabers, blasters, and various speeding ships that give it its unique brand of original.

The inspiration to make a seemingly simple simulator for R2-D2 came directly from the importance of the character in my love for the films. Throughout the films, the audience is forced to infer R2-D2's meanings. Often clues to the meaning are hinted as the sounds R2 makes mimick human speech. This inspired me to design a program that would take a recorded sound file, either from a user, or from a pre-existing file, and produce a R2-D2 style replica to the recorded phrase that was similar in pitch, rhythm and volume.

Goals of the Project

Because of the prevalence of digital audio synthesis techniques used in movie and television soundtracks, this project used many of the techniques and methods for synthesis and analysis that were studied over the past months. The purpose of the project is summarized in the following goals:

Analyze both the audio of a speaker and of R2-D2 using sound analysis techniques learned in class
Gain an understanding of methods of pitch detection
Gain an understanding of methods of time-frequency analysis
Improve profficiency in Matlab programming
Create a program that effectively mimicks R2-D2

An Overview of the Method

In aiming to create a program that would intake "human audio" and output "R2-D2 audio" I was able to divide the project into 5 distinct steps:

Intake of Audio
Time-Frequency Analysis
Representation of R2-D2 Sounds
Merging of time-frequency and R2-D2 sounds
Final processing and output

Intake of Audio

The intake of audio was primarily determined by which software supported the best time frequency analysis. This is explained below. For this reason, the audio file corresponding to the voice is taken in a Max/MSP program that records the frequency and relative amplitude every one-hundredth of a second, and writes the to a text file. The Max patch input.maxpat is shown in Figure 1.

Max Patch

Figure 1

Time-Frequency Analysis

In order to create R2-D2 sounds that matched the frequency of the input audio, a reliable way of analyzing the frequency over time was necessary. The Harmonic Product Spectrum algorithm, the Short Time Fourier Transform and the Max/MSP fzero~ object were all explored as options to estimate the fundamental frequency of a sound.

The Harmonic Product Spectrum

The Harmoninc Product Spectrum (HPS) algorithm is an algorithm that utilizes Fourier Theory to estimate the fundamental frequency of a signal. The algorithm uses the DFT/FFT to calculate the lowest common denominator of the frequency spikes that result from the DFT/FFT: this is in theory the fundamental frequency of the signal.

The HPS works in the following way:

Apply a window to the original signal
Take the DFT/FFT of the resulting signal
Downsample the resulting spectra multiple times, and multiply these together

The HPS is algorithm is shown in Figure 2:

HPS Algorithm

Figure 2

For more information on the Harmonic Product Spectrum Algorithm click here or here.

The Short Time Fourier Transfer

The Short Time Fourier Transfer (STFT) uses the DFT/FFT over short sections of the signal to estimate the time-frequency relationship of a signal. In Matlab, the STFT can be easily utilized with the spectrogram object. Unlike the HPS, the STFT does not calculate the fundamental frequency of the signal.

Fundamental Frequency Estimation in Max/MSP

In Max/MSp there is a standard object called fzero~ that estimates the fundamental frequency of a signal in real time. It performs multiple layers of wavelet transforms on an incoming signal, comparing the spacing between the peaks in each.

Upon testing these three techniques for pitch detection and/or time-frequency anaylsis I found that the HPS provided the most accurate estimation for simple signals, such as a simple note, or a simple R2-D2 sound such as the first half of this simple whistle. The HPS becomes less and less accurate as the audio sample becomes more complicated or shorter. For this reason it was not optimal for analyzing the audio of a speaker.

I originally was inclined to use the spectrogram function in Matlab to read the voice input, but it did not prove to be useful in recorded voice as is shown in the spectrogram in of this voice clip in Figure 3.

Spectrogram Image

Figure 3

For that reason I decided to use the HPS.m algorithm to determine the fundamental frequency of simple R2-D2 whistles that could be later modified in relation to the time-frequency plot of the voice input file, which would be produced via the fzero~ object in Max/MSP.

R2-D2 Sound Representations - To Synthesize or Not to Synthesize?

Originally, I wanted to synthesize the sounds of R2-D2 using additive synthesis. When I listened to R2-D2 sounds take from the movies I realized that many of the sounds that are characteristic of the robot would be quite hard to synthesize. I decided to try to find a technique to pitch shift the samples.

Merging of Time-Frequency and R2-D2 Sounds

In order to make the R2 sounds produced follow the time-frequency plot of the input file two things need to happen:

The time-frequency plot needs to be shifted into the same general register as the R2-D2 sounds
A random R2-D2 sound from library needs to be pitch shifted to the correct pitch

These solutions seemed easy to fix. The tiem frequency data from MAX/Msp was inconsistent, so it was used to create a line of best fit representing the frequencies over time. A scalar could be added to all non-zero values of the frequency best-fit plot to put the original audio file in a range suitable for pitch shifting the R2-D2 sounds to.

In order to pitch shift the individual R2-D2 sounds, the fundamental pitch of the small sound was found with the HPS.m algorithm was used. Using the phase vocoder example provided by Dan Ellis, the pitch could be shifted up relatively easily.

Additional Processing

In the max patch shown in Figure 1, the amplitude of the recorded audio over time is also recorded and output to a text file. This was processed into an envelope in Matlab that would be applied to the final signal to more liken the resulting and original sounds.

Challenges Faced Throughout the Project

The R2-D2 Sounds

As previously mentioned, the sounds of the droid proved difficult to synthesize. Upon first research of the sounds, I discovered that the sound designer Ben Burtt used an analog synthesizer, as well as his own vocalizations processed through other effects (wiki). Because of this synthesis was decided against for the sake of brevity, but would probably produce a better end result.

Pitch Detection and Communication Between Matlab and MAX/Msp

Because of the constraints of the pitch detection methods I tried I was forced to make the program a combination of softwares. This brought to my attention the lack of compatability between Max/MSP and Matlab. Originally, I wanted the project to be one neat package, but that proved harder than orginally thought.

The pitch detection of small clips of R2-D2 noises proved hard as well. The HPS algorithm works well for harmonic sounds, such as some of the R2-D2 whistles, but the sounds more akin to barks, or growls, seemed harder to analyze and pitch shifting resulted in weird effects.

Potential Solutions

The output of the program(s) seems to be most effected by the pitch shifting algorithm and the accuracy of the Harmonic Product Spectrum algorithm. This is due to the nontrivial nature of the R2-D2 sounds: they are the result of combinations of analog, digital, voice and other modified sounds that make analysis difficult. A better solution would be to have a library consisting of simple R2-D2 sounds, such as whistles and other similar sounds, that would respond better to the HPS and pitch shifting algorithms

A better selection method of R2-D2 sounds may be the use of Corpus Based Synthesis to select sounds that match the input sound's characteristics. This may produce better results but would no doubt prove time consuming.

Download

The software in its current state can be downloaded here.