0
0

I havent had a chance to read through the documentation but I was wondering if FMODex has any capabilities when it comes to voice recognition. Im basing my work off the record example and writing to a wave. Now What I can see happening in my head is The user talking into a mic and having that stored in a channel then compared to that wav. To be accurate I would need to compare it to a number of waves. Does FMOD have anything along these lines that let me do this?

  • You must to post comments
0
0

Ill be trying the comparison of differences today but I also wanted to check if there was a way to customize the DSP highpass and DSP lowpass settings. I want to customize the frequencies it covers. I brought my tests into a Logic session (professional digital audio equipment) and analyzed the .wav I played with the low and high pass settings in the EQ and found a generally nice setting that removes most of the static.

  • You must to post comments
0
0

Ok First off Id like to thank everyone for their help. I appreciate it. I think I almost have this down. I record a sound and store it to a buffer.

Then when playing a sound I store all the numerical values from system->getSpectrum(…) in 2 arrays 1 for left 1 for right speakers.

Then I Combind the two arrays together into a sterio array like so:
SterioArray=(LeftArray+RightArray)/2 (I got this from the documentation.
This Sterio Array will be known as SpeechArray from now on.

I then procede to write this file to disk and store it’s vauses in another 2 buffers for comparison using system->getSpectrum(…) again.
I make this a sterio Array as well.
This Sterio Array will be known as FileArray from now on.

Next I check if the two car allike (they should be since they are the same file). by checking if(SterioArray+FileArray)/FileArray>1.8) 20% down.

If it is I check to see if(SterioArray+FileArray)/FileArray<2.2) 20%up

If this is true then Match++ where match starts at 0.

I then repeat these steps for every sample of the audio.

After it is done I check if(Match/numSamples > 0.8)
if it is MATCH
else
NO MATCH

My problem is Im getting a match no matter what and I think it’s because the static is putting it within that varriable all the time. is there a way to cut it out or is there a flaw in my logic here? I took it up wit an audio major and he said that it should create a primitive voice recognition program this way. Thats all im going for.

As far as I can tell putting a lowpass DPS and a highpass DSP does not help alot. It makes it alot more quiet thats all. are there better filters for this that I can’t find or maby a combo that works well to maintain volume but reduce static.

EDIT: OK so I can some numbers and printed the actual values to the screen. Ends up, No matter what I record or how loud I am it has the same spectral numbers. Whats up with that?

  • You must to post comments
0
0

I tried the Sum comparison and it’s working. Thanks for the suggestion. Now I need to compare these sums to zero and grab the closest one. This should not be hard and I should be mostly done today. Thanks again for all the help.

  • You must to post comments
0
0

You can capture recorded data yourself, using record getposition and lock/unlock. There is an example called streamtodisk which does this.

  • You must to post comments
0
0

Figured Ide post a sample of my code to show you how I’m getting the samples and making them stereo.

[code:1l7qegwf]
result = system->playSound(FMOD_CHANNEL_REUSE,sound,false,&channel);
channel->isPlaying(&playing);
while(playing == true)
{
channel->isPlaying(&playing);
cout<<"Playing..... \r";
system->getSpectrum(SpeechL,numValues,0,FMOD_DSP_FFT_WINDOW_BLACKMANHARRIS);
system->getSpectrum(SpeechR,numValues,1,FMOD_DSP_FFT_WINDOW_BLACKMANHARRIS);

            system-&gt;update();
        }
        done = false;
        I = 0;
        while(!done)
        {
            SpeechSTERIO[I] = (SpeechL[I] + SpeechR[I])/2;
            I++;
            if(I == 64)
                done = true;
        }          
    }

[/code:1l7qegwf]

  • You must to post comments
0
0

I could not find the example stream to disk is it ripnetstream?

Well I already have a program written that writes the wav file and getting the data to disk is not the problem. What I’m more interested in is tho method to compare the sounds and interperate 1 from another.

  • You must to post comments
0
0

Sorry i meant recordtodisk not stream to disk btw.

Hagnasty that code doesnt work, all it does it overwrite the same array until the song finishes, so all you have is a small snapshot of a spectrum for the very end of the song. You’re also doing it at full framerate which is slow and unecessary (ie no sleep). This doesnt give you a continuous stream of fft data if that’s what you’re thinking.

  • You must to post comments
0
0

Speech recognition is not that easy (I’m following a course about it at the moment)
Limited recognition is not too hard either though with following constraints:
-Limited number of words/phrases (around 20, maybe 100)
-Training phase where the speaker has to say the words once
-Same training set only used by one speaker

Comparing the sample data directly is probably not very useful due to phase differences as well as speed/timing differences.
The basic method is to start with the spectral data (calculate the fft in 512-sample intervals for the training data, and do the same for the data you are receiving in real-time)
Then there’s a simple algorithm to calculate a cost for going from your sampled data to your training data (this algorithm can take into account insertions and deletions of samples that were not in the training data)

I think to get this working properly and fast in real-time that it will be quite some work and requires a good understanding of what you are doing though.

  • You must to post comments
0
0

Your code does indeed seem wrong. Here’s some more tips:

-Using stereo is most likely useless. It doubles the processing power you need and it doesn’t gain you anything, since speech is mono anyway, and most microphones are mono as well. So just choose either right or left to do your processing with.
-I’m not sure if you understand this correctly, but getSpectrum gives you an array of values for one time frame only, and each value in the array gives you the loudness of a specific frequency.

Basically, to do a very simple comparison, you need to have an array of spectrum arrays of the source, and compare this to an array of spectrum arrays of the incoming sound.

Also normalizing the data can also make it easier to compare the data (since the speaker could speak louder/softer)

  • You must to post comments
0
0

All I need for what Im doing is 1 word recognition. Mainly Left, Right, Jump.

Eventually I would like phrase recognition but untill I uderstand how to get a single word in there Im going to put this on the backburner.

Ive tried to look up snippets of code or some kind of online resource when it comes to voice recognition but I just can’t find anything. If anyone can point me in the right direction, Im a quick study.

  • You must to post comments
0
0

[quote="Adion":34uujkxx]-I’m not sure if you understand this correctly, but getSpectrum gives you an array of values for one time frame only, and each value in the array gives you the loudness of a specific frequency.

Basically, to do a very simple comparison, you need to have an array of spectrum arrays of the source, and compare this to an array of spectrum arrays of the incoming sound.

Also normalizing the data can also make it easier to compare the data (since the speaker could speak louder/softer)[/quote:34uujkxx]

-Another problem with getSpectrum is that it is tightly related to the length of you sound mix buffer, if I’m right. In this case, variable lengths can give different results.

As Adion says, it’s mandatory for you to have a full spectrum of the whole reference sounds, and maybe make a full spectrum of your recorded data.

Actually I am thinking of a graphical representation of the spectrum of a whole sound :
If you generate a bitmap of your spectrum, it’s likely to be cuttable into separate small parts, parts that are different (that is, in one part, a certain bandwidth is he most significant, and in another part, another bandwidth takes the lead)
Then you can compare your sounds comparing the number of parts, then comparing parts (length of a part is not too much different from one sample to another, bandwidth ‘settings’ are not too different, etc.)

I am going to see if I can picture this.

  • You must to post comments
0
0

Ok, an acquaintance of mine gave me some advice and that was to compare the waves based on their Doppler’s. Does FMOD support this and is it a viable option

  • You must to post comments
0
0

Ok, so this image :
[img:ku0e7tr5]http://members.tripod.com/Milaa/mp3Comparison/Camouflage-sample-new-68/lame-390a7-b192-mj-h.jpg[/img:ku0e7tr5]
represents the spectrum of a whole sound (and we have it’s impulsion representation on top). As we see on the image the sound can be clearly cut into several parts than may be compared from one sample to another.
Maybe it is easier to compute something when you remember the speech bandwidth is generally located between 4kHz/8kHz.

  • You must to post comments
0
0

For simple comparison of your sound with a trained limited set of words, your keywords are:
"Dynamic Time Warping" to calculate how well your recording input matches your trained set of words.
"mel spectogram" a transformation of a regular spectogram that represents better how humans perceive sound
"cepstogram" I don’t know if this is already necessary for basic comparisson, it’s basically the fourier transform of a fourier transform, but can give you additional information to compare your data with.

I don’t have any specific online sources at the moment, but these keywords might help you to find relevant information.

  • You must to post comments
0
0

[quote="brett":30paq2tb]
Hagnasty that code doesnt work, all it does it overwrite the same array until the song finishes, so all you have is a small snapshot of a spectrum for the very end of the song. You’re also doing it at full framerate which is slow and unecessary (ie no sleep). This doesnt give you a continuous stream of fft data if that’s what you’re thinking.[/quote:30paq2tb]

Ok,so ill throw in a sleep based on the length of the sound then from what I understand I should loose the stereo array and make a Sound info Vector Inside the sound info vector I will store these mono arrays (These are samples correct?) and then compare the vectors together.

So Ill try and normalize the sound and I’m going to input an adjustable tolerance level. Then it should from my understanding give me a series of snapshots (based on my sleep) of the sound. This explains why my numbers where comming out the same since the sound was the same at the end of the recording (static noise)

artscoop you kinda lost me in the whole picture thing. I try and think of sound as numbers not images. It I need a graphical representation I think wave or bar graph. I think I know what you where getting at though and thats where Dynamic Time Warping come into play. I’m going to try and figure it out.

Ill try this and thanks again.

  • You must to post comments
0
0

I looked into the dynamic time wrapping and I think it wont have to be as accurate as this is. In FMOD there is a function called Channel::getSpectrum()

What I was thinking was that I can use this to take each millisecond and store a numeric value in the array (0.0-1.0) then create a function to see if you are within say 15% of the recorded function.

The problem I can see here is having words like Duck and Puck preform the same actions. I’m not sure if I will have a problem with Up, Down, Left, Right Jump.

Any thoughts?

  • You must to post comments
0
0

Here it is. The new code that should take a number of snapshots

Right now It’s getting the same numbers no matter what is recorded. I thought I had it this time. Any advice?

[code:m9wliqu3]

//RECORDING SOUND
if(recording == true)
{
iter = 0;
while(recording == true)
{
channel->isPlaying(&recording);

            cout&lt;&lt;&quot;Recording.....              \r&quot;;     
            system-&gt;getSpectrum(SpeechSample,numValues,0,FMOD_DSP_FFT_WINDOW_BLACKMANHARRIS);

            SpeechSoundArray[iter][64] = SpeechSample[64];

            iter++;

            Sleep((Soundlength/5)*1000);
        }


    }

//PLAY_RECORDED_SOUND
else if(playing == true)
{

        result = system-&gt;playSound(FMOD_CHANNEL_REUSE,sound,false,&amp;channel);
        while(playing == true)
        {
            channel-&gt;isPlaying(&amp;playing);
            cout&lt;&lt;&quot;Playing.....                \r&quot;;


            system-&gt;update();

        }
    }

//WRITING_SOUND_TO_DISK
else if(WPlaying == true)
{
iter = 0;
result = system->playSound(FMOD_CHANNEL_REUSE,sound,false,&channel);

        channel-&gt;isPlaying(&amp;WPlaying);
        while(WPlaying == true)
        {
            channel-&gt;isPlaying(&amp;WPlaying);
            cout&lt;&lt;&quot;Writing To Disk.....                                                    \r&quot;;

            system-&gt;update();
        }
        for(int iter = 0; iter &lt; 5; iter++)
        {
            for(int iter2 = 0; iter2&lt;64; iter2++)
            {
                StoredSoundArray[iter][iter2] = SpeechSoundArray[iter][iter2];
            }
        }
    }
    else
    {
        cout&lt;&lt;&quot;Tolerance:&quot;&lt;&lt;tolerance*100&lt;&lt;&quot;% Idle.....                   \r&quot;;
    }
}  

//COMPARE_MULTI_DIMENSIONAL_ARRAYS
case ‘c’:
case ‘C’:
Match = 0;
for(int J = 0; J <=6; J++)
{
for(int I=0;I<64;I++)
{
if(((StoredSoundArray[J][I] + SpeechSoundArray[J][I]) / StoredSoundArray[J][I]) > (2-tolerance) )
{
if(((StoredSoundArray[J][I] + SpeechSoundArray[J][I]) / StoredSoundArray[J][I]) < (2+tolerance))
{
Match++;
}
}
}
}
cout<<Match<<" "<<endl;

                if(Match / (numValues*6) &gt; 0.8)
                {
                    cout&lt;&lt;(Match / (numValues*6))*100&lt;&lt;&quot;% MATCH                                          \r&quot;;
                }
                else
                {
                    cout&lt;&lt;(Match / (numValues*6))*100&lt;&lt;&quot;% NO MATCH                                       \r&quot;;
                }

                Sleep(500);

                break;

[/code:m9wliqu3]

  • You must to post comments
0
0

getSpectrum is probably a good start, altough I think in fmod it can’t be configured.
That means that in fmod it will probably be a 1024 sample window with no overlapping (@44.1khz that’s about 21 ms I believe)

Especially for short sounds, the fact that you only have a reading every 20 milliseconds might make detection a bit more difficult.

Anyway, with a 1024 sample fft, you will get back a spectrum of 512 samples, which range from frequencies from 0Hz to frequencies of half the samplerate (22 kHz in this case)
For speech you only need to compare data between 100 and 4000 Hz.

Just comparing with simple distinct words might work, altough finding the start of the word may be a bit difficult. (If there is no background noise it might work to check for a sudden volume change though)

  • You must to post comments
0
0

Ok dsregard the previous post I got it to Work finally. Thanks to all who helped. I just need a way to make it useable Saying the same phrase is only returning a 65% match at a 99.9999 repeating tollerence. I was hope the normalizing will help because my values are very quiet and are expressed in scientific notation.

EDIT: The normalize helped but is there a fliter that can filter static that I havent found yet?

  • You must to post comments
0
0

In FMOD you can change the window. Im using FMOD_DSP_FFT_WINDOW_BLACKMANHARRIS

  • You must to post comments
Showing 1 - 20 of 22 results
Your Answer

Please first to submit.