CN1213399C

CN1213399C - General A-Law format voice identifying method

Info

Publication number: CN1213399C
Application number: CNB021287619A
Authority: CN
Inventors: 冯敬涛; 刘丹亭
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2002-08-07
Filing date: 2002-08-07
Publication date: 2005-08-03
Anticipated expiration: 2022-08-07
Also published as: CN1474377A

Abstract

The present invention provides a universal A-Law format speech recognition method comprising the following procedures: a speech mould plate containing the speech characteristic quantities of all the speeches is generated in an original speech file in an A-Law format, and then, the speech mould plate is loaded; a speech starting point and an end point of a speech stream to be recognized are detected to extract the speech characteristic quantities of the speeches between the starting point and the end point; the speech characteristic quantities of the speech stream to be recognized are compared with the speech characteristic quantities of the speech mould plate to split and recognize the speeches so as to obtain the recognition results. The technical scheme related by the present invention, which is adopted, has the advantages of quick and accurate speech recognition, appropriate cost, few occupied resources, flexible and convenient use, etc. The present invention can be generally suitable for the fields of exchange and nonexchange which limit the speeches within a limited range and have a requirement for performance, and the fields of the exchange and the nonexchange are sensitive to prices.

Description

General A rule format voice identifying method

Technical field

The invention belongs to field of speech recognition, relate to a kind of general A rule format voice identifying method specifically.

Background technology

Along with the continuous development of intelligent network product and abundant, its business relevant with voice is more and more rich and varied, flexible and changeable.And in the intelligent network product test, a gordian technique that realizes test automation is exactly the identification to professional voice, and what generally adopt at present is the mode that the most original craft is called, and whether voice correctly need the tester to listen with ear.This test mode itself has much relations because of completeness, adequacy and the tester with whole test, so efficiency ratio is lower.

In order to address the above problem, prior art generally adopts traditional ASR (Auto SpeechRecognition, automatic speech recognition) technology, and this technical scheme at first becomes text with speech conversion, then text is discerned comparison.But price is very expensive, normally charges by time slot, and all the more so in the time of especially in voice are limited at limited scope, also there is the slower shortcoming of recognition speed in this technical scheme in addition.But in IN service, generally has only the 3-5 time of second after the playback, if face can not be finished speech recognition and dialing during this period, thereby business will overtimely be moved overtime branch, when particularly carrying out performance test, need simultaneously the voice of a plurality of time slots to be discerned, in conjunction with the shortcoming of ASR technology and the demand of IN service, can find out IN service, the ASR technical scheme is all not satisfactory on price and the performance, especially those are to price and the relatively more responsive non-exchange field of performance, and this problem is more outstanding.

Summary of the invention

At the problems referred to above, the present invention proposes a kind of speech recognition quick and precisely, cost is suitable, can be widely used in the general A rule format voice identifying method that voice are limited at interior exchange of limited range and non-exchange field.

For achieving the above object, general A rule format voice identifying method concrete steps of the present invention are:

A, from the voice document of original A rule form, produce the sound template of the phonetic feature amount that contains all voice, load sound template then;

The voice of B, detection voice flow to be identified rise, stop, extract the phonetic feature amount of voice between the terminal;

The phonetic feature amount of C, more described voice flow to be identified and the phonetic feature amount of described sound template are carried out voice and are split and discern, thereby obtain recognition result.

The voice of the detection voice flow to be identified among the wherein said step B rise, stop more particularly comprises following steps:

B1, determine the size of speech data block and speech energy threshold value;

B2, determine the voice starting point, if the energy of the continuous multiple frames of voice flow to be identified is greater than the speech energy threshold value, be about to for the first time frame greater than the speech energy threshold value as voice candidate starting point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice starting point;

B3, determine the voice terminal point, if the energy of the continuous multiple frames of voice flow to be identified is less than the speech energy threshold value, be about to for the first time frame less than the speech energy threshold value as voice candidate terminal point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice terminal point.

Wherein the voice that carry out described in the step C split and discern C1, the splitting voice information of more particularly comprising the steps:; C2, anolytic sentence are formed; C3, analyzing speech are formed, and obtain voice strip number and corresponding encoded.

In order further to dwindle cost of the present invention and to increase its availability, wherein the phonetic feature amount of the sound template described in the steps A further comprises quick coupling voice characteristic quantity and accurate coupling voice characteristic quantity; Wherein the phonetic feature amount described in steps A and the step C is meant time-domain analysis characteristic quantity and frequency-domain analysis characteristic quantity.

Adopt technical solutions according to the invention, have following advantage:

1, speech recognition quick and precisely, the accuracy rate of speech recognition can be up to 100%, and speed is very fast; In multithread application, carry out multichannel identification, very high accuracy rate and the speed of same maintenance.

2, cost is suitable, can be applicable to generally that voice are limited in the limited range, to Price Sensitive, the exchange that performance requirement is arranged and non-exchange field.

3, it is little to take resource.

4, flexible and convenient to use, part is supported fuzzy query; The untimely phenomenon of sound is adopted when taking place by the system that often occurs in for example using, and the header information of voice just has the disappearance problem, and the present invention can discern this voice, can tolerate that the voice header information has the disappearance of long duration; Polyphone problem and for example, " branch " this word has two pronunciations: the branch that the branch of Hour Minute Second and first angle are divided, carry out at twice when recording, and existing ASR technology can only be discerned a word, the present invention but can be distinguished on the voice numbering.

Describe the present invention in detail below in conjunction with the drawings and specific embodiments.

Description of drawings

Fig. 1 is a method flow diagram of the present invention.

Specific implementation

The playback of wired intellective network service is typical case's representative of general A rule formatted voice, and the playback situation of other product business has been contained in its playback substantially, is example explanation the present invention with wired intellective network service voice below.

At first introduce the basic condition of wired intellective network service voice.These type of voice are by the purposes branch, can be divided into: operation flow voice and basic voice two seed voice, the former is meant the voice of control flow, can use separately, can use together with the latter, different business has different operation flow voice, and the latter must be used in combination with the operation flow voice, its content does not change with the change of business, mainly comprise: " 0 ", " 1 "-" 9 ", " ten ", " hundred ", " thousand ", " ten thousand ", " hundred million ", " unit ", " angle ", " branch ", " year ", " moon ", " day ", " time ", " branch ", " second " etc., the very short sub-voice of above-mentioned speech interval are compounded to form statement.The voice document of original A rule form can a statement, also can be that many long at interval statements are formed.Present embodiment be finish voice to be identified and original A rule form like this voice document relatively, concrete steps are as follows:

One, from the voice document of original A rule form, produces the sound template of the phonetic feature amount that contains all voice, load sound template then;

Business platform is sent out message and is given switch, switch extracts the voice document of accordingly original A rule form and sends in the relaying voice channel time slot from voice resource, business platform will produce the sound template of the phonetic feature amount that contains all voice from the voice document of this original A rule form, promptly extract following major parameter: the equal segments track is long, voice starting point frame position, the energy threshold and the energy parameter series that are used for coupling identification fast, the eigenvector threshold value and the character vector series that are used for accurately coupling identification, raw tone filename and voice document coding, thereby generate the sound template that comprises all operation flow voice and basic phonetic feature, then it is loaded initialization.

Two, the voice that detect voice flow to be identified rise, stop, extract the phonetic feature amount of voice between the terminal.The voice that wherein detect voice flow to be identified rise, stop, more specifically, may further comprise the steps:

A, determine the size of speech data block and speech energy threshold value; The sampling rate of supposing these voice to be identified is 8KHz, and the frame period is 25ms, and frame length is 25ms, determine that at first the size of input speech data block is 80ms, and the speech energy threshold value is 30.

B, determine the voice starting point, if the energy of the continuous multiple frames of voice flow to be identified is greater than the speech energy threshold value, be about to for the first time frame greater than the speech energy threshold value as voice candidate starting point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice starting point; According to the speech frame energy

ENE = Σ_{i = 0}^{L - 1} | s (i) | / L,

Wherein s (i) is a voice signal, judge voice segments, can suppose if continuous 3 frames of voice flow to be identified, be to be that the energy of 3 * 25=75ms is greater than speech energy threshold value 30 in cycle time, can with for the first time greater than the frame of speech energy threshold value as voice candidate starting point, again according to preceding 3 speech data block lengths of voice candidate starting point, promptly cycle time is that the position of 3 * 80=240ms is as the voice starting point.

C, determine the voice terminal point, if the energy of the continuous multiple frames of voice flow to be identified is less than the speech energy threshold value, be about to for the first time frame less than the speech energy threshold value as voice candidate terminal point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice terminal point; According to the speech frame energy

ENE = Σ_{i = 0}^{L - 1} | s (i) | / L,

Wherein s (i) is a voice signal, judge voice segments, can suppose if continuous 40 frames of voice flow to be identified, be to be that the energy of 40 * 25=1000ms is greater than speech energy threshold value 30 in cycle time, can with for the first time less than the frame of speech energy threshold value as voice candidate terminal point, again according to preceding 2 speech data block lengths of voice candidate starting point, promptly cycle time is that the position of 2 * 80=160ms is as the voice terminal point.

Whether whether can accurately detect voice flow to be identified by above step and exist, finish, each parameter of suitable configuration also can be avoided the end of the interval between the short sentence as voice.

Three, the phonetic feature amount of the phonetic feature amount of more described voice flow to be identified and described sound template is carried out voice and is split and discern, thereby obtains recognition result.As this voice flow to be identified be: " your remaining sum is 10 yuan 5 jiaos.Make a phone call please by 1, query the balance please by 2 "; voice are earlier by cutting between the terminal; just make a concrete analysis of statement and comprise " your remaining sum is 10 yuan 5 jiaos "; " making a phone call please by 1 "; " querying the balance please by 2 "; wherein the voice that each statement comprised are " your remaining sum are ", " 10 ", " unit ", " 5 ", " angle ", " make a phone call " please by 1, " query the balance ", thereby obtain 7 voice strip numbers and corresponding encoded is respectively: 06800018 please by 2,00000001,0000000a, 00000031,00000045,00000009,0680000d ...Compare with the phonetic feature amount of described sound template respectively then, find out the most close one, the voice that promptly obtain separately are described as " your remaining sum is ", " one " and " ten ", " unit ", " five ", " angle ", " make a phone call " please by 1, " query the balance ", thereby obtain recognition result please by 2.

In order further to increase its availability of the present invention, the present invention can be realized by the form of dynamic base (DLL), as the present invention being divided into five function performances, be voice starting point measuring ability, speech identifying function, sound template making function, function of initializing, speech recognition end functions, particularly, voice starting point measuring ability is finished above-mentioned step 1, sound template is made function and function of initializing is finished above-mentioned step 2, speech identifying function completing steps three, last speech recognition end functions discharges shared system resource.Five functions are corresponding with five functions in the dynamic base (DLL), use very flexiblely, and resource such as CPU of the system that takies, internal memory is all very little.

Claims

1, a kind of general A rule format voice identifying method is characterized in that the method includes the steps of:

A, from the voice document of initial A rule form, produce the sound template of the phonetic feature amount that contains all voice, load sound template then;

The phonetic feature amount of C, more described voice flow to be identified and the phonetic feature amount of described sound template are carried out voice and are split and discern, thereby obtain recognition result;

The voice of the detection voice flow to be identified among the described step B rise, stop specifically comprises following steps:

B1, determine the size of speech data block and speech energy threshold value;

2, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that the phonetic feature amount described in steps A and the step C is meant time-domain analysis characteristic quantity and frequency-domain analysis characteristic quantity.

3, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that the voice that carry out described in the step C split and discern C1, the splitting voice information of more particularly comprising the steps:; C2, anolytic sentence are formed; C3, analyzing speech are formed, and obtain voice strip number and corresponding encoded.

4, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that, the phonetic feature amount of the sound template described in the steps A further comprises quick coupling voice characteristic quantity and accurate coupling voice characteristic quantity.