CN1213399C - General A-Law format voice identifying method - Google Patents

General A-Law format voice identifying method Download PDF

Info

Publication number
CN1213399C
CN1213399C CNB021287619A CN02128761A CN1213399C CN 1213399 C CN1213399 C CN 1213399C CN B021287619 A CNB021287619 A CN B021287619A CN 02128761 A CN02128761 A CN 02128761A CN 1213399 C CN1213399 C CN 1213399C
Authority
CN
China
Prior art keywords
voice
speech
starting point
feature amount
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB021287619A
Other languages
Chinese (zh)
Other versions
CN1474377A (en
Inventor
冯敬涛
刘丹亭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CNB021287619A priority Critical patent/CN1213399C/en
Publication of CN1474377A publication Critical patent/CN1474377A/en
Application granted granted Critical
Publication of CN1213399C publication Critical patent/CN1213399C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The present invention provides a universal A-Law format speech recognition method comprising the following procedures: a speech mould plate containing the speech characteristic quantities of all the speeches is generated in an original speech file in an A-Law format, and then, the speech mould plate is loaded; a speech starting point and an end point of a speech stream to be recognized are detected to extract the speech characteristic quantities of the speeches between the starting point and the end point; the speech characteristic quantities of the speech stream to be recognized are compared with the speech characteristic quantities of the speech mould plate to split and recognize the speeches so as to obtain the recognition results. The technical scheme related by the present invention, which is adopted, has the advantages of quick and accurate speech recognition, appropriate cost, few occupied resources, flexible and convenient use, etc. The present invention can be generally suitable for the fields of exchange and nonexchange which limit the speeches within a limited range and have a requirement for performance, and the fields of the exchange and the nonexchange are sensitive to prices.

Description

General A rule format voice identifying method
Technical field
The invention belongs to field of speech recognition, relate to a kind of general A rule format voice identifying method specifically.
Background technology
Along with the continuous development of intelligent network product and abundant, its business relevant with voice is more and more rich and varied, flexible and changeable.And in the intelligent network product test, a gordian technique that realizes test automation is exactly the identification to professional voice, and what generally adopt at present is the mode that the most original craft is called, and whether voice correctly need the tester to listen with ear.This test mode itself has much relations because of completeness, adequacy and the tester with whole test, so efficiency ratio is lower.
In order to address the above problem, prior art generally adopts traditional ASR (Auto SpeechRecognition, automatic speech recognition) technology, and this technical scheme at first becomes text with speech conversion, then text is discerned comparison.But price is very expensive, normally charges by time slot, and all the more so in the time of especially in voice are limited at limited scope, also there is the slower shortcoming of recognition speed in this technical scheme in addition.But in IN service, generally has only the 3-5 time of second after the playback, if face can not be finished speech recognition and dialing during this period, thereby business will overtimely be moved overtime branch, when particularly carrying out performance test, need simultaneously the voice of a plurality of time slots to be discerned, in conjunction with the shortcoming of ASR technology and the demand of IN service, can find out IN service, the ASR technical scheme is all not satisfactory on price and the performance, especially those are to price and the relatively more responsive non-exchange field of performance, and this problem is more outstanding.
Summary of the invention
At the problems referred to above, the present invention proposes a kind of speech recognition quick and precisely, cost is suitable, can be widely used in the general A rule format voice identifying method that voice are limited at interior exchange of limited range and non-exchange field.
For achieving the above object, general A rule format voice identifying method concrete steps of the present invention are:
A, from the voice document of original A rule form, produce the sound template of the phonetic feature amount that contains all voice, load sound template then;
The voice of B, detection voice flow to be identified rise, stop, extract the phonetic feature amount of voice between the terminal;
The phonetic feature amount of C, more described voice flow to be identified and the phonetic feature amount of described sound template are carried out voice and are split and discern, thereby obtain recognition result.
The voice of the detection voice flow to be identified among the wherein said step B rise, stop more particularly comprises following steps:
B1, determine the size of speech data block and speech energy threshold value;
B2, determine the voice starting point, if the energy of the continuous multiple frames of voice flow to be identified is greater than the speech energy threshold value, be about to for the first time frame greater than the speech energy threshold value as voice candidate starting point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice starting point;
B3, determine the voice terminal point, if the energy of the continuous multiple frames of voice flow to be identified is less than the speech energy threshold value, be about to for the first time frame less than the speech energy threshold value as voice candidate terminal point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice terminal point.
Wherein the voice that carry out described in the step C split and discern C1, the splitting voice information of more particularly comprising the steps:; C2, anolytic sentence are formed; C3, analyzing speech are formed, and obtain voice strip number and corresponding encoded.
In order further to dwindle cost of the present invention and to increase its availability, wherein the phonetic feature amount of the sound template described in the steps A further comprises quick coupling voice characteristic quantity and accurate coupling voice characteristic quantity; Wherein the phonetic feature amount described in steps A and the step C is meant time-domain analysis characteristic quantity and frequency-domain analysis characteristic quantity.
Adopt technical solutions according to the invention, have following advantage:
1, speech recognition quick and precisely, the accuracy rate of speech recognition can be up to 100%, and speed is very fast; In multithread application, carry out multichannel identification, very high accuracy rate and the speed of same maintenance.
2, cost is suitable, can be applicable to generally that voice are limited in the limited range, to Price Sensitive, the exchange that performance requirement is arranged and non-exchange field.
3, it is little to take resource.
4, flexible and convenient to use, part is supported fuzzy query; The untimely phenomenon of sound is adopted when taking place by the system that often occurs in for example using, and the header information of voice just has the disappearance problem, and the present invention can discern this voice, can tolerate that the voice header information has the disappearance of long duration; Polyphone problem and for example, " branch " this word has two pronunciations: the branch that the branch of Hour Minute Second and first angle are divided, carry out at twice when recording, and existing ASR technology can only be discerned a word, the present invention but can be distinguished on the voice numbering.
Describe the present invention in detail below in conjunction with the drawings and specific embodiments.
Description of drawings
Fig. 1 is a method flow diagram of the present invention.
Specific implementation
The playback of wired intellective network service is typical case's representative of general A rule formatted voice, and the playback situation of other product business has been contained in its playback substantially, is example explanation the present invention with wired intellective network service voice below.
At first introduce the basic condition of wired intellective network service voice.These type of voice are by the purposes branch, can be divided into: operation flow voice and basic voice two seed voice, the former is meant the voice of control flow, can use separately, can use together with the latter, different business has different operation flow voice, and the latter must be used in combination with the operation flow voice, its content does not change with the change of business, mainly comprise: " 0 ", " 1 "-" 9 ", " ten ", " hundred ", " thousand ", " ten thousand ", " hundred million ", " unit ", " angle ", " branch ", " year ", " moon ", " day ", " time ", " branch ", " second " etc., the very short sub-voice of above-mentioned speech interval are compounded to form statement.The voice document of original A rule form can a statement, also can be that many long at interval statements are formed.Present embodiment be finish voice to be identified and original A rule form like this voice document relatively, concrete steps are as follows:
One, from the voice document of original A rule form, produces the sound template of the phonetic feature amount that contains all voice, load sound template then;
Business platform is sent out message and is given switch, switch extracts the voice document of accordingly original A rule form and sends in the relaying voice channel time slot from voice resource, business platform will produce the sound template of the phonetic feature amount that contains all voice from the voice document of this original A rule form, promptly extract following major parameter: the equal segments track is long, voice starting point frame position, the energy threshold and the energy parameter series that are used for coupling identification fast, the eigenvector threshold value and the character vector series that are used for accurately coupling identification, raw tone filename and voice document coding, thereby generate the sound template that comprises all operation flow voice and basic phonetic feature, then it is loaded initialization.
Two, the voice that detect voice flow to be identified rise, stop, extract the phonetic feature amount of voice between the terminal.The voice that wherein detect voice flow to be identified rise, stop, more specifically, may further comprise the steps:
A, determine the size of speech data block and speech energy threshold value; The sampling rate of supposing these voice to be identified is 8KHz, and the frame period is 25ms, and frame length is 25ms, determine that at first the size of input speech data block is 80ms, and the speech energy threshold value is 30.
B, determine the voice starting point, if the energy of the continuous multiple frames of voice flow to be identified is greater than the speech energy threshold value, be about to for the first time frame greater than the speech energy threshold value as voice candidate starting point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice starting point; According to the speech frame energy ENE = Σ i = 0 L - 1 | s ( i ) | / L , Wherein s (i) is a voice signal, judge voice segments, can suppose if continuous 3 frames of voice flow to be identified, be to be that the energy of 3 * 25=75ms is greater than speech energy threshold value 30 in cycle time, can with for the first time greater than the frame of speech energy threshold value as voice candidate starting point, again according to preceding 3 speech data block lengths of voice candidate starting point, promptly cycle time is that the position of 3 * 80=240ms is as the voice starting point.
C, determine the voice terminal point, if the energy of the continuous multiple frames of voice flow to be identified is less than the speech energy threshold value, be about to for the first time frame less than the speech energy threshold value as voice candidate terminal point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice terminal point; According to the speech frame energy ENE = Σ i = 0 L - 1 | s ( i ) | / L , Wherein s (i) is a voice signal, judge voice segments, can suppose if continuous 40 frames of voice flow to be identified, be to be that the energy of 40 * 25=1000ms is greater than speech energy threshold value 30 in cycle time, can with for the first time less than the frame of speech energy threshold value as voice candidate terminal point, again according to preceding 2 speech data block lengths of voice candidate starting point, promptly cycle time is that the position of 2 * 80=160ms is as the voice terminal point.
Whether whether can accurately detect voice flow to be identified by above step and exist, finish, each parameter of suitable configuration also can be avoided the end of the interval between the short sentence as voice.
Three, the phonetic feature amount of the phonetic feature amount of more described voice flow to be identified and described sound template is carried out voice and is split and discern, thereby obtains recognition result.As this voice flow to be identified be: " your remaining sum is 10 yuan 5 jiaos.Make a phone call please by 1, query the balance please by 2 "; voice are earlier by cutting between the terminal; just make a concrete analysis of statement and comprise " your remaining sum is 10 yuan 5 jiaos "; " making a phone call please by 1 "; " querying the balance please by 2 "; wherein the voice that each statement comprised are " your remaining sum are ", " 10 ", " unit ", " 5 ", " angle ", " make a phone call " please by 1, " query the balance ", thereby obtain 7 voice strip numbers and corresponding encoded is respectively: 06800018 please by 2,00000001,0000000a, 00000031,00000045,00000009,0680000d ...Compare with the phonetic feature amount of described sound template respectively then, find out the most close one, the voice that promptly obtain separately are described as " your remaining sum is ", " one " and " ten ", " unit ", " five ", " angle ", " make a phone call " please by 1, " query the balance ", thereby obtain recognition result please by 2.
In order further to increase its availability of the present invention, the present invention can be realized by the form of dynamic base (DLL), as the present invention being divided into five function performances, be voice starting point measuring ability, speech identifying function, sound template making function, function of initializing, speech recognition end functions, particularly, voice starting point measuring ability is finished above-mentioned step 1, sound template is made function and function of initializing is finished above-mentioned step 2, speech identifying function completing steps three, last speech recognition end functions discharges shared system resource.Five functions are corresponding with five functions in the dynamic base (DLL), use very flexiblely, and resource such as CPU of the system that takies, internal memory is all very little.

Claims (4)

1, a kind of general A rule format voice identifying method is characterized in that the method includes the steps of:
A, from the voice document of initial A rule form, produce the sound template of the phonetic feature amount that contains all voice, load sound template then;
The voice of B, detection voice flow to be identified rise, stop, extract the phonetic feature amount of voice between the terminal;
The phonetic feature amount of C, more described voice flow to be identified and the phonetic feature amount of described sound template are carried out voice and are split and discern, thereby obtain recognition result;
The voice of the detection voice flow to be identified among the described step B rise, stop specifically comprises following steps:
B1, determine the size of speech data block and speech energy threshold value;
B2, determine the voice starting point, if the energy of the continuous multiple frames of voice flow to be identified is greater than the speech energy threshold value, be about to for the first time frame greater than the speech energy threshold value as voice candidate starting point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice starting point;
B3, determine the voice terminal point, if the energy of the continuous multiple frames of voice flow to be identified is less than the speech energy threshold value, be about to for the first time frame less than the speech energy threshold value as voice candidate terminal point, again according to the position of preceding a plurality of speech data block lengths of voice candidate starting point as the voice terminal point.
2, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that the phonetic feature amount described in steps A and the step C is meant time-domain analysis characteristic quantity and frequency-domain analysis characteristic quantity.
3, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that the voice that carry out described in the step C split and discern C1, the splitting voice information of more particularly comprising the steps:; C2, anolytic sentence are formed; C3, analyzing speech are formed, and obtain voice strip number and corresponding encoded.
4, a kind of general A rule format voice identifying method as claimed in claim 1 is characterized in that, the phonetic feature amount of the sound template described in the steps A further comprises quick coupling voice characteristic quantity and accurate coupling voice characteristic quantity.
CNB021287619A 2002-08-07 2002-08-07 General A-Law format voice identifying method Expired - Fee Related CN1213399C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021287619A CN1213399C (en) 2002-08-07 2002-08-07 General A-Law format voice identifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021287619A CN1213399C (en) 2002-08-07 2002-08-07 General A-Law format voice identifying method

Publications (2)

Publication Number Publication Date
CN1474377A CN1474377A (en) 2004-02-11
CN1213399C true CN1213399C (en) 2005-08-03

Family

ID=34143814

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021287619A Expired - Fee Related CN1213399C (en) 2002-08-07 2002-08-07 General A-Law format voice identifying method

Country Status (1)

Country Link
CN (1) CN1213399C (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581158A (en) * 2012-08-10 2014-02-12 百度在线网络技术(北京)有限公司 Method and system for processing voice data
CN105609118B (en) * 2015-12-30 2020-02-07 生迪智慧科技有限公司 Voice detection method and device

Also Published As

Publication number Publication date
CN1474377A (en) 2004-02-11

Similar Documents

Publication Publication Date Title
US20080294433A1 (en) Automatic Text-Speech Mapping Tool
WO2020238209A1 (en) Audio processing method, system and related device
CN111798833B (en) Voice test method, device, equipment and storage medium
CN109840052B (en) Audio processing method and device, electronic equipment and storage medium
CN109326305B (en) Method and system for batch testing of speech recognition and text synthesis
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN101876887A (en) Voice input method and device
CN110784591A (en) Intelligent voice automatic detection method, device and system
CN103050116A (en) Voice command identification method and system
CN112331188A (en) Voice data processing method, system and terminal equipment
CN1333501A (en) Dynamic Chinese speech synthesizing method
CN113782026A (en) Information processing method, device, medium and equipment
CN1213399C (en) General A-Law format voice identifying method
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN108920500B (en) Time analysis method
CN104882146A (en) Method and device for processing audio popularization information
CN109862408B (en) User voice recognition control method for intelligent television voice remote controller
CN1198260C (en) Phonetic recognizing system
CN101160380A (en) Class quantization for distributed speech recognition
EP1632932A1 (en) Voice response system, voice response method, voice server, voice file processing method, program and recording medium
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN114861640A (en) Text abstract model training method and device
CN112714058A (en) Method, system and electronic equipment for instantly interrupting AI voice
CN106101573A (en) The grappling of a kind of video labeling and matching process
CN111986706A (en) Voice response time testing method based on audio analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20050803

Termination date: 20130807