CN109192192A - A kind of Language Identification, device, translator, medium and equipment - Google Patents

A kind of Language Identification, device, translator, medium and equipment Download PDF

Info

Publication number
CN109192192A
CN109192192A CN201810908924.XA CN201810908924A CN109192192A CN 109192192 A CN109192192 A CN 109192192A CN 201810908924 A CN201810908924 A CN 201810908924A CN 109192192 A CN109192192 A CN 109192192A
Authority
CN
China
Prior art keywords
phonetic feature
feature sequence
domain signal
time domain
languages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810908924.XA
Other languages
Chinese (zh)
Inventor
李宝祥
吕安超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN201810908924.XA priority Critical patent/CN109192192A/en
Publication of CN109192192A publication Critical patent/CN109192192A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to voice technology field, in particular to a kind of Language Identification, device, translator, medium and equipment.It comprises determining that the corresponding phonetic feature sequence of each frame in collected voice time domain signal, and pond is carried out to the corresponding phonetic feature sequence of each frame voice time domain signal, determine the corresponding phonetic feature sequence of collected voice time domain signal.No matter so that the length of collected voice time domain signal be how long, the phonetic feature sequence of regular length can be obtained, so as to the input as the languages identification model trained in advance, carry out languages identification.Therefore can be by specific phonetic feature retrieval mode, and by training obtained corresponding languages identification model, effectively improve the accuracy of languages identification.

Description

A kind of Language Identification, device, translator, medium and equipment
Technical field
The present invention relates to voice technology field, in particular to a kind of Language Identification, device, translator, medium and set It is standby.
Background technique
Languages identification is to determine the process of one section of affiliated category of language of voice signal (languages).It is mainly used in multilingual language The front end of sound signal processing system classifies automatically to voice, then is transferred in the subsystem of corresponding languages and carries out subsequent place Reason, can be applied to portable translator, multilingual speech recognition system etc..In order to realize more intelligent interaction side Formula judges that languages classification is necessary using languages identification technology.
The languages identification model of current main-stream mainly has: gauss hybrid models-global context model (GMM-UBM), Gauss Super vector-supporting vector machine model (GSV-SVM) and deep neural network (DNN) model etc..But language is carried out using these models The accuracy of kind identification, languages identification need to be improved.
Summary of the invention
The embodiment of the present invention provides a kind of Language Identification, device, translator, medium and equipment, for solving languages The lower problem of the accuracy of identification.
The present invention provides a kind of Language Identification, which comprises
Voice time domain signal is acquired, determines the corresponding phonetic feature of each speech frame in collected voice time domain signal Sequence;
Pond is carried out to the phonetic feature sequence, obtains the corresponding phonetic feature sequence of collected voice time domain signal Column;
Using the corresponding phonetic feature sequence of collected voice time domain signal as input, the languages trained in advance are utilized Identification model determines the corresponding languages of collected voice time domain signal.
In a kind of possible implementation, pond is carried out to the phonetic feature sequence, obtains collected voice time domain The corresponding phonetic feature sequence of signal, comprising: from the phonetic feature sequence, selected section phonetic feature sequence, wherein every In a phonetic feature sequence selected, the characteristic value number no more than zero is not more than setting value;To the phonetic feature selected Sequence carries out pond, obtains the corresponding phonetic feature sequence of collected voice time domain signal.
In a kind of possible implementation, the pond turns to maximum pond.
Further, determine the corresponding phonetic feature sequence of each speech frame in collected voice time domain signal it Afterwards, the method also includes: according to the phonetic feature sequence, speech recognition is carried out to collected voice time domain signal, is obtained To the corresponding text of collected voice time domain signal.
Further, according to the phonetic feature sequence, speech recognition, packet are carried out to collected voice time domain signal It includes: according to the phonetic feature sequence and the languages determined, speech recognition being carried out to collected voice time domain signal.
In a kind of possible implementation, training obtains the languages identification model in the following manner:
According to the languages that the language environment selected includes, corresponding training sample is selected;
For each training sample, following operation is executed:
Determine the corresponding phonetic feature sequence of each speech frame in the training sample;
Pond is carried out to the phonetic feature sequence determined, obtains the corresponding phonetic feature sequence of the training sample;
Using the corresponding phonetic feature sequence of the training sample as input, the corresponding languages identification model of training.
The present invention also provides a kind of languages identification device, described device includes:
Acquisition module, for acquiring voice time domain signal;
Characteristic determination module, for determining the corresponding phonetic feature of each speech frame in collected voice time domain signal Sequence;
It is corresponding to obtain collected voice time domain signal for carrying out pond to the phonetic feature sequence for pond module Phonetic feature sequence;
Languages identification module, for using the corresponding phonetic feature sequence of collected voice time domain signal as input, benefit With the languages identification model trained in advance, the corresponding languages of collected voice time domain signal are determined.
In a kind of possible implementation, the pond module is specifically used for the selector from the phonetic feature sequence Divide phonetic feature sequence, wherein in the phonetic feature sequence each selected, the characteristic value number no more than zero is no more than setting Value;Pond is carried out to the phonetic feature sequence selected, obtains the corresponding phonetic feature sequence of collected voice time domain signal.
In a kind of possible implementation, the pond module is specifically used for carrying out the phonetic feature sequence maximum Chi Hua obtains the corresponding phonetic feature sequence of collected voice time domain signal.
In a kind of possible implementation, training obtains the languages identification model in the following manner:
According to the languages that the language environment selected includes, corresponding training sample is selected;
For each training sample, following operation is executed:
Determine the corresponding phonetic feature sequence of each speech frame in the training sample;
Pond is carried out to the phonetic feature sequence determined, obtains the corresponding phonetic feature sequence of the training sample;
Using the corresponding phonetic feature sequence of the training sample as input, the corresponding languages identification model of training.
The present invention also provides a kind of translator, the translator includes device as described above.
The present invention also provides a kind of nonvolatile computer storage media, the computer storage medium is stored with executable Program, the executable code processor execute the step of realizing method as described above.
The present invention also provides a kind of languages to identify equipment, including memory, the calculating of processor and storage on a memory The step of machine program, the processor realizes method as described above when executing described program.
The scheme provided according to embodiments of the present invention can determine that each frame in collected voice time domain signal is corresponding Phonetic feature sequence, and to the corresponding phonetic feature sequence of each frame voice time domain signal carry out pond, determine to collect The corresponding phonetic feature sequence of voice time domain signal.No matter so that the length of collected voice time domain signal be how long, all The phonetic feature sequence of regular length can be obtained, so as to as the defeated of the languages identification model trained in advance Enter, carries out languages identification.Therefore can be by specific phonetic feature retrieval mode, and pass through the correspondence that training obtains Languages identification model, effectively improve languages identification accuracy.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram for the Language Identification that the embodiment of the present invention one provides;
Fig. 2 is the flow diagram of Language Identification provided by Embodiment 2 of the present invention;
Fig. 3 is the flow diagram for the Language Identification that the embodiment of the present invention three provides;
Fig. 4 is the structural schematic diagram for the languages identification device that the embodiment of the present invention four provides;
Fig. 5 is the structural schematic diagram that the languages that the embodiment of the present invention six provides identify equipment.
Specific embodiment
In the present invention program, languages identification is carried out using languages identification model trained in advance.In order to guarantee that languages are known The number requirement of other mode input allows to carry out languages identification to the voice time domain signal of random length, can determine and adopt The corresponding phonetic feature sequence of each frame in voice time domain signal collected, and to the corresponding language of each frame voice time domain signal Sound characteristic sequence carries out pond, to obtain the corresponding phonetic feature sequence of collected voice time domain signal.So that no matter adopting Whether the duration of the voice time domain signal collected is identical (i.e. no matter collected voice time domain signal is divided into how many frame), can be with The phonetic feature sequence of regular length is obtained, requirement of the languages identification model to input is met.
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that described embodiment is only a part of the embodiments of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts Every other embodiment, shall fall within the protection scope of the present invention.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
Embodiment one
The embodiment of the present invention one provides a kind of Language Identification, and the step process of this method can be as shown in Figure 1, packet It includes:
Step 001, acquisition voice time domain signal.
In this step, need to carry out the voice time domain signal of languages identification by microphone (MIC) acquisition.
In an implementation, voice time domain signal can be acquired in real time by MIC, be acquired after can also being triggered by user.Example Such as, user is acquired by physical button or virtual key trigger signal.
Step 002 determines the corresponding phonetic feature sequence of each speech frame.Wherein, phonetic feature sequence is for characterizing language Sound signal substantive characteristics.
In a kind of possible implementation, framing first can be carried out to collected voice time domain signal, and for obtaining Each speech frame be handled as follows:
Frequency domain conversion is carried out to the speech frame, determines corresponding voice frequency domain signal;
According to the voice frequency domain signal determined, the filter group fbank characteristic sequence of the first specified dimension is extracted;
The calculus of differences that the fbank characteristic sequence of the first specified dimension is carried out to predetermined number of times determines the second specified dimension Several phonetic feature sequences;
To the phonetic feature sequence of the second specified dimension, nonlinear transformation, determination pair are carried out using deep learning model The third answered specifies the phonetic feature sequence of dimension.
Wherein, third specifies dimension to be determined according to the input number that languages identification model needs.
Preferably, in the present embodiment, deep learning model may include convolutional neural networks (CNN) model and shot and long term Memory network (LSTM) model further guarantees language to enhance by CNN model and LSTM model phonetic feature The accuracy of kind identification.CNN model and LSTM model can be successively utilized, nonlinear transformation is carried out.Certainly, in the present embodiment In, deep learning model is not limited to using CNN model and LSTM model.
Step 003 carries out pond to the phonetic feature sequence determined.
In this step, pond can be carried out to obtained phonetic feature sequence, preferably, the pondization can be maximum Pond (Max-pooling) obtains the corresponding phonetic feature sequence of collected voice time domain signal.
In a kind of possible embodiment, when extracting feature, from the same position for the phonetic feature sequence for carrying out pond The upper maximum characteristic value of extraction of values, forms new phonetic feature sequence.Due to extracting characteristic value using maximum pondization, so that obtaining New phonetic feature sequence can more reflect the feature of voice time domain signal, improve the accuracy rate of languages identification.
When implementing, all phonetic feature sequences that can be determined based on step 002 carry out pond processing, can also be with base Pond processing is carried out in the part of speech characteristic sequence that step 002 is determined.
Specifically, can be from the phonetic feature sequence that step 002 obtains, selected section phonetic feature sequence, and to choosing The phonetic feature sequence selected out carries out pond, obtains the corresponding phonetic feature sequence of collected voice time domain signal.Due to only Pond processing is carried out to part of speech characteristic sequence, operand can be reduced, improves treatment effeciency.
In an implementation, select the mode of phonetic feature sequence for any way, the voice of such as random selection setting quantity is special Sequence is levied, or is spaced selection phonetic feature sequence according to setting or selects the phonetic feature sequence etc. of top n speech frame Deng.
Preferably, from all phonetic feature sequences, characteristic value number of the selection no more than zero is not more than the language of setting value Sound characteristic sequence, due to including, then phonetic feature more no more than zero characteristic value quantity in a phonetic feature sequence Validity feature value in sequence is fewer, will affect the accuracy rate of identification, in the present solution, this kind of phonetic feature sequence is screened out, The accuracy rate of languages identification can be further increased.
In an implementation, selected phonetic feature sequence can be the corresponding phonetic feature sequence of continuous speech frame, It can be the corresponding phonetic feature sequence of discontinuous speech frame, can select as needed.
Step 004 determines languages.
It in this step, can be by the phonetic feature sequence obtained through Chi Huahou (i.e. collected voice time domain signal pair The phonetic feature sequence answered) as input, using the languages identification model trained in advance, which can be used Fully-connected network (FC) determines the corresponding languages of collected voice time domain signal.
In the present embodiment, as a kind of possible implementation, the languages identification model can instruct in the following manner It gets:
According to the languages that the language environment selected includes, corresponding training sample is selected;
For each training sample, following operation is executed:
Determine the corresponding phonetic feature sequence of each speech frame in the training sample;
Pond is carried out to the phonetic feature sequence determined, obtains the corresponding phonetic feature sequence of the training sample;
Using the corresponding phonetic feature sequence of the training sample as input, the corresponding languages identification model of training.
It should be noted that preferably, in the present embodiment, languages identification can with speech recognition community network, to reach To the purpose for saving computing resource.Specifically, determining each speech frame pair in collected voice time domain signal in step 002 After the phonetic feature sequence answered, further according to the phonetic feature sequence, language is carried out to collected voice time domain signal Sound identification, obtains the corresponding text of collected voice time domain signal.
In the present embodiment, speech recognition process can be carries out parallel with languages identification process.It is to be understood that if The voice time domain signal duration for needing to carry out speech recognition is longer, can only for it is shorter when long voice time domain signal into The identification of row languages, such as only choose top n (such as 10) speech frame in voice time domain signal and correspond to languages identification, or only choose First M seconds of voice time domain signal carries out languages identification.And while carrying out languages identification, it can use step 002 and determine Phonetic feature sequence carry out speech recognition, to reduce the time delay of speech recognition to the greatest extent.It certainly, can after identifying languages Believed for voice time domain remaining in the voice time domain signal for needing to carry out speech recognition further combined with the languages identified Number, speech recognition is carried out, corresponding text is obtained.
It is serially carried out it should be noted that speech recognition process can also be with languages identification process.Identifying language The languages that kind and then combination identify carry out speech recognition using the phonetic feature sequence that step 002 is determined, reduce language Sound identification needs decoder quantity to be used, saves computing resource.
It should be further noted that when speech recognition process and languages identification process serially carry out, if necessary to carry out The voice time domain signal duration of speech recognition is longer, can also only for it is shorter when long voice time domain signal carry out languages Identification.When carrying out speech recognition, since the corresponding phonetic feature sequence of voice time domain signal of languages for identification is true It is fixed, can be, but not limited to carry out the voice time domain signal of languages for identification corresponding phonetic feature sequence speech recognition it Afterwards, continuing with remaining voice time domain signal in the voice time domain signal for needing to carry out speech recognition, speech recognition is carried out, with Reduce the time delay in speech recognition process to the greatest extent.
When serially being carried out due to speech recognition process and languages identification process, will lead to speech recognition there are it is certain when Prolong, therefore, can be needed to be set for according to user languages identification it is shorter when long voice time domain signal duration, to meet Requirement of the different user to time delay.
If it is identifying languages and then carrying out speech recognition, then can be understood as after step 004, according to The languages that the phonetic feature sequence and step 004 that step 002 is determined are determined carry out language to collected voice time domain signal Sound identification.
It is illustrated below by the scheme that two specific examples provide the embodiment of the present invention one.
Embodiment two
Second embodiment of the present invention provides a kind of Language Identifications, are illustrated so that languages identification model is using FC as an example, The step process of this method can be as shown in Figure 2, comprising:
Step 101 determines language environment.
In this step, it can determine that the language environment selected, the language environment include at least two languages.Example Such as, language environment be include Chinese languages and English languages, for another example, language environment be include Chinese languages, English languages and day Language languages.I.e. in this step, the subsequent languages range for needing to identify can be determined based on language environment, so that it is determined that corresponding Languages identification model.
Specifically, can be the selected languages environment determination on interactive interface according to user when determination language environment, For example, multilingual intertranslation mode can be provided on interactive interface, and such as Sino-British intertranslation, Sino-Korean intertranslation, Sino-Japan intertranslation etc., Yong Huke In a manner of the intertranslation for selecting oneself to need on interactive interface.Further, corresponding language is selected according to identified language environment Kind identification model.For example, user has selected Sino-British intertranslation, the languages identification model comprising Chinese and English identification can choose. Since different model training abilities are different, if some languages identification model is only capable of two languages of identification, need first to determine language Environment, then carry out languages identification;If languages identification model can identify multiple languages, such as multiple languages such as can identify Sino-British Korean and Japanese Kind, then it can not need first to determine language model, it can not execute this step, directly execution step 102.
Step 102, acquisition voice time domain signal.
In this step, the voice time domain signal for needing to carry out languages identification can be acquired.Specifically, voice can be passed through Equipment, such as microphone are acquired, realizes the acquisition of voice time domain signal.
It should be noted that collected voice time domain signal can be in short, it is also possible to half word.That is, this reality The scheme for applying example offer may be implemented to identify the languages of short audio, and languages identification accuracy is available compared with prior art It effectively improves.
Step 103 determines the corresponding phonetic feature sequence of each speech frame.
In this step, phonetic feature sequence can be determined for each frame in collected voice time domain signal.
As a kind of possible implementation, framing first can be carried out to collected voice time domain signal, and for every One speech frame:
It determines corresponding voice frequency domain signal, according to the voice frequency domain signal determined, extracts the filter of the first specified dimension Wave device group fbank characteristic sequence;
The calculus of differences that the fbank characteristic sequence of the first specified dimension is carried out to predetermined number of times determines the second specified dimension Several phonetic feature sequences;
To the phonetic feature sequence of the second specified dimension, nonlinear transformation, determination pair are carried out using deep learning model The third answered specifies the phonetic feature sequence of dimension.
Wherein, third specifies dimension to be determined according to the input number that fully-connected network needs.Assuming that fully-connected network needs Input number be 512, then third specify dimension can for 512 dimension.Preferably, corresponding if it is 512 dimensions that third, which specifies dimension, The first specified dimension can be, but not limited to for 80 dimension.And calculus of differences twice can be carried out to the fbank characteristic sequence of 80 dimensions, Obtain the phonetic feature sequence of 240 dimensions (the i.e. second specified dimension).And then depth can be utilized to the phonetic feature sequence of 240 dimensions It spends learning model and carries out nonlinear transformation, obtain the phonetic feature sequence of 512 dimensions.
Determine that the corresponding phonetic feature sequence of each speech frame can since feature selecting more optimizes in the above manner To carry out languages identification using fully-connected network, on the basis of improving languages identification accuracy, subsequent languages are further increased The accuracy of identification.
Preferably, in the present embodiment, deep learning model may include CNN model and LSTM model, to pass through CNN Model and LSTM model enhance phonetic feature, further guarantee the accuracy of languages identification.CNN can successively be utilized Model and LSTM model carry out nonlinear transformation.Certainly, in the present embodiment, deep learning model is not limited to using CNN model With LSTM model.
If successively utilizing CNN model and LSTM model, carry out nonlinear transformation, then it can be hidden from one of LSTM model Layer exports the corresponding phonetic feature sequence of each speech frame.
CNN model and LSTM model can be to be obtained based on the training of existing speech recognition modeling method, e.g., can be It is obtained using the speech recognition modeling method training based on syllable, for another example, can be and utilize the speech recognition based on Chinese English Modeling method training obtains, and for another example, can also be and utilizes the language based on other more complicated deep layer superposition mixed network structures Sound identification modeling method training obtains.To by with speech recognition modeling community network, it is possible to reduce the occupancy of computing resource.
Step 104 carries out pond to the phonetic feature sequence determined.
In this step, pond can be carried out to the corresponding phonetic feature sequence of each speech frame, it is in the present embodiment, false If carrying out maximum pond (Max-pooling), the corresponding phonetic feature sequence of collected voice time domain signal is determined.
The phonetic feature sequence length obtained through pond, phonetic feature sequence length corresponding with each speech frame are identical.
For example, it is assumed that collected voice time domain signal is divided into 5 frames, the voice of corresponding 3 dimension is determined for each frame Characteristic sequence.At this point, being directed to 5 frame voice time domain signals, the phonetic feature sequence of 5 group of 3 dimension is just produced.The voice that 5 group 3 is tieed up Characteristic sequence carries out maximum pond, that is, by the maximum characteristic value of numerical value in 5 groups of phonetic feature sequence same positions, as acquisition The corresponding characteristic value on the position of the voice time domain signal arrived, so that it is determined that collected voice time domain signal is corresponding out The phonetic feature sequence of 1 group of 3 dimension.
Further, it is assumed that the corresponding phonetic feature sequence of 5 frame voice time domain signals be respectively { 2,4,6 }, 1,2, 3 }, { 2,3,4 }, { 2,5,7 } and { 3,5,2 }, then by maximum pond, the corresponding phonetic feature of collected voice time domain signal Sequence is { 3,5,7 }.
Step 105 determines languages.
In this step, can using the phonetic feature sequence obtained through Chi Huahou as input, using with the language ring Border is corresponding, the fully-connected network trained in advance, determines the corresponding languages of collected voice time domain signal.
That is, if the language environment determined in step 101 be include Chinese languages and English languages, in this step In, it can use and train in advance, for language environment be the fully-connected network for including Chinese languages and English languages, determine The corresponding languages of collected voice time domain signal are Chinese languages or English languages.
Preferably, in order to guarantee the accuracy of languages identification, and ensure the speed of languages identification, the fully-connected network can To use three layers of fully-connected network.
In the present embodiment, when the result that is identified based on languages carries out speech recognition, when can be by collected voice The corresponding phonetic feature sequence of each speech frame in the signal of domain obtains syllable probability distribution through full connection output layer, and combines language Plant identifying as a result, the output of full connection output layer is input to corresponding decoder, such as Chinese decoder, English decoder Deng, carry out speech recognition, obtain corresponding text representation.It further, can be after realizing speech recognition, to obtained text Information carries out the processing such as voiced translation.
When being trained to fully-connected network, training sample can be selected for language environment.For example, if the language being directed to Saying that environment is includes Chinese languages and Japanese languages, then can choose corresponding Chinese speech training sample and japanese voice instruction Practice sample, it can be understood as all training samples form training set.For example, total duration is selected to instruct for 100 hours Chinese speech Practice sample and total duration for 135 hours Korean voice training samples, forms training set.
Specifically, the languages for including according to language environment, after selecting corresponding training sample, can be directed to each instruction Practice sample, execute following operation, realize the training to fully-connected network:
Determine the corresponding phonetic feature sequence of each speech frame of the training sample;
Pond is carried out to the corresponding phonetic feature sequence of each speech frame, determines the corresponding phonetic feature sequence of the training sample Column;
Using the corresponding phonetic feature sequence of the training sample as input, training fully-connected network.
The thousands of audio-frequency tests in short-term recorded through profession are known by the fully-connected network languages that the above training method obtains Other accuracy is greatly improved.For example, the language environment being directed to is the fully-connected network for including Chinese languages and English languages, Recognition accuracy can achieve 99% or more.For language environment be the fully connected network for including Chinese languages and Japanese languages Network, recognition accuracy can achieve 97% or more.For language environment be the full connection for including Chinese languages and Korean languages Network, recognition accuracy can achieve 92% or more.
Embodiment three
The embodiment of the present invention three is provided in a kind of Language Identification, and languages identify the network of multiplexed speech identification, the party The step process of method can be as shown in Figure 3, comprising:
Step 201 determines language environment.
This step is optional step.If languages identification model can identify multiple languages, such as can identify Sino-British Korean and Japanese etc. Multiple languages can then not need first to determine language model, it can not execute this step, directly execution step 202.
In this step, it can determine that the language environment selected, the language environment include at least two languages.This reality The language environment to determine is applied in example to be include Chinese languages and English languages for be illustrated, other situations are similar, this Place repeats no more.
Step 202, acquisition voice time domain signal.
In this step, the voice time domain signal for needing to carry out speech recognition can be acquired.
Step 203 determines the corresponding phonetic feature sequence of each speech frame.
Specifically, collected voice time domain signal is first carried out sub-frame processing;Frequency domain is carried out to each speech frame again to turn It changes, obtains voice frequency domain signal;Then, for each speech frame, the voice extracted for characterizing voice signal substantive characteristics is special Levy sequence.
After step 203, step 2041 and step 2042 carry out parallel.
Step 2041 carries out speech recognition.
Being due to the language environment that step 201 is determined includes that Chinese languages and English languages can incite somebody to action in this step The phonetic feature sequence that step 203 is determined obtains syllable probability distribution through full connection output layer, and it is inputted Chinese respectively Decoder and English decoder carry out speech recognition.
Step 2042 carries out pond to part of speech characteristic sequence.
In an implementation, in this step to the phonetic feature sequence of the part of speech frame in collected voice time domain signal into Row pond, such as the phonetic feature sequence of top n speech frame (the unit element duration needed not less than languages identification) carry out pond The corresponding voice of voice time domain signal in change processing or M seconds first (the unit element duration needed not less than languages identification) The phonetic feature sequence of frame carries out pond processing.
For example, if longer (the most phrase needed greater than languages identification of the duration of voice time domain signal to be identified Sound time-domain signal duration), it such as 10 seconds, then in this step, can be for setting duration in voice time domain signal to be identified (no The shortest voice time domain signal duration needed less than languages identification), such as the corresponding voice of voice time domain signal in 2 seconds durations The phonetic feature sequence of frame carries out pond.
Preferably, the pondization can be maximum pond (Max-pooling), the voice time domain for carrying out languages identification is obtained The corresponding phonetic feature sequence of signal.And continue to execute step 205.
Step 205 determines languages.
In this step, can using the phonetic feature sequence obtained through Chi Huahou as input, using with the language ring Border is corresponding, and the languages identification model trained in advance determines corresponding languages.
Assuming that the languages determined are Chinese languages, then English decoder can be closed, so that subsequent can directly utilize Chinese decoder carries out speech recognition to the voice time domain signal for not yet carrying out speech recognition.Can for it is collected still The voice time domain signal for not carrying out speech recognition determines the corresponding phonetic feature sequence of each speech frame, through connecting output layer entirely Syllable probability distribution is obtained, and is inputted Chinese decoder and carries out speech recognition.
For example, it is assumed that a length of 10 seconds when the collected voice time domain signal for needing to carry out speech recognition, through step 203 It, can be with after~step 205 carries out speech recognition and languages identification for the voice time domain signal of collected 2 seconds durations respectively For the voice time domain signal of collected remaining 8 seconds durations, the corresponding phonetic feature sequence of each speech frame is determined, and according to Languages recognition result is Chinese languages, and the phonetic feature sequence determined is continued through full connection output layer input Chinese decoder Carry out speech recognition.
The above process is illustrated so that speech recognition process carries out parallel with languages identification process as an example, if speech recognition Process is serially carried out with languages identification process, then it is to be understood that after step 203, directly executing step 2042 and step 205, and after step 205, according to the languages identified, carry out speech recognition.
If speech recognition process is serially carried out with languages identification process, it is assumed that the languages determined are Chinese languages, then may be used To close English decoder, so that subsequent can directly utilize Chinese decoder, speech recognition is carried out to voice time domain signal.
For example, it is assumed that a length of 10 seconds when the collected voice time domain signal for needing to carry out speech recognition, through step 201 After~step 203, step 2042 and step 205 carry out languages identification for the voice time domain signal of collected 2 seconds durations, Can phonetic feature sequence (the corresponding phonetic feature sequence of the voice time domain signal of 2 seconds durations) to languages for identification carry out Speech recognition, and it is directed to the voice time domain signal of collected remaining 8 seconds durations, determine the corresponding phonetic feature of each speech frame The corresponding phonetic feature sequence of voice time domain signal of remaining 8 seconds durations is also inputted Chinese solution through full connection output layer by sequence Code device carries out speech recognition.
In the present embodiment, it can be, but not limited to the corresponding language of voice time domain signal first to the 2 seconds durations obtained Sound characteristic sequence carries out speech recognition, then carries out voice knowledge to the corresponding phonetic feature sequence of the voice time domain signal of 8 seconds durations Not, to reduce the time delay of speech recognition to the greatest extent.It is of course also possible to but be not limited in the voice time domain letter for determining remaining 8 seconds durations After number corresponding phonetic feature sequence, the corresponding phonetic feature sequence of the voice time domain signal of 10 seconds durations is connected entirely together It connects output layer input Chinese decoder and carries out speech recognition.
Based on the same inventive concept with embodiment one~tri-, device below is provided.
Example IV
The embodiment of the present invention four provides a kind of languages identification device, and the structure of the device can be as shown in Figure 4, comprising:
Acquisition module 11 is for acquiring voice time domain signal;
Characteristic determination module 12 is used to determine that the corresponding voice of each speech frame in collected voice time domain signal to be special Levy sequence;
Pond module 13 is used to carry out pond to the phonetic feature sequence, and it is corresponding to obtain collected voice time domain signal Phonetic feature sequence;
Languages identification module 14 is used for using the corresponding phonetic feature sequence of collected voice time domain signal as input, benefit With the languages identification model trained in advance, the corresponding languages of collected voice time domain signal are determined.
As a kind of possible implementation, the pond module 13 is specifically used for from the phonetic feature sequence, choosing Select part of speech characteristic sequence, wherein in the phonetic feature sequence each selected, the characteristic value number no more than zero is not more than Setting value;Pond is carried out to the phonetic feature sequence selected, obtains the corresponding phonetic feature of collected voice time domain signal Sequence.
As a kind of possible implementation, the pond module 13 is specifically used for carrying out most the phonetic feature sequence Great Chiization obtains the corresponding phonetic feature sequence of collected voice time domain signal.
As a kind of possible implementation, training obtains the languages identification model in the following manner:
According to the languages that the language environment selected includes, corresponding training sample is selected;
For each training sample, following operation is executed:
Determine the corresponding phonetic feature sequence of each speech frame in the training sample;
Pond is carried out to the phonetic feature sequence determined, obtains the corresponding phonetic feature sequence of the training sample;
Using the corresponding phonetic feature sequence of the training sample as input, the corresponding languages identification model of training.
According to embodiments of the present invention one~tetra- scheme provided, languages identify accuracy is higher, speed faster, and languages are known The occupancy of computing resource can not be reduced with speech recognition modeling community network.Meanwhile multilingual output is as a result, also increase The concurrency for having added languages to identify.
Further, the embodiment of the present invention five can also provide a kind of translator, and the translator includes such as example IV The device.To identify by fast and accurately languages, more preferable, more convenient and fast translator clothes can be let user experiencing Business.
Based on the same inventive concept, the embodiment of the present invention provides equipment below and medium.
Embodiment six
The embodiment of the present invention six provides a kind of languages identification equipment, and the structure of the equipment can be as shown in figure 5, include storage Device 21, the computer program of processor 22 and storage on a memory, the processor 22 realize this hair when executing described program The step of bright one the method for embodiment.
Optionally, the processor 22 can specifically include central processing unit (CPU), application-specific integrated circuit (ASIC, Application specific integrated circuit), it can be one or more for controlling the collection of program execution At circuit, the hard of use site programmable gate array (FPGA, field programmable gate array) exploitation can be Part circuit, can be baseband processor.
Optionally, the processor 22 may include at least one processing core.
Optionally, the memory 21 may include read-only memory (ROM, read only memory), arbitrary access Memory (RAM, random access memory) and magnetic disk storage.Memory 21 is for storing at least one processor 22 Required data when operation.The quantity of memory 21 can be one or more.
The embodiment of the present invention seven provides a kind of nonvolatile computer storage media, and the computer storage medium is stored with Executable program realizes the method that the embodiment of the present invention one provides when executable code processor executes.
In the specific implementation process, computer storage medium may include: general serial bus USB (USB, Universal Serial Bus flash drive), mobile hard disk, read-only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic or disk etc. be various to can store program code Storage medium.
In embodiments of the present invention, it should be understood that disclosed device and method, it can be real by another way It is existing.For example, apparatus embodiments described above are merely indicative, for example, the division of the unit or unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of equipment or unit It connects, can be electrical or other forms.
Each functional unit in embodiments of the present invention can integrate in one processing unit or each unit can also To be independent physical module.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the embodiment of the present invention All or part can be embodied in the form of software products, which is stored in a storage medium In, including some instructions use so that a computer equipment, such as can be personal computer, server or network are set Standby etc. or processor (processor) performs all or part of the steps of the method described in the various embodiments of the present invention.And it is above-mentioned Storage medium include: general serial bus USB (universal serial bus flash drive), mobile hard disk, The various media that can store program code such as ROM, RAM, magnetic or disk.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of device (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (10)

1. a kind of Language Identification, which is characterized in that the described method includes:
Voice time domain signal is acquired, determines the corresponding phonetic feature sequence of each speech frame in collected voice time domain signal Column;
Pond is carried out to the phonetic feature sequence, obtains the corresponding phonetic feature sequence of collected voice time domain signal;
Using the corresponding phonetic feature sequence of collected voice time domain signal as input, identified using the languages trained in advance Model determines the corresponding languages of collected voice time domain signal.
2. the method as described in claim 1, which is characterized in that carry out pond to the phonetic feature sequence, collected The corresponding phonetic feature sequence of voice time domain signal, comprising:
From the phonetic feature sequence, selected section phonetic feature sequence, wherein the phonetic feature sequence each selected In, the characteristic value number no more than zero is not more than setting value;
Pond is carried out to the phonetic feature sequence selected, obtains the corresponding phonetic feature sequence of collected voice time domain signal Column.
3. the method as described in claim 1, which is characterized in that the pond turns to maximum pond.
4. the method as described in claim 1, which is characterized in that determine each speech frame in collected voice time domain signal After corresponding phonetic feature sequence, the method also includes:
According to the phonetic feature sequence, speech recognition is carried out to collected voice time domain signal, obtains collected voice The corresponding text of time-domain signal.
5. method as claimed in claim 4, which is characterized in that according to the phonetic feature sequence, when to collected voice Domain signal carries out speech recognition, comprising:
According to the phonetic feature sequence and the languages determined, speech recognition is carried out to collected voice time domain signal.
6. the method as described in Claims 1 to 5 is any, which is characterized in that the languages identification model is instructed in the following manner It gets:
According to the languages that the language environment selected includes, corresponding training sample is selected;
For each training sample, following operation is executed:
Determine the corresponding phonetic feature sequence of each speech frame in the training sample;
Pond is carried out to the phonetic feature sequence determined, obtains the corresponding phonetic feature sequence of the training sample;
Using the corresponding phonetic feature sequence of the training sample as input, the corresponding languages identification model of training.
7. a kind of languages identification device, which is characterized in that described device includes:
Acquisition module, for acquiring voice time domain signal;
Characteristic determination module, for determining the corresponding phonetic feature sequence of each speech frame in collected voice time domain signal Column;
Pond module obtains the corresponding language of collected voice time domain signal for carrying out pond to the phonetic feature sequence Sound characteristic sequence;
Languages identification module, for using the corresponding phonetic feature sequence of collected voice time domain signal as inputting, using pre- The languages identification model first trained determines the corresponding languages of collected voice time domain signal.
8. a kind of translator, which is characterized in that the translator includes languages identification device as claimed in claim 7.
9. a kind of nonvolatile computer storage media, which is characterized in that the computer storage medium is stored with executable journey Sequence, the executable code processor execute the step of realizing claim 1~6 any the method.
10. a kind of languages identify equipment, which is characterized in that including memory, the computer of processor and storage on a memory The step of program, the processor realizes claim 1~6 any the method when executing described program.
CN201810908924.XA 2018-08-10 2018-08-10 A kind of Language Identification, device, translator, medium and equipment Pending CN109192192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810908924.XA CN109192192A (en) 2018-08-10 2018-08-10 A kind of Language Identification, device, translator, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810908924.XA CN109192192A (en) 2018-08-10 2018-08-10 A kind of Language Identification, device, translator, medium and equipment

Publications (1)

Publication Number Publication Date
CN109192192A true CN109192192A (en) 2019-01-11

Family

ID=64920965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810908924.XA Pending CN109192192A (en) 2018-08-10 2018-08-10 A kind of Language Identification, device, translator, medium and equipment

Country Status (1)

Country Link
CN (1) CN109192192A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033756A (en) * 2019-04-15 2019-07-19 北京达佳互联信息技术有限公司 Language Identification, device, electronic equipment and storage medium
CN110648654A (en) * 2019-10-09 2020-01-03 国家电网有限公司客户服务中心 Speech recognition enhancement method and device introducing language vectors
CN110689875A (en) * 2019-10-28 2020-01-14 国家计算机网络与信息安全管理中心 Language identification method and device and readable storage medium
CN110797016A (en) * 2019-02-26 2020-02-14 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111326139A (en) * 2020-03-10 2020-06-23 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
WO2020182153A1 (en) * 2019-03-11 2020-09-17 腾讯科技(深圳)有限公司 Method for performing speech recognition based on self-adaptive language, and related apparatus
CN112801239A (en) * 2021-01-28 2021-05-14 科大讯飞股份有限公司 Input recognition method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
CN103761975A (en) * 2014-01-07 2014-04-30 苏州思必驰信息科技有限公司 Method and device for oral evaluation
US20170011735A1 (en) * 2015-07-10 2017-01-12 Electronics And Telecommunications Research Institute Speech recognition system and method
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106878805A (en) * 2017-02-06 2017-06-20 广东小天才科技有限公司 Mixed language subtitle file generation method and device
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
CN108305642A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
CN103761975A (en) * 2014-01-07 2014-04-30 苏州思必驰信息科技有限公司 Method and device for oral evaluation
US20170011735A1 (en) * 2015-07-10 2017-01-12 Electronics And Telecommunications Research Institute Speech recognition system and method
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
CN108335693A (en) * 2017-01-17 2018-07-27 腾讯科技(深圳)有限公司 A kind of Language Identification and languages identification equipment
CN106878805A (en) * 2017-02-06 2017-06-20 广东小天才科技有限公司 Mixed language subtitle file generation method and device
CN108305642A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李子煜等: "卷积神经网络在语言识别中的应用――以江苏省方言分类为例", 《科技传播》 *
金马: "基于卷积神经网络的语种识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110797016A (en) * 2019-02-26 2020-02-14 北京嘀嘀无限科技发展有限公司 Voice recognition method and device, electronic equipment and storage medium
WO2020182153A1 (en) * 2019-03-11 2020-09-17 腾讯科技(深圳)有限公司 Method for performing speech recognition based on self-adaptive language, and related apparatus
US12033621B2 (en) 2019-03-11 2024-07-09 Tencent Technology (Shenzhen) Company Limited Method for speech recognition based on language adaptivity and related apparatus
CN110033756A (en) * 2019-04-15 2019-07-19 北京达佳互联信息技术有限公司 Language Identification, device, electronic equipment and storage medium
CN110033756B (en) * 2019-04-15 2021-03-16 北京达佳互联信息技术有限公司 Language identification method and device, electronic equipment and storage medium
CN110648654A (en) * 2019-10-09 2020-01-03 国家电网有限公司客户服务中心 Speech recognition enhancement method and device introducing language vectors
CN110689875A (en) * 2019-10-28 2020-01-14 国家计算机网络与信息安全管理中心 Language identification method and device and readable storage medium
CN111326139A (en) * 2020-03-10 2020-06-23 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
CN111326139B (en) * 2020-03-10 2024-02-13 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
CN112801239A (en) * 2021-01-28 2021-05-14 科大讯飞股份有限公司 Input recognition method and device, electronic equipment and storage medium
CN112801239B (en) * 2021-01-28 2023-11-21 科大讯飞股份有限公司 Input recognition method, input recognition device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109192192A (en) A kind of Language Identification, device, translator, medium and equipment
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
CN105427858B (en) Realize the method and system that voice is classified automatically
CN107492382B (en) Voiceprint information extraction method and device based on neural network
CN105976812B (en) A kind of audio recognition method and its equipment
CN110097894A (en) A kind of method and system of speech emotion recognition end to end
CN108986835B (en) Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network
CN107103903A (en) Acoustic training model method, device and storage medium based on artificial intelligence
Ye et al. Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN107680597A (en) Audio recognition method, device, equipment and computer-readable recording medium
CN110211565A (en) Accent recognition method, apparatus and computer readable storage medium
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN104036774A (en) Method and system for recognizing Tibetan dialects
CN106782501A (en) Speech Feature Extraction and device based on artificial intelligence
CN106683677A (en) Method and device for recognizing voice
CN108399923A (en) More human hairs call the turn spokesman's recognition methods and device
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN109147769B (en) Language identification method, language identification device, translation machine, medium and equipment
CN110517689A (en) A kind of voice data processing method, device and storage medium
CN102982811A (en) Voice endpoint detection method based on real-time decoding
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN106504768A (en) Phone testing audio frequency classification method and device based on artificial intelligence
CN108090038A (en) Text punctuate method and system
CN109036471A (en) Sound end detecting method and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190111

RJ01 Rejection of invention patent application after publication