CN110288980A

CN110288980A - Audio recognition method, the training method of model, device, equipment and storage medium

Info

Publication number: CN110288980A
Application number: CN201910522750.8A
Authority: CN
Inventors: 王健宗; 魏文琦; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-09-27
Also published as: WO2020253060A1

Abstract

This application involves field of biological recognition, specifically used conversion neural fusion In vivo detection, and a kind of audio recognition method, the training method of model, device, equipment and storage medium are disclosed, the training method includes: to obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus；Word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data；According to preset word incorporation model, term vector conversion is carried out to the training participle data, to obtain word insertion vector；Position data of the training participle data in the trained phonetic corpus is obtained, and vector conversion is carried out to the position data, to obtain position vector；Institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector；Based on conversion neural network, model training is carried out to obtain language transformation model according to the splicing term vector and the data label.

Description

Audio recognition method, the training method of model, device, equipment and storage medium

Technical field

This application involves model training technical field more particularly to the instructions of a kind of audio recognition method, language transformation model Practice method, apparatus, equipment and storage medium.

Background technique

Speech recognition technology, also referred to as automatic speech recognition (Automatic Speech Recognition, ASR), refer to Machine is by identification and understands, voice signal is become a technology of text, is widely used in smart home and voice input Equal fields, it is greatly convenient for people's lives.However existing speech recognition technology is mostly based on Recognition with Recurrent Neural Network (Recurrent Neural Networks, RNN), shot and long term memory network (Long Short-Term Memory, LSTM) or What the speech recognition modelings such as gating cycle unit (Gated Recurrent Unit, GRU) were realized, based on speech recognition modeling Speech recognition is a sequence calculating process, and sequence calculating process will lead to information loss, so that speech recognition accuracy is influenced, Reduce audio identification efficiency again simultaneously.Therefore, the efficiency and accuracy rate for how improving speech recognition become asking for urgent need to resolve Topic.

Summary of the invention

This application provides a kind of audio recognition method, the training method of language transformation model, device, computer equipment and Storage medium improves the accuracy rate and efficiency of speech recognition when the language transformation model is applied to speech recognition.

In a first aspect, this application provides a kind of training methods of language transformation model, which comprises

Obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus；

Word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data；

According to preset word incorporation model, term vector conversion is carried out to the training participle data, with obtain word be embedded in Amount；

Position data of the training participle data in the trained phonetic corpus is obtained, and to the positional number It is believed that breath carries out vector conversion, to obtain position vector；

Institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector；

Based on conversion neural network, model training is carried out to obtain language according to the splicing term vector and the data label Say transformation model.

Second aspect, this application provides a kind of audio recognition methods, which comprises

Targeted voice signal is obtained, the targeted voice signal is pre-processed to obtain according to default processing rule State the corresponding spectral vectors of targeted voice signal；

The spectral vectors are inputted in preset phoneme model, to obtain phonetic feature sequence；

By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language converts mould Type is obtained by the training method training of language as described above conversion identification model.

The third aspect, present invention also provides a kind of training device of language transformation model, described device includes:

Corpus acquiring unit, for obtaining trained phonetic corpus and the corresponding data label of the trained phonetic corpus；

Word segmentation processing unit, for carrying out word segmentation processing to the trained phonetic corpus, to obtain training participle data；

Vector conversion unit, for carrying out term vector to the training participle data and turning according to preset word incorporation model Change, to obtain word insertion vector；

Position acquisition unit, for obtaining the training positional number of the participle data in the trained phonetic corpus it is believed that Breath, and vector conversion is carried out to the position data, to obtain position vector；

Vector concatenation unit, for splicing to institute's predicate insertion vector with the position vector, to obtain splicing word Vector；

Model training unit, for based on conversion neural network, according to the splicing term vector and the data label into Row model training is to obtain language transformation model.

Fourth aspect, present invention also provides a kind of speech recognition equipment, described device includes:

Signal acquiring unit, for obtaining targeted voice signal, according to default processing rule to the targeted voice signal It is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal；

Frequency spectrum input unit, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature sequence Column；

Text acquiring unit is used for the phonetic feature sequence inputting language transformation model, to obtain target Chinese text This, the language transformation model is obtained by the training method training of language as described above conversion identification model.

5th aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device；The memory is for storing computer program；The processor, for executing the computer program and described in the execution Training method or the above-mentioned audio recognition method such as above-mentioned language transformation model are realized when computer program.

6th aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize that above-mentioned language such as converts when being executed by processor The training method of model or above-mentioned audio recognition method.

This application discloses a kind of audio recognition method, the training method of model, device, equipment and storage mediums, pass through Institute's predicate insertion vector is spliced with the position vector, obtains splicing term vector；Based on conversion neural network, according to institute It states splicing term vector and the data label carries out model training to obtain language transformation model, which is applied to Speech recognition changes the sequence calculating process of speech recognition, location information is avoided to lose, so that it is accurate to improve speech recognition Rate and efficiency.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of schematic flow diagram of the training method for language transformation model that embodiments herein provides；

Fig. 2 is the sub-step schematic flow diagram of the training method of the language transformation model in Fig. 1；

Fig. 3 is the schematic illustration for the acquisition splicing term vector that embodiments herein provides；

Fig. 4 is the sub-step schematic flow diagram of the training method of language transformation model in Fig. 1；

Fig. 5 is the sub-step schematic flow diagram that training one embodiment of encoded information is exported in Fig. 4；

Fig. 6 is the sub-step schematic flow diagram that training another embodiment of encoded information is exported in Fig. 4；

Fig. 7 is the schematic flow diagram for the audio recognition method that embodiments herein provides；

Fig. 8 is the sub-step schematic flow diagram of audio recognition method in Fig. 7；

Fig. 9 is a kind of schematic block diagram of the training device of language transformation model provided by the embodiments of the present application；

Figure 10 is the schematic block diagram of the submodule of the training device of language transformation model in Fig. 9；

Figure 11 is that embodiments herein also provides a kind of schematic block diagram of speech recognition equipment；

Figure 12 is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.

Embodiments herein provides a kind of training method of language transformation model, audio recognition method, device, calculating Machine equipment and storage medium.When the language transformation model is applied to speech recognition, audio identification efficiency and accuracy rate can be improved.

With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

Referring to Fig. 1, the step of Fig. 1 is a kind of training method of language transformation model provided by the embodiments of the present application signal Flow chart.

As shown in Figure 1, the training method of the language transformation model, specifically includes: step S101 to step S105.

S101, training phonetic corpus and the corresponding data label of the trained phonetic corpus are obtained.

Specifically, phonetic text can be acquired according to practical application scene, as training phonetic corpus.Illustratively, for News category voice, the common phonetic of Chinese sentence when can acquire News Field, as training phonetic corpus.

Wherein, data label is the corresponding true Chinese text of training phonetic corpus.Illustratively, training phonetic corpus " wo3xi3huan1bei3jing1 " corresponding true Chinese text is " I likes Beijing ", the corresponding number of training phonetic corpus It is " I likes Beijing " according to label.

S102, word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data.

It illustratively, can be based on dictionary segmentation methods or based on the machine learning algorithm of statistics to the trained phonetic Corpus carries out word segmentation processing.

In some embodiments, the detailed process of word segmentation processing, i.e. step S102 are carried out to the trained phonetic corpus It specifically includes: according to preset dictionary, word segmentation processing being carried out to the trained phonetic corpus, to obtain training participle data.

Dictionary is the candidate collection of an everyday words, is as I likes Pekinese's training phonetic corpus " wo3xi3huan1bei3jing1 " is then traversed from corpus head to tail, is cut if having word to occur in dictionary in corpus Point word, so as to will " wo3xi3huan1bei3jing1 " word segmentation processing be " wo3 ", " xi3huan1 " and " bei3jing1 " three training participle data.Wherein, digital " 3 ", " 1 " indicate tone.

In other embodiments, the detailed process of word segmentation processing, i.e. step are carried out to the trained phonetic corpus S102 is specifically included: according to preset dictionary, one-hot coding is carried out to the trained phonetic corpus, to obtain training participle Data.

One-hot coding, i.e. one-hot encoding, an efficient coding；One-hot encoding is a kind of following code system: for a certain attribute Word, how many state is with regard to how many bit, and only one bit is 1, other are all 0.

It illustratively, include the corresponding word of this attribute of season, the respectively phonetic in spring in preset dictionary Phonetic " xia4tian1 ", the phonetic in autumn " qiu1tian1 ", the phonetic in winter in " chun1tian1 ", summer " dong1tian1 " and other phonetics " qi2ta1 ".The attribute shares 5 different classification values, needs 5 bits at this time Indicate what value the attribute is.For example, the one-hot encoding of " chun1tian1 " is { 10000 }, the one-hot encoding of " xia4tian1 " is { 01000 }, the one-hot encoding of " qiu1tian1 " is { 00100 }, and the one-hot encoding of " dong1tian1 " is { 00010 }, " qi2ta1's " One-hot encoding is { 00001 }.

It illustratively, can also include the attributes such as person, fruit, gender, motion mode, i.e., each attribute in preset dictionary Corresponding word and one-hot encoding.

If there are multiple words in certain phonetic corpus, when one-hot encoding being needed to encode, successively the one-hot encoding of each word is spliced Get up: the one-hot encoding of the phonetic " xia4tian1 " in such as summer is { 01000 }, and the one-hot encoding of hot phonetic " re4 " is { 001 }, The one-hot encoding { 01000001 } for both so connecting to the end.

Phonetic corpus is handled using one-hot coding, can data are thinned out, and one-hot is encoded To data contain the information of Words ' Attributes in phonetic corpus.

The corresponding training participle data of the training phonetic corpus are obtained after carrying out word segmentation processing to training phonetic corpus.

Illustratively, the corresponding training participle data of certain training phonetic corpus are as follows: 100000001000000001000010 010000。

S103, according to preset word incorporation model, term vector conversion is carried out to the training participle data, it is embedding to obtain word Incoming vector.

After obtaining training participle data, term vector is carried out to the training participle data according to preset word incorporation model Conversion is to obtain word insertion vector.

In one embodiment, the preset word incorporation model can be embedded in for Word2vec (word to vector) word Model.Multiple training participle data composition training participle data sets.It, can be by training point according to the Word2vec word incorporation model Each of word data set training participle data are indicated with a word insertion vector.In one embodiment, word is embedded in vector Dimension be 512.

It should be understood that in other embodiments, preset word incorporation model is also possible to other minds trained in advance Through network model, such as deep neural network (Deep Neural Network, DNN) model.

S104, position data of the training participle data in the trained phonetic corpus is obtained, and to described Position data carries out vector conversion, to obtain position vector.

Specifically, being carried out after obtaining the corresponding position data of training participle data to the position data Vector conversion processing, to obtain position vector corresponding with the position data.

In one embodiment, it is described obtain the training positional number of the participle data in the trained phonetic corpus it is believed that Breath, comprising:

Based on position calculation formula, the training participle data are calculated in the trained phonetic language according to training participle data Position data in material；The position calculation formula are as follows:

Or,

Wherein, the pos is the position of the training participle data, and 2m or (2m+1) indicate the training participle data pair The dimension for the word insertion vector answered, d_gFor the corresponding vector dimension of the trained phonetic corpus.

Specifically, when the dimension of training participle data corresponding word insertion vector is even number, using first formula into Row calculates position data of the training participle data in the trained phonetic corpus.When training participle data are corresponding When the dimension of word insertion vector is odd number, calculate the training participle data in the trained phonetic using second formula Position data in corpus.

Illustratively, it is assumed that d_gIt is 512, position pos of the training participle data R in training phonetic corpus is 20, training The dimension 2m for segmenting the corresponding word insertion vector of data R is 128, at this point, can calculate instruction by above-mentioned position calculation formula Practice position data of the participle data R in the trained phonetic corpus

For another example, it is assumed that d_gIt is 512, position pos of the training participle data R in training phonetic corpus is 20, training participle The dimension 2m+1 of the corresponding word insertion vector of data R is 129, at this point, can calculate training point by above-mentioned position calculation formula Position data of the word data R in the trained phonetic corpus be

As shown in Fig. 2, in one embodiment, it is described that vector conversion is carried out to the position data, to obtain position The step of vector, including sub-step S104a and S104b.

S104a, training participle data the putting in order in the trained phonetic corpus is determined.

Illustratively, training phonetic corpus is " wo3xi3huan1bei3jing1 ", and training participle data " wo3 " is in the instruction Practice putting in order as 1, training participle data " xi3huan1 " putting in order in the training voice data in voice data It is 2, training participle data " bei3jing1 " putting in order in the training voice data is 3.

S104b, putting in order according to carries out vector conversion to the position data, to obtain and the training Segment the corresponding position vector of data.

Specifically, the progress that puts in order by each position data information according to training participle data in training phonetic corpus Vector conversion.

Illustratively, position data of training participle data " wo3 " in the training voice data is 0.863, instruction Practicing participle data " wo3 " putting in order in the training voice data is 1, training corresponding position participle data " wo3 " to Amount is (0.863,0,0).Training segments the position data of data " xi3huan1 " in the training voice data 0.125, training participle data " xi3huan1 " putting in order in the training voice data is 2, training participle data " xi3huan1 " corresponding position vector is (0,0.125,0).Training participle data " bei3jing1 " is in the training voice data In position data be 0.928, training participle data " bei3jing1 " in the training voice data put in order for 3, the corresponding position vector of training participle data " bei3jing1 " is (0,0,0.928).

S105, institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector.

Specifically, after obtaining word insertion vector sum position vector, institute's predicate insertion vector sum position vector is spelled It connects, to obtain splicing term vector.

In one embodiment, it is described by institute's predicate be embedded in vector sum described in position vector splice, obtain splicing word to Amount, specifically includes: institute's predicate insertion vector being summed with the position vector, obtains the splicing term vector.

For example, training phonetic corpus be " wo3xi3huan1bei3jing1 " carry out word segmentation processing, obtain " wo3 ", " xi3huan1 " and " bei3jing1 " three training participle data.Wherein, " wo3 ", " xi3huan1 " and " bei3jing1 " point Not corresponding word insertion vector is A₁、A₂And A₃, " wo3 ", " xi3huan1 " and " bei3jing1 " corresponding position vector For B₁、B₂And B₃.The dimension that suppositive is embedded in vector sum position vector is the four-dimension, this three training participle data are corresponding Splicing term vector is C₁、C₂And C₃.Wherein, it please refers to shown in Fig. 3, C₁=A₁+B₁, C₂=A₂+B₂, C₃=A₃+B₃。

In another embodiment, described to splice position vector described in institute's predicate insertion vector sum, obtain splicing word Vector specifically includes: institute's predicate insertion vector being attached with the position vector, obtains the splicing term vector.

In one embodiment, institute's predicate insertion vector is sequentially connected with the position vector.For example, institute's predicate Be embedded in vector be (1,0,0), the position vector be (0,0.125,0), obtained splicing term vector for (1,0,0,0, 0.125,0).Certainly, in another embodiment, the position vector and institute's predicate insertion vector are sequentially connected.Example Such as, institute's predicate insertion vector be (1,0,0), the position vector be (0,0.125,0), obtained splicing term vector for (0, 0.125,0,1,0,0).

S106, based on conversion neural network, according to the splicing term vector and the data label progress model training with Obtain language transformation model.

Specifically, conversion neural network, that is, Transformer Networks, abbreviation Transformer, are a kind of height The neural network of parallelization.Based on the conversion neural network, model is carried out according to the splicing term vector and the data label Training, training speed are obviously improved.

It is described based on conversion neural network as shown in figure 4, in one embodiment, according to the splicing term vector and described The corresponding data label of training phonetic corpus carries out the step of model training is to obtain language transformation model, including step S201 is extremely S203。

S201, the encoder for converting neural network is inputted by described, to export trained encoded information.

Specifically, conversion neural network includes encoder and decoder, it being capable of information transmitting between encoder and decoder And interaction.Encoder and decoder may each comprise multilayer, the dimension size phase of the layer of the dimension and decoder of the layer of encoder Together.

In one embodiment, encoder includes dot product attention model and feedforward neural network (FeedForward).Its In, attention (Attention) indicates the incidence relation between word and word.In one embodiment, it is noted that power indicates to turn in language Corresponding relationship during change from phonetic end to Chinese end the possible mutually word of inversion of phases.

Specifically, referring to Fig. 5, by the encoder of the input conversion neural network described in step S201, with defeated Encoded information is trained out, is specifically included: sub-step S201a and S201b.

S201a, the splicing term vector is inputted into the dot product attention model, to export dot product expressive force information.

Specifically, the dot product attention model are as follows:

Wherein, Q indicates inquiry, and K indicates keyword, V expression value, and d_kIndicate the dimension of Q and K.

Specifically, being provided with 3 vectors (vector), respectively Query Vector, Key in dot product attention model Vector, Value Vector, are abbreviated as Q, K, V respectively.Splicing term vector is inputted into dot product attention model, the point exported Expressive force of the corresponding training participle data in current location, the process can be embodied by multiplying performance force information Attention (Q, K, V) Parallelization is high.

S201b, by BP network model described in the dot product expressive force information input, with output training coding letter Breath.

Specifically, the BP network model, specifically:

Wherein, Y is that the dot product shows force information, W₁、W₂For weight, b₁、b₂For bias function.

In another embodiment, encoder includes bull attention model and feedforward neural network (Feed Forward). Wherein, attention (Attention) indicates the incidence relation between word and word.In one embodiment, it is noted that power is indicated in language Corresponding relationship in conversion process from phonetic end to Chinese end the possible mutually word of inversion of phases.

As shown in Figure 6, wherein the encoder by the input conversion neural network, with output training coding Information specifically includes: sub-step S201c and S201d.

S201c, the splicing term vector is inputted into the bull attention model, to export bull performance force information.

Wherein, the bull attention model are as follows:

MultiHead (Q, K, V)=Concat (head₁,...,head_n)W⁰；

head_i=Attention (QW_i ^Q,KW_i ^K,VW_i ^V)

Wherein,d_gThe dimension of vector is embedded in for word.

Specifically, being provided with the matrix of multiple Q, K, V matrixes and actual value in bull attention model, which is instructed It is more to practice parameter, is able to ascend model capability, it is contemplated that the attention of different location, and it is empty more sons can be assigned to attention Between.Splicing term vector is inputted into bull attention model, bull performance force information MultiHead (Q, K, V) exported can Corresponding training participle data are embodied in the expressive force of current location, and the concurrent processization is high, and the speed of service is fast.

S201d, by BP network model described in the bull expressive force information input, with output training coding letter Breath.

It should be understood that the BP network model in the step is referred to the Feedforward Neural Networks in step S201b Network model, details are not described herein.

S202, the decoder that the trained encoded information is inputted to the conversion neural network, with output training Chinese text This.

In one embodiment, decoder and encoder all have multilayer, wherein the layer of decoder more than the layer of encoder one A sub-network, i.e. coder-decoder attention (Encoder-Decoder Attention) indicate source to target side Attention mechanism.Specifically, coder-decoder attention indicates between the word at phonetic end and the Chinese word of phonetic end generation Dependence.

S203, the trained Chinese text is verified according to the data label, and adjusts the encoder and the decoding Parameter in device, until the trained Chinese text is verified to obtain language transformation model.

Specifically, suitable loss function, such as cross entropy loss function, Lai Hengliang data label and training can be used The inconsistent degree of Chinese text, loss function is smaller, and the robustness of model is better.Illustratively, if loss function is less than in advance If when threshold values, indicating that the trained Chinese text is verified, stopping model training at this time, obtain language transformation model.

The training method of language transformation model provided by the above embodiment, by being embedded in vector and the position to institute's predicate Vector is spliced, and splicing term vector is obtained；Based on conversion neural network, according to the splicing term vector and the data label Model training is carried out to obtain language transformation model, which is applied to speech recognition, changes speech recognition Sequence calculating process, avoids location information from losing, to improve speech recognition accuracy and efficiency.

Referring to Fig. 7, Fig. 7 is the schematic flow diagram for the audio recognition method that embodiments herein provides.Wherein, the language Voice recognition method can be applied in terminal or server, for converting Chinese text for voice signal.

As shown in fig. 7, the audio recognition method, comprising: step S301 to S303.

S301, obtain targeted voice signal, according to default processing it is regular to the targeted voice signal pre-processed with Obtain the corresponding spectral vectors of the targeted voice signal.

Specifically, " voice " refers to the audio with linguistic property, can be issued by human body, it can also be by loudspeaker etc. Electronic equipment issues.

In the present embodiment, corresponding voice signal, sound pick-up outfit when chatting with user can be acquired by sound pick-up outfit Such as recording pen, smart phone, tablet computer, notebook or intelligent wearable device etc., such as Intelligent bracelet or smartwatch etc..

Wherein, which is specifically to compare for the targeted voice signal to be converted into the information in frequency domain The target voice information acquired in the time domain is such as converted by frequency using Fast Fourier Transform (FFT) rule or wavelet transformation rule Information in domain.

S302, the spectral vectors are inputted in preset phoneme model, to obtain phonetic feature sequence.

Preset phoneme model can instruct initial neural network using a large amount of spectral vectors-phonetic sample data Practice and obtains.Initial neural network can be various neural networks, for example, convolutional neural networks, Recognition with Recurrent Neural Network, shot and long term are remembered Recall neural network etc..

Specifically, as shown in figure 8, described input the spectral vectors in preset phoneme model, to obtain phonetic spy Levy sequence, comprising: S302a, according to the spectral vectors identifies the corresponding tone of the spectral vectors, initial consonant and simple or compound vowel of a Chinese syllable； S302b, the tone, initial consonant and simple or compound vowel of a Chinese syllable are integrated, to obtain the phonetic feature sequence of Chinese text.

Specifically, tone includes the first sound (also known as high and level tone or Heibei provincial opera), the rising tone (also known as rising tone or tone), third sound (also deserve to be called sound or folding adjust), the falling tone (also known as falling tone or falling tone), softly.Softly, the first sound, the rising tone, third sound and the 4th Sound can use digital " 0 ", " 1 ", " 2 ", " 3 ", " 4 " to indicate respectively.

It, can be with for example, the corresponding spectral vectors of source voice data of " I likes Beijing " are input to preset phoneme model Identify that the corresponding tone of the spectral vectors is followed successively by " 3 ", " 3 ", " 1 ", " 3 ", " 1 " in chronological order；Corresponding initial consonant is temporally Sequence is followed successively by " w ", " x ", " h ", " b ", " j "；Corresponding simple or compound vowel of a Chinese syllable be followed successively by chronological order " o ", " i ", " uan ", " ei ", “ing”。

It identifies the corresponding tone of the spectral vectors, initial consonant and rhythm imperial mother, the tone, initial consonant and simple or compound vowel of a Chinese syllable is integrated, Obtain the phonetic feature sequence { wo3xi3huan1bei3jing1 } of " I likes Beijing " Chinese text.

S303, by the phonetic feature sequence inputting language transformation model, to obtain target Chinese text.

Specifically, what the language transformation model was obtained by the training method training of above-mentioned language conversion identification model. Pinyin Chinese conversion is carried out by phonetic feature sequence of the language model to input, to obtain target Chinese text.

Above-mentioned audio recognition method, by obtaining targeted voice signal, according to default processing rule to the target voice Signal is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal；The spectral vectors are inputted into preset sound In prime model, to obtain phonetic feature sequence；By the phonetic feature sequence inputting language transformation model, to obtain target Chinese Text.Since language transformation model changes the sequence calculating process of speech recognition, location information is avoided to lose, thus improved Speech recognition accuracy and efficiency.

Referring to Fig. 9, Fig. 9 is a kind of training cartridge of the training pattern for language transformation model that embodiments herein provides The schematic block diagram set, the training device of the training pattern of the language transformation model is for executing aforementioned any term language conversion mould The training method of type.Wherein, the training device of the training pattern of the language transformation model can be configured in server or terminal.

Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.

As shown in figure 9, the training device 400 of the training pattern of language transformation model includes: corpus acquiring unit 401, divides Word processing unit 402, vector conversion unit 403, position acquisition unit 404, vector concatenation unit 405 and model training unit 406。

Corpus acquiring unit 401, for obtaining trained phonetic corpus and the corresponding data mark of the trained phonetic corpus Label.

Word segmentation processing unit 402, for carrying out word segmentation processing to the trained phonetic corpus, to obtain training participle number According to.

Vector conversion unit 403, for carrying out term vector to the training participle data according to preset word incorporation model Conversion, to obtain word insertion vector.

Position acquisition unit 404, for obtaining positional number of the training participle data in the trained phonetic corpus It is believed that breath, and vector conversion is carried out to the position data, to obtain position vector.

Vector concatenation unit 405, for splicing to institute's predicate insertion vector with the position vector, to be spliced Term vector.

Model training unit 406, for being based on conversion neural network, according to the splicing term vector and the data label Model training is carried out to obtain language transformation model.

Referring to Fig. 9, in one embodiment, position acquisition unit 404 includes data computation subunit 4041.The data Computation subunit 4041 is used for: being based on position calculation formula, is calculated the training participle data in institute according to training participle data State the position data in trained phonetic corpus.

Referring to Fig. 9, in one embodiment, position acquisition unit 404 includes that sequence determines 4042 He of subelement Vector transforming subunit 4043.

Sequence determines subelement 4042, for determining that the training participle data are suitable in the arrangement of the trained phonetic corpus Sequence.

Vector transforming subunit 4043 carries out vector turn to the position data for putting in order according to Change, to obtain position vector corresponding with the training participle data.

Referring again to Figure 10, in one embodiment, model training unit 406 includes coding output subelement 4061, text This output subelement 4062 and text authentication subelement 4063.

Coding output subelement 4061, for the splicing term vector to be inputted to the encoder of the conversion neural network, To export training encoded information.

Text output subelement 4062, for the trained encoded information to be inputted to the decoding of the conversion neural network Device, to export trained Chinese text.

Text authentication subelement 4063 for verifying the trained Chinese text according to the data label, and adjusts institute The parameter in encoder and the decoder is stated, until the trained Chinese text is verified to obtain language transformation model.

Referring to Figure 10, in an implementation, the encoder includes dot product attention model and feedforward neural network Model.Coding output subelement 4061 includes dot product output sub-module 4061a and information output sub-module 4061b.

Dot product output sub-module 4061a, for the splicing term vector to be inputted the dot product attention model, with output Dot product shows force information.

Information output sub-module 4061b is used for BP network model described in the dot product expressive force information input, To export training encoded information.

Figure 11 is please referred to, Figure 11 is that embodiments herein also provides a kind of schematic block diagram of speech recognition equipment, should Speech recognition equipment is for executing Emotion identification method above-mentioned.Wherein, the speech recognition equipment can be configured at server or In terminal.

As shown in figure 11, speech recognition equipment 500, comprising: signal acquiring unit 501, frequency spectrum input unit 502 and text This acquiring unit 503.

Signal acquiring unit 501, for obtaining targeted voice signal, according to default processing rule to the target language message It number is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal.

Frequency spectrum input unit 502, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature Sequence.

Text acquiring unit 503 is used for by the phonetic feature sequence inputting language transformation model, to obtain target Chinese Text, the language transformation model are obtained by the training method training of above-mentioned language conversion identification model.

It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The device of foregoing description and the specific work process of each unit can refer to the corresponding process in preceding predicate embodiment of the method, This is repeated no more.

Above-mentioned apparatus can be implemented as a kind of form of computer program, which can be as shown in figure 12 It is run in computer equipment.

Figure 12 is please referred to, Figure 12 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating Machine equipment can be server or terminal.

Refering to fig. 12, which includes processor, memory and the network interface connected by system bus, In, memory may include non-volatile memory medium and built-in storage.

Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute the training method of any one language transformation model, or is executed any A kind of audio recognition method.

Processor supports the operation of entire computer equipment for providing calculating and control ability.

Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute a kind of training method of language transformation model, or executes the knowledge of any one voice Other method.

The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Figure 12, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme The restriction for the computer equipment being applied thereon, specific computer equipment may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.

It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.

Wherein, the processor is for running computer program stored in memory, to realize following steps:

Obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus；To the trained phonetic corpus Word segmentation processing is carried out, to obtain training participle data；According to preset word incorporation model, word is carried out to the training participle data Vector conversion, to obtain word insertion vector；Obtain position data of the training participle data in the trained phonetic corpus Information, and vector conversion is carried out to the position data, to obtain position vector；To institute's predicate insertion vector and institute's rheme It sets vector to be spliced, to obtain splicing term vector；Based on conversion neural network, according to the splicing term vector and the data Label carries out model training to obtain language transformation model.

In one embodiment, the processor is realizing the acquisition training participle data in the trained phonetic When position data in corpus, for realizing:

Or,

Wherein, the pos is the position of the training participle data, and m indicates that the corresponding word of the training participle data is embedding The dimension of incoming vector, d_gFor the corresponding vector dimension of the trained phonetic corpus.

In one embodiment, the processor realize it is described to the position data carry out vector conversion, with When obtaining position vector, for realizing:

Determine training participle data the putting in order in the trained phonetic corpus；It is put in order according to described to institute It states position data and carries out vector conversion, to obtain position vector corresponding with the training participle data.

In one embodiment, the processor realize it is described based on conversion neural network, according to the splicing word to When amount and the corresponding data label progress model training of the trained phonetic corpus are to obtain language transformation model, for realizing:

The splicing term vector is inputted to the encoder of the conversion neural network, to export trained encoded information；By institute The decoder that trained encoded information inputs the conversion neural network is stated, to export trained Chinese text；According to the data mark Label verify the trained Chinese text, and adjust the parameter in the encoder and the decoder, until the training Chinese Text authentication is by obtaining language transformation model.

In one embodiment, the processor is realizing that the encoder includes dot product attention model and feed forward neural Network model；It is described by the splicing term vector input coding device, when exporting training encoded information, for realizing:

The splicing term vector is inputted into the dot product attention model, to export dot product expressive force information；By the point Multiply BP network model described in expressive force information input, to export trained encoded information.

Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

Targeted voice signal is obtained, the targeted voice signal is pre-processed to obtain according to default processing rule State the corresponding spectral vectors of targeted voice signal；The spectral vectors are inputted in preset phoneme model, to obtain phonetic spy Levy sequence；By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language transformation model It is obtained by the training method training of language described in any of the above embodiments conversion identification model.

A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this Apply for the training method or any one audio recognition method of any term language transformation model that embodiment provides.

Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims

1. a kind of training method of language transformation model characterized by comprising

According to preset word incorporation model, term vector conversion is carried out to the training participle data, to obtain word insertion vector；

Obtain the position data of the training participle data in the trained phonetic corpus, and to the positional number it is believed that Breath carries out vector conversion, to obtain position vector；

Based on conversion neural network, model training is carried out according to the splicing term vector and the data label and is turned with obtaining language Change model.

2. the training method of language transformation model according to claim 1, which is characterized in that described to obtain the training point Position data of the word data in the trained phonetic corpus, comprising:

Based on position calculation formula, the training participle data are calculated in the trained phonetic corpus according to training participle data Position data；The position calculation formula are as follows:

Or,

Wherein, the pos is the position of the training participle data, m indicate the corresponding word of the training participle data be embedded in The dimension of amount, d_gFor the corresponding vector dimension of the trained phonetic corpus.

3. the training method of language transformation model according to claim 1, which is characterized in that described to the position data Information carries out vector conversion, to obtain position vector, comprising:

Determine training participle data the putting in order in the trained phonetic corpus；

Vector conversion is carried out to the position data according to described put in order, segments data pair with the training to obtain The position vector answered.

4. the training method of language transformation model according to claim 1-3, which is characterized in that described to be based on turning Neural network is changed, model training is carried out to obtain according to the splicing term vector and the corresponding data label of the trained phonetic corpus To language transformation model, comprising:

The splicing term vector is inputted to the encoder of the conversion neural network, to export trained encoded information；

The trained encoded information is inputted to the decoder of the conversion neural network, to export trained Chinese text；

The trained Chinese text is verified according to the data label, and adjusts the ginseng in the encoder and the decoder Number, until the trained Chinese text is verified to obtain language transformation model.

5. the training method of language transformation model according to claim 4, which is characterized in that the encoder includes dot product Attention model and BP network model；It is described by the splicing term vector input coding device, with output training coding letter Breath, comprising:

The splicing term vector is inputted into the dot product attention model, to export dot product expressive force information；

By BP network model described in the dot product expressive force information input, to export trained encoded information.

6. a kind of audio recognition method characterized by comprising

Targeted voice signal is obtained, the targeted voice signal is pre-processed according to default processing rule to obtain the mesh The corresponding spectral vectors of poster sound signal；

By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language transformation model by What the training method training of language conversion identification model as described in any one in claim 1-5 obtained.

7. a kind of training device of language transformation model characterized by comprising

Vector conversion unit, for carrying out term vector conversion to the training participle data according to preset word incorporation model, with Obtain word insertion vector；

Position acquisition unit, for obtaining position data of the training participle data in the trained phonetic corpus, And vector conversion is carried out to the position data, to obtain position vector；

Vector concatenation unit, for splicing to institute's predicate insertion vector with the position vector, to obtain splicing term vector；

Model training unit, for carrying out mould according to the splicing term vector and the data label based on conversion neural network Type training is to obtain language transformation model.

8. a kind of speech recognition equipment characterized by comprising

Signal acquiring unit carries out the targeted voice signal according to default processing rule for obtaining targeted voice signal Pretreatment is to obtain the corresponding spectral vectors of the targeted voice signal；

Frequency spectrum input unit, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature sequence；

Text acquiring unit is used for by the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, institute Predicate says that transformation model is obtained by the training method training of language as described in any one in claim 1-5 conversion identification model 's.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor；

The memory is for storing computer program；

The processor, for executing the computer program and realization such as claim 1 when executing the computer program To the training method of language transformation model described in any one of 5 or audio recognition method as claimed in claim 6.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor realize the language as described in any one of claims 1 to 5 when being executed by processor The training method of speech transformation model or audio recognition method as claimed in claim 6.