CN110288980A - Audio recognition method, the training method of model, device, equipment and storage medium - Google Patents

Audio recognition method, the training method of model, device, equipment and storage medium Download PDF

Info

Publication number
CN110288980A
CN110288980A CN201910522750.8A CN201910522750A CN110288980A CN 110288980 A CN110288980 A CN 110288980A CN 201910522750 A CN201910522750 A CN 201910522750A CN 110288980 A CN110288980 A CN 110288980A
Authority
CN
China
Prior art keywords
training
vector
model
data
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910522750.8A
Other languages
Chinese (zh)
Inventor
王健宗
魏文琦
贾雪丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910522750.8A priority Critical patent/CN110288980A/en
Publication of CN110288980A publication Critical patent/CN110288980A/en
Priority to PCT/CN2019/118227 priority patent/WO2020253060A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Abstract

This application involves field of biological recognition, specifically used conversion neural fusion In vivo detection, and a kind of audio recognition method, the training method of model, device, equipment and storage medium are disclosed, the training method includes: to obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus;Word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data;According to preset word incorporation model, term vector conversion is carried out to the training participle data, to obtain word insertion vector;Position data of the training participle data in the trained phonetic corpus is obtained, and vector conversion is carried out to the position data, to obtain position vector;Institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector;Based on conversion neural network, model training is carried out to obtain language transformation model according to the splicing term vector and the data label.

Description

Audio recognition method, the training method of model, device, equipment and storage medium
Technical field
This application involves model training technical field more particularly to the instructions of a kind of audio recognition method, language transformation model Practice method, apparatus, equipment and storage medium.
Background technique
Speech recognition technology, also referred to as automatic speech recognition (Automatic Speech Recognition, ASR), refer to Machine is by identification and understands, voice signal is become a technology of text, is widely used in smart home and voice input Equal fields, it is greatly convenient for people's lives.However existing speech recognition technology is mostly based on Recognition with Recurrent Neural Network (Recurrent Neural Networks, RNN), shot and long term memory network (Long Short-Term Memory, LSTM) or What the speech recognition modelings such as gating cycle unit (Gated Recurrent Unit, GRU) were realized, based on speech recognition modeling Speech recognition is a sequence calculating process, and sequence calculating process will lead to information loss, so that speech recognition accuracy is influenced, Reduce audio identification efficiency again simultaneously.Therefore, the efficiency and accuracy rate for how improving speech recognition become asking for urgent need to resolve Topic.
Summary of the invention
This application provides a kind of audio recognition method, the training method of language transformation model, device, computer equipment and Storage medium improves the accuracy rate and efficiency of speech recognition when the language transformation model is applied to speech recognition.
In a first aspect, this application provides a kind of training methods of language transformation model, which comprises
Obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus;
Word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data;
According to preset word incorporation model, term vector conversion is carried out to the training participle data, with obtain word be embedded in Amount;
Position data of the training participle data in the trained phonetic corpus is obtained, and to the positional number It is believed that breath carries out vector conversion, to obtain position vector;
Institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector;
Based on conversion neural network, model training is carried out to obtain language according to the splicing term vector and the data label Say transformation model.
Second aspect, this application provides a kind of audio recognition methods, which comprises
Targeted voice signal is obtained, the targeted voice signal is pre-processed to obtain according to default processing rule State the corresponding spectral vectors of targeted voice signal;
The spectral vectors are inputted in preset phoneme model, to obtain phonetic feature sequence;
By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language converts mould Type is obtained by the training method training of language as described above conversion identification model.
The third aspect, present invention also provides a kind of training device of language transformation model, described device includes:
Corpus acquiring unit, for obtaining trained phonetic corpus and the corresponding data label of the trained phonetic corpus;
Word segmentation processing unit, for carrying out word segmentation processing to the trained phonetic corpus, to obtain training participle data;
Vector conversion unit, for carrying out term vector to the training participle data and turning according to preset word incorporation model Change, to obtain word insertion vector;
Position acquisition unit, for obtaining the training positional number of the participle data in the trained phonetic corpus it is believed that Breath, and vector conversion is carried out to the position data, to obtain position vector;
Vector concatenation unit, for splicing to institute's predicate insertion vector with the position vector, to obtain splicing word Vector;
Model training unit, for based on conversion neural network, according to the splicing term vector and the data label into Row model training is to obtain language transformation model.
Fourth aspect, present invention also provides a kind of speech recognition equipment, described device includes:
Signal acquiring unit, for obtaining targeted voice signal, according to default processing rule to the targeted voice signal It is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal;
Frequency spectrum input unit, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature sequence Column;
Text acquiring unit is used for the phonetic feature sequence inputting language transformation model, to obtain target Chinese text This, the language transformation model is obtained by the training method training of language as described above conversion identification model.
5th aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device;The memory is for storing computer program;The processor, for executing the computer program and described in the execution Training method or the above-mentioned audio recognition method such as above-mentioned language transformation model are realized when computer program.
6th aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize that above-mentioned language such as converts when being executed by processor The training method of model or above-mentioned audio recognition method.
This application discloses a kind of audio recognition method, the training method of model, device, equipment and storage mediums, pass through Institute's predicate insertion vector is spliced with the position vector, obtains splicing term vector;Based on conversion neural network, according to institute It states splicing term vector and the data label carries out model training to obtain language transformation model, which is applied to Speech recognition changes the sequence calculating process of speech recognition, location information is avoided to lose, so that it is accurate to improve speech recognition Rate and efficiency.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of schematic flow diagram of the training method for language transformation model that embodiments herein provides;
Fig. 2 is the sub-step schematic flow diagram of the training method of the language transformation model in Fig. 1;
Fig. 3 is the schematic illustration for the acquisition splicing term vector that embodiments herein provides;
Fig. 4 is the sub-step schematic flow diagram of the training method of language transformation model in Fig. 1;
Fig. 5 is the sub-step schematic flow diagram that training one embodiment of encoded information is exported in Fig. 4;
Fig. 6 is the sub-step schematic flow diagram that training another embodiment of encoded information is exported in Fig. 4;
Fig. 7 is the schematic flow diagram for the audio recognition method that embodiments herein provides;
Fig. 8 is the sub-step schematic flow diagram of audio recognition method in Fig. 7;
Fig. 9 is a kind of schematic block diagram of the training device of language transformation model provided by the embodiments of the present application;
Figure 10 is the schematic block diagram of the submodule of the training device of language transformation model in Fig. 9;
Figure 11 is that embodiments herein also provides a kind of schematic block diagram of speech recognition equipment;
Figure 12 is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.
Embodiments herein provides a kind of training method of language transformation model, audio recognition method, device, calculating Machine equipment and storage medium.When the language transformation model is applied to speech recognition, audio identification efficiency and accuracy rate can be improved.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.
Referring to Fig. 1, the step of Fig. 1 is a kind of training method of language transformation model provided by the embodiments of the present application signal Flow chart.
As shown in Figure 1, the training method of the language transformation model, specifically includes: step S101 to step S105.
S101, training phonetic corpus and the corresponding data label of the trained phonetic corpus are obtained.
Specifically, phonetic text can be acquired according to practical application scene, as training phonetic corpus.Illustratively, for News category voice, the common phonetic of Chinese sentence when can acquire News Field, as training phonetic corpus.
Wherein, data label is the corresponding true Chinese text of training phonetic corpus.Illustratively, training phonetic corpus " wo3xi3huan1bei3jing1 " corresponding true Chinese text is " I likes Beijing ", the corresponding number of training phonetic corpus It is " I likes Beijing " according to label.
S102, word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data.
It illustratively, can be based on dictionary segmentation methods or based on the machine learning algorithm of statistics to the trained phonetic Corpus carries out word segmentation processing.
In some embodiments, the detailed process of word segmentation processing, i.e. step S102 are carried out to the trained phonetic corpus It specifically includes: according to preset dictionary, word segmentation processing being carried out to the trained phonetic corpus, to obtain training participle data.
Dictionary is the candidate collection of an everyday words, is as I likes Pekinese's training phonetic corpus " wo3xi3huan1bei3jing1 " is then traversed from corpus head to tail, is cut if having word to occur in dictionary in corpus Point word, so as to will " wo3xi3huan1bei3jing1 " word segmentation processing be " wo3 ", " xi3huan1 " and " bei3jing1 " three training participle data.Wherein, digital " 3 ", " 1 " indicate tone.
In other embodiments, the detailed process of word segmentation processing, i.e. step are carried out to the trained phonetic corpus S102 is specifically included: according to preset dictionary, one-hot coding is carried out to the trained phonetic corpus, to obtain training participle Data.
One-hot coding, i.e. one-hot encoding, an efficient coding;One-hot encoding is a kind of following code system: for a certain attribute Word, how many state is with regard to how many bit, and only one bit is 1, other are all 0.
It illustratively, include the corresponding word of this attribute of season, the respectively phonetic in spring in preset dictionary Phonetic " xia4tian1 ", the phonetic in autumn " qiu1tian1 ", the phonetic in winter in " chun1tian1 ", summer " dong1tian1 " and other phonetics " qi2ta1 ".The attribute shares 5 different classification values, needs 5 bits at this time Indicate what value the attribute is.For example, the one-hot encoding of " chun1tian1 " is { 10000 }, the one-hot encoding of " xia4tian1 " is { 01000 }, the one-hot encoding of " qiu1tian1 " is { 00100 }, and the one-hot encoding of " dong1tian1 " is { 00010 }, " qi2ta1's " One-hot encoding is { 00001 }.
It illustratively, can also include the attributes such as person, fruit, gender, motion mode, i.e., each attribute in preset dictionary Corresponding word and one-hot encoding.
If there are multiple words in certain phonetic corpus, when one-hot encoding being needed to encode, successively the one-hot encoding of each word is spliced Get up: the one-hot encoding of the phonetic " xia4tian1 " in such as summer is { 01000 }, and the one-hot encoding of hot phonetic " re4 " is { 001 }, The one-hot encoding { 01000001 } for both so connecting to the end.
Phonetic corpus is handled using one-hot coding, can data are thinned out, and one-hot is encoded To data contain the information of Words ' Attributes in phonetic corpus.
The corresponding training participle data of the training phonetic corpus are obtained after carrying out word segmentation processing to training phonetic corpus.
Illustratively, the corresponding training participle data of certain training phonetic corpus are as follows: 100000001000000001000010 010000。
S103, according to preset word incorporation model, term vector conversion is carried out to the training participle data, it is embedding to obtain word Incoming vector.
After obtaining training participle data, term vector is carried out to the training participle data according to preset word incorporation model Conversion is to obtain word insertion vector.
In one embodiment, the preset word incorporation model can be embedded in for Word2vec (word to vector) word Model.Multiple training participle data composition training participle data sets.It, can be by training point according to the Word2vec word incorporation model Each of word data set training participle data are indicated with a word insertion vector.In one embodiment, word is embedded in vector Dimension be 512.
It should be understood that in other embodiments, preset word incorporation model is also possible to other minds trained in advance Through network model, such as deep neural network (Deep Neural Network, DNN) model.
S104, position data of the training participle data in the trained phonetic corpus is obtained, and to described Position data carries out vector conversion, to obtain position vector.
Specifically, being carried out after obtaining the corresponding position data of training participle data to the position data Vector conversion processing, to obtain position vector corresponding with the position data.
In one embodiment, it is described obtain the training positional number of the participle data in the trained phonetic corpus it is believed that Breath, comprising:
Based on position calculation formula, the training participle data are calculated in the trained phonetic language according to training participle data Position data in material;The position calculation formula are as follows:
Or,
Wherein, the pos is the position of the training participle data, and 2m or (2m+1) indicate the training participle data pair The dimension for the word insertion vector answered, dgFor the corresponding vector dimension of the trained phonetic corpus.
Specifically, when the dimension of training participle data corresponding word insertion vector is even number, using first formula into Row calculates position data of the training participle data in the trained phonetic corpus.When training participle data are corresponding When the dimension of word insertion vector is odd number, calculate the training participle data in the trained phonetic using second formula Position data in corpus.
Illustratively, it is assumed that dgIt is 512, position pos of the training participle data R in training phonetic corpus is 20, training The dimension 2m for segmenting the corresponding word insertion vector of data R is 128, at this point, can calculate instruction by above-mentioned position calculation formula Practice position data of the participle data R in the trained phonetic corpus
For another example, it is assumed that dgIt is 512, position pos of the training participle data R in training phonetic corpus is 20, training participle The dimension 2m+1 of the corresponding word insertion vector of data R is 129, at this point, can calculate training point by above-mentioned position calculation formula Position data of the word data R in the trained phonetic corpus be
As shown in Fig. 2, in one embodiment, it is described that vector conversion is carried out to the position data, to obtain position The step of vector, including sub-step S104a and S104b.
S104a, training participle data the putting in order in the trained phonetic corpus is determined.
Illustratively, training phonetic corpus is " wo3xi3huan1bei3jing1 ", and training participle data " wo3 " is in the instruction Practice putting in order as 1, training participle data " xi3huan1 " putting in order in the training voice data in voice data It is 2, training participle data " bei3jing1 " putting in order in the training voice data is 3.
S104b, putting in order according to carries out vector conversion to the position data, to obtain and the training Segment the corresponding position vector of data.
Specifically, the progress that puts in order by each position data information according to training participle data in training phonetic corpus Vector conversion.
Illustratively, position data of training participle data " wo3 " in the training voice data is 0.863, instruction Practicing participle data " wo3 " putting in order in the training voice data is 1, training corresponding position participle data " wo3 " to Amount is (0.863,0,0).Training segments the position data of data " xi3huan1 " in the training voice data 0.125, training participle data " xi3huan1 " putting in order in the training voice data is 2, training participle data " xi3huan1 " corresponding position vector is (0,0.125,0).Training participle data " bei3jing1 " is in the training voice data In position data be 0.928, training participle data " bei3jing1 " in the training voice data put in order for 3, the corresponding position vector of training participle data " bei3jing1 " is (0,0,0.928).
S105, institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector.
Specifically, after obtaining word insertion vector sum position vector, institute's predicate insertion vector sum position vector is spelled It connects, to obtain splicing term vector.
In one embodiment, it is described by institute's predicate be embedded in vector sum described in position vector splice, obtain splicing word to Amount, specifically includes: institute's predicate insertion vector being summed with the position vector, obtains the splicing term vector.
For example, training phonetic corpus be " wo3xi3huan1bei3jing1 " carry out word segmentation processing, obtain " wo3 ", " xi3huan1 " and " bei3jing1 " three training participle data.Wherein, " wo3 ", " xi3huan1 " and " bei3jing1 " point Not corresponding word insertion vector is A1、A2And A3, " wo3 ", " xi3huan1 " and " bei3jing1 " corresponding position vector For B1、B2And B3.The dimension that suppositive is embedded in vector sum position vector is the four-dimension, this three training participle data are corresponding Splicing term vector is C1、C2And C3.Wherein, it please refers to shown in Fig. 3, C1=A1+B1, C2=A2+B2, C3=A3+B3
In another embodiment, described to splice position vector described in institute's predicate insertion vector sum, obtain splicing word Vector specifically includes: institute's predicate insertion vector being attached with the position vector, obtains the splicing term vector.
In one embodiment, institute's predicate insertion vector is sequentially connected with the position vector.For example, institute's predicate Be embedded in vector be (1,0,0), the position vector be (0,0.125,0), obtained splicing term vector for (1,0,0,0, 0.125,0).Certainly, in another embodiment, the position vector and institute's predicate insertion vector are sequentially connected.Example Such as, institute's predicate insertion vector be (1,0,0), the position vector be (0,0.125,0), obtained splicing term vector for (0, 0.125,0,1,0,0).
S106, based on conversion neural network, according to the splicing term vector and the data label progress model training with Obtain language transformation model.
Specifically, conversion neural network, that is, Transformer Networks, abbreviation Transformer, are a kind of height The neural network of parallelization.Based on the conversion neural network, model is carried out according to the splicing term vector and the data label Training, training speed are obviously improved.
It is described based on conversion neural network as shown in figure 4, in one embodiment, according to the splicing term vector and described The corresponding data label of training phonetic corpus carries out the step of model training is to obtain language transformation model, including step S201 is extremely S203。
S201, the encoder for converting neural network is inputted by described, to export trained encoded information.
Specifically, conversion neural network includes encoder and decoder, it being capable of information transmitting between encoder and decoder And interaction.Encoder and decoder may each comprise multilayer, the dimension size phase of the layer of the dimension and decoder of the layer of encoder Together.
In one embodiment, encoder includes dot product attention model and feedforward neural network (FeedForward).Its In, attention (Attention) indicates the incidence relation between word and word.In one embodiment, it is noted that power indicates to turn in language Corresponding relationship during change from phonetic end to Chinese end the possible mutually word of inversion of phases.
Specifically, referring to Fig. 5, by the encoder of the input conversion neural network described in step S201, with defeated Encoded information is trained out, is specifically included: sub-step S201a and S201b.
S201a, the splicing term vector is inputted into the dot product attention model, to export dot product expressive force information.
Specifically, the dot product attention model are as follows:
Wherein, Q indicates inquiry, and K indicates keyword, V expression value, and dkIndicate the dimension of Q and K.
Specifically, being provided with 3 vectors (vector), respectively Query Vector, Key in dot product attention model Vector, Value Vector, are abbreviated as Q, K, V respectively.Splicing term vector is inputted into dot product attention model, the point exported Expressive force of the corresponding training participle data in current location, the process can be embodied by multiplying performance force information Attention (Q, K, V) Parallelization is high.
S201b, by BP network model described in the dot product expressive force information input, with output training coding letter Breath.
Specifically, the BP network model, specifically:
Wherein, Y is that the dot product shows force information, W1、W2For weight, b1、b2For bias function.
In another embodiment, encoder includes bull attention model and feedforward neural network (Feed Forward). Wherein, attention (Attention) indicates the incidence relation between word and word.In one embodiment, it is noted that power is indicated in language Corresponding relationship in conversion process from phonetic end to Chinese end the possible mutually word of inversion of phases.
As shown in Figure 6, wherein the encoder by the input conversion neural network, with output training coding Information specifically includes: sub-step S201c and S201d.
S201c, the splicing term vector is inputted into the bull attention model, to export bull performance force information.
Wherein, the bull attention model are as follows:
MultiHead (Q, K, V)=Concat (head1,...,headn)W0
headi=Attention (QWi Q,KWi K,VWi V)
Wherein,dgThe dimension of vector is embedded in for word.
Specifically, being provided with the matrix of multiple Q, K, V matrixes and actual value in bull attention model, which is instructed It is more to practice parameter, is able to ascend model capability, it is contemplated that the attention of different location, and it is empty more sons can be assigned to attention Between.Splicing term vector is inputted into bull attention model, bull performance force information MultiHead (Q, K, V) exported can Corresponding training participle data are embodied in the expressive force of current location, and the concurrent processization is high, and the speed of service is fast.
S201d, by BP network model described in the bull expressive force information input, with output training coding letter Breath.
It should be understood that the BP network model in the step is referred to the Feedforward Neural Networks in step S201b Network model, details are not described herein.
S202, the decoder that the trained encoded information is inputted to the conversion neural network, with output training Chinese text This.
In one embodiment, decoder and encoder all have multilayer, wherein the layer of decoder more than the layer of encoder one A sub-network, i.e. coder-decoder attention (Encoder-Decoder Attention) indicate source to target side Attention mechanism.Specifically, coder-decoder attention indicates between the word at phonetic end and the Chinese word of phonetic end generation Dependence.
S203, the trained Chinese text is verified according to the data label, and adjusts the encoder and the decoding Parameter in device, until the trained Chinese text is verified to obtain language transformation model.
Specifically, suitable loss function, such as cross entropy loss function, Lai Hengliang data label and training can be used The inconsistent degree of Chinese text, loss function is smaller, and the robustness of model is better.Illustratively, if loss function is less than in advance If when threshold values, indicating that the trained Chinese text is verified, stopping model training at this time, obtain language transformation model.
The training method of language transformation model provided by the above embodiment, by being embedded in vector and the position to institute's predicate Vector is spliced, and splicing term vector is obtained;Based on conversion neural network, according to the splicing term vector and the data label Model training is carried out to obtain language transformation model, which is applied to speech recognition, changes speech recognition Sequence calculating process, avoids location information from losing, to improve speech recognition accuracy and efficiency.
Referring to Fig. 7, Fig. 7 is the schematic flow diagram for the audio recognition method that embodiments herein provides.Wherein, the language Voice recognition method can be applied in terminal or server, for converting Chinese text for voice signal.
As shown in fig. 7, the audio recognition method, comprising: step S301 to S303.
S301, obtain targeted voice signal, according to default processing it is regular to the targeted voice signal pre-processed with Obtain the corresponding spectral vectors of the targeted voice signal.
Specifically, " voice " refers to the audio with linguistic property, can be issued by human body, it can also be by loudspeaker etc. Electronic equipment issues.
In the present embodiment, corresponding voice signal, sound pick-up outfit when chatting with user can be acquired by sound pick-up outfit Such as recording pen, smart phone, tablet computer, notebook or intelligent wearable device etc., such as Intelligent bracelet or smartwatch etc..
Wherein, which is specifically to compare for the targeted voice signal to be converted into the information in frequency domain The target voice information acquired in the time domain is such as converted by frequency using Fast Fourier Transform (FFT) rule or wavelet transformation rule Information in domain.
S302, the spectral vectors are inputted in preset phoneme model, to obtain phonetic feature sequence.
Preset phoneme model can instruct initial neural network using a large amount of spectral vectors-phonetic sample data Practice and obtains.Initial neural network can be various neural networks, for example, convolutional neural networks, Recognition with Recurrent Neural Network, shot and long term are remembered Recall neural network etc..
Specifically, as shown in figure 8, described input the spectral vectors in preset phoneme model, to obtain phonetic spy Levy sequence, comprising: S302a, according to the spectral vectors identifies the corresponding tone of the spectral vectors, initial consonant and simple or compound vowel of a Chinese syllable; S302b, the tone, initial consonant and simple or compound vowel of a Chinese syllable are integrated, to obtain the phonetic feature sequence of Chinese text.
Specifically, tone includes the first sound (also known as high and level tone or Heibei provincial opera), the rising tone (also known as rising tone or tone), third sound (also deserve to be called sound or folding adjust), the falling tone (also known as falling tone or falling tone), softly.Softly, the first sound, the rising tone, third sound and the 4th Sound can use digital " 0 ", " 1 ", " 2 ", " 3 ", " 4 " to indicate respectively.
It, can be with for example, the corresponding spectral vectors of source voice data of " I likes Beijing " are input to preset phoneme model Identify that the corresponding tone of the spectral vectors is followed successively by " 3 ", " 3 ", " 1 ", " 3 ", " 1 " in chronological order;Corresponding initial consonant is temporally Sequence is followed successively by " w ", " x ", " h ", " b ", " j ";Corresponding simple or compound vowel of a Chinese syllable be followed successively by chronological order " o ", " i ", " uan ", " ei ", “ing”。
It identifies the corresponding tone of the spectral vectors, initial consonant and rhythm imperial mother, the tone, initial consonant and simple or compound vowel of a Chinese syllable is integrated, Obtain the phonetic feature sequence { wo3xi3huan1bei3jing1 } of " I likes Beijing " Chinese text.
S303, by the phonetic feature sequence inputting language transformation model, to obtain target Chinese text.
Specifically, what the language transformation model was obtained by the training method training of above-mentioned language conversion identification model. Pinyin Chinese conversion is carried out by phonetic feature sequence of the language model to input, to obtain target Chinese text.
Above-mentioned audio recognition method, by obtaining targeted voice signal, according to default processing rule to the target voice Signal is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal;The spectral vectors are inputted into preset sound In prime model, to obtain phonetic feature sequence;By the phonetic feature sequence inputting language transformation model, to obtain target Chinese Text.Since language transformation model changes the sequence calculating process of speech recognition, location information is avoided to lose, thus improved Speech recognition accuracy and efficiency.
Referring to Fig. 9, Fig. 9 is a kind of training cartridge of the training pattern for language transformation model that embodiments herein provides The schematic block diagram set, the training device of the training pattern of the language transformation model is for executing aforementioned any term language conversion mould The training method of type.Wherein, the training device of the training pattern of the language transformation model can be configured in server or terminal.
Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.
As shown in figure 9, the training device 400 of the training pattern of language transformation model includes: corpus acquiring unit 401, divides Word processing unit 402, vector conversion unit 403, position acquisition unit 404, vector concatenation unit 405 and model training unit 406。
Corpus acquiring unit 401, for obtaining trained phonetic corpus and the corresponding data mark of the trained phonetic corpus Label.
Word segmentation processing unit 402, for carrying out word segmentation processing to the trained phonetic corpus, to obtain training participle number According to.
Vector conversion unit 403, for carrying out term vector to the training participle data according to preset word incorporation model Conversion, to obtain word insertion vector.
Position acquisition unit 404, for obtaining positional number of the training participle data in the trained phonetic corpus It is believed that breath, and vector conversion is carried out to the position data, to obtain position vector.
Vector concatenation unit 405, for splicing to institute's predicate insertion vector with the position vector, to be spliced Term vector.
Model training unit 406, for being based on conversion neural network, according to the splicing term vector and the data label Model training is carried out to obtain language transformation model.
Referring to Fig. 9, in one embodiment, position acquisition unit 404 includes data computation subunit 4041.The data Computation subunit 4041 is used for: being based on position calculation formula, is calculated the training participle data in institute according to training participle data State the position data in trained phonetic corpus.
Referring to Fig. 9, in one embodiment, position acquisition unit 404 includes that sequence determines 4042 He of subelement Vector transforming subunit 4043.
Sequence determines subelement 4042, for determining that the training participle data are suitable in the arrangement of the trained phonetic corpus Sequence.
Vector transforming subunit 4043 carries out vector turn to the position data for putting in order according to Change, to obtain position vector corresponding with the training participle data.
Referring again to Figure 10, in one embodiment, model training unit 406 includes coding output subelement 4061, text This output subelement 4062 and text authentication subelement 4063.
Coding output subelement 4061, for the splicing term vector to be inputted to the encoder of the conversion neural network, To export training encoded information.
Text output subelement 4062, for the trained encoded information to be inputted to the decoding of the conversion neural network Device, to export trained Chinese text.
Text authentication subelement 4063 for verifying the trained Chinese text according to the data label, and adjusts institute The parameter in encoder and the decoder is stated, until the trained Chinese text is verified to obtain language transformation model.
Referring to Figure 10, in an implementation, the encoder includes dot product attention model and feedforward neural network Model.Coding output subelement 4061 includes dot product output sub-module 4061a and information output sub-module 4061b.
Dot product output sub-module 4061a, for the splicing term vector to be inputted the dot product attention model, with output Dot product shows force information.
Information output sub-module 4061b is used for BP network model described in the dot product expressive force information input, To export training encoded information.
Figure 11 is please referred to, Figure 11 is that embodiments herein also provides a kind of schematic block diagram of speech recognition equipment, should Speech recognition equipment is for executing Emotion identification method above-mentioned.Wherein, the speech recognition equipment can be configured at server or In terminal.
As shown in figure 11, speech recognition equipment 500, comprising: signal acquiring unit 501, frequency spectrum input unit 502 and text This acquiring unit 503.
Signal acquiring unit 501, for obtaining targeted voice signal, according to default processing rule to the target language message It number is pre-processed to obtain the corresponding spectral vectors of the targeted voice signal.
Frequency spectrum input unit 502, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature Sequence.
Text acquiring unit 503 is used for by the phonetic feature sequence inputting language transformation model, to obtain target Chinese Text, the language transformation model are obtained by the training method training of above-mentioned language conversion identification model.
It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The device of foregoing description and the specific work process of each unit can refer to the corresponding process in preceding predicate embodiment of the method, This is repeated no more.
Above-mentioned apparatus can be implemented as a kind of form of computer program, which can be as shown in figure 12 It is run in computer equipment.
Figure 12 is please referred to, Figure 12 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The calculating Machine equipment can be server or terminal.
Refering to fig. 12, which includes processor, memory and the network interface connected by system bus, In, memory may include non-volatile memory medium and built-in storage.
Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute the training method of any one language transformation model, or is executed any A kind of audio recognition method.
Processor supports the operation of entire computer equipment for providing calculating and control ability.
Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute a kind of training method of language transformation model, or executes the knowledge of any one voice Other method.
The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Figure 12, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme The restriction for the computer equipment being applied thereon, specific computer equipment may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.
It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.
Wherein, the processor is for running computer program stored in memory, to realize following steps:
Obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus;To the trained phonetic corpus Word segmentation processing is carried out, to obtain training participle data;According to preset word incorporation model, word is carried out to the training participle data Vector conversion, to obtain word insertion vector;Obtain position data of the training participle data in the trained phonetic corpus Information, and vector conversion is carried out to the position data, to obtain position vector;To institute's predicate insertion vector and institute's rheme It sets vector to be spliced, to obtain splicing term vector;Based on conversion neural network, according to the splicing term vector and the data Label carries out model training to obtain language transformation model.
In one embodiment, the processor is realizing the acquisition training participle data in the trained phonetic When position data in corpus, for realizing:
Based on position calculation formula, the training participle data are calculated in the trained phonetic language according to training participle data Position data in material;The position calculation formula are as follows:
Or,
Wherein, the pos is the position of the training participle data, and m indicates that the corresponding word of the training participle data is embedding The dimension of incoming vector, dgFor the corresponding vector dimension of the trained phonetic corpus.
In one embodiment, the processor realize it is described to the position data carry out vector conversion, with When obtaining position vector, for realizing:
Determine training participle data the putting in order in the trained phonetic corpus;It is put in order according to described to institute It states position data and carries out vector conversion, to obtain position vector corresponding with the training participle data.
In one embodiment, the processor realize it is described based on conversion neural network, according to the splicing word to When amount and the corresponding data label progress model training of the trained phonetic corpus are to obtain language transformation model, for realizing:
The splicing term vector is inputted to the encoder of the conversion neural network, to export trained encoded information;By institute The decoder that trained encoded information inputs the conversion neural network is stated, to export trained Chinese text;According to the data mark Label verify the trained Chinese text, and adjust the parameter in the encoder and the decoder, until the training Chinese Text authentication is by obtaining language transformation model.
In one embodiment, the processor is realizing that the encoder includes dot product attention model and feed forward neural Network model;It is described by the splicing term vector input coding device, when exporting training encoded information, for realizing:
The splicing term vector is inputted into the dot product attention model, to export dot product expressive force information;By the point Multiply BP network model described in expressive force information input, to export trained encoded information.
Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:
Targeted voice signal is obtained, the targeted voice signal is pre-processed to obtain according to default processing rule State the corresponding spectral vectors of targeted voice signal;The spectral vectors are inputted in preset phoneme model, to obtain phonetic spy Levy sequence;By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language transformation model It is obtained by the training method training of language described in any of the above embodiments conversion identification model.
A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this Apply for the training method or any one audio recognition method of any term language transformation model that embodiment provides.
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of training method of language transformation model characterized by comprising
Obtain training phonetic corpus and the corresponding data label of the trained phonetic corpus;
Word segmentation processing is carried out to the trained phonetic corpus, to obtain training participle data;
According to preset word incorporation model, term vector conversion is carried out to the training participle data, to obtain word insertion vector;
Obtain the position data of the training participle data in the trained phonetic corpus, and to the positional number it is believed that Breath carries out vector conversion, to obtain position vector;
Institute's predicate insertion vector is spliced with the position vector, to obtain splicing term vector;
Based on conversion neural network, model training is carried out according to the splicing term vector and the data label and is turned with obtaining language Change model.
2. the training method of language transformation model according to claim 1, which is characterized in that described to obtain the training point Position data of the word data in the trained phonetic corpus, comprising:
Based on position calculation formula, the training participle data are calculated in the trained phonetic corpus according to training participle data Position data;The position calculation formula are as follows:
Or,
Wherein, the pos is the position of the training participle data, m indicate the corresponding word of the training participle data be embedded in The dimension of amount, dgFor the corresponding vector dimension of the trained phonetic corpus.
3. the training method of language transformation model according to claim 1, which is characterized in that described to the position data Information carries out vector conversion, to obtain position vector, comprising:
Determine training participle data the putting in order in the trained phonetic corpus;
Vector conversion is carried out to the position data according to described put in order, segments data pair with the training to obtain The position vector answered.
4. the training method of language transformation model according to claim 1-3, which is characterized in that described to be based on turning Neural network is changed, model training is carried out to obtain according to the splicing term vector and the corresponding data label of the trained phonetic corpus To language transformation model, comprising:
The splicing term vector is inputted to the encoder of the conversion neural network, to export trained encoded information;
The trained encoded information is inputted to the decoder of the conversion neural network, to export trained Chinese text;
The trained Chinese text is verified according to the data label, and adjusts the ginseng in the encoder and the decoder Number, until the trained Chinese text is verified to obtain language transformation model.
5. the training method of language transformation model according to claim 4, which is characterized in that the encoder includes dot product Attention model and BP network model;It is described by the splicing term vector input coding device, with output training coding letter Breath, comprising:
The splicing term vector is inputted into the dot product attention model, to export dot product expressive force information;
By BP network model described in the dot product expressive force information input, to export trained encoded information.
6. a kind of audio recognition method characterized by comprising
Targeted voice signal is obtained, the targeted voice signal is pre-processed according to default processing rule to obtain the mesh The corresponding spectral vectors of poster sound signal;
The spectral vectors are inputted in preset phoneme model, to obtain phonetic feature sequence;
By the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, the language transformation model by What the training method training of language conversion identification model as described in any one in claim 1-5 obtained.
7. a kind of training device of language transformation model characterized by comprising
Corpus acquiring unit, for obtaining trained phonetic corpus and the corresponding data label of the trained phonetic corpus;
Word segmentation processing unit, for carrying out word segmentation processing to the trained phonetic corpus, to obtain training participle data;
Vector conversion unit, for carrying out term vector conversion to the training participle data according to preset word incorporation model, with Obtain word insertion vector;
Position acquisition unit, for obtaining position data of the training participle data in the trained phonetic corpus, And vector conversion is carried out to the position data, to obtain position vector;
Vector concatenation unit, for splicing to institute's predicate insertion vector with the position vector, to obtain splicing term vector;
Model training unit, for carrying out mould according to the splicing term vector and the data label based on conversion neural network Type training is to obtain language transformation model.
8. a kind of speech recognition equipment characterized by comprising
Signal acquiring unit carries out the targeted voice signal according to default processing rule for obtaining targeted voice signal Pretreatment is to obtain the corresponding spectral vectors of the targeted voice signal;
Frequency spectrum input unit, for inputting the spectral vectors in preset phoneme model, to obtain phonetic feature sequence;
Text acquiring unit is used for by the phonetic feature sequence inputting language transformation model, to obtain target Chinese text, institute Predicate says that transformation model is obtained by the training method training of language as described in any one in claim 1-5 conversion identification model 's.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor;
The memory is for storing computer program;
The processor, for executing the computer program and realization such as claim 1 when executing the computer program To the training method of language transformation model described in any one of 5 or audio recognition method as claimed in claim 6.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor realize the language as described in any one of claims 1 to 5 when being executed by processor The training method of speech transformation model or audio recognition method as claimed in claim 6.
CN201910522750.8A 2019-06-17 2019-06-17 Audio recognition method, the training method of model, device, equipment and storage medium Pending CN110288980A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910522750.8A CN110288980A (en) 2019-06-17 2019-06-17 Audio recognition method, the training method of model, device, equipment and storage medium
PCT/CN2019/118227 WO2020253060A1 (en) 2019-06-17 2019-11-13 Speech recognition method, model training method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910522750.8A CN110288980A (en) 2019-06-17 2019-06-17 Audio recognition method, the training method of model, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110288980A true CN110288980A (en) 2019-09-27

Family

ID=68005146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910522750.8A Pending CN110288980A (en) 2019-06-17 2019-06-17 Audio recognition method, the training method of model, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110288980A (en)
WO (1) WO2020253060A1 (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827816A (en) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 Voice instruction recognition method and device, electronic equipment and storage medium
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111090886A (en) * 2019-12-31 2020-05-01 新奥数能科技有限公司 Desensitization data determination method and device, readable medium and electronic equipment
CN111144370A (en) * 2019-12-31 2020-05-12 科大讯飞华南人工智能研究院(广州)有限公司 Document element extraction method, device, equipment and storage medium
CN111222335A (en) * 2019-11-27 2020-06-02 上海眼控科技股份有限公司 Corpus correction method and device, computer equipment and computer-readable storage medium
CN111382340A (en) * 2020-03-20 2020-07-07 北京百度网讯科技有限公司 Information identification method, information identification device and electronic equipment
CN111681669A (en) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 Neural network-based voice data identification method and equipment
US10817665B1 (en) * 2020-05-08 2020-10-27 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
CN111833849A (en) * 2020-03-10 2020-10-27 北京嘀嘀无限科技发展有限公司 Method for speech recognition and speech model training, storage medium and electronic device
CN111859994A (en) * 2020-06-08 2020-10-30 北京百度网讯科技有限公司 Method, device and storage medium for obtaining machine translation model and translating text
CN111881726A (en) * 2020-06-15 2020-11-03 马上消费金融股份有限公司 Living body detection method and device and storage medium
CN112002306A (en) * 2020-08-26 2020-11-27 阳光保险集团股份有限公司 Voice category identification method and device, electronic equipment and readable storage medium
WO2020253060A1 (en) * 2019-06-17 2020-12-24 平安科技(深圳)有限公司 Speech recognition method, model training method, apparatus and device, and storage medium
CN112132281A (en) * 2020-09-29 2020-12-25 腾讯科技(深圳)有限公司 Model training method, device, server and medium based on artificial intelligence
CN112133304A (en) * 2020-09-18 2020-12-25 中科极限元(杭州)智能科技股份有限公司 Low-delay speech recognition model based on feedforward neural network and training method
CN112417086A (en) * 2020-11-30 2021-02-26 深圳市欢太科技有限公司 Data processing method, device, server and storage medium
CN112528637A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 Text processing model training method and device, computer equipment and storage medium
CN112820269A (en) * 2020-12-31 2021-05-18 平安科技(深圳)有限公司 Text-to-speech method, device, electronic equipment and storage medium
CN112906403A (en) * 2021-04-25 2021-06-04 中国平安人寿保险股份有限公司 Semantic analysis model training method and device, terminal equipment and storage medium
CN112951204A (en) * 2021-03-29 2021-06-11 北京大米科技有限公司 Speech synthesis method and device
CN112951240A (en) * 2021-05-14 2021-06-11 北京世纪好未来教育科技有限公司 Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113297346A (en) * 2021-06-28 2021-08-24 中国平安人寿保险股份有限公司 Text intention recognition method, device, equipment and storage medium
CN113486671A (en) * 2021-07-27 2021-10-08 平安科技(深圳)有限公司 Data expansion method, device, equipment and medium based on regular expression coding
CN113761841A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Method for converting text data into acoustic features

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112679A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Method for building language model, speech recognition method and electronic apparatus
CN107204184A (en) * 2017-05-10 2017-09-26 平安科技(深圳)有限公司 Audio recognition method and system
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109800298A (en) * 2019-01-29 2019-05-24 苏州大学 A kind of training method of Chinese word segmentation model neural network based
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109684452A (en) * 2018-12-25 2019-04-26 中科国力(镇江)智能技术有限公司 A kind of neural network problem generation method based on answer Yu answer location information
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150112679A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Method for building language model, speech recognition method and electronic apparatus
CN107204184A (en) * 2017-05-10 2017-09-26 平安科技(深圳)有限公司 Audio recognition method and system
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109800298A (en) * 2019-01-29 2019-05-24 苏州大学 A kind of training method of Chinese word segmentation model neural network based
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253060A1 (en) * 2019-06-17 2020-12-24 平安科技(深圳)有限公司 Speech recognition method, model training method, apparatus and device, and storage medium
CN110827816A (en) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 Voice instruction recognition method and device, electronic equipment and storage medium
CN111222335A (en) * 2019-11-27 2020-06-02 上海眼控科技股份有限公司 Corpus correction method and device, computer equipment and computer-readable storage medium
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN110970031B (en) * 2019-12-16 2022-06-24 思必驰科技股份有限公司 Speech recognition system and method
CN111090886A (en) * 2019-12-31 2020-05-01 新奥数能科技有限公司 Desensitization data determination method and device, readable medium and electronic equipment
CN111144370A (en) * 2019-12-31 2020-05-12 科大讯飞华南人工智能研究院(广州)有限公司 Document element extraction method, device, equipment and storage medium
CN111144370B (en) * 2019-12-31 2023-08-04 科大讯飞华南人工智能研究院(广州)有限公司 Document element extraction method, device, equipment and storage medium
CN111833849A (en) * 2020-03-10 2020-10-27 北京嘀嘀无限科技发展有限公司 Method for speech recognition and speech model training, storage medium and electronic device
CN111382340A (en) * 2020-03-20 2020-07-07 北京百度网讯科技有限公司 Information identification method, information identification device and electronic equipment
US11113468B1 (en) * 2020-05-08 2021-09-07 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
US10817665B1 (en) * 2020-05-08 2020-10-27 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
CN111681669A (en) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 Neural network-based voice data identification method and equipment
CN111859994B (en) * 2020-06-08 2024-01-23 北京百度网讯科技有限公司 Machine translation model acquisition and text translation method, device and storage medium
CN111859994A (en) * 2020-06-08 2020-10-30 北京百度网讯科技有限公司 Method, device and storage medium for obtaining machine translation model and translating text
CN111881726A (en) * 2020-06-15 2020-11-03 马上消费金融股份有限公司 Living body detection method and device and storage medium
CN112002306B (en) * 2020-08-26 2024-04-05 阳光保险集团股份有限公司 Speech class recognition method and device, electronic equipment and readable storage medium
CN112002306A (en) * 2020-08-26 2020-11-27 阳光保险集团股份有限公司 Voice category identification method and device, electronic equipment and readable storage medium
CN112133304A (en) * 2020-09-18 2020-12-25 中科极限元(杭州)智能科技股份有限公司 Low-delay speech recognition model based on feedforward neural network and training method
CN112133304B (en) * 2020-09-18 2022-05-06 中科极限元(杭州)智能科技股份有限公司 Low-delay speech recognition model based on feedforward neural network and training method
CN112132281A (en) * 2020-09-29 2020-12-25 腾讯科技(深圳)有限公司 Model training method, device, server and medium based on artificial intelligence
CN112417086B (en) * 2020-11-30 2024-02-27 深圳市与飞科技有限公司 Data processing method, device, server and storage medium
CN112417086A (en) * 2020-11-30 2021-02-26 深圳市欢太科技有限公司 Data processing method, device, server and storage medium
CN112528637B (en) * 2020-12-11 2024-03-29 平安科技(深圳)有限公司 Text processing model training method, device, computer equipment and storage medium
CN112528637A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 Text processing model training method and device, computer equipment and storage medium
CN112820269A (en) * 2020-12-31 2021-05-18 平安科技(深圳)有限公司 Text-to-speech method, device, electronic equipment and storage medium
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113035231B (en) * 2021-03-18 2024-01-09 三星(中国)半导体有限公司 Keyword detection method and device
CN113129869B (en) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN112951204A (en) * 2021-03-29 2021-06-11 北京大米科技有限公司 Speech synthesis method and device
CN113761841A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Method for converting text data into acoustic features
CN113761841B (en) * 2021-04-19 2023-07-25 腾讯科技(深圳)有限公司 Method for converting text data into acoustic features
CN112906403B (en) * 2021-04-25 2023-02-03 中国平安人寿保险股份有限公司 Semantic analysis model training method and device, terminal equipment and storage medium
CN112906403A (en) * 2021-04-25 2021-06-04 中国平安人寿保险股份有限公司 Semantic analysis model training method and device, terminal equipment and storage medium
CN112951240A (en) * 2021-05-14 2021-06-11 北京世纪好未来教育科技有限公司 Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN113297346B (en) * 2021-06-28 2023-10-31 中国平安人寿保险股份有限公司 Text intention recognition method, device, equipment and storage medium
CN113297346A (en) * 2021-06-28 2021-08-24 中国平安人寿保险股份有限公司 Text intention recognition method, device, equipment and storage medium
CN113486671B (en) * 2021-07-27 2023-06-30 平安科技(深圳)有限公司 Regular expression coding-based data expansion method, device, equipment and medium
CN113486671A (en) * 2021-07-27 2021-10-08 平安科技(深圳)有限公司 Data expansion method, device, equipment and medium based on regular expression coding

Also Published As

Publication number Publication date
WO2020253060A1 (en) 2020-12-24

Similar Documents

Publication Publication Date Title
CN110288980A (en) Audio recognition method, the training method of model, device, equipment and storage medium
US11948066B2 (en) Processing sequences using convolutional neural networks
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN111583900B (en) Song synthesis method and device, readable medium and electronic equipment
CN110516253B (en) Chinese spoken language semantic understanding method and system
CN110264991A (en) Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN112509555B (en) Dialect voice recognition method, device, medium and electronic equipment
US11823656B2 (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN114882862A (en) Voice processing method and related equipment
CN115394321A (en) Audio emotion recognition method, device, equipment, storage medium and product
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN112580669B (en) Training method and device for voice information
CN112270184B (en) Natural language processing method, device and storage medium
CN108962228A (en) model training method and device
CN112580325B (en) Rapid text matching method and device
US20230351752A1 (en) Moment localization in media stream
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
JP2022121386A (en) Speaker dialization correction method and system utilizing text-based speaker change detection
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN113591472A (en) Lyric generation method, lyric generation model training method and device and electronic equipment
CN113469197A (en) Image-text matching method, device, equipment and storage medium
CN114093340A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination