WO2020253060A1 - 语音识别方法、模型的训练方法、装置、设备及存储介质 - Google Patents

语音识别方法、模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020253060A1
WO2020253060A1 PCT/CN2019/118227 CN2019118227W WO2020253060A1 WO 2020253060 A1 WO2020253060 A1 WO 2020253060A1 CN 2019118227 W CN2019118227 W CN 2019118227W WO 2020253060 A1 WO2020253060 A1 WO 2020253060A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
vector
model
pinyin
word
Prior art date
Application number
PCT/CN2019/118227
Other languages
English (en)
French (fr)
Inventor
王健宗
魏文琦
贾雪丽
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253060A1 publication Critical patent/WO2020253060A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This application relates to the technical field of model training, in particular to a speech recognition method, a training method, device, equipment and storage medium of a language conversion model.
  • Speech recognition technology also known as Automatic Speech Recognition (ASR) refers to a technology that machines use to recognize and understand speech signals and turn speech signals into text. It is widely used in smart homes and voice input fields, which is extremely convenient people's lives.
  • ASR Automatic Speech Recognition
  • most of the existing speech recognition technologies are based on speech recognition models such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU).
  • RNN Recurrent Neural Networks
  • LSTM Long Short-Term Memory
  • GRU Gated Recurrent Unit
  • This application provides a speech recognition method, a training method, device, computer equipment, and storage medium of a language conversion model.
  • the language conversion model is applied to speech recognition, the accuracy and efficiency of speech recognition are improved.
  • this application provides a method for training a language conversion model, the method including:
  • model training is performed according to the spliced word vector and the data label to obtain a language transformation model.
  • this application provides a voice recognition method, the method includes:
  • the pinyin feature sequence is input into a language conversion model to obtain a target Chinese text, and the language conversion model is trained by the above-mentioned language conversion recognition model training method.
  • this application also provides a language conversion model training device, which includes:
  • a corpus acquisition unit for acquiring training pinyin corpus and data labels corresponding to the training pinyin corpus
  • the word segmentation processing unit is configured to perform word segmentation processing on the training pinyin corpus to obtain training word segmentation data
  • the vector conversion unit is configured to perform word vector conversion on the training word segmentation data according to a preset word embedding model to obtain a word embedding vector;
  • a location acquiring unit configured to acquire location data information of the training word segmentation data in the training pinyin corpus, and perform vector transformation on the location data information to obtain a location vector
  • a vector splicing unit for splicing the word embedding vector and the position vector to obtain a spliced word vector
  • the model training unit is configured to perform model training according to the spliced word vector and the data label based on the conversion neural network to obtain a language conversion model.
  • this application also provides a voice recognition device, which includes:
  • a signal acquisition unit configured to acquire a target voice signal, and preprocess the target voice signal according to preset processing rules to obtain a spectrum vector corresponding to the target voice signal;
  • a frequency spectrum input unit configured to input the frequency spectrum vector into a preset phoneme model to obtain a pinyin feature sequence
  • the text acquisition unit is configured to input the pinyin feature sequence into a language conversion model to obtain target Chinese text, and the language conversion model is trained by the above-mentioned language conversion recognition model training method.
  • the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and execute the The computer program implements the above-mentioned language conversion model training method or the above-mentioned speech recognition method.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor realizes the language translation model described above Training method or the above-mentioned speech recognition method.
  • This application discloses a speech recognition method, model training method, device, equipment and storage medium.
  • a spliced word vector is obtained; based on a transformation neural network, according to the The splicing word vectors and the data tags are trained to obtain a language conversion model.
  • the language conversion model is applied to speech recognition, which changes the sequence calculation process of speech recognition and avoids loss of position information, thereby improving the accuracy and efficiency of speech recognition.
  • FIG. 1 is a schematic flowchart of a method for training a language conversion model provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of sub-steps of the training method of the language conversion model in FIG. 1;
  • FIG. 3 is a schematic diagram of the principle of obtaining spliced word vectors provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of sub-steps of the training method of the language conversion model in FIG. 1;
  • Fig. 5 is a schematic flowchart of sub-steps of an embodiment of outputting training coding information in Fig. 4;
  • Fig. 6 is a schematic flowchart of sub-steps of another embodiment of outputting training coding information in Fig. 4;
  • FIG. 7 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of sub-steps of the voice recognition method in FIG. 7;
  • Fig. 9 is a schematic block diagram of a training device for a language conversion model provided by an embodiment of the application.
  • FIG. 10 is a schematic block diagram of the sub-modules of the training device of the language conversion model in FIG. 9;
  • FIG. 11 is a schematic block diagram of a voice recognition device provided in an embodiment of the present application.
  • FIG. 12 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • the embodiments of the present application provide a training method, speech recognition method, device, computer equipment, and storage medium of a language conversion model.
  • the language conversion model When the language conversion model is applied to speech recognition, it can improve the efficiency and accuracy of speech recognition.
  • FIG. 1 is a schematic flowchart of steps of a method for training a language conversion model provided by an embodiment of the present application.
  • the training method of the language conversion model specifically includes: step S101 to step S105.
  • the Pinyin text can be collected according to actual application scenarios and used as training Pinyin corpus.
  • the pinyin of Chinese sentences commonly used in the news field can be collected as a training pinyin corpus.
  • the data label is the real Chinese text corresponding to the training Pinyin corpus.
  • the real Chinese text corresponding to the training pinyin corpus "wo3xi3huan1bei3jing1" is "I like Beijing”
  • the data label corresponding to the training pinyin corpus is "I like Beijing”.
  • S102 Perform word segmentation processing on the training pinyin corpus to obtain training word segmentation data.
  • word segmentation processing may be performed on the training Pinyin corpus based on a dictionary word segmentation algorithm or a statistics-based machine learning algorithm.
  • step S102 specifically includes: performing word segmentation processing on the training pinyin corpus according to a preset dictionary to obtain training word segmentation data.
  • the dictionary is a candidate set of commonly used words. For example, I like the training Pinyin corpus in Beijing to be “wo3xi3huan1bei3jing1", and then traverse from the beginning to the end of the corpus. If there is a word in the corpus that appears in the dictionary, the word is segmented, so that " “wo3xi3huan1bei3jing1” word segmentation is processed into three training word segmentation data "wo3", “xi3huan1” and “bei3jing1". Among them, the numbers “3" and “1" represent the tone.
  • step S102 specifically includes: performing one-hot encoding on the training pinyin corpus according to a preset dictionary to obtain training word segmentation data.
  • One-hot encoding namely one-hot code, one-bit effective encoding;
  • one-hot code is a code system as follows: for a word of a certain attribute, there are as many bits as there are states, and only one bit is 1. Is 0.
  • the preset dictionary includes words corresponding to the attribute of season, which are the pinyin of spring "chun1tian1", the pinyin of summer “xia4tian1”, the pinyin of autumn “qiu1tian1", the pinyin of winter “dong1tian1” and others. Pinyin "qi2ta1".
  • the attribute has 5 different classification values, and 5 bits are needed to indicate what value the attribute is.
  • the one hot code for "chun1tian1” is ⁇ 10000 ⁇
  • the one hot code for "xia4tian1” is ⁇ 01000 ⁇
  • the one hot code for "qiu1tian1” is ⁇ 00100 ⁇
  • the one hot code for "dong1tian1” is ⁇ 00010 ⁇
  • the one-hot code for "qi2ta1" is ⁇ 00001 ⁇ .
  • the preset dictionary may also include attributes such as person, fruit, gender, and movement mode, that is, words and one-hot codes corresponding to each attribute.
  • the one-hot codes of each word are spliced together in turn: for example, the one-hot code of summer pinyin "xia4tian1" is ⁇ 01000 ⁇ , and the hot pinyin "re4" "’S one hot code is ⁇ 001 ⁇ , then the two are connected to get the final one hot code ⁇ 01000001 ⁇ .
  • Using one-hot encoding to process the Pinyin corpus can make the data sparse, and the data obtained by one-hot encoding contains the information of the word attributes in the Pinyin corpus.
  • the training word segmentation data corresponding to the training pinyin corpus is obtained.
  • the training word segmentation data corresponding to a certain training Pinyin corpus is: 100000001000000001 000010 010000.
  • S103 Perform word vector conversion on the training word segmentation data according to a preset word embedding model to obtain a word embedding vector.
  • word vector conversion is performed on the training word segmentation data according to a preset word embedding model to obtain a word embedding vector.
  • the preset word embedding model may be a Word2vec (word to vector) word embedding model.
  • Multiple training word segmentation data form a training word segmentation data set.
  • each training word segmentation data in the training word segmentation data set can be represented by a word embedding vector.
  • the dimension of the word embedding vector is 512.
  • the preset word embedding model may also be another neural network model that is pre-trained, such as a deep neural network (Deep Neural Network, DNN) model.
  • DNN Deep Neural Network
  • vector transformation processing is performed on the location data information, so as to obtain the location vector corresponding to the location data information.
  • the acquiring location data information of the training word segmentation data in the training Pinyin corpus includes:
  • the position calculation formula calculates the position data information of the training word segmentation data in the training pinyin corpus according to the training word segmentation data; the position calculation formula is:
  • pos is the position of the training word segmentation data
  • 2m or (2m+1) represents the dimension of the word embedding vector corresponding to the training word segmentation data
  • d g is the vector dimension corresponding to the training pinyin corpus.
  • the first formula is used to calculate the position data information of the training word segmentation data in the training pinyin corpus.
  • the second formula is used to calculate the position data information of the training word segmentation data in the training pinyin corpus.
  • the position pos of the training word segmentation data R in the training Pinyin corpus is 20, and the dimension 2m+1 of the word embedding vector corresponding to the training word segmentation data R is 129.
  • the location data information of the training word segmentation data R in the training Pinyin corpus can be calculated as
  • the step of performing vector transformation on the position data information to obtain a position vector includes sub-steps S104a and S104b.
  • S104a Determine an arrangement sequence of the training word segmentation data in the training Pinyin corpus.
  • the training Pinyin corpus is "wo3xi3huan1bei3jing1"
  • the sequence of training word segmentation data "wo3” in the training speech data is 1
  • the sequence of training word segmentation data "xi3huan1" in the training speech data is 2
  • the training word segmentation The arrangement order of the data "bei3jing1" in the training speech data is 3.
  • S104b Perform vector transformation on the position data information according to the arrangement order to obtain a position vector corresponding to the training word segmentation data.
  • each position data information is vectorized according to the sequence of the training word segmentation data in the training Pinyin corpus.
  • the position data information of the training word segmentation data "wo3" in the training speech data is 0.863
  • the order of the training word segmentation data "wo3” in the training speech data is 1, and the position corresponding to the training word segmentation data "wo3”
  • the vector is (0.863,0,0).
  • the position data information of the training word segmentation data "xi3huan1" in the training speech data is 0.125
  • the order of the training word segmentation data "xi3huan1” in the training speech data is 2
  • the position vector corresponding to the training word segmentation data "xi3huan1" is (0 , 0.125, 0).
  • the position data information of the training word segmentation data "bei3jing1" in the training speech data is 0.928
  • the order of the training word segmentation data "bei3jing1” in the training speech data is 3
  • the position vector corresponding to the training word segmentation data "bei3jing1” is (0 , 0,0.928).
  • the word embedding vector and the position vector are obtained, the word embedding vector and the position vector are spliced to obtain the spliced word vector.
  • the splicing the word embedding vector and the position vector to obtain a spliced word vector specifically includes: summing the word embedding vector and the position vector to obtain the spliced word vector.
  • the training Pinyin corpus is "wo3xi3huan1bei3jing1" for word segmentation processing, and three training word segmentation data of "wo3", "xi3huan1” and “bei3jing1” are obtained.
  • the word embedding vectors corresponding to "wo3”, “xi3huan1” and “bei3jing1” are A 1 , A 2 and A 3
  • the position vectors corresponding to "wo3", “xi3huan1” and “bei3jing1” are B 1 , B 2 and B 3 .
  • the splicing word vectors corresponding to the three training word segmentation data are C 1 , C 2 and C 3 .
  • C 1 A 1 +B 1
  • C 2 A 2 +B 2
  • C 3 A 3 +B 3 .
  • the splicing the word embedding vector and the position vector to obtain a spliced word vector specifically includes: connecting the word embedding vector and the position vector to obtain the spliced word vector.
  • the word embedding vector and the position vector are sequentially connected.
  • the word embedding vector is (1, 0, 0)
  • the position vector is (0, 0.125, 0)
  • the resulting spliced word vector is (1,0, 0, 0.125, 0).
  • the position vector and the word embedding vector are sequentially connected.
  • the word embedding vector is (1, 0, 0)
  • the position vector is (0, 0.125, 0)
  • the resulting spliced word vector is (0, 0.125, 0, 1, 0, 0).
  • S106 Based on the transformation neural network, perform model training according to the spliced word vector and the data label to obtain a language transformation model.
  • the transformation neural network is a highly parallelized neural network. Based on the conversion neural network, model training is performed according to the spliced word vector and the data label, and the training speed is significantly improved.
  • the step of performing model training based on the transformation neural network according to the spliced word vector and the data label corresponding to the training pinyin corpus to obtain a language transformation model includes steps S201 to S203.
  • the transformation neural network includes an encoder and a decoder, and the encoder and the decoder can communicate and interact with each other.
  • Both the encoder and the decoder may include multiple layers, and the dimensions of the layers of the encoder and the layers of the decoder are the same.
  • the encoder includes a dot product attention model and a feedforward neural network (Feed Forward).
  • attention represents the association relationship between words.
  • attention represents the correspondence between the words that may be converted to each other from the pinyin end to the Chinese end in the language conversion process.
  • the step S201 of inputting the input to the encoder of the conversion neural network to output training coding information specifically includes: sub-steps S201a and S201b.
  • the dot product attention model is:
  • Q represents a query
  • K represents a keyword
  • V represents a value
  • d k represents the dimensions of Q and K.
  • the dot product attention model three vectors are set, namely Query Vector, Key Vector, and Value Vector, which are abbreviated as Q, K, and V respectively.
  • the spliced word vector is input to the dot product attention model, and the output dot product expressive information Attention (Q, K, V) can reflect the expressive power of the corresponding training word segmentation data at the current position, and the process is highly parallelized.
  • the feedforward neural network model is specifically:
  • Y is the dot product expressive power information
  • W 1 and W 2 are weights
  • b 1 and b 2 are bias functions.
  • the encoder includes a multi-head attention model and a feedforward neural network (Feed Forward).
  • attention represents the association relationship between words.
  • the attention represents the correspondence between the words that may be mutually converted from the pinyin end to the Chinese end in the language conversion process.
  • the inputting the encoder of the transforming neural network to output training coding information specifically includes: sub-steps S201c and S201d.
  • the multi-head attention model is:
  • MultiHead(Q,K,V) Concat(head 1 ,...,head n )W 0 ;
  • d g is the dimension of the word embedding vector.
  • multiple Q, K, V matrices and actual value matrices are set in the multi-head attention model.
  • the model has many trainable parameters, which can improve the model's ability, taking into account the attention of different positions, and can focus on attention Give more subspace.
  • Input the spliced word vector into the multi-head attention model, and the output multi-head expressive power information MultiHead(Q,K,V) can reflect the expressive power of the corresponding training word segmentation data at the current position.
  • the process is highly parallelized and runs fast.
  • the feedforward neural network model in this step can refer to the feedforward neural network model in step S201b, which will not be repeated here.
  • both the decoder and the encoder have multiple layers, where the layer of the decoder is one more sub-network than the layer of the encoder, that is, the encoder-decoder attention (Encoder-Decoder Attention), which represents the source to Attention mechanism at the target end.
  • the encoder-decoder pays attention to the dependence relationship between the words on the pinyin end and the Chinese words generated on the pinyin end.
  • a suitable loss function such as a cross-entropy loss function
  • a cross-entropy loss function can be used to measure the degree of inconsistency between the data label and the training Chinese text.
  • the smaller the loss function the better the robustness of the model.
  • the loss function is less than the preset threshold, it means that the training Chinese text has passed the verification, and the model training is stopped at this time to obtain the language conversion model.
  • the training method of the language conversion model obtained by the above embodiment obtains a spliced word vector by splicing the word embedding vector and the position vector; based on the transformation neural network, the model is performed based on the spliced word vector and the data label Training to obtain a language conversion model, which is applied to speech recognition, changes the sequence calculation process of speech recognition, avoids loss of position information, and improves the accuracy and efficiency of speech recognition.
  • FIG. 7 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application.
  • the voice recognition method can be applied to a terminal or a server to convert a voice signal into Chinese text.
  • the voice recognition method includes: steps S301 to S303.
  • voice refers to audio with language attributes, which can be emitted by the human body or by electronic devices such as speakers.
  • the corresponding voice signal when chatting with the user can be collected through a recording device, such as a voice recorder, a smart phone, a tablet computer, a notebook or a smart wearable device, such as a smart bracelet or smart watch.
  • a recording device such as a voice recorder, a smart phone, a tablet computer, a notebook or a smart wearable device, such as a smart bracelet or smart watch.
  • the preset processing rule is used to convert the target voice signal into information in the frequency domain, specifically, for example, using fast Fourier transform rules or wavelet transform rules to convert the target voice information collected in the time domain into frequency domain information. Information in the domain.
  • the preset phoneme model can be obtained by training the initial neural network using a large amount of frequency spectrum vector-Pinyin sample data.
  • the initial neural network can be various neural networks, for example, convolutional neural network, recurrent neural network, long-short-term memory neural network, and so on.
  • inputting the frequency spectrum vector into a preset phoneme model to obtain a pinyin feature sequence includes: S302a. According to the frequency spectrum vector, identifying the tone and initials corresponding to the frequency spectrum vector And vowels; S302b. Integrating the tones, initials and vowels to obtain the pinyin feature sequence of the Chinese text.
  • the tones include the first tone (also known as Yinping or Ping Tiao), the second tone (also known as Yang Ping or tone), the third tone (also known as Shang Sheng or Zhe Tiao), and the fourth tone (also known as Qu Sheng or Falling tone), softly.
  • the soft sound, the first sound, the second sound, the third sound and the fourth sound can be represented by the numbers “0", “1", “2", “3”, and "4" respectively.
  • the tones corresponding to the frequency spectrum vector can be identified as “3", “3", “1”, and “3” in chronological order.
  • “, "1”; the corresponding initials are “w”, “x”, “h”, “b”, “j” in chronological order; the corresponding finals are “o”, “i”, and “Uan”, “ei”, “ing”.
  • the tones, initials and finals are integrated to obtain the pinyin feature sequence ⁇ wo3xi3huan1bei3jing1 ⁇ of the Chinese text of "I like Beijing".
  • the language conversion model is obtained by training the above-mentioned language conversion recognition model training method.
  • the input pinyin feature sequence is converted into pinyin to Chinese to obtain the target Chinese text.
  • the speech recognition method by acquiring a target speech signal, preprocessing the target speech signal according to a preset processing rule to obtain a spectrum vector corresponding to the target speech signal; inputting the spectrum vector into a preset phoneme model, To obtain the pinyin feature sequence; input the pinyin feature sequence into the language conversion model to obtain the target Chinese text. Because the language conversion model changes the sequential calculation process of speech recognition and avoids loss of location information, the accuracy and efficiency of speech recognition are improved.
  • FIG. 9 is a schematic block diagram of a training device for a training model of a language conversion model provided by an embodiment of the present application.
  • the training device for the training model of the language conversion model is used to perform any of the foregoing language conversions.
  • the training method of the model can be configured in the server or the terminal.
  • the server can be an independent server or a server cluster.
  • the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • the training device 400 for the training model of the language conversion model includes: a corpus acquisition unit 401, a word segmentation processing unit 402, a vector transformation unit 403, a position acquisition unit 404, a vector splicing unit 405, and a model training unit 406.
  • the corpus acquisition unit 401 is configured to acquire training Pinyin corpus and data labels corresponding to the training Pinyin corpus.
  • the word segmentation processing unit 402 is configured to perform word segmentation processing on the training pinyin corpus to obtain training word segmentation data.
  • the vector conversion unit 403 is configured to perform word vector conversion on the training word segmentation data according to a preset word embedding model to obtain a word embedding vector.
  • the location acquiring unit 404 is configured to acquire location data information of the training word segmentation data in the training Pinyin corpus, and perform vector transformation on the location data information to obtain a location vector.
  • the vector splicing unit 405 is configured to splice the word embedding vector and the position vector to obtain a spliced word vector.
  • the model training unit 406 is configured to perform model training according to the spliced word vector and the data label based on the conversion neural network to obtain a language conversion model.
  • the position obtaining unit 404 includes a data calculation subunit 4041.
  • the data calculation subunit 4041 is used to calculate the position data information of the training word segmentation data in the training Pinyin corpus based on the position calculation formula and the training word segmentation data.
  • the position acquisition unit 404 includes a sequence determination subunit 4042 and a vector transformation subunit 4043.
  • the sequence determination subunit 4042 is used to determine the sequence of the training word segmentation data in the training Pinyin corpus.
  • the vector conversion subunit 4043 is configured to perform vector conversion on the position data information according to the arrangement sequence to obtain a position vector corresponding to the training word segmentation data.
  • the model training unit 406 includes an encoding output subunit 4061, a text output subunit 4062, and a text verification subunit 4063.
  • the encoding output subunit 4061 is configured to input the spliced word vector into the encoder of the conversion neural network to output training encoding information.
  • the text output subunit 4062 is used to input the training coding information into the decoder of the transformation neural network to output training Chinese text.
  • the text verification subunit 4063 is configured to verify the training Chinese text according to the data tags, and adjust the parameters in the encoder and the decoder, until the training Chinese text is verified to obtain a language conversion model.
  • the encoder includes a dot multiplication attention model and a feedforward neural network model.
  • the encoding output subunit 4061 includes a dot product output submodule 4061a and an information output submodule 4061b.
  • the dot product output sub-module 4061a is used to input the spliced word vector into the dot product attention model to output dot product expressiveness information.
  • the information output sub-module 4061b is configured to input the dot product expressiveness information into the feedforward neural network model to output training coding information.
  • FIG. 11 is a schematic block diagram of a voice recognition device according to an embodiment of the present application, and the voice recognition device is used to execute the aforementioned emotion recognition method.
  • the voice recognition device can be configured in a server or a terminal.
  • the speech recognition device 500 includes: a signal acquisition unit 501, a frequency spectrum input unit 502, and a text acquisition unit 503.
  • the signal acquisition unit 501 is configured to acquire a target voice signal, and preprocess the target voice signal according to preset processing rules to obtain a spectrum vector corresponding to the target voice signal.
  • the frequency spectrum input unit 502 is configured to input the frequency spectrum vector into a preset phoneme model to obtain a pinyin feature sequence.
  • the text acquisition unit 503 is configured to input the pinyin feature sequence into a language conversion model to obtain a target Chinese text, and the language conversion model is trained by the above-mentioned language conversion recognition model training method.
  • the foregoing apparatus may be implemented in the form of a computer program, and the computer program may be run on the computer device as shown in FIG. 12.
  • FIG. 12 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer equipment can be a server or a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any method for training a language conversion model or execute any method for speech recognition.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute a language conversion model training method or execute any speech recognition method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 12 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the processor is used to run a computer program stored in the memory to implement the following steps:
  • the processor is configured to achieve: when implementing the acquiring of the location data information of the training word segmentation data in the training pinyin corpus:
  • the position calculation formula calculates the position data information of the training word segmentation data in the training pinyin corpus according to the training word segmentation data; the position calculation formula is:
  • pos is the position of the training word segmentation data
  • m represents the dimension of the word embedding vector corresponding to the training word segmentation data
  • d g is the vector dimension corresponding to the training pinyin corpus.
  • the processor when the processor implements the vector conversion of the position data information to obtain a position vector, the processor is used to implement:
  • the processor when the processor implements the transformation-based neural network and performs model training according to the spliced word vector and the data label corresponding to the training pinyin corpus to obtain a language transformation model, it is used to implement:
  • the processor realizes that the encoder includes a dot multiplication attention model and a feedforward neural network model; when the spliced word vector is input to the encoder to output training coding information, it is used for achieve:
  • the spliced word vector is input to the dot product attention model to output dot product expressiveness information; the dot product expressive power information is input to the feedforward neural network model to output training coding information.
  • the processor is used to run a computer program stored in the memory to implement the following steps:
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement the present application Any method for training a language conversion model or any method for speech recognition provided in the embodiment.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种语音识别方法、模型的训练方法、装置、设备及存储介质,训练方法包括:获取训练拼音语料以及数据标签(S101);对训练拼音语料进行分词处理(S102);对训练分词数据进行词向量转化(S103);获取位置数据信息,并对位置数据信息进行向量转化(S104);对词嵌入向量与位置向量进行拼接(S105);根据拼接词向量和数据标签进行模型训练以得到语言转化模型(S106)。

Description

语音识别方法、模型的训练方法、装置、设备及存储介质
本申请要求于2019年06月17日提交中国专利局、申请号为201910522750.8、发明名称为“语音识别方法、模型的训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及模型训练技术领域,尤其涉及一种语音识别方法、语言转化模型的训练方法、装置、设备及存储介质。
背景技术
语音识别技术,也称自动语音识别(Automatic Speech Recognition,ASR),是指机器通过识别和理解,把语音信号变成文字的一项技术,广泛应用于智能家居和语音输入等领域,极大方便人们的生活。然而现有的语音识别技术多是基于循环神经网络(Recurrent Neural Networks,RNN)、长短期记忆网络(Long Short-Term Memory,LSTM)或门控循环单元(Gated Recurrent Unit,GRU)等语音识别模型实现的,基于语音识别模型的语音识别是一个顺序计算过程,顺序计算过程会导致信息丢失,从而影响语音识别准确率,同时又降低了语音识别效率。因此,如何提高语音识别的效率和准确率成为亟需解决的问题。
发明内容
本申请提供了一种语音识别方法、语言转化模型的训练方法、装置、计算机设备及存储介质,该语言转化模型应用于语音识别时,提高了语音识别的准确率和效率。
第一方面,本申请提供了一种语言转化模型的训练方法,所述方法包括:
获取训练拼音语料以及所述训练拼音语料对应的数据标签;
对所述训练拼音语料进行分词处理,以得到训练分词数据;
根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
获取所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;
对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
第二方面,本申请提供了一种语音识别方法,所述方法包括:
获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如上所述的语言转化识别模型的训练方法训练得到的。
第三方面,本申请还提供了一种语言转化模型的训练装置,所述装置包括:
语料获取单元,用于获取训练拼音语料以及所述训练拼音语料对应的数据标签;
分词处理单元,用于对所述训练拼音语料进行分词处理,以得到训练分词数据;
向量转化单元,用于根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
位置获取单元,用于获取所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;
向量拼接单元,用于对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
模型训练单元,用于基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
第四方面,本申请还提供了一种语音识别装置,所述装置包括:
信号获取单元,用于获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
频谱输入单元,用于将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
文本获取单元,用于将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如上所述的语言转化识别模型的训练方法训练得到的。
第五方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机程序;所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如上述的语言转化模型的训练方法或上述的语音识别方法。
第六方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如上述的语言转化模型的训练方法或上述的语音识别方法。
本申请公开了一种语音识别方法、模型的训练方法、装置、设备及存储介质,通过对所述词嵌入向量与所述位置向量进行拼接,得到拼接词向量;基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型,该语言转化模型应用于语音识别,改变了语音识别的顺序计算过程,避免位置信息丢失,从而提高了语音识别准确率和效率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的实施例提供的一种语言转化模型的训练方法的示意流程图;
图2是图1中的语言转化模型的训练方法的子步骤示意流程图;
图3是本申请的实施例提供的获取拼接词向量的原理示意图;
图4是图1中语言转化模型的训练方法的子步骤示意流程图;
图5是图4中输出训练编码信息一实施例的子步骤示意流程图;
图6是图4中输出训练编码信息另一实施例的子步骤示意流程图;
图7是本申请的实施例提供的语音识别方法的示意流程图;
图8是图7中语音识别方法的子步骤示意流程图;
图9为本申请实施例提供的一种语言转化模型的训练装置的示意性框图;
图10是图9中语言转化模型的训练装置的子模块的示意性框图;
图11是本申请的实施例还提供一种语音识别装置的示意性框图;
图12为本申请一实施例提供的一种计算机设备的结构示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请的实施例提供了一种语言转化模型的训练方法、语音识别方法、装置、计算机设备及存储介质。该语言转化模型应用于语音识别时,能够提高语音识别效率和准确率。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参阅图1,图1是本申请实施例提供的一种语言转化模型的训练方法的步骤示意流程图。
如图1所示,该语言转化模型的训练方法,具体包括:步骤S101至步骤S105。
S101、获取训练拼音语料以及所述训练拼音语料对应的数据标签。
具体的,可根据实际应用场景采集拼音文本,作为训练拼音语料。示例性的,对于新闻类语音,可采集新闻领域时常用的中文语句的拼音,作为训练拼音语料。
其中,数据标签为训练拼音语料对应的真实中文文本。示例性的,训练拼音语料“wo3xi3huan1bei3jing1”对应的真实中文文本为“我喜欢北京”,该训练拼音语料对应的数据标签为“我喜欢北京”。
S102、对所述训练拼音语料进行分词处理,以得到训练分词数据。
示例性的,可以基于词典分词算法或者基于统计的机器学习算法对所述训练拼音语料进行分词处理。
在一些实施方式中,对所述训练拼音语料进行分词处理的具体过程,即步骤S102具体包括:根据预设的词典,对所述训练拼音语料进行分词处理,以得到训练分词数据。
词典是一个常用词的候选集合,如我喜欢北京的训练拼音语料为“wo3xi3huan1bei3jing1”,然后从语料头到尾遍历,如果语料中有词在词典中出现过则切分该词,从而可以将“wo3xi3huan1bei3jing1”分词处理为“wo3”、“xi3huan1”和“bei3jing1”三个训练分词数据。其中,数字“3”、“1”表示声调。
在另一些实施方式中,对所述训练拼音语料进行分词处理的具体过程,即步骤S102具体包括:根据预设的词典,对所述训练拼音语料进行one-hot编码,以得到训练分词数据。
one-hot编码,即独热码、一位有效编码;独热码是如下一种码制:对于某一属性的词,有多少个状态就有多少比特,而且只有一个比特为1,其他全为0。
示例性的,预设的词典中包括季节这一属性对应的词,分别为春天的拼音“chun1tian1”、夏天的拼音“xia4tian1”、秋天的拼音“qiu1tian1”、冬天的拼音“dong1tian1”和其他的拼音“qi2ta1”。该属性共有5个不同的分类值,此时需要5个比特位表示该属性是什么值。例如,“chun1tian1”的独热码为{10000},“xia4tian1”的独热码为{01000},“qiu1tian1”的独热码为{00100},“dong1tian1”的独热码为{00010},“qi2ta1”的独热码为{00001}。
示例性的,预设的词典中还可以包括人称、水果、性别、运动方式等属性,即各属性对应的词和独热码。
假如某拼音语料中有多个词语,需要独热码编码时,依次将每个词的独热码拼接起来:例如夏天的拼音“xia4tian1”的独热码为{01000},热的拼音“re4”的独热码为{001},那么两者连接起来得到最后的独热码{01000001}。
使用one-hot编码对拼音语料进行处理,可以使得数据变稀疏,且one-hot编码得到的数据包含了拼音语料中词语属性的信息。
对训练拼音语料进行分词处理后得到该训练拼音语料对应的训练分词数据。
示例性的,某训练拼音语料对应的训练分词数据为:100000001000000001 000010 010000。
S103、根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量。
在得到训练分词数据后,根据预设的词嵌入模型对所述训练分词数据进行词向量转化以得到词嵌入向量。
在一实施例中,所述预设的词嵌入模型可以为Word2vec(word to vector)词嵌入模型。多个训练分词数据组成训练分词数据集。根据该Word2vec词嵌入模型,可以将训练分词数据集中的每个训练分词数据都用一个词嵌入向量表示。在一实施方式中,词嵌入向量的维度为512。
可以理解的,在其他实施例中,预设的词嵌入模型也可以是预先训练好的其他神经网络模型,例如深度神经网络(Deep Neural Network,DNN)模型。
S104、获取所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量。
具体的,在获取训练分词数据对应的位置数据信息后,对所述位置数据信息进行向量转化处理,从而得到与该位置数据信息对应的位置向量。
在一实施例中,所述获取所述训练分词数据在所述训练拼音语料中的位置数据信息,包括:
基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息;所述位置计算公式为:
Figure PCTCN2019118227-appb-000001
或,
Figure PCTCN2019118227-appb-000002
其中,所述pos为所述训练分词数据的位置,2m或(2m+1)表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度。
具体的,当训练分词数据对应的词嵌入向量的维度为偶数时,采用第一个公式进行计算所述训练分词数据在所述训练拼音语料中的位置数据信息。当训练分词数据对应的词嵌入向量的维度为奇数时,采用第二个公式进行计算所述训练分词数据在所述训练拼音语料中的位置数据信息。
示例性的,假设d g为512,训练分词数据R在训练拼音语料中的位置pos为20,训练分词数据R对应的词嵌入向量的维度2m为128,此时,通过上述位置计算公式,可以计算出训练分词数据R在所述训练拼音语料中的位置数据信息
Figure PCTCN2019118227-appb-000003
又如,假设d g为512,训练分词数据R在训练拼音语料中的位置pos为20,训练分词数据R对应的词嵌入向量的维度2m+1为129,此时,通过上述位置计算公式,可以计算出训练分词数据R在所述训练拼音语料中的位置数据信息为
Figure PCTCN2019118227-appb-000004
如图2所示,在一实施例中,所述对所述位置数据信息进行向量转化,以得到位置向量的步骤,包括子步骤S104a和S104b。
S104a、确定所述训练分词数据在所述训练拼音语料的排列顺序。
示例性的,训练拼音语料为“wo3xi3huan1bei3jing1”,训练分词数据“wo3”在该训练 语音数据中的排列顺序为1、训练分词数据“xi3huan1”在该训练语音数据中的排列顺序为2,训练分词数据“bei3jing1”在该训练语音数据中的排列顺序为3。
S104b、根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
具体的,将各位置数据信息按照训练分词数据在训练拼音语料中的排列顺序进行向量转化。
示例性的,训练分词数据“wo3”在该训练语音数据中的位置数据信息为0.863,训练分词数据“wo3”在该训练语音数据中的排列顺序为1,训练分词数据“wo3”对应的位置向量为(0.863,0,0)。训练分词数据“xi3huan1”在该训练语音数据中的位置数据信息为0.125,训练分词数据“xi3huan1”在该训练语音数据中的排列顺序为2,训练分词数据“xi3huan1”对应的位置向量为(0,0.125,0)。训练分词数据“bei3jing1”在该训练语音数据中的位置数据信息为0.928,训练分词数据“bei3jing1”在该训练语音数据中的排列顺序为3,训练分词数据“bei3jing1”对应的位置向量为(0,0,0.928)。
S105、对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量。
具体地,在得到词嵌入向量和位置向量后,对所述词嵌入向量和位置向量进行拼接,从而得到拼接词向量。
在一实施例中,所述将所述词嵌入向量和所述位置向量进行拼接,得到拼接词向量,具体包括:将所述词嵌入向量与所述位置向量进行求和,得到所述拼接词向量。
例如,训练拼音语料为“wo3xi3huan1bei3jing1”进行分词处理,得到“wo3”、“xi3huan1”和“bei3jing1”三个训练分词数据。其中,“wo3”、“xi3huan1”和“bei3jing1”分别对应的词嵌入向量为A 1、A 2和A 3,“wo3”、“xi3huan1”和“bei3jing1”分别对应的位置向量为B 1、B 2和B 3。假设词嵌入向量和位置向量的维度为四维,该三个训练分词数据分别对应的拼接词向量为C 1、C 2和C 3。其中,请参阅图3所示,C 1=A 1+B 1,C 2=A 2+B 2,C 3=A 3+B 3
在另一实施例中,所述将所述词嵌入向量和所述位置向量进行拼接,得到拼接词向量,具体包括:将所述词嵌入向量与所述位置向量进行连接,得到所述拼接词向量。
在一实施方式中,将所述词嵌入向量与所述位置向量进行依次连接。例如,所述词嵌入向量为(1,0,0),所述位置向量为(0,0.125,0),所得到的拼接词向量为(1,0,0,0,0.125,0)。当然,在另一实施方式中,将所述位置向量与所述词嵌入向量进行依次连接。例如,所述词嵌入向量为(1,0,0),所述位置向量为(0,0.125,0),所得到的拼接词向量为(0,0.125,0,1,0,0)。
S106、基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
具体的,转换神经网络即Transformer Networks,简称Transformer,是一种高度并行化的神经网络。基于该转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练,训练速度明显提升。
如图4所示,在一实施例中,所述基于转换神经网络,根据所述拼接词向量和所述训练拼音语料对应的数据标签进行模型训练以得到语言转化模型的步骤,包括步骤S201至S203。
S201、将所述输入所述转换神经网络的编码器,以输出训练编码信息。
具体的,转换神经网络包括编码器和解码器,编码器和解码器之间能够信息传递和交互。编码器和解码器均可以包括多层,编码器的层的维度和解码器的层的维度大小相同。
在一实施例中,编码器包括点乘注意力模型和前馈神经网络(Feed Forward)。其中,注意力(Attention)表示词与词之间的关联关系。在一实施例中,注意力表示在语言转化过 程中从拼音端到中文端可能互相转化的词之间的对应关系。
具体的,请参阅图5,步骤S201所述将所述输入所述转换神经网络的编码器,以输出训练编码信息,具体包括:子步骤S201a和S201b。
S201a、将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息。
具体的,所述点乘注意力模型为:
Figure PCTCN2019118227-appb-000005
其中,Q表示查询,K表示关键词,V表示值,且d k表示Q和K的维度。
具体的,点乘注意力模型中,设置了3个向量(vector),分别为Query Vector、Key Vector、Value Vector,分别简写为Q、K、V。将拼接词向量输入点乘注意力模型,所输出的点乘表现力信息Attention(Q,K,V)能够体现对应训练分词数据在当前位置的表现力,该过程并行化高。
S201b、将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
具体的,所述前馈神经网络模型,具体为:
Figure PCTCN2019118227-appb-000006
其中,Y为所述点乘表现力信息,W 1、W 2为权重,b 1、b 2为偏置函数。
在另一实施例中,编码器包括多头注意力模型和前馈神经网络(Feed Forward)。其中,注意力(Attention)表示词与词之间的关联关系。在一实施例中,注意力表示在语言转化过程中从拼音端到中文端可能互相转化的词之间的对应关系。
如图6所示,其中,所述将所述输入所述转换神经网络的编码器,以输出训练编码信息,具体包括:子步骤S201c和S201d。
S201c、将所述拼接词向量输入所述多头注意力模型,以输出多头表现力信息。
其中,所述多头注意力模型为:
MultiHead(Q,K,V)=Concat(head 1,...,head n)W 0
Figure PCTCN2019118227-appb-000007
其中,
Figure PCTCN2019118227-appb-000008
d g为词嵌入向量的维度。
具体的,多头注意力模型中设置了多个Q、K、V矩阵和实际值的矩阵,该模型的可训练参数多,能够提升模型能力,考虑到不同位置的注意力,并能够对注意力赋予更多子空间。将拼接词向量输入多头注意力模型,所输出的多头表现力信息MultiHead(Q,K,V)能够体现对应训练分词数据在当前位置的表现力,该过程并行化高,运行速度快。
S201d、将所述多头表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
可以理解的,该步骤中的前馈神经网络模型可以参照步骤S201b中的前馈神经网络模型,在此不再赘述。
S202、将所述训练编码信息输入所述转换神经网络的解码器,以输出训练中文文本。
在一实施例中,解码器和编码器均具有多层,其中解码器的层比编码器的层多一个子网络,即编码器-解码器注意力(Encoder-Decoder Attention),表示源端到目标端的注意力机制。具体的,编码器-解码器注意力表示拼音端的词和拼音端生成的中文词之间的依赖关系。
S203、根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
具体的,可以采用合适的损失函数,例如交叉熵损失函数,来衡量数据标签与训练中文 文本的不一致程度,损失函数越小,模型的鲁棒性越好。示例性的,若损失函数小于预设阀值时,表示所述训练中文文本验证通过,此时停止模型训练,得到语言转化模型。
上述实施例提供的语言转化模型的训练方法,通过对所述词嵌入向量与所述位置向量进行拼接,得到拼接词向量;基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型,该语言转化模型应用于语音识别,改变了语音识别的顺序计算过程,避免位置信息丢失,从而提高了语音识别准确率和效率。
请参阅图7,图7是本申请的实施例提供的语音识别方法的示意流程图。其中,该语音识别方法,可应用于终端或服务器中,用于将语音信号转化为中文文本。
如图7所示,该语音识别方法,包括:步骤S301至S303。
S301、获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量。
具体的,“语音”是指具有语言属性的音频,其可以由人体发出,也可以由扬声器等电子设备发出。
在本实施例中,可通过录音设备采集与用户聊天时对应的语音信号,该录音设备比如录音笔、智能手机、平板电脑、笔记本或智能穿戴设备等,比如智能手环或智能手表等。
其中,该预设处理规则为用于将所述目标语音信号转换成频域中的信息,具体比如采用快速傅里叶变换规则或者小波变换规则将在时域中采集的目标语音信息转换成频域中的信息。
S302、将所述频谱向量输入预设的音素模型中,以得到拼音特征序列。
预设的音素模型可以采用大量的频谱向量-拼音样本数据对初始神经网络进行训练获得。初始神经网络可以是各种神经网络,例如,卷积神经网络、循环神经网络、长短期记忆神经网络等。
具体的,如图8所示,所述将所述频谱向量输入预设的音素模型中,以得到拼音特征序列,包括:S302a、根据所述频谱向量,识别所述频谱向量对应的音调、声母和韵母;S302b、将所述音调、声母和韵母进行整合,以获取中文文本的拼音特征序列。
具体的,音调包括第一声(亦称阴平或平调)、第二声(亦称阳平或声调)、第三声(亦称上声或折调)、第四声(亦称去声或降调)、轻声。轻声、第一声、第二声、第三声和第四声可以分别用数字“0”、“1”、“2”、“3”、“4”表示。
例如,“我喜欢北京”的源语音数据对应的频谱向量输入至预设的音素模型,可以识别该频谱向量对应的音调按时间顺序依次为“3”、“3”、“1”、“3”、“1”;对应的声母按时间顺序依次为“w”、“x”、“h”、“b”、“j”;对应的韵母按时间顺序依次为“o”、“i”、“uan”、“ei”、“ing”。
识别该频谱向量对应的音调、声母和韵母后,将所述音调、声母和韵母进行整合,得到“我喜欢北京”中文文本的拼音特征序列{wo3xi3huan1bei3jing1}。
S303、将所述拼音特征序列输入语言转化模型,以获取目标中文文本。
具体的,所述语言转化模型由上述的语言转化识别模型的训练方法训练得到的。通过该语言模型对输入的拼音特征序列进行拼音中文转化,从而得到目标中文文本。
上述语音识别方法,通过获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;将所述拼音特征序列输入语言转化模型,以获取目标中文文本。由于语言转化模型改变了语音识别的顺序计算过程,避免位置信息丢失,因而提高了语音识别准确率和效率。
请参阅图9,图9是本申请的实施例提供的一种语言转化模型的训练模型的训练装置的示意性框图,该语言转化模型的训练模型的训练装置用于执行前述任一项语言转化模型的训练方法。其中,该语言转化模型的训练模型的训练装置可以配置于服务器或终端中。
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
如图9所示,语言转化模型的训练模型的训练装置400包括:语料获取单元401、分词处理单元402、向量转化单元403、位置获取单元404、向量拼接单元405和模型训练单元406。
语料获取单元401,用于获取训练拼音语料以及所述训练拼音语料对应的数据标签。
分词处理单元402,用于对所述训练拼音语料进行分词处理,以得到训练分词数据。
向量转化单元403,用于根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量。
位置获取单元404,用于获取所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量。
向量拼接单元405,用于对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量。
模型训练单元406,用于基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
请参阅图9,在一个实施例中,位置获取单元404包括数据计算子单元4041。该数据计算子单元4041用于:基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息。
请再次参阅图9,在一个实施例中,位置获取单元404包括顺序确定子单元4042和向量转化子单元4043。
顺序确定子单元4042,用于确定所述训练分词数据在所述训练拼音语料的排列顺序。
向量转化子单元4043,用于根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
请再参阅图10,在一个实施例中,模型训练单元406包括编码输出子单元4061、文本输出子单元4062和文本验证子单元4063。
编码输出子单元4061,用于将所述拼接词向量输入所述转换神经网络的编码器,以输出训练编码信息。
文本输出子单元4062,用于将所述训练编码信息输入所述转换神经网络的解码器,以输出训练中文文本。
文本验证子单元4063,用于根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
请再次参阅图10,在一实施中,所述编码器包括点乘注意力模型和前馈神经网络模型。编码输出子单元4061包括点乘输出子模块4061a和信息输出子模块4061b。
点乘输出子模块4061a,用于将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息。
信息输出子模块4061b,用于将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
请参阅图11,图11是本申请的实施例还提供一种语音识别装置的示意性框图,该语音识别装置用于执行前述的情绪识别方法。其中,该语音识别装置可以配置于服务器或终端中。
如图11所示,该语音识别装置500,包括:信号获取单元501、频谱输入单元502和文本获取单元503。
信号获取单元501,用于获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量。
频谱输入单元502,用于将所述频谱向量输入预设的音素模型中,以得到拼音特征序列。
文本获取单元503,用于将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由上述的语言转化识别模型的训练方法训练得到的。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各单元的具体工作过程,可以参考前述语方法实施例中的对应过程,在此不再赘述。
上述装置可以实现为一种计算机程序的形式,该计算机程序可以在如图12所示的计算机设备上运行。
请参阅图12,图12是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备可以是服务器或终端。
参阅图12,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种语言转化模型的训练方法,或执行任意一种语音识别方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行一种语言转化模型的训练方法,或执行任意一种语音识别方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图12中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取训练拼音语料以及所述训练拼音语料对应的数据标签;对所述训练拼音语料进行分词处理,以得到训练分词数据;根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;获取所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
在一个实施例中,所述处理器在实现所述获取所述训练分词数据在所述训练拼音语料中的位置数据信息时,用于实现:
基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息;所述位置计算公式为:
Figure PCTCN2019118227-appb-000009
或,
Figure PCTCN2019118227-appb-000010
其中,所述pos为所述训练分词数据的位置,m表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度。
在一个实施例中,所述处理器在实现所述对所述位置数据信息进行向量转化,以得到位置向量时,用于实现:
确定所述训练分词数据在所述训练拼音语料的排列顺序;根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
在一个实施例中,所述处理器在实现所述基于转换神经网络,根据所述拼接词向量和所述训练拼音语料对应的数据标签进行模型训练以得到语言转化模型时,用于实现:
将所述拼接词向量输入所述转换神经网络的编码器,以输出训练编码信息;将所述训练 编码信息输入所述转换神经网络的解码器,以输出训练中文文本;根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
在一个实施例中,所述处理器在实现所述编码器包括点乘注意力模型和前馈神经网络模型;所述将所述拼接词向量输入编码器,以输出训练编码信息时,用于实现:
将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息;将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
其中,在另一实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由上述任一项所述的语言转化识别模型的训练方法训练得到的。
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项语言转化模型的训练方法,或任一项语音识别方法。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种语言转化模型的训练方法,包括:
    获取训练拼音语料以及所述训练拼音语料对应的数据标签;
    对所述训练拼音语料进行分词处理,以得到训练分词数据;
    根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
    基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;所述位置计算公式为:
    Figure PCTCN2019118227-appb-100001
    或,
    Figure PCTCN2019118227-appb-100002
    其中,所述pos为所述训练分词数据的位置,m表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度;
    对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
    基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
  2. 根据权利要求1所述的语言转化模型的训练方法,其中,所述对所述训练拼音语料进行分词处理,以得到训练分词数据,包括:
    根据预设的词典,对所述训练拼音语料进行one-hot编码,以得到训练分词数据。
  3. 根据权利要求1所述的语言转化模型的训练方法,其中,所述对所述位置数据信息进行向量转化,以得到位置向量,包括:
    确定所述训练分词数据在所述训练拼音语料的排列顺序;
    根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
  4. 根据权利要求1-3任一项所述的语言转化模型的训练方法,其中,所述基于转换神经网络,根据所述拼接词向量和所述训练拼音语料对应的数据标签进行模型训练以得到语言转化模型,包括:
    将所述拼接词向量输入所述转换神经网络的编码器,以输出训练编码信息;
    将所述训练编码信息输入所述转换神经网络的解码器,以输出训练中文文本;
    根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
  5. 根据权利要求4所述的语言转化模型的训练方法,其中,所述编码器包括点乘注意力模型和前馈神经网络模型;所述将所述拼接词向量输入编码器,以输出训练编码信息,包括:
    将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息;
    将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
  6. 一种语音识别方法,包括:
    获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
    将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
    将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如权利要求1-5任一项所述的语言转化识别模型的训练方法训练得到的。
  7. 一种语言转化模型的训练装置,包括:
    语料获取单元,用于获取训练拼音语料以及所述训练拼音语料对应的数据标签;
    分词处理单元,用于对所述训练拼音语料进行分词处理,以得到训练分词数据;
    向量转化单元,用于根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
    位置获取单元,用于基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;所述位置计算公式为:
    Figure PCTCN2019118227-appb-100003
    或,
    Figure PCTCN2019118227-appb-100004
    其中,所述pos为所述训练分词数据的位置,m表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度;
    向量拼接单元,用于对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
    模型训练单元,用于基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
  8. 一种语音识别装置,包括:
    信号获取单元,用于获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
    频谱输入单元,用于将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
    文本获取单元,用于将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如权利要求1-5任一项所述的语言转化识别模型的训练方法训练得到的。
  9. 一种计算机设备,所述计算机设备包括存储器和处理器;
    所述存储器用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:
    获取训练拼音语料以及所述训练拼音语料对应的数据标签;
    对所述训练拼音语料进行分词处理,以得到训练分词数据;
    根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
    基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;所述位置计算公式为:
    Figure PCTCN2019118227-appb-100005
    或,
    Figure PCTCN2019118227-appb-100006
    其中,所述pos为所述训练分词数据的位置,m表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度;
    对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
    基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述训练拼音语料进行分词处理,以得到训练分词数据,包括:
    根据预设的词典,对所述训练拼音语料进行one-hot编码,以得到训练分词数据。
  11. 根据权利要求9所述的计算机设备,其中,所述对所述位置数据信息进行向量转化, 以得到位置向量,包括:
    确定所述训练分词数据在所述训练拼音语料的排列顺序;
    根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
  12. 根据权利要求9-11任一项所述的计算机设备,其中,所述基于转换神经网络,根据所述拼接词向量和所述训练拼音语料对应的数据标签进行模型训练以得到语言转化模型,包括:
    将所述拼接词向量输入所述转换神经网络的编码器,以输出训练编码信息;
    将所述训练编码信息输入所述转换神经网络的解码器,以输出训练中文文本;
    根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
  13. 根据权利要求12所述的计算机设备,其中,所述编码器包括点乘注意力模型和前馈神经网络模型;所述将所述拼接词向量输入编码器,以输出训练编码信息,包括:
    将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息;
    将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
  14. 一种计算机设备,所述计算机设备包括存储器和处理器;
    所述存储器用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时,实现如下步骤:
    获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
    将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
    将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如权利要求1-5任一项所述的语言转化识别模型的训练方法训练得到的。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取训练拼音语料以及所述训练拼音语料对应的数据标签;
    对所述训练拼音语料进行分词处理,以得到训练分词数据;
    根据预设的词嵌入模型,对所述训练分词数据进行词向量转化,以得到词嵌入向量;
    基于位置计算公式,根据训练分词数据计算所述训练分词数据在所述训练拼音语料中的位置数据信息,并对所述位置数据信息进行向量转化,以得到位置向量;所述位置计算公式为:
    Figure PCTCN2019118227-appb-100007
    或,
    Figure PCTCN2019118227-appb-100008
    其中,所述pos为所述训练分词数据的位置,m表示所述训练分词数据对应的词嵌入向量的维度,d g为所述训练拼音语料对应的向量维度;
    对所述词嵌入向量与所述位置向量进行拼接,以得到拼接词向量;
    基于转换神经网络,根据所述拼接词向量和所述数据标签进行模型训练以得到语言转化模型。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述训练拼音语料进行分词处理,以得到训练分词数据,包括:
    根据预设的词典,对所述训练拼音语料进行one-hot编码,以得到训练分词数据。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述位置数据信息进行向量转化,以得到位置向量,包括:
    确定所述训练分词数据在所述训练拼音语料的排列顺序;
    根据所述排列顺序对所述位置数据信息进行向量转化,以获取与所述训练分词数据对应的位置向量。
  18. 根据权利要求15-17任一项所述的计算机可读存储介质,其中,所述基于转换神经网络,根据所述拼接词向量和所述训练拼音语料对应的数据标签进行模型训练以得到语言转化模型,包括:
    将所述拼接词向量输入所述转换神经网络的编码器,以输出训练编码信息;
    将所述训练编码信息输入所述转换神经网络的解码器,以输出训练中文文本;
    根据所述数据标签验证所述训练中文文本,并调整所述编码器和所述解码器中的参数,直至所述训练中文文本验证通过得到语言转化模型。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述编码器包括点乘注意力模型和前馈神经网络模型;所述将所述拼接词向量输入编码器,以输出训练编码信息,包括:
    将所述拼接词向量输入所述点乘注意力模型,以输出点乘表现力信息;
    将所述点乘表现力信息输入所述前馈神经网络模型,以输出训练编码信息。
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取目标语音信号,根据预设处理规则对所述目标语音信号进行预处理以得到所述目标语音信号对应的频谱向量;
    将所述频谱向量输入预设的音素模型中,以得到拼音特征序列;
    将所述拼音特征序列输入语言转化模型,以获取目标中文文本,所述语言转化模型由如权利要求1-5任一项所述的语言转化识别模型的训练方法训练得到的。
PCT/CN2019/118227 2019-06-17 2019-11-13 语音识别方法、模型的训练方法、装置、设备及存储介质 WO2020253060A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910522750.8A CN110288980A (zh) 2019-06-17 2019-06-17 语音识别方法、模型的训练方法、装置、设备及存储介质
CN201910522750.8 2019-06-17

Publications (1)

Publication Number Publication Date
WO2020253060A1 true WO2020253060A1 (zh) 2020-12-24

Family

ID=68005146

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118227 WO2020253060A1 (zh) 2019-06-17 2019-11-13 语音识别方法、模型的训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110288980A (zh)
WO (1) WO2020253060A1 (zh)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质
CN110827816A (zh) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 语音指令识别方法、装置、电子设备及存储介质
CN111222335A (zh) * 2019-11-27 2020-06-02 上海眼控科技股份有限公司 语料修正方法、装置、计算机设备和计算机可读存储介质
CN110970031B (zh) * 2019-12-16 2022-06-24 思必驰科技股份有限公司 语音识别系统及方法
CN111144370B (zh) * 2019-12-31 2023-08-04 科大讯飞华南人工智能研究院(广州)有限公司 单据要素抽取方法、装置、设备及存储介质
CN111090886A (zh) * 2019-12-31 2020-05-01 新奥数能科技有限公司 脱敏数据确定方法、装置、可读介质及电子设备
CN111833849B (zh) * 2020-03-10 2024-06-11 北京嘀嘀无限科技发展有限公司 语音识别和语音模型训练的方法及存储介质和电子设备
CN111382340A (zh) * 2020-03-20 2020-07-07 北京百度网讯科技有限公司 信息识别方法、信息识别装置和电子设备
US10817665B1 (en) * 2020-05-08 2020-10-27 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备
CN111859994B (zh) * 2020-06-08 2024-01-23 北京百度网讯科技有限公司 机器翻译模型获取及文本翻译方法、装置及存储介质
CN111881726B (zh) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 一种活体检测方法、装置及存储介质
CN112002306B (zh) * 2020-08-26 2024-04-05 阳光保险集团股份有限公司 语音类别的识别方法、装置、电子设备及可读存储介质
CN112133304B (zh) * 2020-09-18 2022-05-06 中科极限元(杭州)智能科技股份有限公司 基于前馈神经网络的低延时语音识别模型及训练方法
CN112132281B (zh) * 2020-09-29 2024-04-26 腾讯科技(深圳)有限公司 一种基于人工智能的模型训练方法、装置、服务器及介质
CN112417086B (zh) * 2020-11-30 2024-02-27 深圳市与飞科技有限公司 数据处理方法、装置、服务器及存储介质
CN112528637B (zh) * 2020-12-11 2024-03-29 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质
CN112820269B (zh) * 2020-12-31 2024-05-28 平安科技(深圳)有限公司 文本转语音方法、装置、电子设备及存储介质
CN113035231B (zh) * 2021-03-18 2024-01-09 三星(中国)半导体有限公司 关键词检测方法及装置
CN113129869B (zh) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 语音识别模型的训练与语音识别的方法、装置
CN112951204B (zh) * 2021-03-29 2023-06-13 北京大米科技有限公司 语音合成方法和装置
CN113761841B (zh) * 2021-04-19 2023-07-25 腾讯科技(深圳)有限公司 将文本数据转换为声学特征的方法
CN112906403B (zh) * 2021-04-25 2023-02-03 中国平安人寿保险股份有限公司 语义分析模型训练方法、装置、终端设备及存储介质
CN112951240B (zh) * 2021-05-14 2021-10-29 北京世纪好未来教育科技有限公司 模型训练、语音识别方法及装置、电子设备及存储介质
CN113297346B (zh) * 2021-06-28 2023-10-31 中国平安人寿保险股份有限公司 文本意图识别方法、装置、设备及存储介质
CN113377997B (zh) * 2021-06-30 2024-06-18 腾讯音乐娱乐科技(深圳)有限公司 一种歌曲检索方法、电子设备及计算机可读存储介质
CN113486671B (zh) * 2021-07-27 2023-06-30 平安科技(深圳)有限公司 基于正则表达式编码的数据扩展方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (zh) * 2013-10-18 2014-02-12 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
CN109492232A (zh) * 2018-10-22 2019-03-19 内蒙古工业大学 一种基于Transformer的增强语义特征信息的蒙汉机器翻译方法
CN109684452A (zh) * 2018-12-25 2019-04-26 中科国力(镇江)智能技术有限公司 一种基于答案与答案位置信息的神经网络问题生成方法
CN109859760A (zh) * 2019-02-19 2019-06-07 成都富王科技有限公司 基于深度学习的电话机器人语音识别结果校正方法
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204184B (zh) * 2017-05-10 2018-08-03 平安科技(深圳)有限公司 语音识别方法及系统
CN108549637A (zh) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 基于拼音的语义识别方法、装置以及人机对话系统
CN109800298B (zh) * 2019-01-29 2023-06-16 苏州大学 一种基于神经网络的中文分词模型的训练方法
CN109817246B (zh) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN109817198B (zh) * 2019-03-06 2021-03-02 广州多益网络股份有限公司 语音合成方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (zh) * 2013-10-18 2014-02-12 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
CN109492232A (zh) * 2018-10-22 2019-03-19 内蒙古工业大学 一种基于Transformer的增强语义特征信息的蒙汉机器翻译方法
CN109684452A (zh) * 2018-12-25 2019-04-26 中科国力(镇江)智能技术有限公司 一种基于答案与答案位置信息的神经网络问题生成方法
CN109859760A (zh) * 2019-02-19 2019-06-07 成都富王科技有限公司 基于深度学习的电话机器人语音识别结果校正方法
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JULIAN SALAZAR ET AL.: "Self-attention Networks for Connectionist Temporal Classification in Speech Recognition", 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 17 May 2019 (2019-05-17), XP033565120, ISSN: 2379-190X, DOI: 20200317111756X *
L. DONG ET AL.: "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition", 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 20 April 2018 (2018-04-20), XP033401817, ISSN: 2379-190X, DOI: 20200317111505X *

Also Published As

Publication number Publication date
CN110288980A (zh) 2019-09-27

Similar Documents

Publication Publication Date Title
WO2020253060A1 (zh) 语音识别方法、模型的训练方法、装置、设备及存储介质
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
US20230186912A1 (en) Speech recognition method, apparatus and device, and storage medium
WO2018133761A1 (zh) 一种人机对话的方法和装置
CN110516253B (zh) 中文口语语义理解方法及系统
US20200042613A1 (en) Processing an incomplete message with a neural network to generate suggested messages
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
US20130346066A1 (en) Joint Decoding of Words and Tags for Conversational Understanding
WO2022252636A1 (zh) 基于人工智能的回答生成方法、装置、设备及存储介质
US20220083742A1 (en) Man-machine dialogue method and system, computer device and medium
JP2020042257A (ja) 音声認識方法及び装置
WO2023134067A1 (zh) 语音分类模型的训练方法、装置、设备及存储介质
JP2021081713A (ja) 音声信号を処理するための方法、装置、機器、および媒体
CN111027681B (zh) 时序数据处理模型训练方法、数据处理方法、装置及存储介质
US20220310065A1 (en) Supervised and Unsupervised Training with Contrastive Loss Over Sequences
KR20200041199A (ko) 챗봇 구동 방법, 장치 및 컴퓨터 판독가능 매체
WO2022257454A1 (zh) 一种合成语音的方法、装置、终端及存储介质
CN112487813B (zh) 命名实体识别方法及系统、电子设备及存储介质
CN111797220A (zh) 对话生成方法、装置、计算机设备和存储介质
CN116775873A (zh) 一种多模态对话情感识别方法
CN113515617B (zh) 一种对话生成模型的方法、装置以及设备
KR20200082240A (ko) 호칭 결정 장치, 이를 포함하는 대화 서비스 제공 시스템, 호칭 결정을 위한 단말 장치 및 호칭 결정 방법
CN111401069A (zh) 会话文本的意图识别方法、意图识别装置及终端
JP2022121386A (ja) テキストベースの話者変更検出を活用した話者ダイアライゼーション補正方法およびシステム
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933688

Country of ref document: EP

Kind code of ref document: A1