WO2020073509A1 - Neural network-based speech recognition method, terminal device, and medium - Google Patents

Neural network-based speech recognition method, terminal device, and medium Download PDF

Info

Publication number
WO2020073509A1
WO2020073509A1 PCT/CN2018/124306 CN2018124306W WO2020073509A1 WO 2020073509 A1 WO2020073509 A1 WO 2020073509A1 CN 2018124306 W CN2018124306 W CN 2018124306W WO 2020073509 A1 WO2020073509 A1 WO 2020073509A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
preset
sequence
probability
neural network
Prior art date
Application number
PCT/CN2018/124306
Other languages
French (fr)
Chinese (zh)
Inventor
王义文
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020073509A1 publication Critical patent/WO2020073509A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Definitions

  • the present application belongs to the field of artificial intelligence technology, and particularly relates to a neural network-based speech recognition method, terminal device, and non-volatile readable storage medium.
  • Speech recognition is the process of converting a speech sequence into a text sequence.
  • speech recognition models based on machine learning are widely used in various speech recognition scenarios.
  • the speech sequence and the text sequence should be frame aligned.
  • the sample data used to train the model is relatively large. It takes a lot of manpower and time to perform frame alignment processing on the speech sequence and the text sequence contained in each sample data, and the labor cost and time cost are relatively high.
  • Embodiments of the present application provide a neural network-based voice recognition method, terminal device, and non-volatile readable storage medium to solve the labor cost and time cost of the existing voice recognition method based on the traditional voice recognition model Higher issues.
  • a first aspect of the embodiments of the present application provides a speech recognition method based on a neural network, including:
  • the probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
  • a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  • a second aspect of the embodiments of the present application provides a terminal device, including:
  • the first dividing unit is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
  • a feature extraction unit configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment
  • the first determining unit is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector The probability for identifying the pronunciation of the speech segment as the preset phoneme corresponding to the element;
  • the second determining unit is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
  • a third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes the computer The following steps are realized when the instructions can be read:
  • the probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
  • a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  • a fourth aspect of the embodiments of the present application provides a non-volatile readable storage medium, where the non-volatile readable storage medium stores computer-readable instructions that are implemented when executed by a processor The following steps:
  • the probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
  • a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  • a method for speech recognition based on a neural network provided by an embodiment of the present application, by dividing the speech sequence to be recognized into at least two frames of speech segments, extracting the feature vector of each frame of speech segments; at the probability calculation layer of the preset neural network model Determine the first probability vector of the speech segment based on the feature vector of the speech segment; determine the text sequence corresponding to the speech sequence to be recognized in the joint timing classification layer of the preset neural network model based on the first probability vector of all speech segments.
  • the joint time series classification layer in the neural network model can directly determine the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all the speech fragments corresponding to the speech sequence to be recognized, so the preset in this embodiment
  • the neural network model is trained, it is not necessary to perform frame alignment processing on the speech sequence and the text sequence in the sample data used for model training, thereby saving the time cost and labor cost of speech recognition.
  • FIG. 1 is an implementation flowchart of a neural network-based speech recognition method provided by the first embodiment of the present application
  • FIG. 2 is a specific implementation flowchart of S13 in a neural network-based speech recognition method provided by a second embodiment of the present application;
  • FIG. 3 is a specific implementation flowchart of S14 in a speech recognition method based on a neural network according to a third embodiment of the present application;
  • FIG. 4 is a flowchart of a neural network-based speech recognition method provided by a fourth embodiment of the present application.
  • FIG. 5 is a structural block diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 6 is a structural block diagram of a terminal device according to another embodiment of the present application.
  • FIG. 1 is an implementation flowchart of a neural network-based speech recognition method provided in the first embodiment of the present application.
  • the execution subject of the speech recognition method based on the neural network is a terminal device.
  • Terminal devices include but are not limited to smartphones, tablets or desktop computers.
  • the speech recognition method based on neural network shown in Figure 1 includes the following steps:
  • S11 Acquire a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.
  • the voice sequence refers to a piece of voice data whose duration is greater than a preset duration threshold, where the preset duration threshold is greater than zero.
  • the speech sequence to be recognized is a speech sequence that needs to be translated into a text sequence.
  • a voice recognition request for the voice sequence can be triggered on the terminal device, and the voice recognition request carries the voice sequence to be recognized.
  • the voice recognition request is used to request the terminal device to translate the voice sequence to be recognized into a corresponding text sequence.
  • a user chats with a contact through an instant messaging application installed on the terminal device if a voice sequence sent by the opposite end is received, the user can translate the voice sequence into a corresponding text sequence when needed For users to view.
  • the user can trigger the terminal device to display a menu bar for the voice sequence by long-pressing or right-clicking the voice display icon corresponding to the voice sequence, and triggering the Voice recognition request for voice sequence.
  • the terminal device When detecting a voice recognition request for a certain voice sequence, the terminal device extracts the voice sequence to be recognized from the voice recognition request, and divides the extracted voice sequence into at least two frames of voice segments.
  • the terminal device may divide the voice sequence to be recognized into at least two frames of voice segments in the following manner, that is, S11 may specifically include the following steps:
  • Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  • the preset frame length is used to identify the duration of each frame of speech fragments obtained after framing the speech sequence; the preset frame shift amount is used to identify the timing step when the speech sequence is subjected to the minute hand operation.
  • the terminal device After the terminal device obtains the speech sequence to be recognized, starting from the starting time point of the speech sequence to be recognized, a section of speech with a preset frame length is intercepted from the speech sequence to be recognized every preset frame shift, and then the The speech sequence to be recognized is divided into at least two frames of speech segments, and the duration of each frame of speech segments obtained by performing the framing operation on the recognized speech sequence is the preset frame length, and the starting time point of every two adjacent frames of speech Preset the frame shift amount at intervals.
  • the preset frame shift amount is smaller than the preset frame length, that is, there is a certain overlap in timing between every two adjacent frames of speech segments, and the duration of the overlapped portion is It is the difference between the preset frame length and the preset frame shift amount.
  • the terminal device divides the speech sequence to be recognized into at least After two frames of speech segments, the acoustic feature extraction is performed on each frame of speech segments based on a preset feature extraction network to obtain the feature vector of each frame of speech segments.
  • the feature vector of the speech segment contains the acoustic feature information of the speech segment.
  • the preset neural network can be set according to actual needs, and there is no limitation here.
  • the preset feature extraction network may be a Mel Frequency Cepstral Coefficient (MFCC) feature extraction network.
  • MFCC Mel Frequency Cepstral Coefficient
  • S13 Determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector is used to identify the The pronunciation of the speech segment is the probability of the preset phoneme corresponding to the element.
  • the feature vectors of all speech segments obtained by framing operation of the speech sequence to be recognized are imported Preset neural network model.
  • the preset neural network model is obtained by training a pre-built original neural network model through a machine learning algorithm based on a preset number of sample data. Each piece of data in the sample data is composed of the feature vectors of all speech segments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.
  • the original neural network model includes successively connected probability calculation layers and joint time series classification layers. among them:
  • the probability calculation layer is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment.
  • the value of each element in the first probability vector is used to identify the probability that the speech segment is pronounced as the preset phoneme corresponding to the element.
  • the number of elements contained in the first probability vector is the same as the number of preset phonemes.
  • phonemes are the smallest phonetic units divided according to the natural attributes of speech. Phonemes usually include two major categories of vowel phonemes and consonant phonemes.
  • the preset phonemes can be set according to actual needs, without limitation here. In the embodiment of the present application, the preset phoneme includes at least one blank phoneme.
  • the preset phoneme may include a blank phoneme and all vowel phonemes and consonant phonemes in Chinese pinyin.
  • the joint timing classification layer is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all speech segments obtained by framing the speech sequence to be recognized.
  • the feature vectors of all the speech fragments obtained by framing the speech sequence contained in each sample data are used as the input of the original neural network model, and the speech contained in each sample data is used
  • the text sequence corresponding to the sequence serves as the output of the original neural network model, trains the original neural network model, and determines the original neural network model that has completed the training as the preset neural network model in the embodiment of the present application. It should be noted that, during the training process of the original neural network model, the terminal device will learn the probability of the feature vectors of all speech segments appearing in the sample data relative to each preset phoneme in the probability calculation layer.
  • the terminal device after the terminal device imports the feature vectors of all speech segments obtained by framing the speech sequence to be recognized into the preset neural network model, it can determine each The first probability vector of the frame speech segment:
  • S131 In the probability calculation layer, determine the feature vectors of the at least two frames of speech segments relative to each of the preset phonemes based on the pre-learned probability of the feature vectors of each preset speech segment relative to each preset phoneme The probability.
  • S132 Determine a first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each preset phoneme.
  • the preset voice segments include all the voice segments that have appeared in the sample data, and the probability of the feature vectors of each preset voice segment learned in advance with respect to each preset phoneme is the appearance in the sample data learned in advance The probability of the feature vector of the passed speech segment relative to each preset phoneme.
  • the probability calculation layer of the preset neural network model is based on all possible speech segments learned in advance.
  • the probability of the feature vector relative to each preset phoneme, the probability of the feature vector of each frame of the speech segment obtained by performing the framing operation of the speech sequence to be recognized relative to each preset phoneme, and the feature vector of each speech segment relative to each pre-set Let the phoneme probabilities constitute the first probability vector of the speech segment.
  • S14 Determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
  • the terminal device determines the first probability vector of each frame of speech fragments obtained by performing the framing operation on the speech sequence to be recognized, based on the first The probability vector determines the text sequence corresponding to the speech sequence to be recognized.
  • the total number of preset phonemes is N
  • T frames of speech fragments are obtained after framing the speech sequence to be recognized
  • the pronunciation phonemes corresponding to each frame of speech fragments may be any of the N preset phonemes eleven, therefore, to pronounce phoneme sequence corresponding to the speech sequence to be recognized a total of N T possible
  • the kind of phoneme pronunciations N T sequence as a predetermined sequence of phoneme pronunciations N T sequences kinds of phoneme pronunciations
  • Each pronunciation phoneme sequence in is a sequence of length T composed of at least one phoneme in a preset phoneme.
  • S14 may be implemented through S141 to S143 shown in FIG. 3, and the details are as follows:
  • S141 Calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector of all the speech segments and a preset probability calculation formula in the joint time-series classification layer of the preset neural network model; the pronunciation phoneme probability vector The value of each element in is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:
  • the terminal device classifies based on the speech sequence to be recognized in the joint timing classification layer of the preset neural network model
  • the first probability vector of all speech segments obtained by the frame operation and the above-mentioned preset probability calculation formula calculate the pronunciation phoneme probability vector of the speech sequence to be recognized.
  • the value of each element in the pronunciation phoneme probability vector of the speech sequence is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element.
  • the preset phonemes include the following 4 phonemes: a, i, o, and blank phonemes
  • the a priori probability corresponding to the first pronunciation phoneme contained in the first preset pronunciation phoneme sequence is The probability of the feature vector of the first frame of speech segment determined at the probability calculation layer relative to the preset phoneme a, the prior probability corresponding to the second pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability The probability of the feature vector of the second frame of speech segment determined by the calculation layer relative to the preset phoneme a, the a priori probability corresponding to the third pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability calculation layer
  • the determined third frame audio clip The probability of the feature vector of the segment relative to the preset phoneme i, the a priori probability corresponding to the 4th pronunciation phoneme included in the first preset pronunciation phoneme sequence is the fourth frame speech segment determined at the probability calculation layer
  • S142 Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the The probability of a preset text sequence corresponding to the element, which is obtained by compressing the preset pronunciation phoneme sequence.
  • the preset pronunciation phoneme sequence usually contains some blank phonemes, or some adjacent elements in the preset pronunciation phoneme sequence correspond to the same phoneme. Therefore, the terminal device determines the pronunciation corresponding to the speech sequence to be recognized After the phoneme sequence is the probability of each preset pronunciation phoneme sequence, each preset pronunciation phoneme sequence is compressed to obtain the text sequence corresponding to each preset pronunciation phoneme sequence, and then the pronunciation phoneme sequence corresponding to the speech sequence to be recognized The probability of each preset pronunciation phoneme sequence is converted to: the probability that the text sequence corresponding to the speech sequence to be recognized is the text sequence corresponding to each preset pronunciation phoneme sequence, that is, the pronunciation phoneme corresponding to the speech sequence to be recognized is The probability of each preset pronunciation phoneme sequence is determined as the probability that the text sequence corresponding to the speech sequence is the text sequence corresponding to each preset pronunciation phoneme sequence.
  • the terminal device compressing the preset pronunciation phoneme sequence may specifically include: excluding blank phonemes in the preset pronunciation phoneme sequence, and at the same time, only one of the consecutive elements with the same value may be retained. For example, if a preset pronunciation phoneme sequence is [a, a, i,-,-], the text sequence obtained after compression processing is [a, i].
  • the text sequence obtained after the terminal device compresses different pronunciation phoneme sequences may be the same, for example, the text sequence obtained after the terminal device compresses the pronunciation phoneme sequence [a, a, i,-,-] Is [a, i], the text sequence obtained by compressing the phoneme sequence [a,-, i, i,-] is also [a, i], therefore, in the embodiment of the present application, when the pronunciation is preset At least two preset pronunciation phoneme sequences in the phoneme sequence have the same text sequence, then the terminal device sums the probability of the text sequence corresponding to the speech sequence to be recognized as the text sequence corresponding to the at least two preset pronunciation phoneme sequences After calculation, the probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence is obtained.
  • the preset text sequence is composed of all different text sequences obtained by compressing the preset phoneme sequence.
  • the probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence constitutes a text sequence probability vector
  • S143 Determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  • the terminal determines the preset text sequence corresponding to the element with the largest value in the text sequence probability vector as the corresponding to the speech to be recognized. Text sequence.
  • the feature vector of each speech segment is extracted by dividing the speech sequence to be recognized into at least two frames of speech segments; the probability calculation layer of the preset neural network model determines the speech segment based on the feature vector of the speech segment The first probability vector; the joint timing classification layer of the preset neural network model determines the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all speech segments. Since the joint timing classification layer in the preset neural network model can Based on the first probability vectors of all speech segments corresponding to the speech sequence to be recognized, the text sequence corresponding to the speech sequence to be recognized is directly determined. Therefore, when training the preset neural network model in this embodiment, there is no need to use Both the speech sequence and the text sequence in the sample data of the model training are frame aligned, which saves the time cost and labor cost of speech recognition.
  • FIG. 4 is an implementation flowchart of a neural network-based speech recognition method according to a fourth embodiment of the present application.
  • a neural network-based speech recognition method provided in this embodiment may include S01 to S04 before S11, as described in detail as follows:
  • each sample data in the sample data set is composed of all speech fragments obtained by framing a speech sequence Is composed of the feature vector and the text sequence corresponding to the speech sequence.
  • the original neural network Before translating the speech sequence to be recognized into the corresponding text sequence, the original neural network model needs to be constructed first.
  • the original neural network includes successively connected probability calculation layers and joint time series classification layers.
  • probability calculation layer and the joint timing classification layer For the specific structure and principle of the probability calculation layer and the joint timing classification layer, please refer to the relevant description in S13 of the first embodiment, which will not be repeated here.
  • the terminal device After constructing the original neural network model, the terminal device obtains a preset sample data set.
  • Each sample data in the sample data set is composed of the feature vectors of all the speech fragments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.
  • the terminal device may divide the sample data set into a training set and a test set based on the preset allocation ratio.
  • the training set is used to train the original neural network model
  • the test set is used to verify the accuracy of the trained original neural network model.
  • the preset distribution ratio can be set according to actual needs, and is not limited here.
  • S02 Train a pre-built original neural network model based on the training set, and determine values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.
  • the terminal device trains the pre-built original neural network model based on the training set, and when training the original neural network model, all speeches obtained by framing the speech sequence contained in each sample data
  • the feature vectors of the fragments are used as the input of the original neural network model, and the text sequence corresponding to the speech sequence contained in each sample data is used as the output of the original neural network model, and the feature vectors of all the speech fragments appearing in the sample data are learned in the probability calculation layer Relative to the probability of each preset phoneme, the training of the original neural network model is completed.
  • the terminal device completes the training of the original neural network model based on the training set
  • the original neural network model that has completed the training is verified based on the test set.
  • the terminal device uses the feature vectors of all speech fragments obtained by framing the speech sequence contained in each sample data as the original neural network model ,
  • the original neural network model has been trained to determine the predicted value of the text sequence corresponding to the speech sequence in each sample data in the test set.
  • the terminal device calculates the prediction error of the trained original neural network model based on the text sequence corresponding to the speech sequence in each sample data in the test set and the prediction value of the text sequence corresponding to the speech sequence in each sample data.
  • the prediction error of the original neural network model that has been trained is used to identify the accuracy of the speech recognition of the original neural network model that has been trained. .
  • the terminal device compares the prediction error of the original neural network model that has completed training with a preset error threshold, and determines the completed training based on the comparison result Verification results of the original neural network model.
  • the preset error threshold is the allowable error value of speech recognition accuracy in practical applications.
  • the terminal device determines the verification result of the original neural network model that has completed the training as verified; if the comparison result is that the prediction error of the original neural network model that has completed the training is greater than the preset error threshold, the original training has been completed The accuracy of the speech recognition of the neural network model exceeds the allowable error range. At this time, the terminal device determines the verification result of the original neural network model that has completed the training as a verification failure.
  • the terminal device determines the trained original neural network model as the preset neural network model.
  • this implementation trains the pre-built original neural network model through a training set containing a certain number of sample data, and trains the original neural network model vehicles that have completed training through a test set containing a certain number of sample data. The accuracy of the fixed loss is verified. After the verification is passed, the original neural network model that has been trained is used as the subsequent preset neural network model for speech recognition, thereby improving the accuracy of speech recognition.
  • FIG. 5 is a structural block diagram of a terminal device according to an embodiment of the present application.
  • the terminal device in this embodiment is a terminal device.
  • Each unit included in the terminal device is used to execute each step in the embodiments corresponding to FIGS. 1 to 4.
  • the terminal device 500 includes: a first segmentation unit 51, a feature extraction unit 52, a first determination unit 53, and a second determination unit 54. among them:
  • the first division unit 51 is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.
  • the feature extraction unit 52 is configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment.
  • the first determining unit 53 is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector It is used to identify the probability that the pronunciation of the speech segment is the preset phoneme corresponding to the element.
  • the second determining unit 54 is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
  • the first dividing unit 51 is specifically used for:
  • Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  • the first determination unit 53 includes a first probability determination unit and a second probability determination unit. among them:
  • the first probability determination unit is used to determine, in the probability calculation layer, the feature vectors of the at least two frames of speech segments relative to the respective The probability of the preset phoneme.
  • the second probability determination unit is used to determine the first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
  • the second determination unit 54 includes: a first calculation unit, a third probability determination unit, and a text sequence determination unit. among them:
  • the first calculation unit is used to calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector and the preset probability calculation formula of all the speech segments in the joint time-series classification layer of the preset neural network model;
  • the value of each element in the pronunciation phoneme probability vector is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element.
  • the calculation formula of the preset probability is as follows:
  • T represents the total number of speech segments obtained by framing the speech sequence
  • N represents the total number of preset phonemes
  • N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes
  • y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ⁇ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
  • the third probability determination unit is used to determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the correspondence of the speech sequence Is a probability of a preset text sequence corresponding to the element, and the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence.
  • the text sequence determining unit is configured to determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  • the terminal device 500 may further include: a first acquisition unit, a model training unit, a model verification unit, and a model generation unit. among them:
  • the first obtaining unit is used to obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is obtained by framing a speech sequence The feature vectors of all the speech segments and the text sequence corresponding to the speech sequence.
  • the model training unit is used to train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.
  • the model verification unit is used to verify the original neural network model that has completed training based on the test set.
  • the model generating unit is configured to determine the original neural network model that has completed training as the preset neural network model if the verification is passed.
  • the terminal device 6 of this embodiment includes: a processor 60, a memory 61, and computer-readable instructions 62 stored in the memory 61 and executable on the processor 60, for example, based on a neural network The program of the voice recognition method.
  • the processor 60 executes the computer-readable instructions 62
  • the steps in the embodiments of the above neural network-based speech recognition methods are implemented, for example, S11 to S14 shown in FIG. 1.
  • the processor 60 executes the computer-readable instructions 62
  • the computer-readable instructions 62 may be divided into one or more units, and the one or more units are stored in the memory 61 and executed by the processor 60 to complete the application .
  • the one or more units may be a series of computer-readable instruction instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.
  • the computer-readable instructions 62 may be divided into a first segmentation unit, a feature extraction unit, a first determination unit, and a second determination unit, and the specific functions of each unit are as described above.
  • the so-called processor 60 can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6.
  • the memory 61 may also be an external storage device of the terminal device 6, for example, a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc.
  • the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is used to store the computer-readable instructions and other programs and data required by the terminal device.
  • the memory 61 can also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present application is applicable to the technical field of artificial intelligence, and provides a neural network-based speech recognition method, a terminal device, and a medium. The neural network-based speech recognition method comprises: obtaining a speech sequence to be recognized, and dividing the speech sequence into at least two frames of speech segments; performing acoustic feature extraction on each speech segment to obtain a feature vector of the speech segment; determining a first probability vector of the speech segment on a probability calculation layer of a preset neural network model on the basis of the feature vector of the speech segment, the value of each element in the first probability vector being used for identifying a probability that the pronunciation of the speech segment is a preset phoneme corresponding to the element; and determining a text sequence corresponding to the speech sequence on a joint timing classification layer of the preset neural network model on the basis of the first probability vectors of all the speech segments. Therefore, time costs and labor costs of speech recognition are saved.

Description

一种基于神经网络的语音识别方法、终端设备及介质Speech recognition method, terminal equipment and medium based on neural network
本申请申明享有2018年10月11日递交的申请号为201811182186.1、名称为“一种基于神经网络的语音识别方法、终端设备及介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application declares that it enjoys the priority of the Chinese patent application filed on October 11, 2018 with the application number 201811182186.1 and titled "A Neural Network-based Speech Recognition Method, Terminal Equipment and Medium". The overall content of the Chinese patent application is The way of reference is incorporated in this application.
技术领域Technical field
本申请属于人工智能技术领域,尤其涉及一种基于神经网络的语音识别方法、终端设备及非易失性可读存储介质。The present application belongs to the field of artificial intelligence technology, and particularly relates to a neural network-based speech recognition method, terminal device, and non-volatile readable storage medium.
背景技术Background technique
语音识别是将语音序列转换为文本序列的过程。随着人工智能技术的快速发展,基于机器学习的语音识别模型被广泛应用于各种语音识别场景中。Speech recognition is the process of converting a speech sequence into a text sequence. With the rapid development of artificial intelligence technology, speech recognition models based on machine learning are widely used in various speech recognition scenarios.
然而,在对传统的基于机器学习的语音识别模型进行训练时,对于待识别的语音序列中的每一帧语音数据,需要预先知道其对应的发音音素才能对语音识别模型进行有效训练,这就要求在训练语音识别模型之前,将语音序列与文本序列进行帧对齐处理。而训练模型时所用的样本数据较为庞大,对每条样本数据包含的语音序列与文本序列均进行帧对齐处理需要消耗大量的人力和时间,人工成本和时间成本较高。However, when training a traditional machine learning-based speech recognition model, for each frame of speech data in the speech sequence to be recognized, the corresponding pronunciation phonemes need to be known in advance in order to effectively train the speech recognition model. Before training the speech recognition model, the speech sequence and the text sequence should be frame aligned. The sample data used to train the model is relatively large. It takes a lot of manpower and time to perform frame alignment processing on the speech sequence and the text sequence contained in each sample data, and the labor cost and time cost are relatively high.
技术问题technical problem
本申请实施例提供了一种基于神经网络的语音识别方法、终端设备及非易失性可读存储介质,以解决现有的基于传统语音识别模型的语音识别方法所存在的人工成本和时间成本较高的问题。Embodiments of the present application provide a neural network-based voice recognition method, terminal device, and non-volatile readable storage medium to solve the labor cost and time cost of the existing voice recognition method based on the traditional voice recognition model Higher issues.
技术解决方案Technical solution
本申请实施例的第一方面提供了一种基于神经网络的语音识别方法,包括:A first aspect of the embodiments of the present application provides a speech recognition method based on a neural network, including:
获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
本申请实施例的第二方面提供了一种终端设备,包括:A second aspect of the embodiments of the present application provides a terminal device, including:
第一切分单元,用于获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;The first dividing unit is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
特征提取单元,用于对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;A feature extraction unit, configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
第一确定单元,用于在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The first determining unit is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector The probability for identifying the pronunciation of the speech segment as the preset phoneme corresponding to the element;
第二确定单元,用于在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。The second determining unit is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下各步骤:A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes the computer The following steps are realized when the instructions can be read:
获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
本申请实施例的第四方面提供了一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下各步骤:A fourth aspect of the embodiments of the present application provides a non-volatile readable storage medium, where the non-volatile readable storage medium stores computer-readable instructions that are implemented when executed by a processor The following steps:
获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
有益效果Beneficial effect
本申请实施例提供的一种基于神经网络的语音识别方法,通过将待识别的语音序列分为至少两帧语音片段,提取每帧语音片段的特征向量;在预设神经网络模型的概率计算层基于语音片段的特征向量,确定语音片段的第一概率向量;在预设神经网络模型的联合时序分类层基于所有语音片段的第一概率向量,确定待识别的语音序列对应的文本序列,由于预设神 经网络模型中的联合时序分类层可以基于待识别的语音序列对应的所有语音片段的第一概率向量,直接确定出待识别的语音序列对应的文本序列,因而在对本实施例中的预设神经网络模型进行训练时,无需将用于模型训练的样本数据中的语音序列与文本序列均进行帧对齐处理,从而节省了语音识别的时间成本和人工成本。A method for speech recognition based on a neural network provided by an embodiment of the present application, by dividing the speech sequence to be recognized into at least two frames of speech segments, extracting the feature vector of each frame of speech segments; at the probability calculation layer of the preset neural network model Determine the first probability vector of the speech segment based on the feature vector of the speech segment; determine the text sequence corresponding to the speech sequence to be recognized in the joint timing classification layer of the preset neural network model based on the first probability vector of all speech segments. Assume that the joint time series classification layer in the neural network model can directly determine the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all the speech fragments corresponding to the speech sequence to be recognized, so the preset in this embodiment When the neural network model is trained, it is not necessary to perform frame alignment processing on the speech sequence and the text sequence in the sample data used for model training, thereby saving the time cost and labor cost of speech recognition.
附图说明BRIEF DESCRIPTION
图1是本申请第一实施例提供的一种基于神经网络的语音识别方法的实现流程图;FIG. 1 is an implementation flowchart of a neural network-based speech recognition method provided by the first embodiment of the present application;
图2是本申请第二实施例提供的一种基于神经网络的语音识别方法中S13的具体实现流程图;2 is a specific implementation flowchart of S13 in a neural network-based speech recognition method provided by a second embodiment of the present application;
图3是本申请第三实施例提供的一种基于神经网络的语音识别方法中S14的具体实现流程图;3 is a specific implementation flowchart of S14 in a speech recognition method based on a neural network according to a third embodiment of the present application;
图4是本申请第四实施例提供的一种基于神经网络的语音识别方法的实现流程图;4 is a flowchart of a neural network-based speech recognition method provided by a fourth embodiment of the present application;
图5是本申请实施例提供的一种终端设备的结构框图;5 is a structural block diagram of a terminal device provided by an embodiment of the present application;
图6是本申请另一实施例提供的一种终端设备的结构框图。6 is a structural block diagram of a terminal device according to another embodiment of the present application.
本发明的实施方式Embodiments of the invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
请参阅图1,图1是本申请第一实施例提供的一种基于神经网络的语音识别方法的实现流程图。本实施例中,基于神经网络的语音识别方法的执行主体为终端设备。终端设备包括但不限于智能手机、平板电脑或台式电脑。Please refer to FIG. 1, which is an implementation flowchart of a neural network-based speech recognition method provided in the first embodiment of the present application. In this embodiment, the execution subject of the speech recognition method based on the neural network is a terminal device. Terminal devices include but are not limited to smartphones, tablets or desktop computers.
如图1所示的基于神经网络的语音识别方法包括以下步骤:The speech recognition method based on neural network shown in Figure 1 includes the following steps:
S11:获取待识别的语音序列,将所述语音序列分为至少两帧语音片段。S11: Acquire a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.
语音序列指一段持续时长大于预设时长阈值的语音数据,其中,预设时长阈值大于零。待识别的语音序列为需要翻译为本文序列的语音序列。The voice sequence refers to a piece of voice data whose duration is greater than a preset duration threshold, where the preset duration threshold is greater than zero. The speech sequence to be recognized is a speech sequence that needs to be translated into a text sequence.
当需要将一段语音序列翻译为与之对应的文本序列时,可以在终端设备上触发针对该语音序列的语音识别请求,语音识别请求中携带待识别的语音序列。语音识别请求用于请求终端设备将待识别的语音序列翻译为与之对应的文本序列。示例性的,当用户通过终端设备上安装的即时通讯应用与某联系人聊天时,若接收到对端发送的语音序列,则用户在需要时可以将该语音序列翻译为与之对应的文本序列,以便用户查看。具体的,用户可以通过长按或右击该语音序列对应的语音显示图标来触发终端设备显示针对该语音序列的菜单栏,并通过触发该菜单栏中的“翻译为文本”选项来触发针对该语音序列的语音识别请求。When a voice sequence needs to be translated into a corresponding text sequence, a voice recognition request for the voice sequence can be triggered on the terminal device, and the voice recognition request carries the voice sequence to be recognized. The voice recognition request is used to request the terminal device to translate the voice sequence to be recognized into a corresponding text sequence. Exemplarily, when a user chats with a contact through an instant messaging application installed on the terminal device, if a voice sequence sent by the opposite end is received, the user can translate the voice sequence into a corresponding text sequence when needed For users to view. Specifically, the user can trigger the terminal device to display a menu bar for the voice sequence by long-pressing or right-clicking the voice display icon corresponding to the voice sequence, and triggering the Voice recognition request for voice sequence.
终端设备检测到针对某一语音序列的语音识别请求时,从该语音识别请求中提取待识别的语音序列,并将提取到的语音序列分为至少两帧语音片段。When detecting a voice recognition request for a certain voice sequence, the terminal device extracts the voice sequence to be recognized from the voice recognition request, and divides the extracted voice sequence into at least two frames of voice segments.
作为本申请一实施例,终端设备可以通过如下方式将待识别的语音序列分为至少两帧语音片段,即S11具体可以包括以下步骤:As an embodiment of the present application, the terminal device may divide the voice sequence to be recognized into at least two frames of voice segments in the following manner, that is, S11 may specifically include the following steps:
基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
本实施例中,预设帧长用于标识对语音序列进行分帧操作后得到的各帧语音片段的时长;预设帧移量用于标识对语音序列进行分针操作时的时序步长。In this embodiment, the preset frame length is used to identify the duration of each frame of speech fragments obtained after framing the speech sequence; the preset frame shift amount is used to identify the timing step when the speech sequence is subjected to the minute hand operation.
终端设备获取到待识别的语音序列后,从待识别的语音序列的起始时间点开始,每隔预设帧移量从待识别的语音序列中截取一段预设帧长的语音片段,进而将待识别的语音序列分为至少两帧语音片段,对待识别的语音序列进行分帧操作得到的每帧语音片段的时长为预设帧长,且每相邻两帧语音片的起始时间点之间间隔预设帧移量。After the terminal device obtains the speech sequence to be recognized, starting from the starting time point of the speech sequence to be recognized, a section of speech with a preset frame length is intercepted from the speech sequence to be recognized every preset frame shift, and then the The speech sequence to be recognized is divided into at least two frames of speech segments, and the duration of each frame of speech segments obtained by performing the framing operation on the recognized speech sequence is the preset frame length, and the starting time point of every two adjacent frames of speech Preset the frame shift amount at intervals.
需要说明的是,本申请实施例中,预设帧移量小于预设帧长,也就是说,每相邻两帧语音片段之间在时序上存在一定的交叠,交叠部分的时长即为预设帧长与预设帧移量之差。在实际应用中,预设帧长和预设帧移量均可以根据实际需求设置。示例性的,若预设帧长设置为25毫秒,预设帧移量设置为10毫秒,那么,终端设备基于预设帧长及预设帧移量对语音序列进行分帧操作后得到的每相邻两帧语音片段之间有25-10=15毫秒的交叠。It should be noted that, in the embodiment of the present application, the preset frame shift amount is smaller than the preset frame length, that is, there is a certain overlap in timing between every two adjacent frames of speech segments, and the duration of the overlapped portion is It is the difference between the preset frame length and the preset frame shift amount. In practical applications, both the preset frame length and the preset frame shift amount can be set according to actual needs. Exemplarily, if the preset frame length is set to 25 milliseconds and the preset frame shift amount is set to 10 milliseconds, then the terminal device obtains each frame after the voice sequence is framed based on the preset frame length and the preset frame shift amount There is an overlap of 25-10 = 15 milliseconds between two adjacent speech frames.
S12:对所述语音片段进行声学特征提取,得到所述语音片段的特征向量。S12: Acoustic feature extraction is performed on the speech segment to obtain a feature vector of the speech segment.
本申请实施例中,由于对待识别的语音序列进行分帧操作后得到的各帧语音片段在时域上几乎没有对该语音片段的描述能力,因此,终端设备将待识别的语音序列分为至少两帧语音片段后,基于预设特征提取网络对每帧语音片段进行声学特征提取,得到每帧语音片段的特征向量。语音片段的特征向量中包含了语音片段的声学特征信息。In the embodiment of the present application, since each frame of speech fragments obtained after performing the framing operation on the speech sequence to be recognized has almost no ability to describe the speech fragment in the time domain, the terminal device divides the speech sequence to be recognized into at least After two frames of speech segments, the acoustic feature extraction is performed on each frame of speech segments based on a preset feature extraction network to obtain the feature vector of each frame of speech segments. The feature vector of the speech segment contains the acoustic feature information of the speech segment.
预设神经网络可以根据实际需求进行设置,此处不做限制。例如,作为本申请一实施例,预设特征提取网络可以是梅尔频率倒谱系数(Mel Frequency Cepstral Coefficents,MFCC)特征提取网络。需要说明的是,由于MFCC特征提取网络为现有技术,因此,此处不再对其原理进行详述。The preset neural network can be set according to actual needs, and there is no limitation here. For example, as an embodiment of this application, the preset feature extraction network may be a Mel Frequency Cepstral Coefficient (MFCC) feature extraction network. It should be noted that, since the MFCC feature extraction network is an existing technology, its principle will not be described in detail here.
S13:在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率。S13: Determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector is used to identify the The pronunciation of the speech segment is the probability of the preset phoneme corresponding to the element.
本申请实施例中,终端设备确定了对待识别的语音序列进行分帧操作得到的各帧语音片段的特征向量后,将对待识别的语音序列进行分帧操作得到的所有语音片段的特征向量均导 入预设神经网络模型。预设神经网络模型是基于预设数量的样本数据,通过机器学习算法对预先构建的原始神经网络模型进行训练得到的。样本数据中的每条数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成。In the embodiment of the present application, after the terminal device determines the feature vectors of the speech segments of each frame obtained by framing operation of the speech sequence to be recognized, the feature vectors of all speech segments obtained by framing operation of the speech sequence to be recognized are imported Preset neural network model. The preset neural network model is obtained by training a pre-built original neural network model through a machine learning algorithm based on a preset number of sample data. Each piece of data in the sample data is composed of the feature vectors of all speech segments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.
原始神经网络模型包括依次连接的概率计算层及联合时序分类层。其中:The original neural network model includes successively connected probability calculation layers and joint time series classification layers. among them:
概率计算层用于基于语音片段的特征向量,确定语音片段的第一概率向量。第一概率向量中的每个元素的值用于标识语音片段的发音为该元素对应的预设音素的概率,第一概率向量中包含的元素个数与预设音素的个数相同。其中,音素是根据语音的自然属性划分出来的最小语音单位,音素通常包含元音音素和辅音音素两大类。预设音素可以根据实际需求设置,此处不做限制。本申请实施例中,预设音素中至少包含一个空白音素,例如,预设音素可以包含一个空白音素以及汉语拼音中的所有元音音素和辅音音素。联合时序分类层用于基于对待识别的语音序列进行分帧操作得到的所有语音片段的第一概率向量,确定该语音序列对应的文本序列。The probability calculation layer is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment. The value of each element in the first probability vector is used to identify the probability that the speech segment is pronounced as the preset phoneme corresponding to the element. The number of elements contained in the first probability vector is the same as the number of preset phonemes. Among them, phonemes are the smallest phonetic units divided according to the natural attributes of speech. Phonemes usually include two major categories of vowel phonemes and consonant phonemes. The preset phonemes can be set according to actual needs, without limitation here. In the embodiment of the present application, the preset phoneme includes at least one blank phoneme. For example, the preset phoneme may include a blank phoneme and all vowel phonemes and consonant phonemes in Chinese pinyin. The joint timing classification layer is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all speech segments obtained by framing the speech sequence to be recognized.
在对原始神经网络模型进行训练时,将每条样本数据中包含的对语音序列进行分帧操作得到的所有语音片段的特征向量作为原始神经网络模型的输入,将每条样本数据中包含的语音序列对应的文本序列作为原始神经网络模型的输出,对原始神经网络模型进行训练,将完成训练的原始神经网络模型确定为本申请实施例中的预设神经网络模型。需要说明的是,在原始神经网络模型的训练过程中,终端设备会在概率计算层学习到样本数据中出现的所有语音片段的特征向量相对于各个预设音素的概率。When training the original neural network model, the feature vectors of all the speech fragments obtained by framing the speech sequence contained in each sample data are used as the input of the original neural network model, and the speech contained in each sample data is used The text sequence corresponding to the sequence serves as the output of the original neural network model, trains the original neural network model, and determines the original neural network model that has completed the training as the preset neural network model in the embodiment of the present application. It should be noted that, during the training process of the original neural network model, the terminal device will learn the probability of the feature vectors of all speech segments appearing in the sample data relative to each preset phoneme in the probability calculation layer.
作为本申请一实施例,终端设备将对待识别的语音序列进行分帧操作得到的所有语音片段的特征向量导入至预设神经网络模型后,可以基于如图2所示的S131~S132来确定各帧语音片段的第一概率向量:As an embodiment of the present application, after the terminal device imports the feature vectors of all speech segments obtained by framing the speech sequence to be recognized into the preset neural network model, it can determine each The first probability vector of the frame speech segment:
S131:在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述预设音素的概率。S131: In the probability calculation layer, determine the feature vectors of the at least two frames of speech segments relative to each of the preset phonemes based on the pre-learned probability of the feature vectors of each preset speech segment relative to each preset phoneme The probability.
S132:基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。S132: Determine a first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each preset phoneme.
本实施例中,预设语音片段包含样本数据中出现过的所有语音片段,预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率即为预先学习到的样本数据中出现过的语音片段的特征向量相对于各个预设音素的概率。In this embodiment, the preset voice segments include all the voice segments that have appeared in the sample data, and the probability of the feature vectors of each preset voice segment learned in advance with respect to each preset phoneme is the appearance in the sample data learned in advance The probability of the feature vector of the passed speech segment relative to each preset phoneme.
终端设备将对待识别的语音序列进行分帧操作得到的所有语音片段的特征向量导入至预设神经网络模型后,在预设神经网络模型的概率计算层基于预先学习到的所有可能的语音片段的特征向量相对于各个预设音素的概率,确定对待识别的语音序列进行分帧操作得到的各 帧语音片段的特征向量相对于各个预设音素的概率,每一语音片段的特征向量相对于各个预设音素的概率构成该语音片段的第一概率向量。After the terminal device imports the feature vectors of all speech segments obtained by framing the speech sequence to be recognized into the preset neural network model, the probability calculation layer of the preset neural network model is based on all possible speech segments learned in advance. The probability of the feature vector relative to each preset phoneme, the probability of the feature vector of each frame of the speech segment obtained by performing the framing operation of the speech sequence to be recognized relative to each preset phoneme, and the feature vector of each speech segment relative to each pre-set Let the phoneme probabilities constitute the first probability vector of the speech segment.
S14:在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。S14: Determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
本申请实施例中,终端设备确定了对待识别的语音序列进行分帧操作得到的各帧语音片段的第一概率向量后,在预设神经网络模型的联合时序分类层基于所有语音片段的第一概率向量,确定待识别的语音序列对应的文本序列。In the embodiment of the present application, after the terminal device determines the first probability vector of each frame of speech fragments obtained by performing the framing operation on the speech sequence to be recognized, based on the first The probability vector determines the text sequence corresponding to the speech sequence to be recognized.
在实际应用中,假设预设音素的总数目为N,对待识别的语音序列进行分帧操作后得到T帧语音片段,由于每帧语音片段对应的发音音素可能是N个预设音素中的任一一个,因此,待识别的语音序列对应的发音音素序列总共有N T种可能,本实施例中,将该N T种发音音素序列确定为预设发音音素序列,N T种发音音素序列中的每一发音音素序列均是由预设音素中的至少一个音素组成的长度为T的序列。 In practical applications, assuming that the total number of preset phonemes is N, T frames of speech fragments are obtained after framing the speech sequence to be recognized, because the pronunciation phonemes corresponding to each frame of speech fragments may be any of the N preset phonemes eleven, therefore, to pronounce phoneme sequence corresponding to the speech sequence to be recognized a total of N T possible, in this embodiment, the kind of phoneme pronunciations N T sequence as a predetermined sequence of phoneme pronunciations, N T sequences kinds of phoneme pronunciations Each pronunciation phoneme sequence in is a sequence of length T composed of at least one phoneme in a preset phoneme.
具体的,作为本申请一实施例,S14可以通过如图3所示的S141~S143实现,详述如下:Specifically, as an embodiment of the present application, S14 may be implemented through S141 to S143 shown in FIG. 3, and the details are as follows:
S141:在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率,所述预设概率计算公式如下:S141: Calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector of all the speech segments and a preset probability calculation formula in the joint time-series classification layer of the preset neural network model; the pronunciation phoneme probability vector The value of each element in is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:
Figure PCTCN2018124306-appb-000001
Figure PCTCN2018124306-appb-000001
其中,
Figure PCTCN2018124306-appb-000002
表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定。
among them,
Figure PCTCN2018124306-appb-000002
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment.
本实施例中,终端设备确定了对待识别的语音序列进行分帧操作得到的各帧语音片段的第一概率向量后,在预设神经网络模型的联合时序分类层基于对待识别的语音序列进行分帧操作得到的所有语音片段的第一概率向量及上述预设概率计算公式,计算待识别的语音序列的发音音素概率向量。其中,语音序列的发音音素概率向量中的每个元素的值用于标识该语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率。In this embodiment, after the terminal device determines the first probability vector of each frame of speech segments obtained by framing the speech sequence to be recognized, the terminal device classifies based on the speech sequence to be recognized in the joint timing classification layer of the preset neural network model The first probability vector of all speech segments obtained by the frame operation and the above-mentioned preset probability calculation formula calculate the pronunciation phoneme probability vector of the speech sequence to be recognized. The value of each element in the pronunciation phoneme probability vector of the speech sequence is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element.
示例性的,若预设音素包括以下4个音素:a、i、o及空白音素-,对待识别的语音序列 进行分帧操作后得到5帧语音片段,即T=5,由于5帧语音片段中每帧语音片段对应的发音音素均可以为预设音素中的任一音素,因此,待识别的语音序列对应的发音音素序列总共有4 5=1024种可能,在这1024个预设发音音素序列中,假设第一个预设发音音素序列为[a,a,i,-,-],则该第一个预设发音音素序列中包含的第1个发音音素对应的先验概率即为在概率计算层确定出的第一帧语音片段的特征向量相对于预设音素a的概率,该第一个预设发音音素序列中包含的第2个发音音素对应的先验概率即为在概率计算层确定出的第二帧语音片段的特征向量相对于预设音素a的概率,该第一个预设发音音素序列中包含的第3个发音音素对应的先验概率即为在概率计算层确定出的第三帧语音片段的特征向量相对于预设音素i的概率,该第一个预设发音音素序列中包含的第4个发音音素对应的先验概率即为在概率计算层确定出的第四帧语音片段的特征向量相对于预设音素i的概率,该第一个预设发音音素序列中包含的第5个发音音素对应的先验概率即为在概率计算层确定出的第五帧语音片段的特征向量相对于空白音素-的概率,终端设备将第一预设发音音素序列中所有元素各自对应的先验概率相乘即得到待识别的语音序列对应的发音音素序列为第一预设发音音素序列的概率,以此类推,即可得到待识别的语音序列对应的发音音素序列为各个预设发音音素序列的概率,待识别的语音序列对应的发音音素序列为各个预设发音音素序列的概率即构成语音序列的发音音素概率向量。 Exemplarily, if the preset phonemes include the following 4 phonemes: a, i, o, and blank phonemes, 5-frame speech fragments are obtained after framing the speech sequence to be recognized, that is, T = 5, because 5 frame speech fragments The pronunciation phonemes corresponding to each speech segment in the frame can be any phoneme in the preset phonemes. Therefore, there are a total of 4 5 = 1024 possible pronunciation phoneme sequences corresponding to the speech sequence to be recognized. Among the 1024 preset pronunciation phonemes In the sequence, assuming that the first preset pronunciation phoneme sequence is [a, a, i,-,-], then the a priori probability corresponding to the first pronunciation phoneme contained in the first preset pronunciation phoneme sequence is The probability of the feature vector of the first frame of speech segment determined at the probability calculation layer relative to the preset phoneme a, the prior probability corresponding to the second pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability The probability of the feature vector of the second frame of speech segment determined by the calculation layer relative to the preset phoneme a, the a priori probability corresponding to the third pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability calculation layer The determined third frame audio clip The probability of the feature vector of the segment relative to the preset phoneme i, the a priori probability corresponding to the 4th pronunciation phoneme included in the first preset pronunciation phoneme sequence is the fourth frame speech segment determined at the probability calculation layer The probability of the feature vector relative to the preset phoneme i, the a priori probability corresponding to the 5th pronunciation phoneme included in the first preset pronunciation phoneme sequence is the feature vector of the fifth frame speech segment determined at the probability calculation layer Relative to the blank phoneme-probability, the terminal device multiplies the prior probability corresponding to all elements in the first preset pronunciation phoneme sequence to obtain the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is the first preset pronunciation phoneme sequence Probability, and so on, the probability that the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is each preset pronunciation phoneme sequence, and the probability that the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is each preset pronunciation phoneme sequence constitutes The phoneme probability vector of pronunciation sequence.
S142:基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理得到。S142: Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the The probability of a preset text sequence corresponding to the element, which is obtained by compressing the preset pronunciation phoneme sequence.
在实际应用中,由于预设发音音素序列中通常会包含一些空白音素,或者预设发音音素序列中有些相邻元素对应的音素相同,因此,终端设备在确定出待识别的语音序列对应的发音音素序列为各个预设发音音素序列的概率后,对每一预设发音音素序列进行压缩处理,得到每一预设发音音素序列对应的文本序列,进而将待识别的语音序列对应的发音音素序列为各个预设发音音素序列的概率转换为:待识别的语音序列对应的文本序列为各个预设发音音素序列对应的文本序列的概率,也就是说,将待识别的语音序列对应的发音音素为各个预设发音音素序列的概率确定为该语音序列对应的文本序列为各个预设发音音素序列对应的文本序列的概率。In practical applications, the preset pronunciation phoneme sequence usually contains some blank phonemes, or some adjacent elements in the preset pronunciation phoneme sequence correspond to the same phoneme. Therefore, the terminal device determines the pronunciation corresponding to the speech sequence to be recognized After the phoneme sequence is the probability of each preset pronunciation phoneme sequence, each preset pronunciation phoneme sequence is compressed to obtain the text sequence corresponding to each preset pronunciation phoneme sequence, and then the pronunciation phoneme sequence corresponding to the speech sequence to be recognized The probability of each preset pronunciation phoneme sequence is converted to: the probability that the text sequence corresponding to the speech sequence to be recognized is the text sequence corresponding to each preset pronunciation phoneme sequence, that is, the pronunciation phoneme corresponding to the speech sequence to be recognized is The probability of each preset pronunciation phoneme sequence is determined as the probability that the text sequence corresponding to the speech sequence is the text sequence corresponding to each preset pronunciation phoneme sequence.
本实施例中,终端设备对预设发音音素序列进行压缩处理具体可以为:将预设发音音素序列中的空白音素剔除,同时对于连续几个值相同的元素仅保留一个即可。例如,若某预设发音音素序列为[a,a,i,-,-],则对其进行压缩处理后得到的文本序列为[a,i]。In this embodiment, the terminal device compressing the preset pronunciation phoneme sequence may specifically include: excluding blank phonemes in the preset pronunciation phoneme sequence, and at the same time, only one of the consecutive elements with the same value may be retained. For example, if a preset pronunciation phoneme sequence is [a, a, i,-,-], the text sequence obtained after compression processing is [a, i].
在实际应用中,终端设备对不同发音音素序列进行压缩处理后得到的文本序列可能相同,例如,终端设备对发音音素序列[a,a,i,-,-]进行压缩处理后得到的文本序列为[a,i],对发音音素序列[a,-,i,i,-]进行压缩处理后得到的文本序列也为[a,i],因此,本申请实施例中,当预设发音音素序列中有至少两个预设发音音素序列对应的文本序列相同,则终端设备将待识别的语音序列对应的文本序列为该至少两个预设发音音素序列对应的文本序列的概率进行求和运算,进而得到待识别的语音序列对应的文本序列为各个预设文本序列的概率。预设文本序列由对预设发音音素序列进行压缩处理得到的所有不同的文本序列构成。待识别的语音序列对应的文本序列为各个预设文本序列的概率即构成待识别的语音序列对应的文本序列概率向量。In practical applications, the text sequence obtained after the terminal device compresses different pronunciation phoneme sequences may be the same, for example, the text sequence obtained after the terminal device compresses the pronunciation phoneme sequence [a, a, i,-,-] Is [a, i], the text sequence obtained by compressing the phoneme sequence [a,-, i, i,-] is also [a, i], therefore, in the embodiment of the present application, when the pronunciation is preset At least two preset pronunciation phoneme sequences in the phoneme sequence have the same text sequence, then the terminal device sums the probability of the text sequence corresponding to the speech sequence to be recognized as the text sequence corresponding to the at least two preset pronunciation phoneme sequences After calculation, the probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence is obtained. The preset text sequence is composed of all different text sequences obtained by compressing the preset phoneme sequence. The probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence constitutes a text sequence probability vector corresponding to the speech sequence to be recognized.
S143:将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。S143: Determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
文本序列概率向量中元素的值越大,说明待识别的语音序列对应的文本序列为该元素对应的预设文本序列的概率越大。因此,本实施例中,终端在确定了待识别的语音序列对应的文本序列概率向量后,将文本序列概率向量中值最大的元素对应的预设文本序列确定为待识别的语音欲裂对应的文本序列。The greater the value of the element in the text sequence probability vector, the greater the probability that the text sequence corresponding to the speech sequence to be recognized is the preset text sequence corresponding to the element. Therefore, in this embodiment, after determining the text sequence probability vector corresponding to the speech sequence to be recognized, the terminal determines the preset text sequence corresponding to the element with the largest value in the text sequence probability vector as the corresponding to the speech to be recognized. Text sequence.
以上可以看出,通过将待识别的语音序列分为至少两帧语音片段,提取每帧语音片段的特征向量;在预设神经网络模型的概率计算层基于语音片段的特征向量,确定语音片段的第一概率向量;在预设神经网络模型的联合时序分类层基于所有语音片段的第一概率向量,确定待识别的语音序列对应的文本序列,由于预设神经网络模型中的联合时序分类层可以基于待识别的语音序列对应的所有语音片段的第一概率向量,直接确定出待识别的语音序列对应的文本序列,因而在对本实施例中的预设神经网络模型进行训练时,无需将用于模型训练的样本数据中的语音序列与文本序列均进行帧对齐处理,节省了语音识别时间成本和人工成本。It can be seen from the above that the feature vector of each speech segment is extracted by dividing the speech sequence to be recognized into at least two frames of speech segments; the probability calculation layer of the preset neural network model determines the speech segment based on the feature vector of the speech segment The first probability vector; the joint timing classification layer of the preset neural network model determines the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all speech segments. Since the joint timing classification layer in the preset neural network model can Based on the first probability vectors of all speech segments corresponding to the speech sequence to be recognized, the text sequence corresponding to the speech sequence to be recognized is directly determined. Therefore, when training the preset neural network model in this embodiment, there is no need to use Both the speech sequence and the text sequence in the sample data of the model training are frame aligned, which saves the time cost and labor cost of speech recognition.
请参阅图4,图4是本申请第四实施例提供的一种基于神经网络的语音识别方法的实现流程图。相对于图1对应的实施例,本实施例提供的一种基于神经网络的语音识别方法在S11之前,还可以包括S01~S04,详述如下:Please refer to FIG. 4. FIG. 4 is an implementation flowchart of a neural network-based speech recognition method according to a fourth embodiment of the present application. Compared with the embodiment corresponding to FIG. 1, a neural network-based speech recognition method provided in this embodiment may include S01 to S04 before S11, as described in detail as follows:
S01:获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成。S01: Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is composed of all speech fragments obtained by framing a speech sequence Is composed of the feature vector and the text sequence corresponding to the speech sequence.
在将待识别的语音序列翻译为与之对应的文本序列之前,需要先构建原始神经网络模型。原始神经网络包括依次连接的概率计算层及联合时序分类层。概率计算层及联合时序分类层的具体结构及原理请参照第一实施例S13中的相关描述,此处不再赘述。Before translating the speech sequence to be recognized into the corresponding text sequence, the original neural network model needs to be constructed first. The original neural network includes successively connected probability calculation layers and joint time series classification layers. For the specific structure and principle of the probability calculation layer and the joint timing classification layer, please refer to the relevant description in S13 of the first embodiment, which will not be repeated here.
在构建好原始神经网络模型后,终端设备获取预设的样本数据集。其中,样本数据集中 的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成。After constructing the original neural network model, the terminal device obtains a preset sample data set. Each sample data in the sample data set is composed of the feature vectors of all the speech fragments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.
终端设备获取到预设的样本数据集后,可以基于预设分配比例将样本数据集分为训练集和测试集。训练集用于对原始神经网络模型进行训练,测试集用于对已完成训练的原始神经网络模型的准确度进行校验。预设分配比例可以根据实际需求设置,此处不做限制,例如,预设分配比例可以为:训练集:测试集=3:1。即样本数据集中3/4的样本数据用于训练原始神经网络模型,1/4的样本数据用于对已完成训练的原始神经网络模型的准确度进行校验。After acquiring the preset sample data set, the terminal device may divide the sample data set into a training set and a test set based on the preset allocation ratio. The training set is used to train the original neural network model, and the test set is used to verify the accuracy of the trained original neural network model. The preset distribution ratio can be set according to actual needs, and is not limited here. For example, the preset distribution ratio may be: training set: test set = 3: 1. That is, 3/4 of the sample data in the sample data set is used to train the original neural network model, and 1/4 of the sample data is used to verify the accuracy of the trained original neural network model.
S02:基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值。S02: Train a pre-built original neural network model based on the training set, and determine values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.
本实施例中,终端设备基于训练集对预先构建的原始神经网络模型进行训练,在对原始神经网络模型进行训练时,将每条样本数据中包含的对语音序列进行分帧操作得到的所有语音片段的特征向量作为原始神经网络模型的输入,将每条样本数据中包含的语音序列对应的文本序列作为原始神经网络模型的输出,在概率计算层学习样本数据中出现的所有语音片段的特征向量相对于各个预设音素的概率,进而完成对原始神经网络模型进行训练。In this embodiment, the terminal device trains the pre-built original neural network model based on the training set, and when training the original neural network model, all speeches obtained by framing the speech sequence contained in each sample data The feature vectors of the fragments are used as the input of the original neural network model, and the text sequence corresponding to the speech sequence contained in each sample data is used as the output of the original neural network model, and the feature vectors of all the speech fragments appearing in the sample data are learned in the probability calculation layer Relative to the probability of each preset phoneme, the training of the original neural network model is completed.
S03:基于所述测试集对已完成训练的所述原始神经网络模型进行验证。S03: verify the original neural network model that has completed training based on the test set.
本实施例中,终端设备基于训练集完成对原始神经网络模型的训练后,基于测试集对已完成训练的原始神经网络模型进行验证。In this embodiment, after the terminal device completes the training of the original neural network model based on the training set, the original neural network model that has completed the training is verified based on the test set.
具体的,终端设备在基于测试集对已完成训练的原始神经网络模型进行验证时,将每条样本数据中包含的对语音序列进行分帧操作得到的所有语音片段的特征向量作为原始神经网络模型的输入,通过已完成训练的原始神经网络模型确定测试集中每条样本数据中的语音序列对应的文本序列的预测值。Specifically, when verifying the original neural network model that has been trained based on the test set, the terminal device uses the feature vectors of all speech fragments obtained by framing the speech sequence contained in each sample data as the original neural network model , The original neural network model has been trained to determine the predicted value of the text sequence corresponding to the speech sequence in each sample data in the test set.
终端设备基于测试集中每条样本数据中的语音序列对应的文本序列及每条样本数据中的语音序列对应的文本序列的预测值,计算已训练完成的原始神经网络模型的预测误差。已完成训练的原始神经网络模型的预测误差用于标识已完成训练的原始神经网络模型的语音识别准确度,已完成训练的原始神经网络模型的预测误差值越大,表明其语音识别准确度低。The terminal device calculates the prediction error of the trained original neural network model based on the text sequence corresponding to the speech sequence in each sample data in the test set and the prediction value of the text sequence corresponding to the speech sequence in each sample data. The prediction error of the original neural network model that has been trained is used to identify the accuracy of the speech recognition of the original neural network model that has been trained. .
本实施例中,终端设备得到已完成训练的原始神经网络模型的预测误差后,将已完成训练的原始神经网络模型的预测误差与预设误差阈值进行比较,并基于比较结果确定对已完成训练的原始神经网络模型的验证结果。其中,预设误差阈值为实际应用中可允许的语音识别准确度误差值。In this embodiment, after obtaining the prediction error of the original neural network model that has completed training, the terminal device compares the prediction error of the original neural network model that has completed training with a preset error threshold, and determines the completed training based on the comparison result Verification results of the original neural network model. Wherein, the preset error threshold is the allowable error value of speech recognition accuracy in practical applications.
其中,若比较结果为已完成训练的原始神经网络模型的预测误差小于或等于所述预设误差阈值,则说明已完成训练的原始神经网络模型的语音识别准确度在可允许的误差范围内, 此时终端设备将对已完成训练的原始神经网络模型的验证结果确定为验证通过;若比较结果为已完成训练的原始神经网络模型的预测误差大于预设误差阈值,则说明已完成训练的原始神经网络模型的语音识别准确度超过了可允许的误差范围,此时终端设备将对已完成训练的原始神经网络模型的验证结果确定为验证未通过。Where, if the comparison result is that the prediction error of the trained original neural network model is less than or equal to the preset error threshold, it means that the accuracy of speech recognition of the trained original neural network model is within the allowable error range, At this time, the terminal device determines the verification result of the original neural network model that has completed the training as verified; if the comparison result is that the prediction error of the original neural network model that has completed the training is greater than the preset error threshold, the original training has been completed The accuracy of the speech recognition of the neural network model exceeds the allowable error range. At this time, the terminal device determines the verification result of the original neural network model that has completed the training as a verification failure.
S04:若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。S04: If the verification is passed, the original neural network model that has completed training is determined to be the preset neural network model.
实施例中,终端设备若检测到对已完成训练的原始神经网络模型的验证通过,则将已完成训练的原始神经网络模型确定为预设神经网络模型。In an embodiment, if the terminal device detects that the verified original neural network model has passed the verification, it determines the trained original neural network model as the preset neural network model.
以上可以看出,本实施通过包含一定数量的样本数据的训练集对预先构建的原始神经网络模型进行训练,并通过包含一定数量的样本数据的测试集对已完成训练的原始神经网络模型的车辆定损准确度进行验证,验证通过后才将已完成训练的原始神经网络模型作为后续用于进行语音识别的预设神经网络模型,从而提高了语音识别的准确度。As can be seen from the above, this implementation trains the pre-built original neural network model through a training set containing a certain number of sample data, and trains the original neural network model vehicles that have completed training through a test set containing a certain number of sample data. The accuracy of the fixed loss is verified. After the verification is passed, the original neural network model that has been trained is used as the subsequent preset neural network model for speech recognition, thereby improving the accuracy of speech recognition.
请参阅图5,图5是本申请实施例提供的一种终端设备的结构框图。本实施例中的终端设备为终端设备。该终端设备包括的各单元用于执行图1至图4对应的实施例中的各步骤。具体请参阅图1至图4以及图1至图4所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图5,终端设备500包括:第一切分单元51、特征提取单元52、第一确定单元53第二确定单元54。其中:Please refer to FIG. 5, which is a structural block diagram of a terminal device according to an embodiment of the present application. The terminal device in this embodiment is a terminal device. Each unit included in the terminal device is used to execute each step in the embodiments corresponding to FIGS. 1 to 4. For details, please refer to the related descriptions in FIGS. 1 to 4 and the embodiments corresponding to FIGS. 1 to 4. For ease of explanation, only parts related to this embodiment are shown. Referring to FIG. 5, the terminal device 500 includes: a first segmentation unit 51, a feature extraction unit 52, a first determination unit 53, and a second determination unit 54. among them:
第一切分单元51用于获取待识别的语音序列,将所述语音序列分为至少两帧语音片段。The first division unit 51 is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.
特征提取单元52用于对所述语音片段进行声学特征提取,得到所述语音片段的特征向量。The feature extraction unit 52 is configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment.
第一确定单元53用于在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率。The first determining unit 53 is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector It is used to identify the probability that the pronunciation of the speech segment is the preset phoneme corresponding to the element.
第二确定单元54用于在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。The second determining unit 54 is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
作为本申请一实施例,第一切分单元51具体用于:As an embodiment of the present application, the first dividing unit 51 is specifically used for:
基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
作为本申请一实施例,第一确定单元53包括第一概率确定单元及第二概率确定单元。其中:As an embodiment of the present application, the first determination unit 53 includes a first probability determination unit and a second probability determination unit. among them:
第一概率确定单元用于在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述 预设音素的概率。The first probability determination unit is used to determine, in the probability calculation layer, the feature vectors of the at least two frames of speech segments relative to the respective The probability of the preset phoneme.
第二概率确定单元用于基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。The second probability determination unit is used to determine the first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
作为本申请一实施例,第二确定单元54包括:第一计算单元、第三概率确定单元及文本序列确定单元。其中:As an embodiment of the present application, the second determination unit 54 includes: a first calculation unit, a third probability determination unit, and a text sequence determination unit. among them:
第一计算单元用于在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率,所述预设概率计算公式如下:The first calculation unit is used to calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector and the preset probability calculation formula of all the speech segments in the joint time-series classification layer of the preset neural network model; The value of each element in the pronunciation phoneme probability vector is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element. The calculation formula of the preset probability is as follows:
Figure PCTCN2018124306-appb-000003
Figure PCTCN2018124306-appb-000003
其中,
Figure PCTCN2018124306-appb-000004
表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定;
among them,
Figure PCTCN2018124306-appb-000004
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
第三概率确定单元用于基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理得到。The third probability determination unit is used to determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the correspondence of the speech sequence Is a probability of a preset text sequence corresponding to the element, and the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence.
文本序列确定单元用于将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。The text sequence determining unit is configured to determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
作为本申请一实施例,终端设备500还可以包括:第一获取单元、模型训练单元、模型验证单元、模型生成单元。其中:As an embodiment of the present application, the terminal device 500 may further include: a first acquisition unit, a model training unit, a model verification unit, and a model generation unit. among them:
第一获取单元用于获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成。The first obtaining unit is used to obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is obtained by framing a speech sequence The feature vectors of all the speech segments and the text sequence corresponding to the speech sequence.
模型训练单元用于基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值。The model training unit is used to train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.
模型验证单元用于基于所述测试集对已完成训练的所述原始神经网络模型进行验证。The model verification unit is used to verify the original neural network model that has completed training based on the test set.
模型生成单元用于若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。The model generating unit is configured to determine the original neural network model that has completed training as the preset neural network model if the verification is passed.
图6是本申请另一实施例提供的一种终端设备的结构框图。如图6所示,该实施例的终端设备6包括:处理器60、存储器61以及存储在所述存储器61中并可在所述处理器60上运行的计算机可读指令62,例如基于神经网络的语音识别方法的程序。处理器60执行所述计算机可读指令62时实现上述各个基于神经网络的语音识别方法各实施例中的步骤,例如图1所示的S11至S14。或者,所述处理器60执行所述计算机可读指令62时实现上述图5对应的实施例中各单元的功能,例如,图5所示的单元51至54的功能,具体请参阅图5对应的实施例中的相关描述,此处不赘述。6 is a structural block diagram of a terminal device according to another embodiment of the present application. As shown in FIG. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61, and computer-readable instructions 62 stored in the memory 61 and executable on the processor 60, for example, based on a neural network The program of the voice recognition method. When the processor 60 executes the computer-readable instructions 62, the steps in the embodiments of the above neural network-based speech recognition methods are implemented, for example, S11 to S14 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of the units in the embodiment corresponding to FIG. 5 described above, for example, the functions of the units 51 to 54 shown in FIG. 5, please refer to the corresponding Relevant descriptions in the embodiments will not be repeated here.
示例性的,所述计算机可读指令62可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器61中,并由所述处理器60执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机可读指令指令段,该指令段用于描述所述计算机可读指令62在所述终端设备6中的执行过程。例如,所述计算机可读指令62可以被分割成第一切分单元、特征提取单元、第一确定单元第二确定单元,各单元具体功能如上所述。Exemplarily, the computer-readable instructions 62 may be divided into one or more units, and the one or more units are stored in the memory 61 and executed by the processor 60 to complete the application . The one or more units may be a series of computer-readable instruction instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6. For example, the computer-readable instructions 62 may be divided into a first segmentation unit, a feature extraction unit, a first determination unit, and a second determination unit, and the specific functions of each unit are as described above.
所称处理器60可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 60 can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器61可以是所述终端设备6的内部存储单元,例如终端设备6的硬盘或内存。所述存储器61也可以是所述终端设备6的外部存储设备,例如所述终端设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器61还可以既包括所述终端设备6的内部存储单元也包括外部存储设备。所述存储器61用于存储所述计算机可读指令以及所述终端设备所需的其他程序和数据。所述存储器61还可以用于暂时地存储已经输出或者将要输出的数据。The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, for example, a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store the computer-readable instructions and other programs and data required by the terminal device. The memory 61 can also be used to temporarily store data that has been or will be output.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application. Within the scope of protection of this application.

Claims (20)

  1. 一种基于神经网络的语音识别方法,其特征在于,包括:A speech recognition method based on neural network, which is characterized by:
    获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
    对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
    在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  2. 根据权利要求1所述的基于神经网络的语音识别方法,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段,包括:The neural network-based speech recognition method according to claim 1, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:
    基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  3. 根据权利要求1所述的基于神经网络的语音识别方法,其特征在于,所述在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量,包括:The neural network-based speech recognition method according to claim 1, wherein the probability calculation layer in the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment ,include:
    在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述预设音素的概率;At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;
    基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
  4. 根据权利要求1所述的基于神经网络的语音识别方法,其特征在于,所述在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列,包括:The neural network-based speech recognition method according to claim 1, wherein the joint time-series classification layer in the preset neural network model determines the speech based on the first probability vectors of all the speech segments The text sequence corresponding to the sequence, including:
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率,所述预设概率计算公式如下:In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:
    Figure PCTCN2018124306-appb-100001
    Figure PCTCN2018124306-appb-100001
    其中,
    Figure PCTCN2018124306-appb-100002
    表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定;
    among them,
    Figure PCTCN2018124306-appb-100002
    Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
    基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理得到;Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;
    将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  5. 根据权利要求1至4任一项所述的基于神经网络的语音识别方法,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段之前,还包括:The neural network-based speech recognition method according to any one of claims 1 to 4, wherein before the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, the method further includes:
    获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成;Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;
    基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值;Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;
    基于所述测试集对已完成训练的所述原始神经网络模型进行验证;Verify the trained original neural network model based on the test set;
    若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.
  6. 一种终端设备,其特征在于,包括:A terminal device is characterized by comprising:
    第一切分单元,用于获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;The first dividing unit is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
    特征提取单元,用于对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;A feature extraction unit, configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
    第一确定单元,用于在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The first determining unit is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector The probability for identifying the pronunciation of the speech segment as the preset phoneme corresponding to the element;
    第二确定单元,用于在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。The second determining unit is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
  7. 根据权利要求6所述的终端设备,其特征在于,所述第一切分单元具体用于:The terminal device according to claim 6, wherein the first dividing unit is specifically configured to:
    基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  8. 根据权利要求6所述的终端设备,其特征在于,所述第一确定单元包括:The terminal device according to claim 6, wherein the first determining unit comprises:
    第一概率确定单元,用于在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述预设音素的概率;A first probability determining unit for determining, in the probability calculation layer, based on the probabilities of the feature vectors of each preset voice segment learned in advance with respect to each preset phoneme, the feature vectors of the at least two frames of voice segments are determined relative to The probability of each of the preset phonemes;
    第二概率确定单元,用于基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。The second probability determining unit is configured to determine a first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
  9. 根据权利要求6所述的终端设备,其特征在于,所述第二确定单元包括:The terminal device according to claim 6, wherein the second determining unit comprises:
    第一计算单元,用于在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率,所述预设概率计算公式如下:A first calculation unit, used to calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector of all the speech segments and a preset probability calculation formula in the joint time-series classification layer of the preset neural network model; The value of each element in the pronunciation phoneme probability vector is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element. The calculation formula of the preset probability is as follows:
    Figure PCTCN2018124306-appb-100003
    Figure PCTCN2018124306-appb-100003
    其中,
    Figure PCTCN2018124306-appb-100004
    表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定;
    among them,
    Figure PCTCN2018124306-appb-100004
    Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
    第三概率确定单元,用于基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理得到;A third probability determining unit, used to determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the speech sequence The corresponding text sequence is the probability of the preset text sequence corresponding to the element, and the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;
    文本序列确定单元,用于将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。The text sequence determining unit is configured to determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  10. 根据权利要求6-9任一项所述的终端设备,其特征在于,还包括:The terminal device according to any one of claims 6-9, further comprising:
    第一获取单元,用于获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成;The first obtaining unit is used to obtain a preset sample data set and divide the sample data set into a training set and a test set; each sample data in the sample data set is framed by a speech sequence The feature vectors of all the speech segments obtained and the text sequence corresponding to the speech sequence are composed;
    模型训练单元,用于基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值;A model training unit, configured to train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;
    模型验证单元,用于基于所述测试集对已完成训练的所述原始神经网络模型进行验证;A model verification unit for verifying the original neural network model that has completed training based on the test set;
    模型生成单元,用于若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。The model generating unit is configured to determine the original neural network model that has completed training as the preset neural network model if the verification is passed.
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that the processor is implemented as follows when executing the computer-readable instructions step:
    获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
    对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
    在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  12. 根据权利要求11所述的终端设备,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段,包括:The terminal device according to claim 11, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:
    基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  13. 根据权利要求11所述的终端设备,其特征在于,所述在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量,包括:The terminal device according to claim 11, wherein the determining the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model includes:
    在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述预设音素的概率;At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;
    基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
  14. 根据权利要求11所述的终端设备,其特征在于,所述在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列,包括:The terminal device according to claim 11, wherein the joint timing classification layer in the preset neural network model determines the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments ,include:
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序 列的概率,所述预设概率计算公式如下:In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:
    Figure PCTCN2018124306-appb-100005
    Figure PCTCN2018124306-appb-100005
    其中,
    Figure PCTCN2018124306-appb-100006
    表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定;
    among them,
    Figure PCTCN2018124306-appb-100006
    Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
    基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理得到;Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;
    将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  15. 根据权利要求11-14任一项所述的终端设备,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段之前,还包括:The terminal device according to any one of claims 11-14, characterized in that before acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, the method further includes:
    获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成;Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;
    基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值;Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;
    基于所述测试集对已完成训练的所述原始神经网络模型进行验证;Verify the trained original neural network model based on the test set;
    若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.
  16. 一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A non-volatile readable storage medium that stores computer-readable instructions, characterized in that, when the computer-readable instructions are executed by a processor, the following steps are implemented:
    获取待识别的语音序列,将所述语音序列分为至少两帧语音片段;Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;
    对所述语音片段进行声学特征提取,得到所述语音片段的特征向量;Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;
    在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量;所述第一概率向量中的每个元素的值用于标识所述语音片段的发音为该元素对应的预设音素的概率;The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列。In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
  17. 根据权利要求16所述的非易失性可读存储介质,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段,包括:The non-volatile readable storage medium according to claim 16, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:
    基于预设帧长及预设帧移量对所述语音序列进行分帧操作,得到至少两帧语音片段;每帧所述语音片段的时长为所述预设帧长。Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
  18. 根据权利要求16所述的非易失性可读存储介质,其特征在于,所述在预设神经网络模型的概率计算层基于所述语音片段的特征向量,确定所述语音片段的第一概率向量,包括:The non-volatile readable storage medium according to claim 16, wherein the probability calculation layer in the preset neural network model determines the first probability of the speech segment based on the feature vector of the speech segment Vector, including:
    在所述概率计算层基于预先学习到的各个预设语音片段的特征向量相对于各个预设音素的概率,分别确定所述至少两帧语音片段的特征向量相对于各个所述预设音素的概率;At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;
    基于所述语音片段的特征向量相对于各个所述预设音素的概率确定所述语音片段的第一概率向量。The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
  19. 根据权利要求16所述的非易失性可读存储介质,其特征在于,所述在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量,确定所述语音序列对应的文本序列,包括:The non-volatile readable storage medium according to claim 16, wherein the joint timing classification layer in the preset neural network model determines the The text sequence corresponding to the speech sequence, including:
    在所述预设神经网络模型的联合时序分类层基于所有所述语音片段的第一概率向量及预设概率计算公式,计算所述语音序列的发音音素概率向量;所述发音音素概率向量中的每个元素的值用于标识所述语音序列对应的发音音素序列为该元素对应的预设发音音素序列的概率,所述预设概率计算公式如下:In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:
    Figure PCTCN2018124306-appb-100007
    Figure PCTCN2018124306-appb-100007
    其中,
    Figure PCTCN2018124306-appb-100008
    表示发音音素概率向量中第i个元素的值,i∈[1,N T],T表示对所述语音序列进行分帧操作得到的语音片段的总数目,N表示预设音素的总数目,N T表示由N个所述预设音素中的至少一个所述预设音素组合而成的长度为T的预设发音音素序列的总数目,y it表示第i个预设发音音素序列中包含的第t个发音音素对应的先验概率,t∈[1,T],第t个发音音素对应的先验概率的值根据第t帧语音片段的第一概率向量确定;
    among them,
    Figure PCTCN2018124306-appb-100008
    Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;
    基于所述语音序列的发音音素概率向量,确定所述语音序列的文本序列概率向量;所述文本概率序列向量中的每个元素的值用于标识所述语音序列对应的文本序列为该元素对应的预设文本序列的概率,所述预设文本序列通过对所述预设发音音素序列进行压缩处理 得到;Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;
    将所述文本序列概率向量中值最大的元素对应的所述预设文本序列确定为所述语音序列对应的文本序列。Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
  20. 根据权利要求16-19任一项所述的非易失性可读存储介质,其特征在于,所述获取待识别的语音序列,将所述语音序列分为至少两帧语音片段之前,还包括:The non-volatile readable storage medium according to any one of claims 16 to 19, wherein before the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, further comprising :
    获取预设的样本数据集,并将所述样本数据集划分为训练集和测试集;所述样本数据集中的每条样本数据均由对一语音序列进行分帧操作得到的所有语音片段的特征向量及该语音序列对应的文本序列构成;Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;
    基于所述训练集对预先构建的原始神经网络模型进行训练,确定所述原始神经网络模型的特征提取层及联合时序分类层所包含的各个预设参数的值;Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;
    基于所述测试集对已完成训练的所述原始神经网络模型进行验证;Verify the trained original neural network model based on the test set;
    若验证通过,则将已完成训练的所述原始神经网络模型确定为所述预设神经网络模型。If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.
PCT/CN2018/124306 2018-10-11 2018-12-27 Neural network-based speech recognition method, terminal device, and medium WO2020073509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811182186.1A CN109559735B (en) 2018-10-11 2018-10-11 Voice recognition method, terminal equipment and medium based on neural network
CN201811182186.1 2018-10-11

Publications (1)

Publication Number Publication Date
WO2020073509A1 true WO2020073509A1 (en) 2020-04-16

Family

ID=65864724

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124306 WO2020073509A1 (en) 2018-10-11 2018-12-27 Neural network-based speech recognition method, terminal device, and medium

Country Status (2)

Country Link
CN (1) CN109559735B (en)
WO (1) WO2020073509A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112002305A (en) * 2020-07-29 2020-11-27 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113763932A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Voice processing method and device, computer equipment and storage medium
CN118098206A (en) * 2024-04-18 2024-05-28 深圳市友杰智新科技有限公司 Command word score calculating method, device, equipment and medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862985B (en) * 2019-05-17 2024-05-31 北京嘀嘀无限科技发展有限公司 Speech recognition device, method, electronic equipment and storage medium
CN111696580B (en) * 2020-04-22 2023-06-16 广州多益网络股份有限公司 Voice detection method and device, electronic equipment and storage medium
CN111883109B (en) * 2020-07-01 2023-09-26 北京猎户星空科技有限公司 Voice information processing and verification model training method, device, equipment and medium
CN112309405A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Method and device for detecting multiple sound events, computer equipment and storage medium
CN113539242A (en) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN113539231B (en) * 2020-12-30 2024-06-18 腾讯科技(深圳)有限公司 Audio processing method, vocoder, device, equipment and storage medium
CN113936643B (en) * 2021-12-16 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 Speech recognition method, speech recognition model, electronic device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107615308A (en) * 2015-05-11 2018-01-19 国立研究开发法人情报通信研究机构 The learning method of Recognition with Recurrent Neural Network and computer program and voice recognition device for the learning method
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN107680582A (en) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 Acoustic training model method, audio recognition method, device, equipment and medium
US20180047388A1 (en) * 2016-08-10 2018-02-15 Conduent Business Services, Llc Modeling a class posterior probability of context dependent phonemes in a speech recognition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104575490B (en) * 2014-12-30 2017-11-07 苏州驰声信息科技有限公司 Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN107680597B (en) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN108417202B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Voice recognition method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107615308A (en) * 2015-05-11 2018-01-19 国立研究开发法人情报通信研究机构 The learning method of Recognition with Recurrent Neural Network and computer program and voice recognition device for the learning method
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
US20180047388A1 (en) * 2016-08-10 2018-02-15 Conduent Business Services, Llc Modeling a class posterior probability of context dependent phonemes in a speech recognition system
CN107680582A (en) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 Acoustic training model method, audio recognition method, device, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112002305A (en) * 2020-07-29 2020-11-27 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113763932A (en) * 2021-05-13 2021-12-07 腾讯科技(深圳)有限公司 Voice processing method and device, computer equipment and storage medium
CN113763932B (en) * 2021-05-13 2024-02-13 腾讯科技(深圳)有限公司 Speech processing method, device, computer equipment and storage medium
CN118098206A (en) * 2024-04-18 2024-05-28 深圳市友杰智新科技有限公司 Command word score calculating method, device, equipment and medium

Also Published As

Publication number Publication date
CN109559735A (en) 2019-04-02
CN109559735B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
WO2020073509A1 (en) Neural network-based speech recognition method, terminal device, and medium
JP6621536B2 (en) Electronic device, identity authentication method, system, and computer-readable storage medium
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN106887225B (en) Acoustic feature extraction method and device based on convolutional neural network and terminal equipment
WO2021000408A1 (en) Interview scoring method and apparatus, and device and storage medium
WO2021179717A1 (en) Speech recognition front-end processing method and apparatus, and terminal device
CN110706690A (en) Speech recognition method and device
WO2020107834A1 (en) Verification content generation method for lip-language recognition, and related apparatus
WO2022116487A1 (en) Voice processing method and apparatus based on generative adversarial network, device, and medium
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
US20180349794A1 (en) Query rejection for language understanding
CN111816166A (en) Voice recognition method, apparatus, and computer-readable storage medium storing instructions
US20230031733A1 (en) Method for training a speech recognition model and method for speech recognition
CN114999463B (en) Voice recognition method, device, equipment and medium
CN109215647A (en) Voice awakening method, electronic equipment and non-transient computer readable storage medium
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
Flamary et al. Spoken WordCloud: Clustering recurrent patterns in speech
CN112397051A (en) Voice recognition method and device and terminal equipment
TWI818427B (en) Method and system for correcting speaker diarisation using speaker change detection based on text
CN115423904A (en) Mouth shape animation generation method and device, electronic equipment and storage medium
CN115132201A (en) Lip language identification method, computer device and storage medium
JP5342629B2 (en) Male and female voice identification method, male and female voice identification device, and program
CN113327584A (en) Language identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18936664

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18936664

Country of ref document: EP

Kind code of ref document: A1