WO2021228084A1 - 语音数据识别方法、设备及介质 - Google Patents

语音数据识别方法、设备及介质 Download PDF

Info

Publication number
WO2021228084A1
WO2021228084A1 PCT/CN2021/093033 CN2021093033W WO2021228084A1 WO 2021228084 A1 WO2021228084 A1 WO 2021228084A1 CN 2021093033 W CN2021093033 W CN 2021093033W WO 2021228084 A1 WO2021228084 A1 WO 2021228084A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
recognition method
data
preset
voice
Prior art date
Application number
PCT/CN2021/093033
Other languages
English (en)
French (fr)
Inventor
宋元峰
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2021228084A1 publication Critical patent/WO2021228084A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • This application relates to the field of artificial intelligence technology of financial technology (Fintech), and in particular to a voice data recognition method, device and medium.
  • ASR Automatic Speech Recognition
  • the main purpose of this application is to provide a voice data recognition method, device, equipment and medium, aiming to solve the technical problem of low accuracy of voice recognition in related technologies.
  • the present application provides a voice data recognition method, and the voice data recognition method includes:
  • a target candidate result is selected from each of the candidate results as the voice recognition result of the voice data to be recognized.
  • the present application also provides a voice data recognition device, which includes:
  • a recognition module configured to perform voice recognition on the voice data to be recognized to obtain each candidate result of the voice data to be recognized
  • the first obtaining module is configured to obtain the initial ranking result of each candidate result, and obtain related topic information of each candidate result;
  • a re-ranking module configured to re-rank the candidate results based on the initial ranking result and the associated topic information to obtain a target ranking result
  • the selection module is configured to select a target candidate result from each of the candidate results as the voice recognition result of the to-be-recognized voice data according to the target ranking result.
  • the present application also provides a voice data recognition device.
  • the voice data recognition device is a physical device.
  • the voice data recognition device includes a memory, a processor, and a device that is stored in the memory and can run on the processor.
  • the program of the voice data recognition method is executed by a processor, the steps of the voice data recognition method can be realized.
  • the present application also provides a medium on which a program for implementing the above-mentioned voice data recognition method is stored, and when the program of the voice data recognition method is executed by a processor, the steps of the above-mentioned voice data recognition method are implemented.
  • the present application obtains each candidate result of the to-be-recognized voice data by performing voice recognition on the to-be-recognized voice data; obtains the initial ranking result of each candidate result, and obtains the related topic information of each candidate result; based on the initial ranking Result and the associated subject information, re-sort the candidate results to obtain the target ranking result; according to the target ranking result, select the target candidate result from each candidate result as the voice recognition of the voice data to be recognized result.
  • the related topic information of each candidate result is also obtained, that is, the phenomenon of word burst (burstiness) is considered based on the related topic information, and then based on the initial
  • the sorting result and the associated topic information re-sort the candidate results to improve the accuracy of the target sorting result, so as to improve the accuracy of the speech recognition result.
  • FIG. 1 is a schematic flowchart of a first embodiment of a voice data recognition method according to this application;
  • FIG. 2 is a detailed flow diagram of the steps of performing voice recognition on the voice data to be recognized to obtain candidate results of the voice data to be recognized in the first embodiment of the voice data recognition method of this application;
  • FIG. 3 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the application;
  • FIG. 4 is a schematic diagram of the first scenario in the voice data recognition method of this application.
  • FIG. 5 is a schematic diagram of the first scenario in the voice data recognition method of this application.
  • the embodiment of the present application provides a voice data recognition method.
  • the voice data recognition method includes:
  • Step S10 performing voice recognition on the voice data to be recognized to obtain each candidate result of the voice data to be recognized;
  • Step S20 obtaining the initial ranking result of each candidate result, and obtaining related topic information of each candidate result
  • Step S30 Re-sort the candidate results based on the initial ranking result and the associated topic information to obtain a target ranking result
  • Step S40 selecting a target candidate result from each of the candidate results according to the target ranking result as the voice recognition result of the voice data to be recognized.
  • Step S10 performing voice recognition on the voice data to be recognized to obtain each candidate result of the voice data to be recognized;
  • the voice data to be recognized is subjected to voice recognition to obtain the candidate results of the voice data to be recognized.
  • voice recognition Specifically, after the voice data to be recognized is obtained, a preset voice feature is extracted
  • the model extracts the voice features of the voice data to be recognized.
  • the voice features can be Mel frequency cepstrum MFCC features, etc., as shown in Figure 4, after the voice features are obtained, the voice features are processed through a preset voice model.
  • the speech recognition result is the state corresponding to each frame of speech.
  • the speech recognition result is input into the language recognition model, and the text recognition result of the speech recognition result is obtained.
  • the word grid is obtained based on the combination of each text recognition result, and each candidate result is obtained based on the word grid.
  • the step of performing voice recognition on the voice data to be recognized to obtain candidate results of the voice data to be recognized includes:
  • Step S11 performing voice feature extraction on the voice data to be recognized to obtain voice feature data of the voice data to be recognized;
  • perform framing processing on the voice data to be recognized Before performing feature extraction, perform framing processing on the voice data to be recognized.
  • the length of each frame may be 25 milliseconds. There is overlap between frames to avoid information loss.
  • framing the speech becomes many small segments.
  • each frame of waveform is turned into a multi-dimensional vector, which contains With the content information of this frame of speech, this process can be called acoustic feature extraction, that is, the speech feature data is obtained through acoustic feature extraction.
  • the sound becomes a M line such as 12 lines (assuming that the acoustic feature is 12-dimensional), N A matrix of columns (speech feature data), where N is the total number of frames, and the size of each dimension vector is different.
  • Step S12 using a preset voice model and a preset language model to recognize the voice feature data, and obtain each candidate result of the voice data to be recognized.
  • the voice feature data is recognized by the preset voice model, and the voice recognition result is obtained.
  • the voice recognition result is the possible state corresponding to each frame of voice. Every three states are combined into a phoneme, and several The phonemes are combined into a bit word such as the final initials, that is to say, as long as you know which state each frame of speech corresponds to, the result of speech recognition (possible) will come out (through the mapping relationship between factors and words in the dictionary), you need It is explained that there may be multiple speech recognition results.
  • each bit word such as the final initials is obtained.
  • the speech recognition results are combined and sorted through the preset language model, and each candidate result is obtained, as shown in Figure 4
  • Each sentence in the N candidates is a candidate result.
  • the decoding score of the word sequence formed by each speech recognition result is determined through a preset language model, and the decoding score outputs a score for the word sequence, which can characterize the probability of each word sequence.
  • Step S20 obtaining the initial ranking result of each candidate result, and obtaining related topic information of each candidate result
  • the initial ranking result of each candidate result is obtained according to the occurrence probability of each candidate result, wherein the ranking with the highest probability of occurrence is the first.
  • the related topic information of each candidate result is also obtained. Specifically, first, the overall topic information of each candidate result is obtained.
  • the topic information can be multiple, and the topic information with the highest probability can be selected from the topic letter as the topic information.
  • the associated topic information can be determined by a preset speech dialogue topic model (Dialogue Speech Topic Model, DSTM), that is, each candidate result is input into the preset speech dialogue topic model, and the preset speech dialogue topic has been trained The model gets related topic information.
  • DSTM Dialog Speech Topic Model
  • the step of obtaining the related topic information of each candidate result includes:
  • Step S21 Input the candidate results into a preset dialogue topic model optimized by non-manually annotated training sentence data, and subject the candidate results to subject feature extraction processing to obtain related topic information of each candidate result;
  • the non-manually labeled training sentence data is obtained based on a preset pre-training model optimized based on simulated label data
  • the simulated label data is obtained by conversion based on preset unlabeled original sentence data.
  • the preset dialogue topic model can accurately perform subject feature extraction processing on the candidate results, and the reason why the associated topic information of each candidate result is obtained is that the preset dialogue topic model is based on non-manual labeling
  • the training sentence data is optimized.
  • the non-artificially annotated training sentence data can be artificially annotated data or non-artificially annotated data.
  • the prediction result after the prediction of the non-artificially annotated training sentence data is compared with the annotation result corresponding to the non-artificially annotated training sentence data, and then the parameters of the basic training model are adjusted, and the basic training model is based on the non-artificially annotated training sentence data
  • the parameters of is continuously adjusted until the preset loss function of the basic training model converges, or the number of training times of the basic training model reaches the first preset number, that is, the preset dialogue topic model is obtained.
  • the preset dialogue topic model is obtained by training the basic training model based on non-manually annotated training sentence data instead of preset training word data. This is because the structure of the dialogue data is relatively short. Usually a sentence corresponds to a topic, rather than each word has a different topic, because the basic training model is based on non-manually labeled training sentence data (in sentence units) instead of preset training word data (in terms of words)
  • the preset dialogue topic model is obtained by training, thus, the training accuracy of the model can be improved (to avoid the topic dispersion caused by each word corresponding to a topic) and the training efficiency of the model (to avoid subject judgment for each word, less judgment times) .
  • Step S30 Re-sort the candidate results based on the initial ranking result and the associated topic information to obtain a target ranking result
  • the candidate results are re-ranked to obtain the target ranking result.
  • each candidate is determined The association relationship between the result and the associated topic information.
  • the association relationship may be the topic information similarity or the contribution probability. Specifically, for example, determine the contribution probability of each candidate result in the process of obtaining the associated topic information, Based on the contribution probability of each candidate result and the initial ranking result (the occurrence probability of each candidate result), the overall probability of each candidate result is calculated according to the preset formula, and the ranking is performed based on the overall probability to obtain the target ranking result.
  • the 5 candidate results into the preset dialogue topic model to obtain the overall related topic information (the highest probability) of the 5 candidate results, and then determine each candidate result to obtain the related topic information
  • the contribution probability of is the proportion of the contribution of the topic information obtained for each candidate result in the related topic information, and then combined with the occurrence probability of each candidate result to obtain the target ranking result.
  • Step S40 selecting a target candidate result from each of the candidate results according to the target ranking result as the voice recognition result of the voice data to be recognized.
  • the target candidate result is selected from each of the candidate results according to the target ranking result, that is, the candidate result with the largest overall probability is selected as the voice recognition of the voice data to be recognized result.
  • the present application obtains each candidate result of the to-be-recognized voice data by performing voice recognition on the to-be-recognized voice data; obtains the initial ranking result of each candidate result, and obtains the related topic information of each candidate result; based on the initial ranking Result and the associated subject information, re-sort the candidate results to obtain the target ranking result; according to the target ranking result, select the target candidate result from each candidate result as the voice recognition of the voice data to be recognized result.
  • the related topic information of each candidate result is also obtained, that is, the phenomenon of word burst (burstiness) is considered based on the related topic information, and then based on the initial
  • the sorting result and the associated topic information re-sort the candidate results to improve the accuracy of the target sorting result, so as to improve the accuracy of the speech recognition result.
  • the simulated label data is obtained by partially replacing the preset unlabeled original sentence data with the generated unlabeled sentence data
  • the analog tag data includes at least the data of the true and false analog tags.
  • the preset pre-training model has been trained. Specifically, the preset pre-training model is obtained based on simulated label data, and the simulated label data is based on preset unlabeled original sentence data. Converted. Since the data for training the preset training model is obtained based on simulated label data, the simulated label data is converted based on the preset unlabeled original sentence data. When the preset training model is sufficiently trained, the original sentence data is learned The hidden representation of, the hidden representation contains information such as the speaker, that is, the preset training model is obtained after sufficient training, so it can be accurately labeled. In addition, the preset training model is based on unsupervised data ( Unlabeled) obtained through training, therefore, has strong generalization ability.
  • Unlabeled unsupervised data obtained through training
  • each original sentence data needs to be encoded with multiple features.
  • the encoding of each original sentence data is (1, 0, 1, 0) or (1 , 0, 1, 0, 1, 0) and so on.
  • each original sentence data may at least include the first coding feature 1 indicating that the original sentence data is real and not synthesized.
  • the original sentence The data needs to be encoded with multiple features.
  • each piece of original sentence data is (1, 0, 1, 0, 1, 0, 1, 0) or (1, 0, 1, 0, 1, 0) Etc.
  • each piece of original sentence data may at least include the first coding feature indicating that the number of frame data contained in the original sentence data is a preset number 1.
  • the preset dialogue topic model can be embedded with the annotation layer formed by the preset pre-training model, so as to obtain the marked training data in the preset dialogue topic model for training and further training.
  • the preset dialogue topic model may also be based on sending the voice data to an external preset pre-training model for labeling, and then obtaining the labeled training data for optimization.
  • the simulated label data is obtained by partially replacing the preset unlabeled original sentence data with the generated unlabeled sentence data, and the simulated label data includes at least true and false simulated label data.
  • analog tag data is obtained by partially deleting frame data from the preset unlabeled original sentence data, and the analog tag data includes at least data of true and false analog tags.
  • the unlabeled sentence data (voice form) includes unlabeled random speech frame data or unlabeled random speech segment data.
  • the preset unlabeled original The sentence data is partially replaced with the unlabeled sentence data to obtain the simulated label data including: after generating unlabeled random voice frame data, selecting a plurality of preset unlabeled original sentence data, and presetting the plurality of unlabeled original sentences At least one frame of data of each piece of data in the data is replaced with the unlabeled random voice frame data to obtain simulated label data, or after generating unlabeled random voice fragment data, select multiple preset unlabeled original sentence data, At least one segment data of each piece of data in the multiple preset unlabeled original sentence data is replaced with the unlabeled random voice segment data to obtain simulated label data.
  • each preset unlabeled original sentence data may include Multiple voice segments, each voice segment may include multiple frames of data.
  • multiple pieces of preset unlabeled original sentence data are selected, and each piece of data in the multiple pieces of preset unlabeled original sentence data is added with unlabeled random voice Frame data to obtain simulated label data, or after generating unlabeled random voice segment data, select multiple preset unlabeled original sentence data, and add at least two frames to each of the multiple preset unlabeled original sentence data Random voice frame data without tags, to obtain analog tag data.
  • multiple pieces of preset unlabeled original sentence data are selected, and each piece of data in the multiple pieces of preset unlabeled original sentence data is reduced by unlabeled random voice frame data to obtain analog label data, or select A plurality of preset unlabeled original sentence data, and each of the plurality of preset unlabeled original sentence data is reduced by at least two frames of unlabeled random voice frame data to obtain analog label data.
  • the coding feature of each original sentence data may also include at least the coding feature of the speaker's voice. Therefore, the voice data marked with the voice characteristics of the speaker can be obtained.
  • a specific description is given by taking as an example a label layer formed by embedding the preset pre-training model in a preset dialogue topic model.
  • the method Before the step of inputting the candidate results into a preset dialogue topic model optimized by labeled training data, subjecting the candidate results to topic feature extraction processing, and obtaining the associated topic information of each candidate result, the method also includes:
  • Step A1 Obtain the original sentence data that is preset without labels
  • the original sentence data without labels is first obtained.
  • the quantity of the original sentence data is greater than the preset quantity value.
  • the original sentence data is in the form of voice.
  • the original sentence data (voice form) can be real non-synthetic or generated voice data to generate analog tag data.
  • the non-synthetic or generated sentence data refers to the voice data sent by the collected person, rather than through machine simulation.
  • each sentence data includes multiple voice files as shown in z1, z2, z3, z4, etc. in Figure 5.
  • Each voice file includes multiple frames of voice data as shown in Figure 5. X1, X2, X3, X4, etc.
  • each sentence data includes multiple voice files as shown in z1, z2, z3, z4, etc. in Figure 5.
  • the number of the multiple voice files is determined, and each voice file includes multiple frames of voice data as shown in Figure 5.
  • X1, X2, X3, X4, etc. The number of frames of the voice file is determined in order to generate analog tag data.
  • the specific data content of the original sentence data in this embodiment is not limited, and the original data is unlabeled data. That is, the preset pre-training model can be obtained through unsupervised training.
  • Step A2 generating unlabeled sentence data, and partially replacing the preset unlabeled original sentence data with the unlabeled sentence data to obtain simulated label data;
  • unlabeled sentence data is generated. Specifically, unlabeled sentence data is generated through a preset generator (generator in Figure 5), that is, unlabeled sentence data is obtained through machine fitting. The sentence data is generated or synthesized simulated fake label data. It should be noted that the unlabeled sentence data can be the data length of each frame or the data length of each fragment. The specific data is not limited, in order to ensure that it can be carried out at any time In replacement, the generator generates unlabeled sentence data of various data lengths, and partially replaces the preset unlabeled original sentence data with the unlabeled sentence data to obtain simulated label data.
  • another method of obtaining simulated label data is also provided.
  • the frame data in the original sentence data that is preset to be unlabeled is randomly deleted, or randomly added Preset the frame data in the unlabeled original sentence data to obtain simulated label data.
  • Step A3 training a preset training model based on the simulated label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model;
  • the step of training a preset training model based on the simulated label data to obtain a target model meeting preset conditions, and setting the target model as the preset pre-training model includes:
  • Step B1 determining simulated fake label data and simulated real label data in the simulated label data
  • the original sentence data that is replaced is the known simulated fake label data
  • the other original sentence data that has not been replaced are the known simulated true label data, that is, the known simulated fake label data and the simulated fake label data.
  • the known analog true tag data constitutes the analog tag data.
  • the original sentence data after the frame data is deleted or the frame data is added is the known simulated fake label data
  • the other unprocessed original sentence data is the known simulated true label data, that is,
  • the known simulated fake label data and the known simulated real label data constitute simulated label data.
  • Step B2 input the simulated fake label data and simulated real label data into a preset training model to obtain a recognition result
  • Step B3 Adjust the model parameters of the preset training model based on the recognition result and the true and false simulated tags in the simulated tag data until a target model that satisfies the preset conditions is obtained, and the target model is set to the Pre-trained models are preset.
  • the known simulated fake label data and the known simulated true label data are input into the preset training model to train the preset
  • the model is trained, specifically, the recognition result of the preset training model after the simulated fake label data and the simulated real label data is predicted, as shown in Figure 5, in the recognition result, which of the original sentence data is predicted to be original (original , Non-replaced or non-processed), which are replaced (non-original, replaced or deleted, etc.), and since the simulated fake label data and simulated real label data are known, that is, Which of the original sentence data is original (original or non-replaced) and which is replaced (non-original or non-replaced) is known.
  • the recognition result is compared with the known result to determine which of the two is
  • the model parameters of the preset training model are adjusted in a targeted manner based on the error until a target model that meets the preset conditions is obtained, and the target model is set as the preset pre-training model.
  • the preset unlabeled original sentence data is encoded with multiple features, the preset unlabeled original sentence data can be learned in the process of authenticating the preset unlabeled original sentence data Other implied expressions.
  • Step A4 input the preset training sentence data into the preset pre-training model to obtain labeled training data
  • Step A5 training to obtain a preset dialogue topic model based on the labeled training data.
  • the labeled training data is obtained based on the preset pre-training model; based on the labeled training data, the preset dialogue topic model is obtained, specifically, the basic model is trained based on the labeled training data, A model that satisfies a certain preset condition is obtained, and the model that satisfies a certain preset condition is set as the preset dialogue topic model.
  • the certain preset condition may be: the preset loss function of the basic model converges or the training of the basic model reaches the preset number of times at this time.
  • the preset unlabeled original sentence data is obtained; the unlabeled sentence data is generated, and the preset unlabeled original sentence data is partially replaced with the unlabeled sentence data to obtain simulated label data;
  • the simulated label data is trained on a preset training model to obtain a target model that meets preset conditions, the target model is set as the preset pre-training model; the preset training sentence data is input to the preset pre-training model
  • labeled training data is obtained; based on the labeled training data, a preset dialogue topic model is obtained through training. In turn, the training of the preset dialogue topic model can be completed quickly.
  • the embodiment of the present application provides a voice data recognition method.
  • the candidate results are re-ranked based on the initial ranking result and the associated topic information .
  • the steps to get the target sorting result include:
  • Step C1 extracting feature data corresponding to the candidate results, inputting the feature data and the associated topic information into a preset ranking model, and re-ranking the candidate results to obtain a target ranking result;
  • the ranking model is obtained by training using a candidate feature set, a piece of training data in the candidate feature set includes feature data corresponding to multiple candidate results, associated topic information corresponding to the multiple candidate results, and the multiple The ranking label of the candidate result.
  • the ranking model is obtained by training using a candidate feature set, a piece of training data in the candidate feature set includes feature data corresponding to multiple candidate results, and the multiple candidate results Corresponding related topic information and ranking labels of the multiple candidate results, where the feature data includes vector representation data of the candidate results, or scoring data of the candidate results (by inputting the candidate results into the preset voice model and/or preset Language model), specifically, the ranking model is trained using a candidate feature set, a piece of training data in the candidate feature set includes vector representation data corresponding to multiple candidate results, and the multiple candidate results correspond to Related topic information and ranking labels of the multiple candidate results, or the ranking model is obtained by training using a candidate feature set, a piece of training data in the candidate feature set includes score data corresponding to multiple candidate results, and the multiple The associated topic information corresponding to the candidate results and the ranking labels of the multiple candidate results, wherein the scoring data is obtained by inputting the candidate results into a preset speech model and/or a preset language model.
  • the ranking model is accurately trained, the input feature data and the corresponding related topic information can be accurately identified. Therefore, when the candidate result is obtained, the feature data corresponding to the candidate result is extracted, and the feature data and all the information After the associated topic information is input into the preset sorting model, the candidate results can be re-sorted to obtain the target sorting result.
  • the feature data corresponding to the candidate result is extracted, the feature data and the associated topic information are input into a preset ranking model, and the candidate results are re-ranked to obtain the target ranking result
  • the ranking model is obtained by using candidate feature set training, a piece of training data in the candidate feature set includes feature data corresponding to multiple candidate results, associated topic information corresponding to the multiple candidate results, and the multiple The ranking label of the candidate results.
  • the method of model prediction takes into account the phenomenon of word burstiness, which improves the accuracy of obtaining the target sorting result, and improves the accuracy of obtaining the speech recognition result.
  • FIG. 3 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application.
  • the voice data recognition device may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between the processor 1001 and the memory 1005.
  • the memory 1005 may be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • the voice data recognition device may also include a rectangular user interface, a network interface, a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, an audio circuit, a WiFi module, and so on.
  • the rectangular user interface may include a display screen (Display) and an input sub-module such as a keyboard (Keyboard), and the optional rectangular user interface may also include a standard wired interface and a wireless interface.
  • Optional network interface can include standard wired interface, wireless interface (such as WI-FI interface).
  • the structure of the voice data recognition device shown in FIG. 3 does not constitute a limitation on the voice data recognition device, and may include more or fewer components than shown in the figure, or combine certain components, or different The layout of the components.
  • the memory 1005 as a computer medium may include an operating system, a network communication module, and a voice data recognition program.
  • the operating system is a program that manages and controls the hardware and software resources of the voice data recognition device, and supports the operation of the voice data recognition program and other software and/or programs.
  • the network communication module is used to realize the communication between various components in the memory 1005 and the communication with other hardware and software in the voice data recognition system.
  • the processor 1001 is configured to execute the voice data recognition program stored in the memory 1005 to implement the steps of the voice data recognition method described in any one of the above.
  • the specific implementation of the voice data recognition device of the present application is basically the same as each embodiment of the voice data recognition method described above, and will not be repeated here.
  • the present application also provides a voice data recognition device, which includes:
  • a recognition module configured to perform voice recognition on the voice data to be recognized to obtain each candidate result of the voice data to be recognized
  • the first obtaining module is configured to obtain the initial ranking result of each candidate result, and obtain related topic information of each candidate result;
  • a re-ranking module configured to re-rank the candidate results based on the initial ranking result and the associated topic information to obtain a target ranking result
  • the selection module is configured to select a target candidate result from each of the candidate results as the voice recognition result of the to-be-recognized voice data according to the target ranking result.
  • the first obtaining module includes:
  • the first extraction unit is configured to input the candidate results into a preset dialogue topic model optimized by labeled training data, and perform topic feature extraction processing on the candidate results to obtain related topic information of the candidate results;
  • the labeled training data is obtained based on a preset pre-training model optimized based on simulated label data
  • the simulated label data is obtained by conversion based on preset unlabeled original sentence data.
  • the simulated label data is obtained by partially replacing the preset unlabeled original sentence data with the generated unlabeled sentence data, and the simulated label data includes at least true and false simulated label data.
  • the voice data recognition device further includes:
  • the second acquisition module is used to acquire preset unlabeled original sentence data
  • a preset pre-training model generation module configured to train a preset training model based on the simulated label data to obtain a target model that meets preset conditions, and set the target model as the preset pre-training model;
  • the input module is used to input preset training sentence data into the preset pre-training model to obtain labeled training data
  • the training module trains to obtain a preset dialogue topic model based on the labeled training data.
  • the generating module includes:
  • the determining unit is used to determine the simulated fake label data and the simulated real label data in the simulated label data
  • the input unit is used to input the simulated fake label data and the simulated real label data into a preset training model to obtain a recognition result;
  • the adjustment unit is configured to adjust the model parameters of the preset training model based on the recognition result and the true and false simulated tags in the simulated tag data until a target model that meets the preset conditions is obtained, and the target model is set to The preset pre-training model.
  • the reordering module includes:
  • the second extraction unit is used to extract feature data corresponding to the candidate results, input the feature data and the associated topic information into a preset ranking model, and re-rank the candidate results to obtain the target ranking result;
  • the ranking model is obtained by training using a candidate feature set, a piece of training data in the candidate feature set includes feature data corresponding to multiple candidate results, associated topic information corresponding to the multiple candidate results, and the multiple The ranking label of the candidate result.
  • the selection module includes:
  • An acquiring unit configured to perform voice feature extraction on the voice data to be recognized to obtain voice feature data of the voice data to be recognized
  • the recognition unit is configured to recognize the voice feature data using a preset voice model and a preset language model, and obtain each candidate result of the voice data to be recognized.
  • the specific implementation of the voice data recognition device of the present application is basically the same as each embodiment of the voice data recognition method described above, and will not be repeated here.
  • the embodiment of the present application provides a medium, and the medium stores one or more programs, and the one or more programs may also be executed by one or more processors to implement any one of the foregoing The steps of the voice data recognition method.

Abstract

一种语音数据识别方法、装置、设备和介质,该方法包括:对待识别语音数据进行语音识别得到待识别语音数据的各候选结果(S10);获取各候选结果的初始排序结果,并获取各候选结果的关联主题信息(S20);基于初始排序结果以及关联主题信息,对各候选结果进行重新排序,得到目标排序结果(S30);根据目标排序结果从各候选结果中选取目标候选结果作为待识别语音数据的语音识别结果(S40)。

Description

语音数据识别方法、设备及介质
本申请要求2020年5月15日申请的,申请号为202010417957.1,名称为“语音数据识别方法、设备及介质”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本申请涉及金融科技(Fintech)的人工智能技术领域,尤其涉及一种语音数据识别方法、设备及介质。
背景技术
随着金融科技,尤其是互联网科技金融的不断发展,越来越多的技术(如分布式、区块链Blockchain、人工智能等)应用在金融领域,但金融业也对技术提出了更高的要求,如对金融业对语音数据识别也有更高的要求。
随着移动设备的发展,语音成了日常的输入沟通方式,其中,自动语音识别 (Automatic Speech Recognition, ASR) 技术是语音输入的重要前提,然而,目前,在对语音数据进行自动识别的过程中,未考虑词突发(burstiness)的现象,词突发(burstiness)的现象指的是一个词如"电影"出现之后,这个词("电影"本身)以及和它相关的词如"演员"出现的频率会增加,而未考虑词突发(burstiness)的现象,致使语音识别的准确性低。
技术问题
本申请的主要目的在于提供一种语音数据识别方法、装置、设备和介质,旨在解决相关技术中语音识别的准确性低的技术问题。
技术解决方案
为实现上述目的,本申请提供一种语音数据识别方法,所述语音数据识别方法包括:
对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
本申请还提供一种语音数据识别装置,所述语音数据识别装置包括:
识别模块,用于对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
第一获取模块,用于获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
重新排序模块,用于基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
选取模块,用于根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
本申请还提供一种语音数据识别设备,所述语音数据识别设备为实体设备,所述语音数据识别设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的所述语音数据识别方法的程序,所述语音数据识别方法的程序被处理器执行时可实现如上述的语音数据识别方法的步骤。
本申请还提供一种介质,所述介质上存储有实现上述语音数据识别方法的程序,所述语音数据识别方法的程序被处理器执行时实现如上述的语音数据识别方法的步骤。
有益效果
本申请通过对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。在本申请中,在得到待识别语音数据的各候选结果后,还获取所述各候选结果的关联主题信息,即是基于关联主题信息考虑词突发(burstiness)的现象,进而基于所述初始排序结果以及所述关联主题信息对所述各候选结果进行重新排序,提升得到目标排序结果的准确性,以提升得到语音识别结果的准确性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请语音数据识别方法第一实施例的流程示意图;
图2为本申请语音数据识别方法第一实施例中对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果的的步骤细化流程示意图;
图3为本申请实施例方案涉及的硬件运行环境的设备结构示意图;
图4为本申请语音数据识别方法中的第一场景示意图;
图5为本申请语音数据识别方法中的第一场景示意图。
本申请目的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供一种语音数据识别方法,在本申请语音数据识别方法的第一实施例中,参照图1,所述语音数据识别方法包括:
步骤S10,对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
步骤S20,获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
步骤S30,基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
步骤S40,根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
具体步骤如下:
步骤S10,对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
在本实施例中,在获取待识别语音数据后,对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果,具体地,在获取待识别语音数据后,通过预设语音特征提取模型提取所述待识别语音数据的语音特征,该语音特征可以是梅尔频率倒谱MFCC特征等,如图4所示,在得到语音特征后,通过预设的语音模型对语音特征进行处理,得到语音识别结果,语音识别结果即是每帧语音对应的状态,在得到语音识别结果后,将语音识别结果输入至语言识别模型中,得到语音识别结果的文本识别结果,在得到文本识别结果后,基于各个文本识别结果组合得到词网格,基于词网格得到各候选结果。
具体地,参照图2,所述对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果的步骤,包括:
步骤S11,对所述待识别语音数据进行语音特征提取,得到所述待识别语音数据的语音特征数据;
在本实施例中,首先对所述待识别语音数据进行语音特征提取,在进行特征提取之前,对所述待识别语音数据进行分帧处理,其中,每帧的长度可以为25毫秒,每两帧之间有交叠,以避免信息流失,在分帧后,语音就变成了很多小段,为了描述,根据人耳的生理特性,把每一帧波形变成一个多维向量,该多维向量包含了这帧语音的内容信息,这个过程可以叫做声学特征提取,即是通过声学特征提取得到语音特征数据,提取后,声音就成了一个M行如12行(假设声学特征是12维)、N列的一个矩阵(语音特征数据),其中,N为总帧数,且每维向量大小不同。
步骤S12,采用预设语音模型和预设语言模型对所述语音特征数据进行识别,得到所述待识别语音数据的各候选结果。
在得到语音特征数据后,采用预设语音模型对所述语音特征数据进行识别,得到语音识别结果,语音识别结果即是每帧语音对应的可能状态,每三个状态组合成一个音素,若干个音素组合成一个比特位词如韵母声母,也就是说,只要知道每帧语音对应哪个状态了,语音识别的结果(可能的)也就出来了(通过因素与词典中词语的映射关系),需要说明的是,语音识别结果可能存在多个,在得到语音识别结果(各个比特位词如韵母声母)后,通过预设语言模型对语音识别结果进行组合排序处理,得到各候选结果,例如图4中N候选中的每句话都是一个候选结果。具体地,通过预设语言模型,确定各个语音识别结果构成的词序列的解码得分,该解码得分输出的是针对所述词序列的评分,其能够表征各个词序列的概率。
步骤S20,获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
在本实施例中,在得到各候选结果后,根据各候选结果出现的概率,得到所述各候选结果的初始排序结果,其中,出现概率最大的排序最靠前。
在本实施例中,还获取所述各候选结果的关联主题信息,具体地,首先获取各候选结果整体的主题信息,该主题信息可以是多个,可以从主题信选取概率最大的主题信息作为关联主题信息。其中,可以通过预设语音对话主题模型(Dialogue Speech Topic Model,DSTM)确定关联主题信息,也即,将各候选结果输入至预设语音对话主题模型中,通过已经训练好的预设语音对话主题模型得到关联主题信息。
其中,所述获取所述各候选结果的关联主题信息的步骤,包括:
步骤S21,将所述候选结果输入至非人工标注训练语句数据优化的预设对话主题模型中,对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息;
其中,所述非人工标注训练语句数据是基于模拟标签数据优化的预设预训练模型得到的,所述模拟标签数据是基于预设无标签原始语句数据转换得到的。
在本实施例中,预设对话主题模型能够准确对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息的原因在于:所述预设对话主题模型是基于非人工标注训练语句数据优化得到的,其中,非人工标注训练语句数据可以是人为标注的数据,也可以是非人为标注的数据,由于训练语句数据已经标注完成的,因而,在训练过程中,基于基础训练模型对非人工标注训练语句数据进行预测后的预测结果,与非人工标注训练语句数据对应的标注结果进行比对,进而,进行基础训练模型的参数调整,基于非人工标注训练语句数据对基础训练模型的参数进行持续调整,直至基础训练模型的预设损失函数收敛,或者对基础训练模型的训练次数达到第一预设次数,即得到预设对话主题模型。
需要说明的是,在本实施例中,预设对话主题模型是基于非人工标注训练语句数据而不是预设训练词语数据对基础训练模型进行训练得到的,这是因为对话数据的结构较短,通常一句话对应一个主题,而不是每个词都有不同的主题,由于基于非人工标注训练语句数据(以句子为单位)而不是预设训练词语数据(以词语为单位)对基础训练模型进行训练得到预设对话主题模型,因而,可以提升模型的训练准确性(避免每个词语对应有一个主题造成的主题分散)以及模型的训练效率(避免每个词语的主题判断,较少判断次数)。
步骤S30,基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
在本实施例中,基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果,具体地,在得到所述关联主题信息后,确定每个候选结果与该关联主题信息之间的关联关系,该关联关系可以是主题信息相似度或者是贡献度概率,具体地,例如,确定每个候选结果在得到该关联主题信息过程中的贡献度概率,基于该每个候选结果的贡献度概率以及所述初始排序结果(每个候选结果的出现概率),按照预设公式计算得到各候选结果的整体概率,并基于该整体概率进行排名,得到目标排序结果。
例如,候选结果有5个,将该5个候选结果输入至预设对话主题模型中,得到该5个候选结果整体的关联主题信息(概率最大),然后确定每个候选结果得到该关联主题信息的贡献度概率,即是得到每个候选结果的主题信息在得到该关联主题信息中的贡献占比,然后结合每个候选结果的出现概率,得到目标排序结果。
步骤S40,根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
在本实施例中,在得到目标排序结果后,根据所述目标排序结果从各所述候选结果中选取目标候选结果,即是选取整体概率最大的候选结果作为所述待识别语音数据的语音识别结果。
本申请通过对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。在本申请中,在得到待识别语音数据的各候选结果后,还获取所述各候选结果的关联主题信息,即是基于关联主题信息考虑词突发(burstiness)的现象,进而基于所述初始排序结果以及所述关联主题信息对所述各候选结果进行重新排序,提升得到目标排序结果的准确性,以提升得到语音识别结果的准确性。
进一步地,基于本申请中第一实施例,在本申请的另一实施例中,所述模拟标签数据为通过将预设无标签原始语句数据,部分替换为生成的无标签语句数据后,得到的,且所述模拟标签数据至少包括真假模拟标签的数据。
需要说明的是,所述预设预训练模型是已经训练完成的,具体地,所述预设预训练模型是基于模拟标签数据得到的,所述模拟标签数据是基于预设无标签原始语句数据转换得到的。由于对预设训练模型进行训练的数据是基于模拟标签数据得到的,所述模拟标签数据是基于预设无标签原始语句数据转换得到的,当预设训练模型训练充分时,学习到了原始语句数据的隐藏表示,隐藏表示的内部包含了说话人等信息,也即,预设训练模型是经过充分训练后得到的,因而,能够准确进行标注,另外,预设训练模型是基于无监督的数据(无标注)训练得到的,因而,具有强的泛化能力。需要说明的是,为了学习到到原始语句数据的隐藏表示,该原始语句数据是需要经过多个特征编码的如每条原始语句数据的编码为(1,0,1,0)或者是(1,0,1,0,1,0)等。虽然每条原始语句数据的编码的特征为多个,但是为了得到模拟标签数据,每条原始语句数据可以至少包括表示原始语句数据是真实的而非合成的首位编码特征1,另外,该原始语句数据是需要经过多个特征编码的如每条原始语句数据的编码为(1,0,1,0,1,0,1,0)或者是(1,0,1,0,1,0)等,虽然每条原始语句数据的编码的特征为多个,但是为了得到模拟标签数据,每条原始语句数据可以至少包括表示原始语句数据所包含的帧数据的数量为预设数量的首位编码特征1。
需要说明的是,预设对话主题模型可以内嵌该预设预训练模型构成的标注层,以在预设对话主题模型内得到已标注训练数据进行训练,进而进行训练,另外,在本实施例中,预设对话主题模型也可以基于将语音数据发送给外部的预设预训练模型进行标注后,得到已标注训练数据进行优化得到的。
需要说明的是,所述模拟标签数据为通过将预设无标签原始语句数据,部分替换为生成的无标签语句数据后,得到的,且所述模拟标签数据至少包括真假模拟标签的数据。
或者所述模拟标签数据是通过将预设无标签原始语句数据,部分删除帧数据后,得到的,且所述模拟标签数据至少包括真假模拟标签的数据。
具体地,在本实施例中,所述无标签语句数据(语音形式)包括无标签随机语音帧数据或者无标签随机语音片段数据,在生成无标签语句数据后,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据包括:在生成无标签随机语音帧数据后,选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据的至少一帧数据替换为该无标签随机语音帧数据,得到模拟标签数据,或者在生成无标签随机语音片段数据后,选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据的至少一个片段数据替换为该无标签随机语音片段数据,得到模拟标签数据,需要说明的是,每条预设无标签原始语句数据可以包括多个语音片段,每个语音片段可以包括多帧数据。
另外,在本实施例中,在生成无标签随机语音帧数据后,选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据增加无标签随机语音帧数据,得到模拟标签数据,或者在生成无标签随机语音片段数据后,选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据增加至少两帧无标签随机语音帧数据,得到模拟标签数据。
或者,在本实施例中,选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据减少无标签随机语音帧数据,得到模拟标签数据,或者选取多条预设无标签原始语句数据,将该多条预设无标签原始语句数据中的每一条数据减少至少两帧无标签随机语音帧数据,得到模拟标签数据。
在本实施例中,每条原始语句数据的编码特征还可以至少包括说话人声音的编码特征。因而,可以得到标注说话人声音特征的语音数据。
在本实施例中,以在预设对话主题模型中内嵌该预设预训练模型构成的标注层为例进行具体说明。
所述将所述候选结果输入至已标注训练数据优化的预设对话主题模型中,对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息的步骤之前,所述方法还包括:
步骤A1,获取预设无标签的原始语句数据;
在本实施例中,首先获取预设无标签的原始语句数据,为了确保训练效果,该原始语句数据的数量大于预设数量值,需要说明的是,原始语句数据是语音形式的。且该原始语句数据(语音形式)可以是真实的非合成或者生成的语音数据,以便生成模拟标签数据,非合成或者生成的语句数据指的是采集的人发出的语音数据,而非通过机器拟合的语音数据,其中,需要说明的是,每条语句数据包括多个语音文件如图5中的z1,z2,z3,z4等,每个语音文件中包括多帧语音数据如图5中的X1,X2,X3,X4等。
或者,每条语句数据包括多个语音文件如图5中的z1,z2,z3,z4等,该多个语音文件的数目是确定的,每个语音文件中包括多帧语音数据如图5中的X1,X2,X3,X4等。该语音文件的帧的数目是确定的,以便生成模拟标签数据。
其中,本实施例中原始语句数据的具体数据内容不做限制,且该原始数据是无标签的数据。即实现通过无监督方式训练得到预设预训练模型。
步骤A2,生成无标签语句数据,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据;
在得到原始语句数据后,生成无标签语句数据,具体地,通过预设的生成器(图5中的generator)生成无标签语句数据,即是通过机器拟合得到无标签语句数据,该无标签语句数据是生成的或者合成的模拟假标签数据,需要说明的是,无标签语句数据可以是每帧的数据长度,也可以是每个片段的数据长度,具体不做限定,为了确保可以随时进行替换,生成器是生成了各个数据长度的无标签语句数据,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据。
在本实施例中,还提供另一种得到模拟标签数据的方式,在该另一种得到模拟标签数据的方式中,随机删除预设无标签的原始语句数据中的帧数据,或者是随机添加预设无标签原始语句数据中的帧数据,得到模拟标签数据。
步骤A3,基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型;
所述基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型的步骤,包括:
步骤B1,确定所述模拟标签数据中的模拟假标签数据以及模拟真标签数据;
在本实施例中,被替换的原始语句数据是已知的模拟假标签数据,其他未被替换的原始语句数据是已知的模拟真标签数据,即是该已知的模拟假标签数据以及已知的模拟真标签数据构成模拟标签数据。
或者,在本实施例中,被删除帧数据后或者添加帧数据后的原始语句数据是已知的模拟假标签数据,其他未被处理的原始语句数据是已知的模拟真标签数据,即是该已知的模拟假标签数据以及已知的模拟真标签数据构成模拟标签数据。
步骤B2,将所述模拟假标签数据以及模拟真标签数据输入至预设训练模型中,得到识别结果;
步骤B3,基于所述识别结果以及所述模拟标签数据中的真假模拟标签调整所述预设训练模型的模型参数,直至得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型。
在得到已知的模拟假标签数据以及已知的模拟真标签数据后,将所述已知的模拟假标签数据以及已知的模拟真标签数据输入至预设训练模型中,以对预设训练模型进行训练,具体地,获取预设训练模型对模拟假标签数据以及模拟真标签数据进行预测后的识别结果,如图5所示,该识别结果中,预测原始语句数据中哪些是original(原始的,非替换的或者是非处理的),哪些是replaced(非原始的,替换的或者是删除等处理后的),而由于模拟假标签数据以及模拟真标签数据都是已知的,也即,原始语句数据中哪些是original(原始的或者是非替换的),哪些是replaced(非原始的或者是非替换的)是已知的,因而,将识别结果与已知结果进行比对,确定两者之间的误差,在确定误差后,基于该误差有针对性地调整预设训练模型的模型参数,直至得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型。
需要说明的是,由于预设无标签原始语句数据是经过多个特征编码的,因而,在对预设无标签原始语句数据进行真假识别的过程中,能够学习到预设无标签原始语句数据的其他隐含表示。
步骤A4,将预设训练语句数据输入至所述预设预训练模型中,得到已标注训练数据;
步骤A5,基于所述已标注训练数据,训练得到预设对话主题模型。
在得到预设预训练模型后,基于预设预训练模型得到已标注训练数据;基于所述已标注训练数据,得到预设对话主题模型,具体地,基于已标注训练数据对基础模型进行训练,得到满足一定预设条件的模型,将所述满足一定预设条件的模型设置为所述预设对话主题模型。需要说明的是,该一定的预设条件可以是:基础模型的预设损失函数收敛或者是基础模型的训练此时达到预设设定的次数。
在本实施例中,获取预设无标签的原始语句数据;生成无标签语句数据,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据;基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型;将预设训练语句数据输入至所述预设预训练模型中,得到已标注训练数据;基于所述已标注训练数据,训练得到预设对话主题模型。进而实现快速的完成预设对话主题模型的训练。
本申请实施例提供一种语音数据识别方法,在本申请语音数据识别方法的另一实施例中,所述基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果的步骤,包括:
步骤C1,提取所述候选结果对应的特征数据,将所述特征数据以及所述关联主题信息,输入至预设排序模型中,对所述各候选结果进行重新排序,得到目标排序结果;
其中,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的特征数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签。
在本实施例中,存在预设排序模型,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的特征数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签,其中,该特征数据包括候选结果的向量表示数据,或者候选结果的评分数据(通过将候选结果输入至预设语音模型和/预设语言模型中得到),具体地,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的向量表示数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签,或者所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的评分数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签,其中,评分数据通过将候选结果输入至预设语音模型和/预设语言模型中得到。
由于准确训练得到排序模型,因而,可以准确对输入的特征数据以及对应关联主题信息进行识别处理,因而,在得到候选结果,提取所述候选结果对应的特征数据,并将所述特征数据以及所述关联主题信息,输入至预设排序模型中后,可以对所述各候选结果进行重新排序,得到目标排序结果。
在本实施例中,提取所述候选结果对应的特征数据,将所述特征数据以及所述关联主题信息,输入至预设排序模型中,对所述各候选结果进行重新排序,得到目标排序结果;其中,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的特征数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签。本实施例中由于通过模型预测的方式,考虑词突发(burstiness)的现象,提升得到目标排序结果的准确性,提升了得到语音识别结果的准确性。
参照图3,图3是本申请实施例方案涉及的硬件运行环境的设备结构示意图。
如图3所示,该语音数据识别设备可以包括:处理器1001,例如CPU,存储器1005,通信总线1002。其中,通信总线1002用于实现处理器1001和存储器1005之间的连接通信。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储设备。
可选地,该语音数据识别设备还可以包括矩形用户接口、网络接口、摄像头、RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。矩形用户接口可以包括显示屏(Display)、输入子模块比如键盘(Keyboard),可选矩形用户接口还可以包括标准的有线接口、无线接口。网络接口可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。
本领域技术人员可以理解,图3中示出的语音数据识别设备结构并不构成对语音数据识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图3所示,作为一种计算机介质的存储器1005中可以包括操作系统、网络通信模块以及语音数据识别程序。操作系统是管理和控制语音数据识别设备硬件和软件资源的程序,支持语音数据识别程序以及其它软件和/或程序的运行。网络通信模块用于实现存储器1005内部各组件之间的通信,以及与语音数据识别系统中其它硬件和软件之间通信。
在图3所示的语音数据识别设备中,处理器1001用于执行存储器1005中存储的语音数据识别程序,实现上述任一项所述的语音数据识别方法的步骤。
本申请语音数据识别设备具体实施方式与上述语音数据识别方法各实施例基本相同,在此不再赘述。
本申请还提供一种语音数据识别装置,所述语音数据识别装置包括:
识别模块,用于对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
第一获取模块,用于获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
重新排序模块,用于基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
选取模块,用于根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
可选地,所述第一获取模块包括:
第一提取单元,用于将所述候选结果输入至已标注训练数据优化的预设对话主题模型中,对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息;
其中,所述已标注训练数据是基于模拟标签数据优化的预设预训练模型得到的,所述模拟标签数据是基于预设无标签原始语句数据转换得到的。
可选地,所述模拟标签数据为通过将预设无标签原始语句数据,部分替换为生成的无标签语句数据后,得到的,且所述模拟标签数据至少包括真假模拟标签的数据。
可选地,所述语音数据识别装置还包括:
第二获取模块,用于获取预设无标签的原始语句数据;
生成模块,用于生成无标签语句数据,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据;
预设预训练模型生成模块,用于基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型;
输入模块,用于将预设训练语句数据输入至所述预设预训练模型中,得到已标注训练数据;
训练模块,基于所述已标注训练数据,训练得到预设对话主题模型。
可选地,所述生成模块包括:
确定单元,用于确定所述模拟标签数据中的模拟假标签数据以及模拟真标签数据;
输入单元,用于将所述模拟假标签数据以及模拟真标签数据输入至预设训练模型中,得到识别结果;
调整单元,用于基于所述识别结果以及所述模拟标签数据中的真假模拟标签调整所述预设训练模型的模型参数,直至得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型。
可选地,所述重新排序模块包括:
第二提取单元,用于提取所述候选结果对应的特征数据,将所述特征数据以及所述关联主题信息,输入至预设排序模型中,对所述各候选结果进行重新排序,得到目标排序结果;
其中,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的特征数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签。
可选地,所述选取模块包括:
获取单元,用于对所述待识别语音数据进行语音特征提取,得到所述待识别语音数据的语音特征数据;
识别单元,用于采用预设语音模型和预设语言模型对所述语音特征数据进行识别,得到所述待识别语音数据的各候选结果。
本申请语音数据识别装置的具体实施方式与上述语音数据识别方法各实施例基本相同,在此不再赘述。
本申请实施例提供了一种介质,且所述介质存储有一个或者一个以上程序,所述一个或者一个以上程序还可被一个或者一个以上的处理器执行以用于实现上述任一项所述的语音数据识别方法的步骤。
本申请介质具体实施方式与上述语音数据识别方法各实施例基本相同,在此不再赘述。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利处理范围内。

Claims (20)

  1. 一种语音数据识别方法,其中,所述语音数据识别方法包括:
    对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
    获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
    基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
    根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
  2. 如权利要求1所述语音数据识别方法,其中,所述获取所述各候选结果的关联主题信息的步骤,包括:
    将所述候选结果输入至非人工标注训练语句数据优化的预设对话主题模型中,对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息;
    其中,所述非人工标注训练语句数据是基于模拟标签数据优化的预设预训练模型得到的,所述模拟标签数据是基于预设无标签原始语句数据转换得到的。
  3. 如权利要求2所述语音数据识别方法,其中,所述模拟标签数据为通过将预设无标签原始语句数据,部分替换为生成的无标签语句数据后,得到的,且所述模拟标签数据至少包括真假模拟标签的数据。
  4. 如权利要求3所述语音数据识别方法,其中,所述将所述候选结果输入至非人工标注训练语句数据优化的预设对话主题模型中,对所述候选结果进行主题特征提取处理,得到所述各候选结果的关联主题信息的步骤之前,所述方法还包括:
    获取预设无标签的原始语句数据;
    生成无标签语句数据,将所述预设无标签原始语句数据,部分替换为所述无标签语句数据,得到模拟标签数据;
    基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型;
    将预设训练语句数据输入至所述预设预训练模型中,得到非人工标注训练语句数据;
    基于所述非人工标注训练语句数据,训练得到预设对话主题模型。
  5. 如权利要求4所述语音数据识别方法,其中,所述基于所述模拟标签数据,对预设训练模型进行训练,得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型的步骤,包括:
    确定所述模拟标签数据中的模拟假标签数据以及模拟真标签数据;
    将所述模拟假标签数据以及模拟真标签数据输入至预设训练模型中,得到识别结果;
    基于所述识别结果以及所述模拟标签数据中的真假模拟标签调整所述预设训练模型的模型参数,直至得到满足预设条件的目标模型,将所述目标模型设置为所述预设预训练模型。
  6. 如权利要求1-5任一项所述语音数据识别方法,其中,所述基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果的步骤,包括:
    提取所述候选结果对应的特征数据,将所述特征数据以及所述关联主题信息,输入至预设排序模型中,对所述各候选结果进行重新排序,得到目标排序结果;
    其中,所述排序模型是采用候选特征集训练得到的,所述候选特征集中的一条训练数据包括多个候选结果对应的特征数据,所述多个候选结果对应的关联主题信息以及所述多个候选结果的排序标签。
  7. 如权利要求1所述的语音识别方法,其中,所述对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果的步骤,包括:
    对所述待识别语音数据进行语音特征提取,得到所述待识别语音数据的语音特征数据;
    采用预设语音模型和预设语言模型对所述语音特征数据进行识别,得到所述待识别语音数据的各候选结果。
  8. 一种语音数据识别装置,其中,所述语音数据识别装置包括:
    识别模块,用于对待识别语音数据进行语音识别得到所述待识别语音数据的各候选结果;
    第一获取模块,用于获取所述各候选结果的初始排序结果,并获取所述各候选结果的关联主题信息;
    重新排序模块,用于基于所述初始排序结果以及所述关联主题信息,对所述各候选结果进行重新排序,得到目标排序结果;
    选取模块,用于根据所述目标排序结果从各所述候选结果中选取目标候选结果作为所述待识别语音数据的语音识别结果。
  9. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求1所述语音数据识别方法的步骤。
  10. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求2所述语音数据识别方法的步骤。
  11. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求3所述语音数据识别方法的步骤。
  12. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求4所述语音数据识别方法的步骤。
  13. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求5所述语音数据识别方法的步骤。
  14. 一种语音数据识别设备,其中,所述语音数据识别设备包括:存储器、处理器以及存储在存储器上的用于实现所述语音数据识别方法的程序,
    所述存储器用于存储实现语音数据识别方法的程序;
    所述处理器用于执行实现所述语音数据识别方法的程序,以实现如权利要求6所述语音数据识别方法的步骤。
  15. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求1所述语音数据识别方法的步骤。
  16. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求2所述语音数据识别方法的步骤。
  17. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求3所述语音数据识别方法的步骤。
  18. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求4所述语音数据识别方法的步骤。
  19. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求5所述语音数据识别方法的步骤。
  20. 一种介质,其中,所述介质上存储有实现语音数据识别方法的程序,所述实现语音数据识别方法的程序被处理器执行以实现如权利要求6所述语音数据识别方法的步骤。
PCT/CN2021/093033 2020-05-15 2021-05-11 语音数据识别方法、设备及介质 WO2021228084A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010417957.1A CN111613219B (zh) 2020-05-15 2020-05-15 语音数据识别方法、设备及介质
CN202010417957.1 2020-05-15

Publications (1)

Publication Number Publication Date
WO2021228084A1 true WO2021228084A1 (zh) 2021-11-18

Family

ID=72203423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093033 WO2021228084A1 (zh) 2020-05-15 2021-05-11 语音数据识别方法、设备及介质

Country Status (2)

Country Link
CN (1) CN111613219B (zh)
WO (1) WO2021228084A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613219B (zh) * 2020-05-15 2023-10-27 深圳前海微众银行股份有限公司 语音数据识别方法、设备及介质
CN113314099B (zh) * 2021-07-28 2021-11-30 北京世纪好未来教育科技有限公司 语音识别置信度的确定方法和确定装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244024A (zh) * 2015-09-02 2016-01-13 百度在线网络技术(北京)有限公司 一种语音识别方法及装置
US20160104478A1 (en) * 2014-10-14 2016-04-14 Sogang University Research Foundation Voice recognition method using machine learning
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
CN108062954A (zh) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 语音识别方法和装置
CN110083837A (zh) * 2019-04-26 2019-08-02 科大讯飞股份有限公司 一种关键词生成方法及装置
CN111613219A (zh) * 2020-05-15 2020-09-01 深圳前海微众银行股份有限公司 语音数据识别方法、设备及介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097276A (ja) * 1996-09-20 1998-04-14 Canon Inc 音声認識方法及び装置並びに記憶媒体
KR101727306B1 (ko) * 2014-06-24 2017-05-02 한국전자통신연구원 언어모델 군집화 기반 음성인식 장치 및 방법
CN104516986B (zh) * 2015-01-16 2018-01-16 青岛理工大学 一种语句识别方法及装置
JP6738436B2 (ja) * 2016-12-20 2020-08-12 日本電信電話株式会社 音声認識結果リランキング装置、音声認識結果リランキング方法、プログラム
WO2019216996A1 (en) * 2018-05-07 2019-11-14 Apple Inc. Raise to speak

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160104478A1 (en) * 2014-10-14 2016-04-14 Sogang University Research Foundation Voice recognition method using machine learning
CN105244024A (zh) * 2015-09-02 2016-01-13 百度在线网络技术(北京)有限公司 一种语音识别方法及装置
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
CN108062954A (zh) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 语音识别方法和装置
CN110083837A (zh) * 2019-04-26 2019-08-02 科大讯飞股份有限公司 一种关键词生成方法及装置
CN111613219A (zh) * 2020-05-15 2020-09-01 深圳前海微众银行股份有限公司 语音数据识别方法、设备及介质

Also Published As

Publication number Publication date
CN111613219B (zh) 2023-10-27
CN111613219A (zh) 2020-09-01

Similar Documents

Publication Publication Date Title
US10971135B2 (en) System and method for crowd-sourced data labeling
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
US11355113B2 (en) Method, apparatus, device and computer readable storage medium for recognizing and decoding voice based on streaming attention model
WO2020186712A1 (zh) 一种语音识别方法、装置及终端
WO2020228175A1 (zh) 多音字预测方法、装置、设备及计算机可读存储介质
US10224030B1 (en) Dynamic gazetteers for personalized entity recognition
WO2021136029A1 (zh) 重打分模型训练方法及装置、语音识别方法及装置
WO2021179701A1 (zh) 多语种语音识别方法、装置及电子设备
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
WO2021228084A1 (zh) 语音数据识别方法、设备及介质
CN110264997A (zh) 语音断句的方法、装置和存储介质
CN114464182B (zh) 一种音频场景分类辅助的语音识别快速自适应方法
JP2024513778A (ja) 自己適応型蒸留
US11893813B2 (en) Electronic device and control method therefor
TW201225064A (en) Method and system for text to speech conversion
CN110853669B (zh) 音频识别方法、装置及设备
US10706086B1 (en) Collaborative-filtering based user simulation for dialog systems
CN115132170A (zh) 语种分类方法、装置及计算机可读存储介质
WO2021159756A1 (zh) 基于多模态的响应义务检测方法、系统及装置
CN114121010A (zh) 模型训练、语音生成、语音交互方法、设备以及存储介质
CN112185357A (zh) 一种同时识别人声和非人声的装置及方法
JP2020173441A (ja) 音声認識方法及び装置
US20230085161A1 (en) Automatic translation between sign language and spoken language
CN113555006B (zh) 一种语音信息识别方法、装置、电子设备及存储介质
US20230077874A1 (en) Methods and systems for determining missing slots associated with a voice command for an advanced voice interaction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21803293

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21803293

Country of ref document: EP

Kind code of ref document: A1