CN114203166B - Method, device and equipment for generating training data based on man-machine conversation - Google Patents

Method, device and equipment for generating training data based on man-machine conversation Download PDF

Info

Publication number
CN114203166B
CN114203166B CN202111504406.XA CN202111504406A CN114203166B CN 114203166 B CN114203166 B CN 114203166B CN 202111504406 A CN202111504406 A CN 202111504406A CN 114203166 B CN114203166 B CN 114203166B
Authority
CN
China
Prior art keywords
text
segmented
voice
recognition model
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111504406.XA
Other languages
Chinese (zh)
Other versions
CN114203166A (en
Inventor
王刚
曾文佳
陈新月
宋成业
冯梦盈
梁鹏斌
李航
韩亚昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingxi Beijing Technology Co Ltd
Original Assignee
Lingxi Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingxi Beijing Technology Co Ltd filed Critical Lingxi Beijing Technology Co Ltd
Priority to CN202111504406.XA priority Critical patent/CN114203166B/en
Publication of CN114203166A publication Critical patent/CN114203166A/en
Application granted granted Critical
Publication of CN114203166B publication Critical patent/CN114203166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method, a device and equipment for generating training data based on man-machine conversation, wherein the method for generating the training data based on the man-machine conversation comprises the following steps: acquiring an error text, a labeled text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained; performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model; calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity; and combining the labeled text and the segmented voice corresponding to the target segmented text to obtain training data.

Description

Method, device and equipment for generating training data based on man-machine conversation
Technical Field
The application relates to the technical field of voice, in particular to a method, a device and equipment for generating training data based on man-machine conversation.
Background
The speech recognition model is a model that can recognize input speech, for example, a piece of audio is input into the speech recognition model, and the speech recognition model can output text corresponding to the piece of audio.
The existing training of the speech recognition model is to firstly obtain a training audio and a correct text corresponding to the training audio, and then input the training audio and the correct text corresponding to the training audio into the speech recognition model, so as to train the speech recognition model.
After the speech recognition model is trained, the trained speech recognition model can be used to recognize speech so as to test the recognition effect of the speech recognition model. In the testing process, the speech recognition model can be found to be wrong in recognition of certain speech, so that the wrong speech is recognized as wrong audio, and the wrong audio is used for continuing training the speech recognition model so as to improve the recognition rate of the speech recognition model.
Then, how is the error audio determined during the multi-turn dialog? In the existing mode, a certain complete audio is heard manually, a text corresponding to the complete audio is determined, then whether the text corresponding to the complete audio is the same as a text output by a speech recognition model is judged manually, if the text is the same as the text output by the speech recognition model, the speech recognition model is recognized as error, if the text is different from the text output by the speech recognition model, then, the start time and the end time of the audio of the text with errors in recognition in the complete speech are determined manually, then, the audio of the text with errors in recognition is extracted from the complete speech according to the start time and the end time, and finally, the audio is used as an error audio. Therefore, the technical problem that the training data generation efficiency is low exists due to the fact that the starting time and the ending time of the error audio need to be confirmed manually.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus and a device for generating training data based on human-computer interaction.
In a first aspect, a method for generating training data based on human-computer interaction is provided, including:
acquiring an error text, a label text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and combining the labeled text and the segmented voice corresponding to the target segmented text to obtain training data.
According to the method for generating training data based on man-machine conversation, because the similarity between the error text or the labeled text and the segmented text is calculated, the target segmented text corresponding to the maximum similarity is obtained, and the segmented voice corresponding to the target segmented text is obtained in advance by performing voice cutting on the complete voice, after the target segmented text is obtained, the segmented voice corresponding to the target segmented text can be directly obtained, and then the segmented voice and the labeled text are used as a group of training data to train the voice recognition model. Therefore, the starting time and the ending time of the error audio (the segmented speech corresponding to the target segmented text) do not need to be determined manually, so that the generation efficiency of the training data is improved to a certain extent.
In an embodiment, after the combining the segmented speech corresponding to the labeled text and the target segmented text to obtain training data, the method further includes:
the first speech recognition model is trained using the training data.
In one embodiment, the performing speech segmentation on the complete speech to obtain a plurality of segmented speech includes:
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
and obtaining the voice of the user from the plurality of segmented voices to obtain a plurality of segmented voices.
In one embodiment, the performing speech segmentation on the complete speech to obtain a plurality of segmented speech includes:
acquiring signal data corresponding to the complete voice;
performing framing processing on the signal data to obtain a first number of signal frames;
determining a category of each signal frame, wherein the category is a voiced category or an unvoiced category;
a second number of segmented voices is obtained based on the categories of the first number of signal frames.
In one embodiment, the calculating the similarity between the error text or the annotation text and each of the segmented texts comprises:
acquiring text feature vectors of the error text or the labeled text and text feature vectors of each segmented text;
and calculating the feature distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text to obtain the similarity between the error text or the labeled text and each segmented text.
In one embodiment, the obtaining the text feature vector of the error text or the labeled text, and the text feature vector of each segmented text includes:
determining a one-hot code for each word in the error text or the annotation text or the segmented text;
multiplying the one-hot code of each word with a preset sharing matrix respectively to obtain a primary word vector of the word;
multiplying the preliminary word vector of the word by a preset weight matrix to obtain a target word vector of the word;
averaging the target word vectors of each word in the error text, the labeled text or the segmented text to obtain the text characteristic vector of the error text or the labeled text and the text characteristic vector of each segmented text.
In a second aspect, an apparatus for generating training data based on human-computer interaction is provided, including:
the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an error text, a labeled text of the error text and a complete voice corresponding to the error text, the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
the cutting module is used for carrying out voice cutting on the complete voice to obtain a plurality of segmented voices;
the recognition module is used for performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
the calculation module is used for calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and the obtaining module is used for combining the segmented speech corresponding to the label text and the target segmented text to obtain training data.
In one embodiment, the apparatus for generating training data based on human-computer interaction further comprises:
a training module to train the first speech recognition model using the training data.
In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method for generating training data based on human-computer conversation as described above when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, in which computer program instructions are stored, which, when read and executed by a processor, perform the steps of the method for generating training data based on human-computer interaction as described above.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flow chart illustrating an implementation of a method for generating training data based on human-computer interaction in an embodiment of the present application;
FIG. 2 is a schematic illustration of one-hot encoding in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a device for generating training data based on human-computer interaction in an embodiment of the present application;
fig. 4 is a block diagram of an internal structure of a computer device in the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In one embodiment, a method for generating training data based on human-computer interaction is provided. The execution subject of the method for generating training data based on human-computer conversation according to the embodiment of the present invention is a computer device capable of implementing the method for generating training data based on human-computer conversation according to the embodiment of the present invention, and the computer device may include, but is not limited to, a terminal and a server. The terminal comprises a desktop terminal and a mobile terminal, wherein the desktop terminal comprises but is not limited to a desktop computer and a vehicle-mounted computer; mobile terminals include, but are not limited to, cell phones, tablets, laptops, and smartwatches. The server includes a high performance computer and a cluster of high performance computers.
In one embodiment, as shown in fig. 1, there is provided a method for generating training data based on human-computer interaction, including:
step 100, obtaining an error text, a labeled text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice recognized by a first voice recognition model and judged as a recognition error, and the first voice recognition model is a voice recognition model which is not trained yet.
The error text is a text which is found by people and recognized by errors; the label text of the error text is the correct text corresponding to the manually labeled error text; the complete speech corresponding to the wrong text is a relatively long speech, multiple words may be involved in the complete speech, the complete speech is recognized by the first speech recognition model, a text containing more words can be obtained, and the wrong text is only a part of the text containing more words; the first speech recognition model is a speech recognition model that has not been trained, and although the first speech recognition model is trained, the recognition rate of the first speech recognition model obtained after training is not very high, and thus, more training data needs to be acquired to retrain the first speech recognition model.
And 200, performing voice cutting on the complete voice to obtain a plurality of segmented voices.
And segmenting the voice to be a section of voice in the complete voice. For example, if the complete speech contains 5 sentences in total, the complete speech is subjected to speech segmentation to obtain 5 segmented speech. In one example, the speech segmentation is performed on the complete speech according to a preset duration to obtain a plurality of segmented speech. The preset time duration is a preset time duration, for example, the preset time duration is 0.5 second. It can be understood that when a user speaks two sentences, a certain time interval is inevitably generated between the two sentences, and then, according to the time interval, the speech cutting of the complete speech is realized.
Step 300, performing speech recognition on each segmented speech by using a second speech recognition model to obtain a segmented text of each segmented speech, where the second speech recognition model is a trained speech recognition model.
The second speech recognition model is a trained speech recognition model, the trained speech recognition model means that the second speech recognition model can achieve a higher recognition rate, and in the practical application process, the right of use of the second speech recognition model can be acquired from other enterprises in a purchasing or renting mode and the like, so that the training of the first speech recognition model of the enterprise is realized, and of course, the first speech recognition model and the second speech recognition model can both be the speech recognition models of the enterprise; and the segmented text is a text obtained after the segmented voice is identified by the second voice identification model.
Step 400, calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity.
And the target segmented text is the text which is most similar to the error text or the labeled text.
The calculation of the error text and the segmented text is used for explanation. Combining the error texts with each segmented text to obtain a plurality of combined texts; calculating the similarity between the error text and the segmented text in the combined text to obtain the similarity corresponding to the combined text; and acquiring the maximum similarity from a plurality of similarities corresponding to the plurality of combined texts, and taking the segmented text in the combined text corresponding to the maximum similarity as the target segmented text.
And 500, combining the segmented speech corresponding to the labeled text and the target segmented text to obtain training data.
The segmented speech corresponding to the target segmented text is regarded as speech corresponding to the wrong text, namely, speech which is recognized by the first speech recognition model and is judged to be recognized wrongly. And combining the segmented speech corresponding to the labeling text and the target segmented text to obtain a group of training data, and then training the first speech recognition model by using the training data.
According to the method for generating training data based on man-machine conversation, because the similarity between the error text or the labeled text and the segmented text is calculated, the target segmented text corresponding to the maximum similarity is obtained, and the segmented voice corresponding to the target segmented text is obtained in advance by performing voice cutting on the complete voice, after the target segmented text is obtained, the segmented voice corresponding to the target segmented text can be directly obtained, and then the segmented voice and the labeled text are used as a group of training data to train the voice recognition model. Therefore, the starting time and the ending time of the error audio (the segmented speech corresponding to the target segmented text) do not need to be determined manually, so that the generation efficiency of the training data is improved to a certain extent.
In an embodiment, after the combining the segmented speech corresponding to the labeled text and the target segmented text to obtain the training data in step 500, the method further includes:
training the first speech recognition model using the training data.
Inputting segmented speech corresponding to a target segmented text in training data into a first speech recognition model to obtain a predicted text output by the first speech recognition model; calculating model loss according to the prediction text and the labeled text in the training data; the first speech recognition model is trained on model losses.
In the above embodiment, after the training data is obtained, the training data is used to train the first speech recognition model, so as to improve the recognition rate of the first speech recognition model.
In one embodiment, the performing speech segmentation on the complete speech in step 200 to obtain a plurality of segmented speech includes:
step 201, performing voice cutting on the complete voice to obtain a plurality of segmented voices.
And performing voice cutting on the complete voice according to the preset duration to obtain a plurality of segmented voices. For example, if the complete speech contains 5 sentences in total, the complete speech is subjected to speech segmentation to obtain 5 segmented speech.
Step 202, obtaining the voice of the user from the plurality of segmented voices to obtain a plurality of segmented voices.
The complete voice may be generated by a user interacting with the robot, and therefore, the complete voice includes the voice of the robot, and the voice of the user is acquired from the plurality of segmented voices and is taken as the plurality of segmented voices. Continuing with the above example, when 2 out of 5 words are spoken by the user, then these 2 words are treated as 2 segmented voices.
Compared with human speech, the robot speech is generally slower in speech speed, clear in pronunciation and higher in recognition success probability, and is different from the real human speech, and the purpose of speech recognition is to recognize the human speech more often, so that the speech of the user needs to be acquired to train the speech recognition model.
In the above embodiment, the user's speech is obtained from the plurality of segmented speeches, so that the training is performed based on the user's speech during the subsequent model training.
In one embodiment, the performing speech segmentation on the complete speech in step 200 to obtain a plurality of segmented speech includes:
and 200A, acquiring signal data corresponding to the complete voice.
The signal data is derived from the acoustic waveform of the complete speech, which may be represented by a series of discrete data points, i.e., the signal data of the complete speech, e.g., the signal data corresponding to the complete speech is [1 20/30/600/… ]. For example, the complete speech is a speech file in wav format, and in practical application, a method in the function library is called to obtain signal data.
Step 200B, performing framing processing on the signal data to obtain a first number of signal frames.
Assuming that the data length of the signal data is 200000, the time length of the complete speech is 10 seconds, and the first number is 2000, then the data length of each signal frame is 100, and the time length of each signal frame is 5 milliseconds.
And step 200C, determining the category of each signal frame, wherein the category is a voiced category or an unvoiced category.
And inputting the signal frame into a pre-trained category identification model to obtain the category of the signal frame. A voiced category, which indicates that the signal frame has a voiced signal, and the probability is that the user is speaking; silence class, indicating that the signal frame has no sound signal, is roughly a pause in speech made by the user in the gap. The class recognition model needs to be trained in advance, including: acquiring category training data, wherein the category training data comprises training signal frames and artificial labeling categories of the training signal frames; inputting the training signal frame into a category identification model to obtain a prediction category output by the category identification model; obtaining loss according to the prediction type of the training signal frame and the artificial labeling type of the training signal frame; and training the class recognition model according to the loss to obtain the trained class recognition model.
And 200D, obtaining a second quantity of segmented voices according to the types of the signal frames of the first quantity.
For example, the categories of the 1 st to 20 th signal frames in the 2000 signal frames are all voiced categories, the 1 st to 20 th signal frames are combined to obtain a1 st combined signal frame, the categories of the 21 st to 25 th signal frames in the 2000 signal frames are all unvoiced categories, the 21 st to 25 th signal frames are combined to obtain a2 nd combined signal frame, and the 1 st segmented speech is obtained according to the 1 st combined signal frame, the 2 nd combined signal frame and the complete speech; for another example, the 26 th to 100 th signal frames in the 2000 signal frames are all voiced frames, the 26 th to 100 th signal frames are combined to obtain a3 rd combined signal frame, the 101 th to 125 th signal frames in the 2000 signal frames are all unvoiced frames, the 101 th to 125 th signal frames are combined to obtain a4 th combined signal frame, and the 2 nd segmented speech is obtained according to the 3 rd combined signal frame, the 4 th combined signal frame and the complete speech. By analogy, a second number of segmented voices may be obtained.
The foregoing embodiment provides a method for acquiring segmented speech, that is, according to a sound interval, dividing two segments of speech with a sound interval into two segments of speech.
In one embodiment, the calculating the similarity between the error text or the annotation text and each of the segmented texts in step 400 includes:
step 401, obtaining the text feature vector of the error text or the labeled text, and the text feature vector of each segmented text.
Text feature vectors, which are vector representations of text for computer devices to recognize. For example, the text feature vector is [0.01,0.02,0.05,0.1,0.32,0.08,0.01].
Step 402, calculating a feature distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text to obtain a similarity between the error text or the labeled text and each segmented text.
And calculating the Euclidean distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text to obtain the feature distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text. For example, the text feature vector of the error text is [ A1, A2, A3, A4, A5, A6, A7], the text feature vector of the segmented text is [ B1, B2, B3, B4, B5, B6, B7], and then the feature distance between two texts is:
(A1-B1) 2 +(A2-B2) 2 +(A3-B3) 2 +(A4-B4) 2 +(A5-B5) 2 +(A6-B6) 2 +(A7-B7) 2
and performing function transformation on the calculated characteristic distance to obtain the similarity between the error text and the segmented text.
It can be understood that the larger the distance, the less similar the two texts are considered, and the smaller the distance, the more similar the two texts are considered, so after the characteristic distance is obtained, a functional transformation is also needed to obtain the similarity between the two texts.
The above embodiment illustrates how to calculate the similarity between the error text or the labeled text and the segmented text.
In one embodiment, the obtaining the text feature vector of the error text or the labeled text and the text feature vector of each segmented text in step 401 includes:
step 401A, determining a unique hot code of each word in the error text or the annotation text or the segmented text.
The one-hot encoding of words is based on the order of the words in the vocabulary library, e.g., there are a total of 7 words in the vocabulary library, respectively: you, in, drive, car, i, happy, wrong text is: i drive, then the one-hot codes of the 'I' word, the 'on' word and the 'vehicle' word in the error text are respectively as follows:
Figure BDA0003403584340000111
and step 401B, multiplying the one-hot code of each word with a preset sharing matrix respectively to obtain a preliminary word vector of the word.
The preset sharing matrix is a preset matrix and contains the characteristics of each word in the vocabulary library.
For example, the one-hot code of a word is represented by T, the dimension of T is 1 × N, the preset shared matrix is represented by W1, and the dimension of W1 is N × M, where N represents the number of words in the vocabulary base, and M represents the dimension of the feature corresponding to the word, so that a preliminary word vector of the word can be obtained by multiplying the one-hot code of the word by the preset shared matrix, and the dimension of the preliminary word vector is 1 × M.
Step 401C, multiplying the preliminary word vector of the word by a preset weight matrix to obtain a target word vector of the word.
The preset weight matrix is a preset weight matrix, and the preliminary word vector of the word is processed through the preset weight matrix, so that a target word vector of the word can be obtained, for example, the dimension of the preset weight matrix is M × N, and then, the dimension of the target word vector is 1 × N.
Step 401D, averaging the target word vectors of each word in the error text, the labeled text, or the segmented text to obtain the text feature vectors of the error text or the labeled text, and the text feature vectors of each segmented text.
Continuing with the example, adding the vectors of the target words of the 'me' word, the 'on' word and the 'car' word with the dimensionality of 1 × N to obtain a sum vector, wherein the dimensionality is 1 × N; the sum vector is divided by the total number of words in the erroneous text (in this example, the total number is 4), and the text feature vector of the erroneous text is obtained. The method for obtaining the segmented text and the text feature vector of the labeled text is the same as the method for obtaining the text feature vector of the error text, and the details are not described herein.
The above embodiment illustrates how to obtain the text feature vectors of the error text, the labeled text and the segmented text.
In one embodiment, the method for generating training data based on human-computer interaction further comprises:
and acquiring a preset sharing matrix and a preset weight matrix.
Acquiring a vocabulary library, and assuming that the total number of words in the vocabulary library is 7, respectively: you, on, car, i, happy; acquiring training corpora, and assuming that the training corpora are 'you are driving' and 'I like driving'; initializing the shared matrix and the weight matrix to obtain an initialized shared matrix and an initialized weight matrix in the voice recognition model, wherein the voice recognition model is a first voice recognition model or a second voice recognition model; obtaining a central word (the central word is one of all words contained in the corpus) from the training corpus (assumed to be 'you are driving'), assuming that the central word is: at least one of the following steps; the method comprises the steps of obtaining a unique hot code of a central word, and a unique hot code of 'you', 'driving' and 'driving', taking the unique hot code of the central word as input, and outputting a combined code obtained by combining the unique hot codes of 'you', 'driving' and 'driving' as a corresponding label, as shown in fig. 2, training a voice recognition model to obtain trained parameters, wherein the trained parameters comprise a trained shared matrix and a trained weight matrix, and the trained shared matrix and the trained weight matrix are respectively used as a preset shared matrix and a preset weight matrix.
The above embodiment illustrates how to obtain the preset sharing matrix and the preset weighting matrix.
In one embodiment, there is provided an apparatus 300 for generating training data based on human-machine interaction, comprising:
an obtaining module 301, configured to obtain an error text, a labeled text of the error text, and a complete speech corresponding to the error text, where the error text is a recognition result of a speech recognized by a first speech recognition model and determined as a recognition error, and the first speech recognition model is a speech recognition model that has not been trained yet;
a cutting module 302, configured to perform voice cutting on the complete voice to obtain a plurality of segmented voices;
a recognition module 303, configured to perform speech recognition on each segmented speech by using a second speech recognition model to obtain a segmented text of each segmented speech, where the second speech recognition model is a trained speech recognition model;
a calculating module 304, configured to calculate a similarity between the error text or the tagged text and each of the segmented texts, so as to obtain a target segmented text corresponding to a maximum similarity;
an obtaining module 305, configured to combine the labeled text and the segmented speech corresponding to the target segmented text to obtain training data.
In one embodiment, the apparatus 300 for generating training data based on human-computer interaction further comprises:
a training module to train the first speech recognition model using the training data.
In an embodiment, the cutting module 302 is specifically configured to:
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
and acquiring the voice of the user from the plurality of segmented voices to obtain a plurality of segmented voices.
In an embodiment, the cutting module 302 is specifically configured to:
acquiring signal data corresponding to the complete voice;
performing framing processing on the signal data to obtain a first number of signal frames;
determining a category of each signal frame, wherein the category is a voiced category or an unvoiced category;
a second number of segmented voices is obtained based on the categories of the first number of signal frames.
In an embodiment, the calculating module 304 is specifically configured to:
acquiring text characteristic vectors of the error text or the labeled text and text characteristic vectors of each segmented text;
and calculating the feature distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text to obtain the similarity between the error text or the labeled text and each segmented text.
In an embodiment, the calculating module 304 is specifically configured to:
determining a unique encoding for each word in the erroneous text or the annotated text or the segmented text;
multiplying the one-hot code of each word with a preset sharing matrix respectively to obtain a primary word vector of the word;
multiplying the preliminary word vector of the word by a preset weight matrix to obtain a target word vector of the word;
averaging the target word vectors of each word in the error text or the labeled text or the segmented text to obtain the text feature vector of the error text or the labeled text and the text feature vector of each segmented text.
In one embodiment, as shown in fig. 4, a computer device is provided, which may be a terminal or a server in particular. The computer device comprises a processor, a memory and a network interface which are connected through a system bus, wherein the memory comprises a nonvolatile storage medium and an internal memory, the nonvolatile storage medium of the computer device stores an operating system, and a computer program can be stored, and when the computer program is executed by the processor, the processor can realize the training data generation method based on the man-machine conversation. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM). The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform a method for generating training data based on human-machine interaction. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The method for generating training data based on human-computer interaction provided by the application can be realized in the form of a computer program, and the computer program can be run on a computer device as shown in fig. 4. The memory of the computer device may store therein respective program templates constituting the generation means of the training data based on the human-computer interaction. Such as an acquisition module 301, a cutting module 302, and an identification module 303.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring an error text, a label text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and combining the labeled text and the segmented voice corresponding to the target segmented text to obtain training data.
In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of:
acquiring an error text, a labeled text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and combining the labeled text and the segmented voice corresponding to the target segmented text to obtain training data.
It should be noted that the above-mentioned method for generating training data based on human-computer interaction, device for generating training data based on human-computer interaction, computer device, and computer-readable storage medium belong to one general inventive concept, and the contents in the embodiments of the method for generating training data based on human-computer interaction, device for generating training data based on human-computer interaction, computer device, and computer-readable storage medium are mutually applicable.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units into only one type of logical function may be implemented in other ways, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for generating training data based on human-computer conversation is characterized by comprising the following steps:
acquiring an error text, a label text of the error text and a complete voice corresponding to the error text, wherein the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and combining the segmented speech corresponding to the label text and the target segmented text to obtain training data.
2. The method according to claim 1, wherein after the combining the segmented speech corresponding to the labeled text and the target segmented text to obtain training data, the method further comprises:
the first speech recognition model is trained using the training data.
3. The method of generating as claimed in claim 1, wherein said performing speech segmentation on said complete speech to obtain a plurality of segmented speech comprises:
performing voice cutting on the complete voice to obtain a plurality of segmented voices;
and obtaining the voice of the user from the plurality of segmented voices to obtain a plurality of segmented voices.
4. The method of generating as claimed in claim 1, wherein said performing speech segmentation on said complete speech to obtain a plurality of segmented speech comprises:
acquiring signal data corresponding to the complete voice;
performing framing processing on the signal data to obtain a first number of signal frames;
determining a category of each of the signal frames, wherein the category is a voiced category or an unvoiced category;
a second number of segmented voices is obtained based on the categories of the first number of signal frames.
5. The method of claim 1, wherein the calculating the similarity between the error text or the labeled text and each of the segmented texts comprises:
acquiring text feature vectors of the error text or the labeled text and text feature vectors of each segmented text;
and calculating the feature distance between the text feature vector of the error text or the labeled text and the text feature vector of the segmented text to obtain the similarity between the error text or the labeled text and each segmented text.
6. The method according to claim 5, wherein the obtaining the text feature vector of the error text or the labeled text, and the text feature vector of each segmented text comprises:
determining a one-hot code for each word in the error text or the annotation text or the segmented text;
multiplying the one-hot code of each word with a preset sharing matrix respectively to obtain a primary word vector of the word;
multiplying the preliminary word vector of the word by a preset weight matrix to obtain a target word vector of the word;
averaging the target word vectors of each word in the error text, the labeled text or the segmented text to obtain the text characteristic vector of the error text or the labeled text and the text characteristic vector of each segmented text.
7. An apparatus for generating training data based on human-computer interaction, comprising:
the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an error text, a labeled text of the error text and a complete voice corresponding to the error text, the error text is a recognition result of the voice which is recognized by a first voice recognition model and is judged to be recognized wrongly, and the first voice recognition model is a voice recognition model which is not trained;
the cutting module is used for carrying out voice cutting on the complete voice to obtain a plurality of segmented voices;
the recognition module is used for performing voice recognition on each segmented voice by using a second voice recognition model to obtain a segmented text of each segmented voice, wherein the second voice recognition model is a trained voice recognition model;
the calculation module is used for calculating the similarity between the error text or the labeled text and each segmented text to obtain a target segmented text corresponding to the maximum similarity;
and the obtaining module is used for combining the labeled text and the segmented voice corresponding to the target segmented text to obtain training data.
8. The generation apparatus according to claim 7, characterized by further comprising:
a training module to train the first speech recognition model using the training data.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method for generating human-machine dialog based training data according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, having stored thereon computer program instructions, which, when read and executed by a processor, perform the steps of the method for generating human-machine-dialogue-based training data as recited in any one of claims 1 to 6.
CN202111504406.XA 2021-12-10 2021-12-10 Method, device and equipment for generating training data based on man-machine conversation Active CN114203166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111504406.XA CN114203166B (en) 2021-12-10 2021-12-10 Method, device and equipment for generating training data based on man-machine conversation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111504406.XA CN114203166B (en) 2021-12-10 2021-12-10 Method, device and equipment for generating training data based on man-machine conversation

Publications (2)

Publication Number Publication Date
CN114203166A CN114203166A (en) 2022-03-18
CN114203166B true CN114203166B (en) 2023-03-31

Family

ID=80651954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111504406.XA Active CN114203166B (en) 2021-12-10 2021-12-10 Method, device and equipment for generating training data based on man-machine conversation

Country Status (1)

Country Link
CN (1) CN114203166B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974221B (en) * 2022-04-29 2024-01-19 中移互联网有限公司 Speech recognition model training method and device and computer readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02282800A (en) * 1989-04-25 1990-11-20 Nec Corp Sound encoding system
US6064957A (en) * 1997-08-15 2000-05-16 General Electric Company Improving speech recognition through text-based linguistic post-processing
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN110310626A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111312224A (en) * 2020-02-20 2020-06-19 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN111402865A (en) * 2020-03-20 2020-07-10 北京达佳互联信息技术有限公司 Method for generating speech recognition training data and method for training speech recognition model
CN111754985A (en) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 Method and device for training voice recognition model and voice recognition
CN112216284A (en) * 2020-10-09 2021-01-12 携程计算机技术(上海)有限公司 Training data updating method and system, voice recognition method and system, and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02282800A (en) * 1989-04-25 1990-11-20 Nec Corp Sound encoding system
US6064957A (en) * 1997-08-15 2000-05-16 General Electric Company Improving speech recognition through text-based linguistic post-processing
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN110310626A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111312224A (en) * 2020-02-20 2020-06-19 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN111402865A (en) * 2020-03-20 2020-07-10 北京达佳互联信息技术有限公司 Method for generating speech recognition training data and method for training speech recognition model
CN111754985A (en) * 2020-07-06 2020-10-09 上海依图信息技术有限公司 Method and device for training voice recognition model and voice recognition
CN112216284A (en) * 2020-10-09 2021-01-12 携程计算机技术(上海)有限公司 Training data updating method and system, voice recognition method and system, and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵晓群 ; 张扬 ; .语音关键词识别系统声学模型构建综述.燕山大学学报.2017,(第06期),全文. *
龙艳花 ; 茅红伟 ; 叶宏 ; .电视剧语音识别中的半监督自动语音分割算法.数据采集与处理.2019,(第02期),全文. *

Also Published As

Publication number Publication date
CN114203166A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN110287283B (en) Intention model training method, intention recognition method, device, equipment and medium
CN111862977B (en) Voice conversation processing method and system
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
US8126717B1 (en) System and method for predicting prosodic parameters
EP1696421B1 (en) Learning in automatic speech recognition
Yoon et al. Attentive modality hopping mechanism for speech emotion recognition
US9495955B1 (en) Acoustic model training
CN111402862B (en) Speech recognition method, device, storage medium and equipment
EP1647970A1 (en) Hidden conditional random field models for phonetic classification and speech recognition
US9484019B2 (en) System and method for discriminative pronunciation modeling for voice search
CN111613212A (en) Speech recognition method, system, electronic device and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN110570876A (en) Singing voice synthesis method and device, computer equipment and storage medium
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium
CN110503956A (en) Audio recognition method, device, medium and electronic equipment
CN114203166B (en) Method, device and equipment for generating training data based on man-machine conversation
CN111370001B (en) Pronunciation correction method, intelligent terminal and storage medium
US6662158B1 (en) Temporal pattern recognition method and apparatus utilizing segment and frame-based models
AU2020103587A4 (en) A system and a method for cross-linguistic automatic speech recognition
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
CN114783405B (en) Speech synthesis method, device, electronic equipment and storage medium
Vetter et al. Unsupervised Phoneme Segmentation of Previously Unseen Languages.
US20230107493A1 (en) Predicting Word Boundaries for On-Device Batching of End-To-End Speech Recognition Models
JP6220733B2 (en) Voice classification device, voice classification method, and program
CN113053409B (en) Audio evaluation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant