CN117219062A - Training data generation method and device, electronic equipment and storage medium - Google Patents

Training data generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117219062A
CN117219062A CN202311426285.0A CN202311426285A CN117219062A CN 117219062 A CN117219062 A CN 117219062A CN 202311426285 A CN202311426285 A CN 202311426285A CN 117219062 A CN117219062 A CN 117219062A
Authority
CN
China
Prior art keywords
sequence
text
information
training data
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311426285.0A
Other languages
Chinese (zh)
Inventor
郭攀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN202311426285.0A priority Critical patent/CN117219062A/en
Publication of CN117219062A publication Critical patent/CN117219062A/en
Pending legal-status Critical Current

Links

Abstract

The application discloses a training data generation method, a training data generation device, electronic equipment and a storage medium, and belongs to the technical field of model training. The generation method comprises the following steps: acquiring first training data, wherein the first training data comprises a first text sequence, a second text sequence and a third text sequence, the first text sequence is obtained by performing voice recognition on first voice data, the second text sequence is a standard text sequence corresponding to the first voice data, and the third text sequence is obtained by translating the second text sequence; generating a first phonological information sequence corresponding to the first character sequence and a second phonological information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence; generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence; second training data is generated based on the fourth text sequence and the third text sequence.

Description

Training data generation method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of model training, and particularly relates to a training data generation method and device, electronic equipment and a storage medium.
Background
In the related art, the voice translation technology can translate the voice content of the user when speaking into the characters of different languages for display, so that the cross-language communication efficiency is improved, and the cross-language communication cost is reduced.
The speech translation technique specifically includes a hierarchical scheme of automatic speech recognition (Automatic Speech Recognition, ASR) and a machine translation model. The training set used by the machine translation model in training comprises text information obtained by automatic speech recognition and correct translation, so that the robot translation model has text translation capability.
The machine translation model obtained by the training method has the problem of error propagation in the cascade scheme, namely, when an error occurs in an upstream automatic voice recognition result, the error is propagated to a downstream machine translation model. Common training data generally includes speech recognition results, corresponding text to speech, and correct translation results. In an actual scenario, different recognition errors may occur in the result of the speech recognition, for example, the user speaks "does it rains today", the automatic speech recognition error is "does it rains today", i.e. the "does it rains" with the end of the sentence missing, so that the original question sentence is erroneously recognized as the statement sentence, and finally, the translation result deviates seriously from the original meaning of the user, and the training data set does not include such errors, so that the diversity of the training data set is insufficient.
Disclosure of Invention
The embodiment of the application aims to provide a training data generation method, a training data generation device, electronic equipment and a storage medium, which can improve the diversity of training data.
In a first aspect, an embodiment of the present application provides a method for generating training data, where the method includes:
acquiring first training data, wherein the first training data comprises a first text sequence, a second text sequence and a third text sequence, the first text sequence is obtained by performing voice recognition on first voice data, the second text sequence is a standard text sequence corresponding to the first voice data, and the third text sequence is obtained by translating the second text sequence;
generating a first phonation information sequence corresponding to the first character sequence and a second phonation information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence, wherein the first phonation information sequence comprises phonation information used for representing the pronunciation of each character in the first character sequence, and the second phonation information sequence comprises phonation information used for representing the pronunciation of each character in the second character sequence;
generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence;
Second training data is generated based on the fourth text sequence and the third text sequence.
In a second aspect, an embodiment of the present application provides a generating device for training data, where the generating device includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first training data, the first training data comprises a first word sequence, a second word sequence and a third word sequence, the first word sequence is obtained by carrying out voice recognition on first voice data, the second word sequence is a standard word sequence corresponding to the first voice data, and the third word sequence is obtained by translating the second word sequence;
a generation module for:
generating a first phonation information sequence corresponding to the first character sequence and a second phonation information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence, wherein the first phonation information sequence comprises phonation information used for representing the pronunciation of each character in the first character sequence, and the second phonation information sequence comprises phonation information used for representing the pronunciation of each character in the second character sequence;
generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence;
Second training data is generated based on the fourth text sequence and the third text sequence.
In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the method as in the first aspect when executed by the processor.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method as in the first aspect.
In a fifth aspect, embodiments of the present application provide a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions implementing the steps of the method as in the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement a method as in the first aspect.
In the embodiment of the application, on the basis of the original first training data, the fourth character sequence is automatically generated based on the first character sequence obtained by voice recognition in the original first training data and the pronunciation of characters in the standard second character sequence, and the training data is generated by simulating the character sequence of the voice recognition result based on the pronunciation of characters, so that the diversity of the training data can be improved.
Drawings
FIG. 1 illustrates a flow chart of a method of generating training data in accordance with some embodiments of the application;
FIG. 2 shows a logical schematic of generating a fourth text sequence from a list of sound information sequences;
FIG. 3 illustrates a block diagram of a training data generation apparatus of some embodiments of the present application;
FIG. 4 shows a block diagram of an electronic device according to an embodiment of the application;
fig. 5 is a schematic hardware structure of an electronic device implementing an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the objects identified by "first," "second," etc. are generally of a type not limited to the number of objects, for example, the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The method, the device, the electronic equipment and the storage medium for generating training data provided by the embodiment of the application are described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
In some embodiments of the present application, a method for generating training data is provided, fig. 1 shows a flowchart of a method for generating training data according to some embodiments of the present application, and as shown in fig. 1, the method includes:
102, acquiring first training data, wherein the first training data comprises a first text sequence, a second text sequence and a third text sequence;
the first text sequence is obtained by performing voice recognition on the first voice data, the second text sequence is a standard text sequence corresponding to the first voice data, and the third text sequence is obtained by translating the second text sequence.
In the embodiment of the application, the first training data is initial training data, and the first training data is used for training the machine translation model. The first training data comprises a first text sequence, a second text sequence and a third text sequence, wherein the first text sequence is a text sequence obtained by performing voice recognition processing on first voice data through an ASR technology, and the first voice data can be recording data when any user speaks.
The second word sequence is specifically a standard word sequence corresponding to the first voice data, where the standard word sequence is a word sequence obtained after the first voice data is accurately identified, and by way of example, the user speaks "there is an apple" in this case, and the second word sequence is a sequence obtained by sorting 7 words, namely, "this", "lining", "having", "one", "apple", "fruit", and the first word sequence is obtained through ASR identification, so the first word sequence may be the same word sequence as the second word sequence, or a word sequence with a difference, for example, the first word sequence may also be: the text sequences of "this", "lining", "composed", "coat", "individual", "apple", "fruit".
The third text sequence is a text sequence obtained by translating the second text sequence, for example, the third text sequence is a text sequence of "here", "is", "an" and "apple" obtained by translating the second text sequence into english.
104, generating a first phonological information sequence corresponding to the first character sequence and a second phonological information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence;
The first list sound information sequence comprises list sound information used for representing the pronunciation of each character in the first character sequence, and the second list sound information sequence comprises list sound information used for representing the pronunciation of each character in the second character sequence.
In the embodiment of the application, the phonological information is used for representing the pronunciation of a character, and illustratively, taking a kanji character as an example, the pronunciation of the kanji character can be represented by a phoneme triplet of initial consonants, intermediate consonants, final consonants and tones.
For example, chinese characters include three pronouncing components of single syllables, double syllables or triple syllables. For a single syllable Chinese character, the pinyin of the single syllable Chinese character can be split into vowels and tones, the missing initial part is a default empty character string, and then the syllables are combined to form a phoneme triplet. For example, the pinyin of an word is "an1", and then it can be converted into a phoneme triplet of ("," an ","1 ") to obtain the phonological information of the an word. Wherein "" "indicates that the initial is blank," an "is the final part of the" an "word, and" 1 "indicates that the tone of the" an "word is a tone.
For a double syllable Chinese character, the pinyin of the double syllable Chinese character can be split into initials, finals and tones to form a phoneme triplet, for example, the pinyin of north is bei3, and the pinyin is converted into the phoneme triplet of north, so that the phonogram information of north is obtained. Wherein, "b" is the initial consonant of the "north" word, "ei" is the final of the "north" word, and "3" indicates that the tone of the "north" word is triphone.
For a three-syllabic Chinese character, the pinyin of the three-syllabic Chinese character can be split into initials, mediants, finals and tones, wherein the mediants and the finals are combined to form a phoneme triplet of (initials, mediants+finals, tones), for example, the pinyin of "hospital" is "yuan4", and then the pinyin of "hospital" can be converted into a phoneme triplet of "y", "uan", "4") to obtain the phonogram information of the "hospital" character. Wherein "y" is the initial consonant of "Chinese" character, "uan" is the combination of the intermediate consonant "u" and the final consonant "an" of "Chinese" character, "4" means that the tone of "Chinese" character is four.
For example, the phonic information for each word in the sequence of words may be determined by the pypinyin tool.
According to the method, each word in the first word sequence and the second word sequence generates corresponding phonogram information according to the pronunciation of the word, and generates a sequence of phonogram information according to the ordering mode of each word in the original word sequence, so as to obtain a first phonogram information sequence corresponding to the first word sequence and a second phonogram information sequence corresponding to the second word sequence.
Illustratively, the first text sequence is: "this", "inside", "there", "one", "apple", "fruit", then the first superficial sound information sequence is: [ "zhe4", "li3", "you3", "yi1", "ge4", "ping2", "guo3" ].
And 106, generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence.
Step 108, generating second training data based on the fourth text sequence and the third text sequence.
In the embodiment of the application, under the actual speech recognition scene, due to the influences of the accent of the user, the precision of the pickup equipment and the environmental noise, the deviation between the pronunciation data recognized by the audio and the phonographic information of the text actually spoken by the user can be caused, and finally the error of the ASR recognition result can be caused.
The training data in the related art is obtained through an ASR recognition result, the ASR recognition result does not always contain the problem of error recognition, and more cases can accurately recognize the user voice, so that the training data in the related art is difficult to cover all the voice recognition results, and the robustness of a machine translation model obtained through training is poor.
Aiming at the problems, the pronunciation of each character in each character sequence is combined on the basis of the first character sequence obtained by the original ASR recognition and the second character sequence of the standard characters to generate the phonological information sequence of the corresponding character sequence.
Based on the phonogram information sequence, combining the probability that each phonogram information is likely to correspond to a word to generate a fourth word sequence, wherein the number of the fourth word sequences can be multiple, the fourth word sequence is obtained by predicting possible error recognition results of ASR based on original pronunciation, and the obtained multiple fourth word sequences are respectively combined with the translation results of the second word sequence, namely the third word sequence, so that multiple training data containing possible ASR errors can be obtained, namely, one piece of original first training data can be expanded into N (N & gt1) pieces of second training data.
The embodiment of the application simulates various recognition results possibly generated by ASR automatic speech recognition based on the text pronunciation, trains the machine translation model through the simulated second training data generated by the text sequences containing the various recognition results, and improves the diversity of the training data.
The training data obtained by the embodiment of the application is used for training the machine translation model, so that the machine translation model obtained after training can have the capability of translating the automatic speech recognition result containing errors into the text of the correct target language, and the robustness of the speech translation technology is improved.
In some embodiments of the present application, generating a fourth text sequence from the first and second list information sequences includes:
determining a homophone text set corresponding to pronunciation represented by each piece of table sound information in the first table sound information sequence and probability of each text in the homophone text set represented by each piece of table sound information to obtain first probability information;
determining the distribution probability of a voice recognition result corresponding to each character in the first character sequence based on the phonogram information of the first character in the first character sequence and the phonogram information of the second character adjacent to the first character to obtain second probability information;
And determining a fourth text sequence according to the first probability information, the second probability information and the second list sound information sequence.
In the embodiment of the application, since the polyphones exist, one piece of phonic information can simultaneously represent a plurality of homophones. For example, the phone triplet included in the phone information "mini 2" is "m", "ing", "2", and the words that can represent the pronunciation include "bright", "name", "ringing", "nameplate", "meditation", "Ming" and "snout", and the homophone word set corresponding to the phone information "mini 2" is [ "bright", "name", "ringing", "nameplate", "meditation", "Ming", "snout" ].
And respectively counting homophonic character sets corresponding to each piece of phonic information existing in the first character sequence and the occurrence probability of each character in each homophonic character set to obtain first probability information.
For example, the representation mode of the occurrence probability of each word in the homophone word set corresponding to the phonogram information "mini 2" is: { "Ming": 0.34, "name": 0.26, "buzzing": 0.15, "Ming": 0.14, "meditation": 0.05, "tea": 0.03, yellow rice borer ":0.03}, wherein "" "bright": 0.34 "indicates that the probability of occurrence of the" bright "word is 34%," "name": 0.26 "means that the probability of occurrence of the" name "word is 26%, and so on.
Illustratively, the first probability information may be represented by table 1:
further, the distribution probability of each character in the first character sequence obtained by ASR recognition by the ASR recognition result is determined.
Specifically, the nature of ASR is to map continuous acoustic signals of audio into word sequences of optimal probability. ASR recognizes that a complete word sequence is a complex process that is commonly affected by a number of factors, a typical feature being that the recognition result of the current audio segment is affected not only by the acoustic signal of the current segment itself, but also by the recognition result of the previous segment.
In the embodiment of the application, the ASR recognition result is set to include four cases of normal, replacement, addition and deletion, and the recognition result of the current text recognized by the ASR is determined according to the phonic information of the adjacent text (second text) of the current text (first text).
After each Chinese character in the first text sequence is converted into a phoneme triplet according to the first order Markov assumption, the ASR recognition result of each phoneme triplet is only determined by the last phoneme triplet. The method takes two adjacent phoneme triples (namely, the phonic information of the first character and the phonic information of the second character) as a unit, counts the probability of the 4 different ASR recognition conditions between the adjacent phoneme triples of the phoneme triples of each phonic information in the first phonic information, and respectively defines the probabilities as:
Orthonormal probability: in the event that the previous phoneme triplet is recognized correctly by the ASR, the next phoneme triplet is also recognized correctly by the ASR.
Ortho substitution probability: where the previous phoneme triplet is correctly recognized by the ASR, the next phoneme triplet is a probability of being replaced by an erroneous phoneme triplet by the ASR recognition.
Ortho-position probability of new increase: where the previous phoneme triplet is recognized correctly by the ASR, the next phoneme triplet is a probability of an additional excess phoneme triplet being recognized by the ASR.
Probability of ortho deletion: where the previous phoneme triplet is correctly recognized by the ASR, the latter phoneme triplet is a probability of a missed phoneme triplet being recognized by the ASR.
Illustratively, the first probability information may be represented by table 2:
TABLE 2
Wherein, 0.10 [ ("t", "ong", "2"), ("x", "ie", "2") ] represents that the probability of the ortho position ("x", "ue", "2") of the meter information ("t", "ong", "2") being replaced with ("x", "ie", "2") is 10%.
0.09 [ ("t", "ong", "2"), ("x", "un", "4") ] means that the probability of the ortho position ("x", "ue", "2") of the meter information ("t", "ong", "2") being replaced with ("x", "un", "4") is 9%, and so on.
The embodiment of the application obtains the distribution probability of the ASR recognition results, namely the second probability information, and the probability of homophones corresponding to the table sound information corresponding to each ASR recognition result, namely the first probability information in the automatic speech recognition.
By combining the first probability information and the second probability information, various situations which may occur in the automatic speech recognition process of the ASR can be simulated, for example, the ASR recognizes wrong phonic information, including ortho substitution, ortho addition, ortho deletion, etc., for example, the ASR recognizes correct phonic information, but corresponds to wrong homophones, etc., including the following steps: "a" is identified as "a garment", and "an answer" is identified as "an arrival", etc.
The embodiment of the application combines various situations possibly occurring in the ASR automatic speech recognition process, expands the first text sequence which is originally only one into the N (N is more than 1) fourth text sequences which comprise more ASR recognition results, thereby forming N pieces of second training data, and therefore, the diversity of the training data can be improved.
The training data obtained by the embodiment of the application is used for training the machine translation model, so that the machine translation model obtained after training can have the capability of translating the automatic speech recognition result containing errors into the text of the correct target language, and the robustness of the speech translation technology is improved.
In some embodiments of the present application, determining a fourth text sequence from the first probability information, the second probability information, and the second list information sequence includes:
Converting at least one of the table sound information in the second table sound information sequence based on the second probability information to obtain a third table sound information sequence;
based on the first probability information, determining the replacement text corresponding to the converted list sound information in the third list sound information sequence;
and replacing the corresponding characters in the second character sequence with the replacement characters according to the position of the converted phonogram information in the third phonogram information sequence, so as to obtain a fourth character sequence.
In the embodiment of the present application, fig. 2 shows a logic diagram of generating a fourth text sequence through a list information sequence, and as shown in fig. 2, assuming that the first text sequence is "the subject is difficult to reach countless testers," the first list information sequence is [ ("zh", "e", "4"), ("d", "ao", "4"), ("t", "i", "2"), ("m", "u", "4"), ("n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1") ].
And according to the second probability information, performing state transition according to the first-order Markov process, namely sequentially mapping each adjacent phoneme triplet in the phoneme information sequence to generate different types of pseudo-phoneme triplet sequences.
Illustratively, the ortho-phoneme triple [ ("start position"), ("zh", "e", "4") ] hit ortho-substitution probabilities in the list information sequence are mapped to [ ("start position"), ("zh", "ao", "4") ], and then a pseudo-phoneme triple sequence of the ortho-substitution type [ ("zh", "ao", "4"), ("d", "ao", "4"), ("t", "i", "2"), ("m", "u", "4"), ("n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1") ]) is generated.
The ortho-phoneme triplet [ ("sh", "eng", "1"), ("end position") ] hit ortho-new probabilities are mapped to "(" sh "," eng ","1 "), (" m "," a ","1 ") ], and an ortho-new type of pseudophoneme triplet sequence [ (" zh "," e ","4 "), (" d "," ao ","4 "), (" t "," i ","2 "), (" m "," u ","4 "), (" n "," an ","2 "), (" d "," ao ","3 "), (" w "," u ","2 "), (" sh "," u ","4 "), (" k "," ao ","3 "), (" sh "," eng ","1 "), (" m "," a ","1 ") ]) is generated.
The ortho-phoneme triplets [ ("t", "i", "2"), ("m", "u", "4") ] hit the ortho-deletion probability, and the ortho-phoneme triplets are deleted in the sequence, i.e., the pseudo-phoneme triplet sequence of the ortho-deletion type [ ("zh", "e", "4"), ("d", "ao", "4"), ("t", "i", "2"), ("n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1") ].
Further, according to the first probability information, the mapped and changed phoneme triples in the pseudo-phoneme triplet sequence generated in the process are converted into homophonic Chinese characters, and the rest phoneme triples are restored into corresponding Chinese characters in the original standard Chinese sentences, so that a fourth character sequence is generated.
As shown in fig. 2, in the orthographic substitution type pseudophoneme triplet sequence [ ("zh", "ao", "4"), ("d", "ao", "4"), ("t", "i", "2"), ("m", "u", "4"), ("n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1") ], modified ("zh", "ao", "4") are mapped to "make" according to the probability of a chinese character in the homophonic chinese character probability table, and the remaining portion is consistent with the original sentence, and a fourth text sequence "make a question" is generated.
Similarly, the orthographic pseudophoneme triplet sequence [ ("zh", "e", "4"), ("d", "ao", "4"), ("t", "i", "2"), ("m", "u", "4"), ("n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1"), ("m", "a", "1") ] generates a fourth word sequence "the subject countless test taker.
The "fourth word sequence" is generated by the "n", "an", "2"), ("d", "ao", "3"), ("w", "u", "2"), ("sh", "u", "4"), ("k", "ao", "3"), ("sh", "eng", "1") ] of the orthographic deletion type of pseudophoneme triplet sequence.
According to the embodiment of the application, the original first text sequence is expanded in different modes by combining the distribution probability of multiple recognition results in the ASR automatic speech recognition process and the probability of different homophones corresponding to the same surface sound information, so that a fourth text sequence which can cover more ASR recognition conditions is obtained, and second training data is generated by combining the translation results, so that the training data can cover more ASR recognition scenes and recognition results.
In some embodiments of the present application, the number of the fourth text sequences is N, N being a positive integer;
generating second training data based on the fourth text sequence and the third text sequence, including:
generating second training data based on each of the N fourth text sequences and the third text sequence, respectively, to obtain N second training data.
In the embodiment of the application, the distribution probability of multiple recognition results in the ASR automatic speech recognition process, namely the second probability information, is obtained through the original text sequence, namely the phonogram information of the first text sequence and the second text sequence, namely the first phonogram information sequence and the second phonogram information sequence, and the probability of different homophones represented by each phonogram information, namely the first probability information, is obtained.
Different recognition results, namely N fourth word sequences, possibly obtained by ASR automatic speech recognition of the first speech data are obtained by carrying out different modes of processing on the original first word sequences based on the first probability information and the second probability information.
And respectively combining each character sequence in the N fourth character sequences with a translation character sequence corresponding to a standard character sequence, namely a second character sequence, namely a third character sequence in the original training data, so as to construct the enhanced training data with stronger robustness.
For example, as shown in fig. 2, combining the generated three different pseudo ASR texts, i.e., the three fourth text sequences, with the translation result "This question baffled countless candidates" corresponding to the original tagged text, i.e., the second text sequence, can obtain three robust second training data, i.e., the third training data: "countless testees who have difficulty in making the themes; this question baffled countless candidates and is true of countless examinees; this question baffled countless candidates "and" countless testees of this problem; this question baffled countless candidates).
The embodiment of the application can simulate various situations possibly occurring in the ASR automatic speech recognition process, enhance the robustness of the original training data and obtain more training data covering more situations.
In some embodiments of the present application, after obtaining the N second training data, the method further comprises:
generating a training data set according to the first training data and the N second training data;
and training the translation model through the training data set to obtain a trained translation model.
In the embodiment of the application, the original first training data is subjected to robustness pre-increment and expansion by combining the distribution probability of multiple recognition results in the ASR automatic speech recognition process and the probability of different homophones corresponding to the same table sound information, so that N second training data with better robustness are obtained.
When the machine translation model is trained, the machine translation model to be trained is subjected to mixed training through the original first training data and the second training data with enhanced robustness, so that the machine translation model obtained by the slow connection can have the capability of translating an automatic speech recognition result containing errors into a text of a correct target language, and the robustness of a speech translation technology is improved.
According to the training data generation method provided by the embodiment of the application, the execution main body can be a training data generation device. In the embodiment of the present application, a method for executing generation of training data by a training data generation device is taken as an example, and the training data generation device provided in the embodiment of the present application is described.
In some embodiments of the present application, a training data generating apparatus is provided, fig. 3 shows a block diagram of a training data generating apparatus according to some embodiments of the present application, and as shown in fig. 3, a training data generating apparatus 300 includes:
the obtaining module 302 is configured to obtain first training data, where the first training data includes a first text sequence, a second text sequence, and a third text sequence, the first text sequence is obtained by performing speech recognition on the first speech data, the second text sequence is a standard text sequence corresponding to the first speech data, and the third text sequence is obtained by translating the second text sequence;
a generating module 304, configured to generate a first phonological information sequence corresponding to the first text sequence and a second phonological information sequence corresponding to the second text sequence according to the pronunciation of each text in the first text sequence and the second text sequence, where the first phonological information sequence includes phonological information for representing the pronunciation of each text in the first text sequence, and the second phonological information sequence includes phonological information for representing the pronunciation of each text in the second text sequence; generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence; second training data is generated based on the fourth text sequence and the third text sequence.
The embodiment of the application simulates various recognition results possibly generated by ASR automatic speech recognition based on the text pronunciation, trains the machine translation model through the simulated second training data generated by the text sequences containing the various recognition results, and improves the diversity of the training data.
In some embodiments of the present application, the generating device further includes:
the determining module is used for determining a homophone text set corresponding to the pronunciation represented by each piece of table sound information in the first table sound information sequence and the probability of each text in the homophone text set represented by each piece of table sound information to obtain first probability information; determining the distribution probability of a voice recognition result corresponding to each character in the first character sequence based on the phonogram information of the first character in the first character sequence and the phonogram information of the second character adjacent to the first character to obtain second probability information; and determining a fourth text sequence according to the first probability information, the second probability information and the second list sound information sequence.
The embodiment of the application combines various situations possibly occurring in the ASR automatic speech recognition process, expands the first text sequence which is originally only one into the N (N is more than 1) fourth text sequences which comprise more ASR recognition results, thereby forming N pieces of second training data, and therefore, the diversity of the training data can be improved.
In some embodiments of the present application, the generating device further includes:
the processing module is used for converting at least one of the table sound information in the second table sound information sequence based on the second probability information to obtain a third table sound information sequence;
the determining module is further used for determining the replacement text corresponding to the converted meter sound information in the third meter sound information sequence based on the first probability information;
and the replacing module is used for replacing the corresponding characters in the second character sequence with the replacing characters according to the position of the converted sound information in the third sound information sequence to obtain a fourth character sequence.
According to the embodiment of the application, the original first text sequence is expanded in different modes by combining the distribution probability of multiple recognition results in the ASR automatic speech recognition process and the probability of different homophones corresponding to the same surface sound information, so that a fourth text sequence which can cover more ASR recognition conditions is obtained, and second training data is generated by combining the translation results, so that the training data can cover more ASR recognition scenes and recognition results.
In some embodiments of the present application, the number of the fourth text sequences is N, N being a positive integer;
The generating module is further configured to generate a second training data based on each of the N fourth text sequences and the third text sequence, to obtain N second training data.
The embodiment of the application can simulate various situations possibly occurring in the ASR automatic speech recognition process, enhance the robustness of the original training data and obtain more training data covering more situations.
In some embodiments of the present application, the generating module is further configured to generate a training data set according to the first training data and the N second training data;
the generating device further includes:
and the training module is used for training the translation model through the training data set to obtain a trained translation model.
When the embodiment of the application is used for training the machine translation model, the original first training data and the second training data with enhanced robustness are used for carrying out mixed training on the to-be-trained translation model, so that the machine translation model obtained by the slow connection can have the capability of translating the automatic speech recognition result containing errors into the text of the correct target language, and the robustness of the speech translation technology is improved.
The training data generating device in the embodiment of the application can be electronic equipment or a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.
The training data generating device in the embodiment of the application may be a device with an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The training data generating device provided by the embodiment of the present application can implement each process implemented by the above method embodiment, and in order to avoid repetition, details are not repeated here.
Optionally, an embodiment of the present application further provides an electronic device, fig. 4 shows a block diagram of an electronic device according to an embodiment of the present application, as shown in fig. 4, an electronic device 400 includes a processor 402, a memory 404, and a program or an instruction stored in the memory 404 and capable of running on the processor 402, where the program or the instruction is executed by the processor 402 to implement each process of the foregoing method embodiment, and the same technical effects are achieved, and are not repeated herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
Fig. 5 is a schematic hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 500 includes, but is not limited to: radio frequency unit 501, network module 502, audio output unit 503, input unit 504, sensor 505, display unit 506, user input unit 507, interface unit 508, memory 509, and processor 510.
Those skilled in the art will appreciate that the electronic device 500 may further include a power source (e.g., a battery) for powering the various components, and that the power source may be logically coupled to the processor 510 via a power management system to perform functions such as managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 5 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.
The processor 510 is configured to obtain first training data, where the first training data includes a first text sequence, a second text sequence, and a third text sequence, the first text sequence is obtained by performing speech recognition on the first speech data, the second text sequence is a standard text sequence corresponding to the first speech data, and the third text sequence is obtained by translating the second text sequence;
Generating a first phonation information sequence corresponding to the first character sequence and a second phonation information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence, wherein the first phonation information sequence comprises phonation information used for representing the pronunciation of each character in the first character sequence, and the second phonation information sequence comprises phonation information used for representing the pronunciation of each character in the second character sequence; generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence; second training data is generated based on the fourth text sequence and the third text sequence.
According to the embodiment of the application, the machine translation model is trained through the second training data generated by the automatic speech recognition result simulating the ASR error, so that the diversity of the training data is improved.
Optionally, the processor 510 is further configured to determine a homophone text set corresponding to the pronunciation represented by each piece of the phonological information in the first phonological information sequence, and a probability that each piece of phonological information represents each text in the homophone text set, so as to obtain first probability information; determining the distribution probability of a voice recognition result corresponding to each character in the first character sequence based on the phonogram information of the first character in the first character sequence and the phonogram information of the second character adjacent to the first character to obtain second probability information; and determining a fourth text sequence according to the first probability information, the second probability information and the second list sound information sequence.
The embodiment of the application combines various situations possibly occurring in the ASR automatic speech recognition process, expands the first text sequence which is originally only one into the N (N is more than 1) fourth text sequences which comprise more ASR recognition results, thereby forming N pieces of second training data, and therefore, the diversity of the training data can be improved.
Optionally, the processor 510 is further configured to perform conversion processing on at least one of the table sound information in the second table sound information sequence based on the second probability information, to obtain a third table sound information sequence; based on the first probability information, determining the replacement text corresponding to the converted list sound information in the third list sound information sequence; and replacing the corresponding characters in the second character sequence with the replacement characters according to the position of the converted phonogram information in the third phonogram information sequence, so as to obtain a fourth character sequence.
According to the embodiment of the application, the original first text sequence is expanded in different modes by combining the distribution probability of multiple recognition results in the ASR automatic speech recognition process and the probability of different homophones corresponding to the same surface sound information, so that a fourth text sequence which can cover more ASR recognition conditions is obtained, and second training data is generated by combining the translation results, so that the training data can cover more ASR recognition scenes and recognition results.
Optionally, the number of the fourth text sequences is N, N being a positive integer;
the processor 510 is further configured to generate a second training data based on each of the N fourth word sequences and the third word sequence, to obtain N second training data.
The embodiment of the application can simulate various situations possibly occurring in the ASR automatic speech recognition process, enhance the robustness of the original training data and obtain more training data covering more situations.
Optionally, the processor 510 is further configured to generate a training data set according to the first training data and the N second training data; and training the translation model through the training data set to obtain a trained translation model.
When the embodiment of the application is used for training the machine translation model, the original first training data and the second training data with enhanced robustness are used for carrying out mixed training on the to-be-trained translation model, so that the machine translation model obtained by the slow connection can have the capability of translating the automatic speech recognition result containing errors into the text of the correct target language, and the robustness of the speech translation technology is improved.
It should be appreciated that in embodiments of the present application, the input unit 504 may include a graphics processor (Graphics Processing Unit, GPU) 5041 and a microphone 5042, the graphics processor 5041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 506 may include a display panel 5061, and the display panel 5061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 507 includes at least one of a touch panel 5071 and other input devices 5072. Touch panel 5071, also referred to as a touch screen. Touch panel 5071 may include two parts, a touch detection device and a touch controller. Other input devices 5072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.
The memory 509 may be used to store software programs as well as various data. The memory 509 may mainly include a first storage area storing programs or instructions and a second storage area storing data, wherein the first storage area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 509 may include volatile memory or nonvolatile memory, or the memory 509 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 509 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.
Processor 510 may include one or more processing units; optionally, the processor 510 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, etc., and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 510.
The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided herein.
The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running programs or instructions, the processes of the embodiment of the method can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the above method embodiments, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in part in the form of a computer software product stored on a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (12)

1. A method of generating training data, the method comprising:
acquiring first training data, wherein the first training data comprises a first text sequence, a second text sequence and a third text sequence, the first text sequence is obtained by performing voice recognition on first voice data, the second text sequence is a standard text sequence corresponding to the first voice data, and the third text sequence is obtained by performing translation on the second text sequence;
generating a first phonation information sequence corresponding to the first character sequence and a second phonation information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence, wherein the first phonation information sequence comprises phonation information used for representing the pronunciation of each character in the first character sequence, and the second phonation information sequence comprises phonation information used for representing the pronunciation of each character in the second character sequence;
generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence;
second training data is generated based on the fourth text sequence and the third text sequence.
2. The method of generating of claim 1, wherein the generating a fourth text sequence from the first and second list information sequences comprises:
determining a homophone text set corresponding to pronunciation represented by each piece of table sound information in the first table sound information sequence and probability of each piece of table sound information representing each text in the homophone text set to obtain first probability information;
determining the distribution probability of a voice recognition result corresponding to each word in the first word sequence based on the phonogram information of the first word in the first word sequence and the phonogram information of the second word adjacent to the first word to obtain second probability information;
and determining the fourth text sequence according to the first probability information, the second probability information and the second list sound information sequence.
3. The method of generating of claim 2, wherein the determining the fourth text sequence from the first probability information, the second probability information, and the second list information sequence comprises:
converting at least one of the table sound information in the second table sound information sequence based on the second probability information to obtain a third table sound information sequence;
Based on the first probability information, determining the replacement text corresponding to the converted meter sound information in the third meter sound information sequence;
and replacing the corresponding characters in the second character sequence with the replacement characters according to the position of the converted sound information in the third sound information sequence, so as to obtain the fourth character sequence.
4. The method according to claim 1, wherein the number of the fourth text sequences is N, N being a positive integer;
the generating second training data based on the fourth text sequence and the third text sequence includes:
and generating one second training data based on each of the fourth text sequences and the third text sequences in the N fourth text sequences respectively, so as to obtain N second training data.
5. The method of generating of claim 4, wherein after said obtaining N of said second training data, the method further comprises:
generating a training data set according to the first training data and N second training data;
and training the translation model through the training data set to obtain a trained translation model.
6. A training data generation apparatus, the generation apparatus comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first training data, the first training data comprises a first text sequence, a second text sequence and a third text sequence, the first text sequence is obtained by performing voice recognition on first voice data, the second text sequence is a standard text sequence corresponding to the first voice data, and the third text sequence is obtained by performing translation on the second text sequence;
a generation module for:
generating a first phonation information sequence corresponding to the first character sequence and a second phonation information sequence corresponding to the second character sequence according to the pronunciation of each character in the first character sequence and the second character sequence, wherein the first phonation information sequence comprises phonation information used for representing the pronunciation of each character in the first character sequence, and the second phonation information sequence comprises phonation information used for representing the pronunciation of each character in the second character sequence;
generating a fourth text sequence according to the first table sound information sequence and the second table sound information sequence;
Second training data is generated based on the fourth text sequence and the third text sequence.
7. The generating apparatus according to claim 6, characterized in that the generating apparatus further comprises:
a determining module for:
determining a homophone text set corresponding to pronunciation represented by each piece of table sound information in the first table sound information sequence and probability of each piece of table sound information representing each text in the homophone text set to obtain first probability information;
determining the distribution probability of a voice recognition result corresponding to each word in a first word sequence based on the phonogram information of the first word in the first word sequence and the phonogram information of a second word adjacent to the first word to obtain second probability information;
and determining the fourth text sequence according to the first probability information, the second probability information and the second list sound information sequence.
8. The generating apparatus according to claim 7, characterized in that the generating apparatus further comprises:
the processing module is used for converting at least one of the table sound information in the second table sound information sequence based on the second probability information to obtain a third table sound information sequence;
The determining module is further configured to determine, based on the first probability information, a replacement text corresponding to the converted meter sound information in the third meter sound information sequence;
and the replacing module is used for replacing the corresponding characters in the second character sequence with the replacing characters according to the position of the converted sound information in the third sound information sequence to obtain the fourth character sequence.
9. The generating device according to claim 6, wherein the number of the fourth text sequences is N, N being a positive integer;
the generating module is further configured to generate one second training data based on each of the N fourth word sequences and the third word sequence, so as to obtain N second training data.
10. The apparatus of claim 9, wherein the generating means comprises,
the generating module is further used for generating a training data set according to the first training data and the N second training data;
the generating device further includes:
and the training module is used for training the translation model through the training data set to obtain a trained translation model.
11. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1 to 5.
12. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the method according to any of claims 1 to 5.
CN202311426285.0A 2023-10-30 2023-10-30 Training data generation method and device, electronic equipment and storage medium Pending CN117219062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311426285.0A CN117219062A (en) 2023-10-30 2023-10-30 Training data generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311426285.0A CN117219062A (en) 2023-10-30 2023-10-30 Training data generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117219062A true CN117219062A (en) 2023-12-12

Family

ID=89051345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311426285.0A Pending CN117219062A (en) 2023-10-30 2023-10-30 Training data generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117219062A (en)

Similar Documents

Publication Publication Date Title
CN109036464B (en) Pronunciation error detection method, apparatus, device and storage medium
EP3469592B1 (en) Emotional text-to-speech learning system
US11404043B2 (en) Systems and methods for providing non-lexical cues in synthesized speech
CN107077841B (en) Superstructure recurrent neural network for text-to-speech
KR101255402B1 (en) Redictation 0f misrecognized words using a list of alternatives
CN103714048B (en) Method and system for correcting text
US11043213B2 (en) System and method for detection and correction of incorrectly pronounced words
CN111681642B (en) Speech recognition evaluation method, device, storage medium and equipment
JP2014145842A (en) Speech production analysis device, voice interaction control device, method, and program
KR101819458B1 (en) Voice recognition apparatus and system
CN110600002B (en) Voice synthesis method and device and electronic equipment
CN111523532A (en) Method for correcting OCR character recognition error and terminal equipment
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN112086094A (en) Method for correcting pronunciation, terminal equipment and computer readable storage medium
EP1475776B1 (en) Dynamic pronunciation support for speech recognition training
CN116229935A (en) Speech synthesis method, device, electronic equipment and computer readable medium
KR20170009486A (en) Database generating method for chunk-based language learning and electronic device performing the same
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
CN117219062A (en) Training data generation method and device, electronic equipment and storage medium
CN115101042A (en) Text processing method, device and equipment
CN110428668B (en) Data extraction method and device, computer system and readable storage medium
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device
KR20170009487A (en) Chunk-based language learning method and electronic device to do this
KR101777141B1 (en) Apparatus and method for inputting chinese and foreign languages based on hun min jeong eum using korean input keyboard
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination