CN112037769A - Training data generation method and device and computer readable storage medium - Google Patents
Training data generation method and device and computer readable storage medium Download PDFInfo
- Publication number
- CN112037769A CN112037769A CN202010738406.5A CN202010738406A CN112037769A CN 112037769 A CN112037769 A CN 112037769A CN 202010738406 A CN202010738406 A CN 202010738406A CN 112037769 A CN112037769 A CN 112037769A
- Authority
- CN
- China
- Prior art keywords
- information
- text
- training
- audio
- text information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000011218 segmentation Effects 0.000 description 6
- 238000002372 labelling Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a training data generation method, a training data generation device and a computer readable storage medium, wherein the training data generation method comprises the following steps: receiving audio information and corresponding labeled text information; generating voice recognition text information and first time stamp information corresponding to the audio information; the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information. By acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a training data generating method and apparatus, and a computer-readable storage medium.
Background
Training a voice recognition system needs a large amount of voice and text labeled training data, the existing scheme for acquiring the training data is to acquire the voice, and labeling personnel label the text corresponding to the voice through a voice labeling system; or a large amount of text is specified and speech is recorded by different speakers according to the specified text. Training data for training the speech recognition system can be obtained through a large number of manual recordings and labels. The existing scheme for acquiring the training data consumes a large amount of manpower and time cost, and the difficulty in acquiring a large amount of high-quality voice training data is very high, so that the training set for training the voice recognition system is deficient.
Disclosure of Invention
The embodiment of the invention provides a training data generation method, a training data generation device and a computer readable storage medium, which have the technical effects of efficiently acquiring a large amount of high-quality voice training data and reducing the cost.
One aspect of the present invention provides a training data generating method, including: receiving audio information and corresponding labeled text information; generating voice recognition text information and first time stamp information corresponding to the audio information; the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In one embodiment, the content matching the annotation text information and the speech recognition text information comprises: performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm; and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
In an embodiment, the generating second timestamp information corresponding to the annotation text information according to the first timestamp information includes: acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information; and aiming at each character/word information in the labeled text information, copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information, and generating second time stamp information corresponding to the labeled text information.
In one possible embodiment, before content matches the annotation text information and speech recognition text information, the method includes: acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system; and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
In an implementation manner, the obtaining, according to the second timestamp information, sub-text training information in the tagged text information and sub-audio training information in the audio information includes: dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information; and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
In an embodiment, before generating the speech recognition text information and the first time stamp information corresponding to the audio information, the method further comprises: and inputting the tagged text information into a language model in a voice recognition system for training, or dynamically increasing the probability value of the tagged text information when the voice recognition system decodes.
Another aspect of the present invention provides a training data generating apparatus, including: the information receiving module is used for receiving the audio information and the corresponding labeled text information; a first information generating module for generating speech recognition text information and first time stamp information corresponding to the audio information; the second information generation module is used for matching the content with the labeled text information and the voice recognition text information and generating second timestamp information corresponding to the labeled text information according to the first timestamp information; and the training data generation module is used for acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In an implementation manner, the second information generating module is specifically configured to: performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm; and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
In an implementation manner, the training data generation module is specifically configured to: dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information; and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform any of the training data generation methods described above.
In the embodiment of the invention, the original audio information and the labeled text information are obtained, and the time stamp information of the audio information is utilized to obtain a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information, so that a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Fig. 1 is a schematic flow chart illustrating an implementation of a training data generation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart illustrating an implementation of a training data generation method according to an embodiment of the present invention.
As shown in fig. 1, in one aspect, the present invention provides a training data generating method, including:
102, generating voice recognition text information and first time stamp information corresponding to the audio information;
103, matching the content with the labeled text information and the voice recognition text information to generate second timestamp information corresponding to the labeled text information, wherein the second timestamp information corresponds to the first timestamp information;
and 104, acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In this embodiment, in step 101, the audio information and the corresponding labeled text information are preferably long audio and long labeled text information, which may be audio books, speech audios, talk records, and the like, and the obtaining manner of the audio information and the corresponding labeled text information may be obtained by crawling from a network or obtaining from a local database through a crawler technology.
In step 102, the speech recognition text information and the first time stamp information may be obtained by inputting the received audio information into an existing speech recognition system or by manual measurement recognition; the first time stamp information includes start and end time stamp information corresponding to each word or phrase in the speech recognition text information, for example, the label text information is: "sky is very hot, glaciers in south pole of earth are all fallen down", the timestamp information assuming that the speech recognition text information is "sky is very hot, glaciers in south pole of earth are all exposed" may be:
very hot in the day: [ day, 19.83,20.49], [ very, 20.49,20.79], [ heat, 20.79,21.00 ];
south pole of earth: [ Earth, 21.90,22.05], [ Antarctic, 22.05,22.62 ];
glacier exposure: [ glaciers, 23.67,24.00], [ shoals, 24.00,24.24 ].
In step 103, the labeled text information and the speech recognition text information are subjected to content matching, so that second timestamp information corresponding to the labeled text information is generated.
Next, in step 104, according to the second timestamp information, the sub-text training information in the label text information and the sub-audio training information in the audio information are obtained.
Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In one embodiment, the content matching the annotation text information and the speech recognition text information includes:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the labeled text information as a reference.
In this embodiment, the edit distance algorithm refers to the minimum number of edit operations required to change from one string to another string, and if the distance between the strings is greater, the similarity between the strings is lower.
When the similarity is matched, the labeled text information and the voice recognition text information can be split into a plurality of long sentence information or word level sentence information respectively according to punctuation marks or word segmentation tools, the long sentence information or word level sentence information in the labeled text information and the voice recognition text information is matched with each other in similarity, and two sentence information with the highest similarity are selected to be considered to be matched.
And then, carrying out text alignment processing, wherein in the processing process, the existing word segmentation tool can be used for carrying out word segmentation processing on the matched voice text information to obtain a plurality of words/word information, then the marked text information is taken as a reference, the words/words in the voice text information respectively correspond to the words/words in the marked text information, and unaligned parts can be filled by adding specific symbols so as to complete content matching. For the example given above, the alignment is expressed as:
labeling texts: the glaciers in the south pole of the earth are all sunk when the weather is very hot;
recognizing the text: very hot __ earth, south pole __ glaciers all show __ (underlined symbols indicate blank).
In one embodiment, generating second timestamp information corresponding to the annotation text information according to the first timestamp information includes:
acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information;
and copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information aiming at each character/word information in the labeled text information, and generating second time stamp information corresponding to the labeled text information.
In this embodiment, the generating process of the second timestamp information specifically includes:
after the content matching is completed, the start time stamp information and the end time stamp information corresponding to each word/word information in the voice recognition text information are obtained, the obtained start time stamp information and the obtained end time stamp information are copied to the word/word corresponding to the index position in the annotation text information according to the character index, and then the second time stamp information corresponding to the annotation text information is generated.
In one possible embodiment, before the content matches the tagged text information and the speech recognition text information, the method includes:
acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system;
and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
In this embodiment, there may be an error in the annotation text information obtained in step 101, for example, "show" of "all glaciers in the hot earth, south pole" shows up "is an identification error. Thus before step 103 is performed, the confidence of each word/word in the speech recognition system is obtained while generating speech recognition text information using the speech recognition system;
if the confidence of each character/word exceeds a preset threshold, the accuracy of the character/word is determined to be high, whether the content of the corresponding character/word in the labeled text information is consistent or not is detected and judged, if the content of the corresponding character/word in the labeled text information is not consistent, the corresponding character/word information in the labeled text information is replaced, and if the content of the corresponding character/word in the labeled text information is not consistent, the character/word information is replaced, namely ' the hot earth Antarctic glaciers are all exposed ' and the hot earth Antarctic glaciers are all fallen down '. Through the step, the calculated amount during the distance editing algorithm can be reduced, and the operation efficiency is further improved.
In an implementation manner, obtaining sub-text training information in the tagged text information and sub-audio training information in the audio information according to the second timestamp information includes:
dividing the marked text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring starting time stamps and ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of sub text training information.
In this embodiment, the specific process of step 104 is:
after the second timestamp information is generated, index positions of punctuation marks in the marked text information are detected or are positioned to index positions needing to be cut according to the number of specified characters, the marked text information is divided into a plurality of pieces of sub-text training information according to the index positions, and if the characters are hot in the sky, glaciers in the south pole of the earth are all fallen, the characters are divided into the characters which are hot in the sky and the characters which are in the south pole of the earth are all fallen.
The start and end timestamp information in each sub-text training information is then obtained from the second timestamp information, such as "very hot day" [19.83,21.00 ].
The audio information is split according to the initial timestamp information and the ending timestamp information of the subfile training information, a plurality of word audio information corresponding to the subfile training information are obtained, and the subfile training information and the corresponding word audio information are used as training data.
In an embodiment, before generating the speech recognition text information and the first time stamp information corresponding to the audio information, the method further comprises:
and inputting the labeled text information into a language model in the voice recognition system for training, or dynamically increasing the probability value of the labeled text information when the voice recognition system decodes.
In this embodiment, considering that the accuracy of the speech recognition text information obtained by the speech recognition system may not be high, before step 102 is executed, the obtained tagged text information is input into a language model in the speech recognition system for training, or a probability value for generating the tagged text information is dynamically increased in a process of decoding the audio information by the speech recognition system, so as to improve the accuracy of recognizing the audio information by the speech recognition system.
Fig. 2 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention.
As shown in fig. 2, another aspect of the present invention provides a training data generating apparatus, including:
an information receiving module 201, configured to receive audio information and corresponding labeled text information;
a first information generating module 202 for generating speech recognition text information and first time stamp information corresponding to the audio information;
the second information generating module 203 is configured to match the labeled text information with the speech recognition text information, and generate second timestamp information corresponding to the labeled text information according to the first timestamp information;
the training data generating module 204 is configured to obtain sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In this embodiment, in the information receiving module 201, the audio information and the corresponding labeled text information are preferably long audio and long labeled text information, which may be audio books, speech audios, interview records, and the like, and the obtaining manner of the audio information and the corresponding labeled text information may be obtained from a network or a local database by using a crawler technology.
In the first information generating module 202, the speech recognition text information and the first timestamp information may be obtained by inputting the received audio information into an existing speech recognition system or by manual measurement recognition; the first time stamp information includes start and end time stamp information corresponding to each word or phrase in the speech recognition text information, for example, the label text information is: "sky is very hot, glaciers in south pole of earth are all fallen down", the timestamp information assuming that the speech recognition text information is "sky is very hot, glaciers in south pole of earth are all exposed" may be:
very hot in the day: [ day, 19.83,20.49], [ very, 20.49,20.79], [ heat, 20.79,21.00 ];
south pole of earth: [ Earth, 21.90,22.05], [ Antarctic, 22.05,22.62 ];
glacier exposure: [ glaciers, 23.67,24.00], [ shoals, 24.00,24.24 ].
In the second information generating module 203, the labeled text information and the speech recognition text information are subjected to content matching, so that second timestamp information corresponding to the labeled text information is generated.
Next, in the training data generating module 204, according to the second timestamp information, the sub-text training information in the labeled text information and the sub-audio training information in the audio information are obtained.
Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In an implementation manner, the second information generating module 203 is specifically configured to:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the labeled text information as a reference.
In this embodiment, the edit distance algorithm refers to the minimum number of edit operations required to change from one string to another string, and if the distance between the strings is greater, the similarity between the strings is lower.
When the similarity is matched, the labeled text information and the voice recognition text information can be split into a plurality of long sentence information or word level sentence information respectively according to punctuation marks or word segmentation tools, the long sentence information or word level sentence information in the labeled text information and the voice recognition text information is matched with each other in similarity, and two sentence information with the highest similarity are selected to be considered to be matched.
And then, carrying out text alignment processing, wherein in the processing process, the existing word segmentation tool can be used for carrying out word segmentation processing on the matched voice text information to obtain a plurality of words/word information, then the marked text information is taken as a reference, the words/words in the voice text information respectively correspond to the words/words in the marked text information, and unaligned parts can be filled by adding specific symbols so as to complete content matching. For the example given above, the alignment is expressed as:
labeling texts: the glaciers in the south pole of the earth are all sunk when the weather is very hot;
recognizing the text: very hot __ earth, south pole __ glaciers all show __ (underlined symbols indicate blank).
In an implementation, the training data generating module 204 is specifically configured to:
dividing the marked text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring starting time stamps and ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of sub text training information.
In this embodiment, the training data generating module 204 is specifically configured to:
after the second timestamp information is generated, index positions of punctuation marks in the marked text information are detected or are positioned to index positions needing to be cut according to the number of specified characters, the marked text information is divided into a plurality of pieces of sub-text training information according to the index positions, and if the characters are hot in the sky, glaciers in the south pole of the earth are all fallen, the characters are divided into the characters which are hot in the sky and the characters which are in the south pole of the earth are all fallen.
The start and end timestamp information in each sub-text training information is then obtained from the second timestamp information, such as "very hot day" [19.83,21.00 ].
The audio information is split according to the initial timestamp information and the ending timestamp information of the subfile training information, a plurality of word audio information corresponding to the subfile training information are obtained, and the subfile training information and the corresponding word audio information are used as training data.
In another aspect, the present invention provides a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, perform any of the above-described training data generation methods.
In an embodiment of the present invention, a computer-readable storage medium includes a set of computer-executable instructions that, when executed, receive audio information and corresponding annotation text information; generating voice recognition text information and first time stamp information corresponding to the audio information; matching the content with the labeled text information and the voice recognition text information to generate second timestamp information corresponding to the labeled text information, wherein the second timestamp information corresponds to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information. Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method of generating training data, the method comprising:
receiving audio information and corresponding labeled text information;
generating voice recognition text information and first time stamp information corresponding to the audio information;
the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information;
and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
2. The method of claim 1, wherein the content matches the annotation text information and speech recognition text information, comprising:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
3. The method of claim 2, wherein the generating second timestamp information corresponding to the annotation text information from the first timestamp information comprises:
acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information;
and aiming at each character/word information in the labeled text information, copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information, and generating second time stamp information corresponding to the labeled text information.
4. The method of claim 2, wherein prior to content matching the annotation text information and speech recognition text information, the method comprises:
acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system;
and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
5. The method of claim 1, wherein the obtaining sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information comprises:
dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
6. The method of claim 1, wherein prior to generating speech recognition text information and first timestamp information corresponding to the audio information, the method further comprises:
and inputting the tagged text information into a language model in a voice recognition system for training, or dynamically increasing the probability value of the tagged text information when the voice recognition system decodes.
7. An apparatus for generating training data, the apparatus comprising:
the information receiving module is used for receiving the audio information and the corresponding labeled text information;
a first information generating module for generating speech recognition text information and first time stamp information corresponding to the audio information;
the second information generation module is used for matching the content with the labeled text information and the voice recognition text information and generating second timestamp information corresponding to the labeled text information according to the first timestamp information;
and the training data generation module is used for acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
8. The apparatus of claim 7, wherein the second information generating module is specifically configured to:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
9. The apparatus of claim 8, wherein the training data generation module is specifically configured to:
dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the training data generation method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010738406.5A CN112037769B (en) | 2020-07-28 | 2020-07-28 | Training data generation method and device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010738406.5A CN112037769B (en) | 2020-07-28 | 2020-07-28 | Training data generation method and device and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112037769A true CN112037769A (en) | 2020-12-04 |
CN112037769B CN112037769B (en) | 2024-07-30 |
Family
ID=73583359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010738406.5A Active CN112037769B (en) | 2020-07-28 | 2020-07-28 | Training data generation method and device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112037769B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112599152A (en) * | 2021-03-05 | 2021-04-02 | 北京智慧星光信息技术有限公司 | Voice data labeling method, system, electronic equipment and storage medium |
CN113129935A (en) * | 2021-06-16 | 2021-07-16 | 北京新唐思创教育科技有限公司 | Audio dotting data acquisition method and device, storage medium and electronic equipment |
CN113539241A (en) * | 2021-07-28 | 2021-10-22 | 广州华多网络科技有限公司 | Speech recognition correction method and corresponding device, equipment and medium |
CN117594060A (en) * | 2023-10-31 | 2024-02-23 | 北京邮电大学 | Audio signal content analysis method, device, equipment and storage medium |
CN117975934A (en) * | 2023-12-31 | 2024-05-03 | 上海稀宇极智科技有限公司 | Method and device for acquiring audio text pairs, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6260011B1 (en) * | 2000-03-20 | 2001-07-10 | Microsoft Corporation | Methods and apparatus for automatically synchronizing electronic audio files with electronic text files |
CN108389577A (en) * | 2018-02-12 | 2018-08-10 | 广州视源电子科技股份有限公司 | Method, system, device and storage medium for optimizing speech recognition acoustic model |
CN110310626A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110516110A (en) * | 2019-07-22 | 2019-11-29 | 平安科技(深圳)有限公司 | Song generation method, device, computer equipment and storage medium |
CN111091834A (en) * | 2019-12-23 | 2020-05-01 | 科大讯飞股份有限公司 | Text and audio alignment method and related product |
-
2020
- 2020-07-28 CN CN202010738406.5A patent/CN112037769B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6260011B1 (en) * | 2000-03-20 | 2001-07-10 | Microsoft Corporation | Methods and apparatus for automatically synchronizing electronic audio files with electronic text files |
CN108389577A (en) * | 2018-02-12 | 2018-08-10 | 广州视源电子科技股份有限公司 | Method, system, device and storage medium for optimizing speech recognition acoustic model |
CN110310626A (en) * | 2019-05-23 | 2019-10-08 | 平安科技(深圳)有限公司 | Voice training data creation method, device, equipment and readable storage medium storing program for executing |
CN110516110A (en) * | 2019-07-22 | 2019-11-29 | 平安科技(深圳)有限公司 | Song generation method, device, computer equipment and storage medium |
CN111091834A (en) * | 2019-12-23 | 2020-05-01 | 科大讯飞股份有限公司 | Text and audio alignment method and related product |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112599152A (en) * | 2021-03-05 | 2021-04-02 | 北京智慧星光信息技术有限公司 | Voice data labeling method, system, electronic equipment and storage medium |
CN113129935A (en) * | 2021-06-16 | 2021-07-16 | 北京新唐思创教育科技有限公司 | Audio dotting data acquisition method and device, storage medium and electronic equipment |
CN113539241A (en) * | 2021-07-28 | 2021-10-22 | 广州华多网络科技有限公司 | Speech recognition correction method and corresponding device, equipment and medium |
CN113539241B (en) * | 2021-07-28 | 2023-04-25 | 广州华多网络科技有限公司 | Speech recognition correction method and corresponding device, equipment and medium thereof |
CN117594060A (en) * | 2023-10-31 | 2024-02-23 | 北京邮电大学 | Audio signal content analysis method, device, equipment and storage medium |
CN117975934A (en) * | 2023-12-31 | 2024-05-03 | 上海稀宇极智科技有限公司 | Method and device for acquiring audio text pairs, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112037769B (en) | 2024-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112037769B (en) | Training data generation method and device and computer readable storage medium | |
CN107657947B (en) | Speech processing method and device based on artificial intelligence | |
CN109800407B (en) | Intention recognition method and device, computer equipment and storage medium | |
CN101996631B (en) | Method and device for aligning texts | |
CN106570180B (en) | Voice search method and device based on artificial intelligence | |
US6975985B2 (en) | Method and system for the automatic amendment of speech recognition vocabularies | |
CN110175334B (en) | Text knowledge extraction system and method based on custom knowledge slot structure | |
JP2008148322A (en) | Method for processing character encoding, and system | |
CN113626598B (en) | Video text generation method, device, equipment and storage medium | |
CN110119510B (en) | Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word | |
CN111091834B (en) | Text and audio alignment method and related product | |
CN111292751A (en) | Semantic analysis method and device, voice interaction method and device, and electronic equipment | |
CN112259083B (en) | Audio processing method and device | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN112101003B (en) | Sentence text segmentation method, device and equipment and computer readable storage medium | |
CN113593522B (en) | Voice data labeling method and device | |
CN112633001A (en) | Text named entity recognition method and device, electronic equipment and storage medium | |
CN111931020A (en) | Formula labeling method, device, equipment and storage medium | |
CN109213970B (en) | Method and device for generating notes | |
CN113761137B (en) | Method and device for extracting address information | |
CN113342935A (en) | Semantic recognition method and device, electronic equipment and readable storage medium | |
CN104834740A (en) | Full-automatic audio/video structuralized accurate searching method | |
CN116644737A (en) | Proper noun error correction method based on automatic word stock updating and prefix tree structure | |
CN115691503A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN109344389A (en) | A kind of construction method and system of the blind control bilingualism corpora of the Chinese |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |