CN112037769A - Training data generation method and device and computer readable storage medium - Google Patents

Training data generation method and device and computer readable storage medium Download PDF

Info

Publication number
CN112037769A
CN112037769A CN202010738406.5A CN202010738406A CN112037769A CN 112037769 A CN112037769 A CN 112037769A CN 202010738406 A CN202010738406 A CN 202010738406A CN 112037769 A CN112037769 A CN 112037769A
Authority
CN
China
Prior art keywords
information
text
training
audio
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010738406.5A
Other languages
Chinese (zh)
Other versions
CN112037769B (en
Inventor
陈晓宇
张彬彬
雷欣
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Information Technology Co Ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN202010738406.5A priority Critical patent/CN112037769B/en
Publication of CN112037769A publication Critical patent/CN112037769A/en
Application granted granted Critical
Publication of CN112037769B publication Critical patent/CN112037769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a training data generation method, a training data generation device and a computer readable storage medium, wherein the training data generation method comprises the following steps: receiving audio information and corresponding labeled text information; generating voice recognition text information and first time stamp information corresponding to the audio information; the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information. By acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.

Description

Training data generation method and device and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a training data generating method and apparatus, and a computer-readable storage medium.
Background
Training a voice recognition system needs a large amount of voice and text labeled training data, the existing scheme for acquiring the training data is to acquire the voice, and labeling personnel label the text corresponding to the voice through a voice labeling system; or a large amount of text is specified and speech is recorded by different speakers according to the specified text. Training data for training the speech recognition system can be obtained through a large number of manual recordings and labels. The existing scheme for acquiring the training data consumes a large amount of manpower and time cost, and the difficulty in acquiring a large amount of high-quality voice training data is very high, so that the training set for training the voice recognition system is deficient.
Disclosure of Invention
The embodiment of the invention provides a training data generation method, a training data generation device and a computer readable storage medium, which have the technical effects of efficiently acquiring a large amount of high-quality voice training data and reducing the cost.
One aspect of the present invention provides a training data generating method, including: receiving audio information and corresponding labeled text information; generating voice recognition text information and first time stamp information corresponding to the audio information; the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In one embodiment, the content matching the annotation text information and the speech recognition text information comprises: performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm; and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
In an embodiment, the generating second timestamp information corresponding to the annotation text information according to the first timestamp information includes: acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information; and aiming at each character/word information in the labeled text information, copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information, and generating second time stamp information corresponding to the labeled text information.
In one possible embodiment, before content matches the annotation text information and speech recognition text information, the method includes: acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system; and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
In an implementation manner, the obtaining, according to the second timestamp information, sub-text training information in the tagged text information and sub-audio training information in the audio information includes: dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information; and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
In an embodiment, before generating the speech recognition text information and the first time stamp information corresponding to the audio information, the method further comprises: and inputting the tagged text information into a language model in a voice recognition system for training, or dynamically increasing the probability value of the tagged text information when the voice recognition system decodes.
Another aspect of the present invention provides a training data generating apparatus, including: the information receiving module is used for receiving the audio information and the corresponding labeled text information; a first information generating module for generating speech recognition text information and first time stamp information corresponding to the audio information; the second information generation module is used for matching the content with the labeled text information and the voice recognition text information and generating second timestamp information corresponding to the labeled text information according to the first timestamp information; and the training data generation module is used for acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In an implementation manner, the second information generating module is specifically configured to: performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm; and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
In an implementation manner, the training data generation module is specifically configured to: dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information; and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform any of the training data generation methods described above.
In the embodiment of the invention, the original audio information and the labeled text information are obtained, and the time stamp information of the audio information is utilized to obtain a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information, so that a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Fig. 1 is a schematic flow chart illustrating an implementation of a training data generation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart illustrating an implementation of a training data generation method according to an embodiment of the present invention.
As shown in fig. 1, in one aspect, the present invention provides a training data generating method, including:
step 101, receiving audio information and corresponding label text information;
102, generating voice recognition text information and first time stamp information corresponding to the audio information;
103, matching the content with the labeled text information and the voice recognition text information to generate second timestamp information corresponding to the labeled text information, wherein the second timestamp information corresponds to the first timestamp information;
and 104, acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In this embodiment, in step 101, the audio information and the corresponding labeled text information are preferably long audio and long labeled text information, which may be audio books, speech audios, talk records, and the like, and the obtaining manner of the audio information and the corresponding labeled text information may be obtained by crawling from a network or obtaining from a local database through a crawler technology.
In step 102, the speech recognition text information and the first time stamp information may be obtained by inputting the received audio information into an existing speech recognition system or by manual measurement recognition; the first time stamp information includes start and end time stamp information corresponding to each word or phrase in the speech recognition text information, for example, the label text information is: "sky is very hot, glaciers in south pole of earth are all fallen down", the timestamp information assuming that the speech recognition text information is "sky is very hot, glaciers in south pole of earth are all exposed" may be:
very hot in the day: [ day, 19.83,20.49], [ very, 20.49,20.79], [ heat, 20.79,21.00 ];
south pole of earth: [ Earth, 21.90,22.05], [ Antarctic, 22.05,22.62 ];
glacier exposure: [ glaciers, 23.67,24.00], [ shoals, 24.00,24.24 ].
In step 103, the labeled text information and the speech recognition text information are subjected to content matching, so that second timestamp information corresponding to the labeled text information is generated.
Next, in step 104, according to the second timestamp information, the sub-text training information in the label text information and the sub-audio training information in the audio information are obtained.
Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In one embodiment, the content matching the annotation text information and the speech recognition text information includes:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the labeled text information as a reference.
In this embodiment, the edit distance algorithm refers to the minimum number of edit operations required to change from one string to another string, and if the distance between the strings is greater, the similarity between the strings is lower.
When the similarity is matched, the labeled text information and the voice recognition text information can be split into a plurality of long sentence information or word level sentence information respectively according to punctuation marks or word segmentation tools, the long sentence information or word level sentence information in the labeled text information and the voice recognition text information is matched with each other in similarity, and two sentence information with the highest similarity are selected to be considered to be matched.
And then, carrying out text alignment processing, wherein in the processing process, the existing word segmentation tool can be used for carrying out word segmentation processing on the matched voice text information to obtain a plurality of words/word information, then the marked text information is taken as a reference, the words/words in the voice text information respectively correspond to the words/words in the marked text information, and unaligned parts can be filled by adding specific symbols so as to complete content matching. For the example given above, the alignment is expressed as:
labeling texts: the glaciers in the south pole of the earth are all sunk when the weather is very hot;
recognizing the text: very hot __ earth, south pole __ glaciers all show __ (underlined symbols indicate blank).
In one embodiment, generating second timestamp information corresponding to the annotation text information according to the first timestamp information includes:
acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information;
and copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information aiming at each character/word information in the labeled text information, and generating second time stamp information corresponding to the labeled text information.
In this embodiment, the generating process of the second timestamp information specifically includes:
after the content matching is completed, the start time stamp information and the end time stamp information corresponding to each word/word information in the voice recognition text information are obtained, the obtained start time stamp information and the obtained end time stamp information are copied to the word/word corresponding to the index position in the annotation text information according to the character index, and then the second time stamp information corresponding to the annotation text information is generated.
In one possible embodiment, before the content matches the tagged text information and the speech recognition text information, the method includes:
acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system;
and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
In this embodiment, there may be an error in the annotation text information obtained in step 101, for example, "show" of "all glaciers in the hot earth, south pole" shows up "is an identification error. Thus before step 103 is performed, the confidence of each word/word in the speech recognition system is obtained while generating speech recognition text information using the speech recognition system;
if the confidence of each character/word exceeds a preset threshold, the accuracy of the character/word is determined to be high, whether the content of the corresponding character/word in the labeled text information is consistent or not is detected and judged, if the content of the corresponding character/word in the labeled text information is not consistent, the corresponding character/word information in the labeled text information is replaced, and if the content of the corresponding character/word in the labeled text information is not consistent, the character/word information is replaced, namely ' the hot earth Antarctic glaciers are all exposed ' and the hot earth Antarctic glaciers are all fallen down '. Through the step, the calculated amount during the distance editing algorithm can be reduced, and the operation efficiency is further improved.
In an implementation manner, obtaining sub-text training information in the tagged text information and sub-audio training information in the audio information according to the second timestamp information includes:
dividing the marked text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring starting time stamps and ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of sub text training information.
In this embodiment, the specific process of step 104 is:
after the second timestamp information is generated, index positions of punctuation marks in the marked text information are detected or are positioned to index positions needing to be cut according to the number of specified characters, the marked text information is divided into a plurality of pieces of sub-text training information according to the index positions, and if the characters are hot in the sky, glaciers in the south pole of the earth are all fallen, the characters are divided into the characters which are hot in the sky and the characters which are in the south pole of the earth are all fallen.
The start and end timestamp information in each sub-text training information is then obtained from the second timestamp information, such as "very hot day" [19.83,21.00 ].
The audio information is split according to the initial timestamp information and the ending timestamp information of the subfile training information, a plurality of word audio information corresponding to the subfile training information are obtained, and the subfile training information and the corresponding word audio information are used as training data.
In an embodiment, before generating the speech recognition text information and the first time stamp information corresponding to the audio information, the method further comprises:
and inputting the labeled text information into a language model in the voice recognition system for training, or dynamically increasing the probability value of the labeled text information when the voice recognition system decodes.
In this embodiment, considering that the accuracy of the speech recognition text information obtained by the speech recognition system may not be high, before step 102 is executed, the obtained tagged text information is input into a language model in the speech recognition system for training, or a probability value for generating the tagged text information is dynamically increased in a process of decoding the audio information by the speech recognition system, so as to improve the accuracy of recognizing the audio information by the speech recognition system.
Fig. 2 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention.
As shown in fig. 2, another aspect of the present invention provides a training data generating apparatus, including:
an information receiving module 201, configured to receive audio information and corresponding labeled text information;
a first information generating module 202 for generating speech recognition text information and first time stamp information corresponding to the audio information;
the second information generating module 203 is configured to match the labeled text information with the speech recognition text information, and generate second timestamp information corresponding to the labeled text information according to the first timestamp information;
the training data generating module 204 is configured to obtain sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
In this embodiment, in the information receiving module 201, the audio information and the corresponding labeled text information are preferably long audio and long labeled text information, which may be audio books, speech audios, interview records, and the like, and the obtaining manner of the audio information and the corresponding labeled text information may be obtained from a network or a local database by using a crawler technology.
In the first information generating module 202, the speech recognition text information and the first timestamp information may be obtained by inputting the received audio information into an existing speech recognition system or by manual measurement recognition; the first time stamp information includes start and end time stamp information corresponding to each word or phrase in the speech recognition text information, for example, the label text information is: "sky is very hot, glaciers in south pole of earth are all fallen down", the timestamp information assuming that the speech recognition text information is "sky is very hot, glaciers in south pole of earth are all exposed" may be:
very hot in the day: [ day, 19.83,20.49], [ very, 20.49,20.79], [ heat, 20.79,21.00 ];
south pole of earth: [ Earth, 21.90,22.05], [ Antarctic, 22.05,22.62 ];
glacier exposure: [ glaciers, 23.67,24.00], [ shoals, 24.00,24.24 ].
In the second information generating module 203, the labeled text information and the speech recognition text information are subjected to content matching, so that second timestamp information corresponding to the labeled text information is generated.
Next, in the training data generating module 204, according to the second timestamp information, the sub-text training information in the labeled text information and the sub-audio training information in the audio information are obtained.
Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In an implementation manner, the second information generating module 203 is specifically configured to:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the labeled text information as a reference.
In this embodiment, the edit distance algorithm refers to the minimum number of edit operations required to change from one string to another string, and if the distance between the strings is greater, the similarity between the strings is lower.
When the similarity is matched, the labeled text information and the voice recognition text information can be split into a plurality of long sentence information or word level sentence information respectively according to punctuation marks or word segmentation tools, the long sentence information or word level sentence information in the labeled text information and the voice recognition text information is matched with each other in similarity, and two sentence information with the highest similarity are selected to be considered to be matched.
And then, carrying out text alignment processing, wherein in the processing process, the existing word segmentation tool can be used for carrying out word segmentation processing on the matched voice text information to obtain a plurality of words/word information, then the marked text information is taken as a reference, the words/words in the voice text information respectively correspond to the words/words in the marked text information, and unaligned parts can be filled by adding specific symbols so as to complete content matching. For the example given above, the alignment is expressed as:
labeling texts: the glaciers in the south pole of the earth are all sunk when the weather is very hot;
recognizing the text: very hot __ earth, south pole __ glaciers all show __ (underlined symbols indicate blank).
In an implementation, the training data generating module 204 is specifically configured to:
dividing the marked text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring starting time stamps and ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of sub text training information.
In this embodiment, the training data generating module 204 is specifically configured to:
after the second timestamp information is generated, index positions of punctuation marks in the marked text information are detected or are positioned to index positions needing to be cut according to the number of specified characters, the marked text information is divided into a plurality of pieces of sub-text training information according to the index positions, and if the characters are hot in the sky, glaciers in the south pole of the earth are all fallen, the characters are divided into the characters which are hot in the sky and the characters which are in the south pole of the earth are all fallen.
The start and end timestamp information in each sub-text training information is then obtained from the second timestamp information, such as "very hot day" [19.83,21.00 ].
The audio information is split according to the initial timestamp information and the ending timestamp information of the subfile training information, a plurality of word audio information corresponding to the subfile training information are obtained, and the subfile training information and the corresponding word audio information are used as training data.
In another aspect, the present invention provides a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, perform any of the above-described training data generation methods.
In an embodiment of the present invention, a computer-readable storage medium includes a set of computer-executable instructions that, when executed, receive audio information and corresponding annotation text information; generating voice recognition text information and first time stamp information corresponding to the audio information; matching the content with the labeled text information and the voice recognition text information to generate second timestamp information corresponding to the labeled text information, wherein the second timestamp information corresponds to the first timestamp information; and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information. Therefore, by acquiring the original audio information and the labeled text information and acquiring a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information by using the timestamp information of the audio information, a large amount of high-quality voice training data is obtained, the process efficiency is high, and the cost is reduced.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method of generating training data, the method comprising:
receiving audio information and corresponding labeled text information;
generating voice recognition text information and first time stamp information corresponding to the audio information;
the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information;
and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
2. The method of claim 1, wherein the content matches the annotation text information and speech recognition text information, comprising:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
3. The method of claim 2, wherein the generating second timestamp information corresponding to the annotation text information from the first timestamp information comprises:
acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information;
and aiming at each character/word information in the labeled text information, copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information, and generating second time stamp information corresponding to the labeled text information.
4. The method of claim 2, wherein prior to content matching the annotation text information and speech recognition text information, the method comprises:
acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system;
and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.
5. The method of claim 1, wherein the obtaining sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information comprises:
dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
6. The method of claim 1, wherein prior to generating speech recognition text information and first timestamp information corresponding to the audio information, the method further comprises:
and inputting the tagged text information into a language model in a voice recognition system for training, or dynamically increasing the probability value of the tagged text information when the voice recognition system decodes.
7. An apparatus for generating training data, the apparatus comprising:
the information receiving module is used for receiving the audio information and the corresponding labeled text information;
a first information generating module for generating speech recognition text information and first time stamp information corresponding to the audio information;
the second information generation module is used for matching the content with the labeled text information and the voice recognition text information and generating second timestamp information corresponding to the labeled text information according to the first timestamp information;
and the training data generation module is used for acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.
8. The apparatus of claim 7, wherein the second information generating module is specifically configured to:
performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;
and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.
9. The apparatus of claim 8, wherein the training data generation module is specifically configured to:
dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;
and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.
10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the training data generation method of any of claims 1-6.
CN202010738406.5A 2020-07-28 2020-07-28 Training data generation method and device and computer readable storage medium Active CN112037769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010738406.5A CN112037769B (en) 2020-07-28 2020-07-28 Training data generation method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010738406.5A CN112037769B (en) 2020-07-28 2020-07-28 Training data generation method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112037769A true CN112037769A (en) 2020-12-04
CN112037769B CN112037769B (en) 2024-07-30

Family

ID=73583359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010738406.5A Active CN112037769B (en) 2020-07-28 2020-07-28 Training data generation method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112037769B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599152A (en) * 2021-03-05 2021-04-02 北京智慧星光信息技术有限公司 Voice data labeling method, system, electronic equipment and storage medium
CN113129935A (en) * 2021-06-16 2021-07-16 北京新唐思创教育科技有限公司 Audio dotting data acquisition method and device, storage medium and electronic equipment
CN113539241A (en) * 2021-07-28 2021-10-22 广州华多网络科技有限公司 Speech recognition correction method and corresponding device, equipment and medium
CN117594060A (en) * 2023-10-31 2024-02-23 北京邮电大学 Audio signal content analysis method, device, equipment and storage medium
CN117975934A (en) * 2023-12-31 2024-05-03 上海稀宇极智科技有限公司 Method and device for acquiring audio text pairs, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
CN110310626A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing
CN110516110A (en) * 2019-07-22 2019-11-29 平安科技(深圳)有限公司 Song generation method, device, computer equipment and storage medium
CN111091834A (en) * 2019-12-23 2020-05-01 科大讯飞股份有限公司 Text and audio alignment method and related product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
CN110310626A (en) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 Voice training data creation method, device, equipment and readable storage medium storing program for executing
CN110516110A (en) * 2019-07-22 2019-11-29 平安科技(深圳)有限公司 Song generation method, device, computer equipment and storage medium
CN111091834A (en) * 2019-12-23 2020-05-01 科大讯飞股份有限公司 Text and audio alignment method and related product

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599152A (en) * 2021-03-05 2021-04-02 北京智慧星光信息技术有限公司 Voice data labeling method, system, electronic equipment and storage medium
CN113129935A (en) * 2021-06-16 2021-07-16 北京新唐思创教育科技有限公司 Audio dotting data acquisition method and device, storage medium and electronic equipment
CN113539241A (en) * 2021-07-28 2021-10-22 广州华多网络科技有限公司 Speech recognition correction method and corresponding device, equipment and medium
CN113539241B (en) * 2021-07-28 2023-04-25 广州华多网络科技有限公司 Speech recognition correction method and corresponding device, equipment and medium thereof
CN117594060A (en) * 2023-10-31 2024-02-23 北京邮电大学 Audio signal content analysis method, device, equipment and storage medium
CN117975934A (en) * 2023-12-31 2024-05-03 上海稀宇极智科技有限公司 Method and device for acquiring audio text pairs, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112037769B (en) 2024-07-30

Similar Documents

Publication Publication Date Title
CN112037769B (en) Training data generation method and device and computer readable storage medium
CN107657947B (en) Speech processing method and device based on artificial intelligence
CN109800407B (en) Intention recognition method and device, computer equipment and storage medium
CN101996631B (en) Method and device for aligning texts
CN106570180B (en) Voice search method and device based on artificial intelligence
US6975985B2 (en) Method and system for the automatic amendment of speech recognition vocabularies
CN110175334B (en) Text knowledge extraction system and method based on custom knowledge slot structure
JP2008148322A (en) Method for processing character encoding, and system
CN113626598B (en) Video text generation method, device, equipment and storage medium
CN110119510B (en) Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word
CN111091834B (en) Text and audio alignment method and related product
CN111292751A (en) Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN112259083B (en) Audio processing method and device
CN111881297A (en) Method and device for correcting voice recognition text
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN113593522B (en) Voice data labeling method and device
CN112633001A (en) Text named entity recognition method and device, electronic equipment and storage medium
CN111931020A (en) Formula labeling method, device, equipment and storage medium
CN109213970B (en) Method and device for generating notes
CN113761137B (en) Method and device for extracting address information
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
CN104834740A (en) Full-automatic audio/video structuralized accurate searching method
CN116644737A (en) Proper noun error correction method based on automatic word stock updating and prefix tree structure
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN109344389A (en) A kind of construction method and system of the blind control bilingualism corpora of the Chinese

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant