CN112037769A

CN112037769A - A training data generation method, apparatus, and computer-readable storage medium

Info

Publication number: CN112037769A
Application number: CN202010738406.5A
Authority: CN
Inventors: 陈晓宇; 张彬彬; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Shanghai Mobvoi Information Technology Co ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-12-04
Anticipated expiration: 2040-07-28
Also published as: CN112037769B

Abstract

The invention discloses a training data generation method, device and computer-readable storage medium, comprising: receiving audio information and corresponding marked text information; generating speech recognition text information and first time stamp information corresponding to the audio information; The content matches the marked text information and the speech recognition text information, and second timestamp information corresponding to the marked text information is generated according to the first timestamp information; and the marked text is obtained according to the second timestamp information sub-text training information in the information and sub-audio training information in the audio information. By obtaining the original audio information and annotated text information, and using the timestamp information of the audio information to obtain multiple sub-audio training information and corresponding sub-text training information from the original audio information and the annotated text information, a large amount of high-quality speech can be obtained. training data, this process is efficient and reduces cost.

Description

A training data generation method, apparatus, and computer-readable storage medium

技术领域technical field

本发明涉及人工智能领域，尤其涉及一种训练数据生成方法、装置以及计算机可读存储介质。The present invention relates to the field of artificial intelligence, and in particular, to a training data generation method, device and computer-readable storage medium.

背景技术Background technique

训练语音识别系统需要大量的语音和文本标注好的训练数据，现有获取训练数据的方案都是获取语音，由标注人员通过语音标注系统标注语音对应的文本；或者指定大量文本，由不同的说话人根据指定的文本来录制语音。通过大量的人工录制和标注，可以获取训练语音识别系统的训练数据。通过现有获取训练数据的方案需要消耗大量人力和时间成本，获取大量高质量的语音训练数据难度很高，导致训练语音识别系统的训练集匮乏。Training a speech recognition system requires a large amount of training data labeled with speech and text. The existing solutions for obtaining training data are to obtain speech, and the annotator will label the text corresponding to the speech through the speech annotation system; or specify a large amount of text, and different speech Humans record speech based on specified text. Through a large number of manual recordings and annotations, the training data for training the speech recognition system can be obtained. Existing solutions for obtaining training data require a lot of manpower and time costs, and it is very difficult to obtain a large amount of high-quality speech training data, resulting in a lack of training sets for training speech recognition systems.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种训练数据生成方法、装置以及计算机可读存储介质，具有高效率地获取大量并且高质量的语音训练数据并降低了耗费成本的技术效果。Embodiments of the present invention provide a training data generating method, apparatus, and computer-readable storage medium, which have the technical effects of efficiently acquiring a large amount of high-quality speech training data and reducing cost.

本发明一方面提供一种训练数据生成方法，所述方法包括：接收音频信息和对应的标注文本信息；生成对应于所述音频信息的语音识别文本信息和第一时间戳信息；内容匹配所述标注文本信息和语音识别文本信息，根据所述第一时间戳信息生成对应于所述标注文本信息的第二时间戳信息；根据所述第二时间戳信息，获取所述标注文本信息中的子文本训练信息和所述音频信息中的子音频训练信息。One aspect of the present invention provides a method for generating training data, the method comprising: receiving audio information and corresponding labeled text information; generating speech recognition text information and first timestamp information corresponding to the audio information; Annotating text information and speech recognition text information, generating second timestamp information corresponding to the annotating text information according to the first timestamp information; Text training information and sub-audio training information in the audio information.

在一可实施方式中，所述内容匹配所述标注文本信息和语音识别文本信息，包括：利用编辑距离算法对所述标注文本信息和语音识别文本信息进行文本相似度匹配；以所述标注文本信息作为基准，对相匹配的语音文本信息中的字/词进行文本对齐处理。In a possible implementation manner, the content matching the annotated text information and the speech recognition text information includes: using an edit distance algorithm to perform text similarity matching on the annotated text information and the speech recognition text information; The information is used as a reference, and text alignment processing is performed on the words/words in the matched speech and text information.

在一可实施方式中，所述根据所述第一时间戳信息生成对应于所述标注文本信息的第二时间戳信息，包括：从所述第一时间戳信息中获取所述语音识别文本信息中每个字/词信息所对应的起始时间戳信息和结尾时间戳信息；针对所述标注文本信息中每个字/词信息，复制对应于所述语音识别文本信息中相匹配字/词信息的起始时间戳信息和结尾时间戳信息，生成对应于所述标注文本信息的第二时间戳信息。In a possible implementation manner, the generating second time stamp information corresponding to the marked text information according to the first time stamp information includes: acquiring the speech recognition text information from the first time stamp information For each word/word information in the marked text information, copy the corresponding words/words in the speech recognition text information The start time stamp information and the end time stamp information of the information are used to generate second time stamp information corresponding to the marked text information.

在一可实施方式中，在内容匹配所述标注文本信息和语音识别文本信息之前，所述方法包括：通过语音识别系统获取所述语音识别文本信息中字/词信息所对应的置信度；根据每个所述字/词信息的置信度，检测并替换所述标注文本信息中所对应的字/词信息。In a possible implementation manner, before the content matches the marked text information and the speech recognition text information, the method includes: obtaining the confidence level corresponding to the word/word information in the speech recognition text information through a speech recognition system; Each confidence level of the word/word information is detected and replaced with the corresponding word/word information in the marked text information.

在一可实施方式中，所述根据所述第二时间戳信息，获取所述标注文本信息中的子文本训练信息和所述音频信息中的子音频训练信息，包括：对所述标注文本信息根据设定字符或者指定字符数量拆分为多个子文本训练信息，并从所述第二时间戳信息中分别获取多个所述子文本训练信息所对应的起始时间戳和结尾时间戳信息；根据多个所述子文本训练信息所对应的起始时间戳和结尾时间戳信息，将所述音频信息拆分为多个子音频训练信息。In a possible implementation manner, the acquiring, according to the second timestamp information, the sub-text training information in the annotated text information and the sub-audio training information in the audio information includes: According to the set character or the specified number of characters, it is divided into a plurality of sub-text training information, and respectively obtain a plurality of start timestamps and end timestamp information corresponding to the sub-text training information from the second timestamp information; The audio information is split into a plurality of sub-audio training information according to the start timestamp and end timestamp information corresponding to the plurality of sub-text training information.

在一可实施方式中，在生成对应于所述音频信息的语音识别文本信息和第一时间戳信息之前，所述方法还包括：将所述标注文本信息输入于语音识别系统中的语言模型进行训练，或者在语音识别系统进行解码时动态增加所述标注文本信息的概率值。In a possible implementation manner, before generating the speech recognition text information and the first time stamp information corresponding to the audio information, the method further comprises: inputting the marked text information into a language model in the speech recognition system for performing the processing. training, or dynamically increase the probability value of the marked text information when the speech recognition system performs decoding.

本发明另一方面提供一种训练数据生成装置，所述装置包括：信息接收模块，用于接收音频信息和对应的标注文本信息；第一信息生成模块，用于生成对应于所述音频信息的语音识别文本信息和第一时间戳信息；第二信息生成模块，用于内容匹配所述标注文本信息和语音识别文本信息，根据所述第一时间戳信息生成对应于所述标注文本信息的第二时间戳信息；训练数据生成模块，用于根据所述第二时间戳信息，获取所述标注文本信息中的子文本训练信息和所述音频信息中的子音频训练信息。Another aspect of the present invention provides an apparatus for generating training data, the apparatus comprising: an information receiving module for receiving audio information and corresponding marked text information; a first information generating module for generating a training data corresponding to the audio information Speech recognition text information and first time stamp information; a second information generation module for content matching the marked text information and speech recognition text information, and generating a first time stamp corresponding to the marked text information according to the first time stamp information; Second timestamp information; a training data generation module, configured to acquire sub-text training information in the marked text information and sub-audio training information in the audio information according to the second timestamp information.

在一可实施方式中，所述第二信息生成模块具体用于：利用编辑距离算法对所述标注文本信息和语音识别文本信息进行文本相似度匹配；以所述标注文本信息作为基准，对相匹配的语音文本信息中的字/词进行文本对齐处理。In a possible implementation manner, the second information generation module is specifically configured to: use an edit distance algorithm to perform text similarity matching between the annotated text information and the speech recognition text information; The words/words in the matched voice text information are processed for text alignment.

在一可实施方式中，所述训练数据生成模块具体用于：对所述标注文本信息根据设定字符或者指定字符数量拆分为多个子文本训练信息，并从所述第二时间戳信息中分别获取多个所述子文本训练信息所对应的起始时间戳和结尾时间戳信息；根据多个所述子文本训练信息所对应的起始时间戳和结尾时间戳信息，将所述音频信息拆分为多个子音频训练信息。In a possible implementation manner, the training data generation module is specifically configured to: split the labeled text information into multiple sub-text training information according to a set character or a specified number of characters, and extract the data from the second timestamp information. Respectively obtain the start time stamp and end time stamp information corresponding to the plurality of sub-text training information; according to the start time stamp and end time stamp information corresponding to the plurality of sub-text training information, Split into multiple sub-audio training information.

本发明另一方面提供一种计算机可读存储介质，所述存储介质包括一组计算机可执行指令，当所述指令被执行时用于执行上述任一项所述的训练数据生成方法。Another aspect of the present invention provides a computer-readable storage medium, the storage medium comprising a set of computer-executable instructions for performing the training data generation method described in any one of the above when the instructions are executed.

在本发明实施例中，通过获取原始的音频信息以及标注文本信息，利用音频信息的时间戳信息从原始的音频信息以及标注文本信息中获取多个子音频训练信息和对应的子文本训练信息，从而得到大量并且高质量的语音训练数据，此过程效率高并且降低了耗费成本。In the embodiment of the present invention, by obtaining the original audio information and the marked text information, the timestamp information of the audio information is used to obtain a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the marked text information, thereby Obtaining a large amount of high-quality speech training data, this process is efficient and reduces the cost.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本发明的若干实施方式，其中：The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are shown by way of example and not limitation, wherein:

在附图中，相同或对应的标号表示相同或对应的部分。In the drawings, the same or corresponding reference numerals denote the same or corresponding parts.

图1为本发明实施例一种训练数据生成方法的实现流程示意图；1 is a schematic diagram of an implementation flow of a method for generating training data according to an embodiment of the present invention;

图2为本发明实施例一种训练数据生成装置的结构组成示意图。FIG. 2 is a schematic structural composition diagram of an apparatus for generating training data according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、特征、优点能够更加的明显和易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而非全部实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present invention.

图1为本发明实施例一种训练数据生成方法的实现流程示意图。FIG. 1 is a schematic diagram of an implementation flowchart of a method for generating training data according to an embodiment of the present invention.

如图1所示，本发明一方面提供一种训练数据生成方法，方法包括：As shown in FIG. 1, one aspect of the present invention provides a method for generating training data, the method comprising:

步骤101，接收音频信息和对应的标注文本信息；Step 101, receiving audio information and corresponding marked text information;

步骤102，生成对应于音频信息的语音识别文本信息和第一时间戳信息；Step 102, generating speech recognition text information and first time stamp information corresponding to the audio information;

步骤103，内容匹配标注文本信息和语音识别文本信息，生成对应于标注文本信息的第二时间戳信息，其中，第二时间戳信息与第一时间戳信息相对应；Step 103, the content matches the marked text information and the speech recognition text information, and generates second time stamp information corresponding to the marked text information, wherein the second time stamp information corresponds to the first time stamp information;

步骤104，根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息。Step 104: Acquire sub-text training information in the marked text information and sub-audio training information in the audio information according to the second timestamp information.

本实施例中，在步骤101中，音频信息和对应的标注文本信息优选为长音频和长标注文本信息，可以是有声书、演讲音频、访谈记录等等，其获取方式可以通过爬虫技术从网络上抓取或者从本地数据库中获取。In this embodiment, in step 101, the audio information and the corresponding marked text information are preferably long audio and long marked text information, which can be audio books, speech audio, interview records, etc. fetched from the database or from a local database.

在步骤102中，语音识别文本信息和第一时间戳信息可以通过将所接收到的音频信息输入于现有的语音识别系统或者通过人工测量识别得到；第一时间戳信息包括对应于语音识别文本信息中每个字或词的起始和结尾时间戳信息，例如标注文本信息为：“天很热，地球南极的冰川都陷落了”，假设语音识别文本信息为“天很热地球南极冰川都显露”的时间戳信息可能为：In step 102, the speech recognition text information and the first time stamp information can be obtained by inputting the received audio information into an existing speech recognition system or through manual measurement; the first time stamp information includes text corresponding to the speech recognition The start and end timestamp information of each word or word in the information, for example, the marked text information is: "The sky is very hot, and the glaciers in the Antarctic of the earth have collapsed." Suppose the text information of speech recognition is "It is very hot. Revealed" timestamp information may be:

天很热：[天,19.83,20.49]，[很,20.49,20.79]，[热,20.79,21.00]；It is very hot: [day, 19.83, 20.49], [very, 20.49, 20.79], [hot, 20.79, 21.00];

地球南极：[地球,21.90,22.05]，[南极,22.05,22.62]；Earth South Pole: [Earth, 21.90, 22.05], [South Pole, 22.05, 22.62];

冰川显露：[冰川,23.67,24.00]，[显露,24.00,24.24]。Glacier Reveal: [Glacier, 23.67, 24.00], [ Reveal, 24.00, 24.24].

在步骤103中，将标注文本信息和语音识别文本信息进行内容匹配，使生成对应于标注文本信息的第二时间戳信息。In step 103, content matching is performed between the marked text information and the speech recognition text information, so that second time stamp information corresponding to the marked text information is generated.

接着在步骤104中，根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息。Next, in step 104, the sub-text training information in the marked text information and the sub-audio training information in the audio information are acquired according to the second timestamp information.

由此，通过获取原始的音频信息以及标注文本信息，利用音频信息的时间戳信息从原始的音频信息以及标注文本信息中获取多个子音频训练信息和对应的子文本训练信息，从而得到大量并且高质量的语音训练数据，此过程效率高并且降低了耗费成本。Therefore, by obtaining the original audio information and the labeled text information, and using the timestamp information of the audio information to obtain a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information, a large number of high-quality audio sub-text training information can be obtained. High-quality speech training data, the process is efficient and cost-effective.

在一可实施方式中，内容匹配标注文本信息和语音识别文本信息，包括：In an embodiment, the content matches the annotation text information and the speech recognition text information, including:

利用编辑距离算法对标注文本信息和语音识别文本信息进行文本相似度匹配；Use edit distance algorithm to perform text similarity matching between labeled text information and speech recognition text information;

以标注文本信息作为基准，对相匹配的语音文本信息中的字/词进行文本对齐处理。Based on the marked text information, text alignment processing is performed on the words/words in the matched voice text information.

本实施例中，编辑距离算法是指两个字串之间，由一个转成另一个所需的最少编辑操作次数，如果它们的距离越大，说明它们相似度越低。In this embodiment, the edit distance algorithm refers to the minimum number of editing operations required to convert two character strings from one to the other, and the larger the distance between them, the lower the similarity is.

在相似度匹配时，具体可以将标注文本信息和语音识别文本信息分别根据标点符号或者分词工具拆分为多个长语句信息或者词级别语句信息，将标注文本信息和语音识别文本信息中的长语句信息或者词级别语句信息进行两两相似度匹配，选取相似度最高的两个语句信息认定为相匹配。When the similarity is matched, the labeled text information and the speech recognition text information can be split into multiple long sentence information or word-level sentence information according to punctuation marks or word segmentation tools, respectively, and the long sentence information in the labeled text information and the speech recognition text information can be split. The sentence information or word-level sentence information is matched for pairwise similarity, and the two sentences with the highest similarity are selected as matching.

接着进行文本对齐处理，在处理过程中，可以首先利用现有的分词工具对相匹配的语音文本信息进行分词处理，得到多个字/词信息，接着以标注文本信息为基准，将语音文本信息中的字/词分别与标注文本信息中的字/词相对应，未对齐的部分可以通过增设特定符号进行填充，以此完成内容匹配。针对上述所举的例子，对齐后表示为：Then perform text alignment processing. In the processing process, you can first use the existing word segmentation tool to perform word segmentation processing on the matching voice text information to obtain multiple word/word information, and then use the marked text information as the benchmark to classify the voice text information. The words/words in the text correspond to the words/words in the marked text information, and the unaligned parts can be filled by adding specific symbols to complete content matching. For the above example, the alignment is expressed as:

标注文本：天很热，地球南极的冰川都陷落了；Label text: It is very hot, and the glaciers in the south pole of the earth have collapsed;

识别文本：天很热__地球南极__冰川都显露__(下划线符号表示空白)。Identifying text: It's hot __Earth's South Pole__ glaciers are all exposed__ (underscores indicate blanks).

在一可实施方式中，根据第一时间戳信息生成对应于标注文本信息的第二时间戳信息，包括：In a possible implementation manner, generating the second time stamp information corresponding to the marked text information according to the first time stamp information includes:

从第一时间戳信息中获取语音识别文本信息中每个字/词信息所对应的起始时间戳信息和结尾时间戳信息；Obtain the start time stamp information and the end time stamp information corresponding to each word/word information in the speech recognition text information from the first time stamp information;

针对标注文本信息中每个字/词信息，复制对应于语音识别文本信息中相匹配字/词信息的起始时间戳信息和结尾时间戳信息，生成对应于标注文本信息的第二时间戳信息。For each word/word information in the marked text information, copy the start time stamp information and the end time stamp information corresponding to the matching word/word information in the speech recognition text information, and generate the second time stamp information corresponding to the marked text information .

本实施例中，第二时间戳信息的生成过程具体为：In this embodiment, the generation process of the second time stamp information is specifically:

在内容匹配完成之后，获取语音识别文本信息中每个字/词信息所对应的起始时间戳信息和结尾时间戳信息，将所获取的起始时间戳信息和结尾时间戳信息按照字符索引复制给标注文本信息中对应索引位置的字/词，从生成了对应于标注文本信息的第二时间戳信息。After the content matching is completed, obtain the start time stamp information and end time stamp information corresponding to each word/word information in the speech recognition text information, and copy the obtained start time stamp information and end time stamp information according to the character index For the word/word corresponding to the index position in the marked text information, second timestamp information corresponding to the marked text information is generated.

在一可实施方式中，在内容匹配标注文本信息和语音识别文本信息之前，方法包括：In a possible implementation manner, before the content matches the annotation text information and the speech recognition text information, the method includes:

通过语音识别系统获取语音识别文本信息中字/词信息所对应的置信度；Obtain the confidence level corresponding to the word/word information in the speech recognition text information through the speech recognition system;

根据每个字/词信息的置信度，检测并替换标注文本信息中所对应的字/词信息。According to the confidence of each word/word information, the corresponding word/word information in the annotated text information is detected and replaced.

本实施例中，在步骤101所获取的标注文本信息可能存在错误，如“天很热地球南极冰川都显露”的“显露”为识别错误。因此在执行103步骤之前，在利用语音识别系统生成语音识别文本信息的同时获取到语音识别系统中每个字/词的置信度；In this embodiment, there may be errors in the marked text information obtained in step 101, for example, "reveal" in "the sky is very hot and the Antarctic glaciers are exposed" is an identification error. Therefore, before performing step 103, while utilizing the speech recognition system to generate speech recognition text information, the confidence level of each word/word in the speech recognition system is obtained;

若每个字/词的置信度超过预设阈值，则认定该字/词的准确率较高，此时检测并判断标注文本信息中对应的字/词是否内容一致，若判定标注文本信息中对应的字/词内容不一致，则将替换标注文本信息中所对应的字/词信息，如将“天很热地球南极冰川都显露”替换为“天很热地球南极冰川都陷落”。通过此步骤，可以减少上述进行编辑距离算法时的计算量，进而提高运行效率。If the confidence of each word/word exceeds the preset threshold, it is determined that the accuracy of the word/word is high, and at this time, it is detected and judged whether the corresponding word/word in the marked text information is consistent in content. If the content of the corresponding words/words is inconsistent, the corresponding word/word information in the marked text information will be replaced, for example, "the sky is very hot and the Antarctic glaciers on the earth are exposed" with "the sky is very hot and the Antarctic glaciers on the earth are sinking". Through this step, the amount of calculation in the above-mentioned edit distance algorithm can be reduced, thereby improving the operation efficiency.

在一可实施方式中，根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息，包括：In a possible implementation manner, obtaining sub-text training information in the marked text information and sub-audio training information in the audio information according to the second timestamp information, including:

对标注文本信息根据设定字符或者指定字符数量拆分为多个子文本训练信息，并从第二时间戳信息中分别获取多个子文本训练信息所对应的起始时间戳和结尾时间戳信息；Splitting the labeled text information into a plurality of sub-text training information according to the set characters or the specified number of characters, and respectively obtaining the start timestamp and end timestamp information corresponding to the plurality of sub-text training information from the second timestamp information;

根据多个子文本训练信息所对应的起始时间戳和结尾时间戳信息，将音频信息拆分为多个子音频训练信息。The audio information is split into multiple sub-audio training information according to the start timestamp and end timestamp information corresponding to the multiple sub-text training information.

本实施例中，步骤104的具体过程为：In this embodiment, the specific process of step 104 is:

在生成第二时间戳信息之后，检测标注文本信息中的标点符号索引位置或者根据指定字符数量定位到所需切割的索引位置，按照索引位置将标注文本信息拆分为多个子文本训练信息，如将“天很热，地球南极的冰川都陷落了”分为“天很热”和“地球南极的冰川都陷落了”。After the second timestamp information is generated, the index position of the punctuation mark in the marked text information is detected or the index position to be cut is located according to the specified number of characters, and the marked text information is divided into multiple sub-text training information according to the index position, such as Divide "it is very hot, the glaciers in the south pole of the earth have collapsed" into "it is very hot" and "the glaciers in the south pole of the earth have collapsed".

接着从第二时间戳信息中获取每个子文本训练信息中的起始时间戳信息和结尾时间戳信息，如“天很热”的[19.83,21.00]。Then, the starting timestamp information and the ending timestamp information in the training information of each sub-text are obtained from the second timestamp information, such as [19.83, 21.00] for "it's very hot".

将音频信息按照子文本训练信息的起始时间戳信息和结尾时间戳信息进行拆分，获取到多个对应于子文本训练信息的字音频信息，将子文本训练信息和对应的字音频信息作为训练数据。Split the audio information according to the starting timestamp information and the ending timestamp information of the sub-text training information, obtain a plurality of word audio information corresponding to the sub-text training information, and use the sub-text training information and the corresponding word audio information as training data.

在一可实施方式中，在生成对应于音频信息的语音识别文本信息和第一时间戳信息之前，方法还包括：In a possible implementation manner, before generating the speech recognition text information and the first timestamp information corresponding to the audio information, the method further includes:

将标注文本信息输入于语音识别系统中的语言模型进行训练，或者在语音识别系统进行解码时动态增加标注文本信息的概率值。Input the marked text information into the language model in the speech recognition system for training, or dynamically increase the probability value of the marked text information when the speech recognition system decodes.

本实施例中，考虑到通过语音识别系统所得到的语音识别文本信息可能准确度不高，因此在执行步骤102之前，将所获取的标注文本信息输入于语音识别系统中的语言模型中进行训练，或者在语音识别系统针对该音频信息进行解码过程中动态增加对生成标注文本信息的概率值，以提高语音识别系统识别该音频信息的准确率。In this embodiment, considering that the text information of speech recognition obtained by the speech recognition system may not have high accuracy, before step 102 is executed, the acquired text information is input into the language model in the speech recognition system for training , or dynamically increase the probability value of generating marked text information during the decoding process of the audio information by the speech recognition system, so as to improve the accuracy of the audio information recognition by the speech recognition system.

如图2所示，本发明另一方面提供一种训练数据生成装置，装置包括：As shown in FIG. 2, another aspect of the present invention provides a training data generating device, the device comprising:

信息接收模块201，用于接收音频信息和对应的标注文本信息；an information receiving module 201, configured to receive audio information and corresponding marked text information;

第一信息生成模块202，用于生成对应于音频信息的语音识别文本信息和第一时间戳信息；a first information generation module 202, configured to generate speech recognition text information and first timestamp information corresponding to the audio information;

第二信息生成模块203，用于内容匹配标注文本信息和语音识别文本信息，根据第一时间戳信息生成对应于标注文本信息的第二时间戳信息；The second information generation module 203 is configured to match the marked text information with the speech recognition text information, and generate second timestamp information corresponding to the marked text information according to the first timestamp information;

训练数据生成模块204，用于根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息。The training data generating module 204 is configured to obtain sub-text training information in the marked text information and sub-audio training information in the audio information according to the second timestamp information.

本实施例中，在信息接收模块201中，音频信息和对应的标注文本信息优选为长音频和长标注文本信息，可以是有声书、演讲音频、访谈记录等等，其获取方式可以通过爬虫技术从网络上抓取或者从本地数据库中获取。In this embodiment, in the information receiving module 201, the audio information and the corresponding marked text information are preferably long audio and long marked text information, which can be audio books, speech audio, interview records, etc., and the acquisition method can be obtained through the crawler technology Fetch from the web or from a local database.

在第一信息生成模块202中，语音识别文本信息和第一时间戳信息可以通过将所接收到的音频信息输入于现有的语音识别系统或者通过人工测量识别得到；第一时间戳信息包括对应于语音识别文本信息中每个字或词的起始和结尾时间戳信息，例如标注文本信息为：“天很热，地球南极的冰川都陷落了”，假设语音识别文本信息为“天很热地球南极冰川都显露”的时间戳信息可能为：In the first information generation module 202, the speech recognition text information and the first time stamp information can be obtained by inputting the received audio information into an existing speech recognition system or by manual measurement; the first time stamp information includes corresponding Based on the start and end timestamp information of each word or word in the speech recognition text information, for example, the text information is marked as: "The sky is very hot, the glaciers in the South Pole of the earth have collapsed", assuming that the speech recognition text information is "The sky is very hot. The time stamp information of the earth's Antarctic glaciers is revealed" may be:

在第二信息生成模块203中，将标注文本信息和语音识别文本信息进行内容匹配，使生成对应于标注文本信息的第二时间戳信息。In the second information generation module 203, content matching is performed between the marked text information and the speech recognition text information, so that second time stamp information corresponding to the marked text information is generated.

接着在训练数据生成模块204中，根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息。Next, in the training data generating module 204, according to the second timestamp information, the sub-text training information in the marked text information and the sub-audio training information in the audio information are acquired.

在一可实施方式中，第二信息生成模块203具体用于：In a possible implementation manner, the second information generation module 203 is specifically configured to:

在一可实施方式中，训练数据生成模块204具体用于：In a possible implementation manner, the training data generation module 204 is specifically used for:

本实施例中，训练数据生成模块204具体用于：In this embodiment, the training data generation module 204 is specifically used for:

本发明另一方面提供一种计算机可读存储介质，存储介质包括一组计算机可执行指令，当指令被执行时用于执行上述任一项的训练数据生成方法。Another aspect of the present invention provides a computer-readable storage medium, the storage medium comprising a set of computer-executable instructions, when executed, for performing any of the above-mentioned training data generation methods.

在本发明实施例中计算机可读存储介质包括一组计算机可执行指令，当指令被执行时用于，接收音频信息和对应的标注文本信息；生成对应于音频信息的语音识别文本信息和第一时间戳信息；内容匹配标注文本信息和语音识别文本信息，生成对应于标注文本信息的第二时间戳信息，其中，第二时间戳信息与第一时间戳信息相对应；根据第二时间戳信息，获取标注文本信息中的子文本训练信息和音频信息中的子音频训练信息。由此，通过获取原始的音频信息以及标注文本信息，利用音频信息的时间戳信息从原始的音频信息以及标注文本信息中获取多个子音频训练信息和对应的子文本训练信息，从而得到大量并且高质量的语音训练数据，此过程效率高并且降低了耗费成本。In this embodiment of the present invention, the computer-readable storage medium includes a set of computer-executable instructions, which, when the instructions are executed, are used to receive audio information and corresponding marked text information; to generate speech recognition text information corresponding to the audio information and first timestamp information; the content matches the marked text information and the speech recognition text information, and generates second timestamp information corresponding to the marked text information, wherein the second timestamp information corresponds to the first timestamp information; according to the second timestamp information , to obtain the sub-text training information in the labeled text information and the sub-audio training information in the audio information. Therefore, by obtaining the original audio information and the labeled text information, and using the timestamp information of the audio information to obtain a plurality of sub-audio training information and corresponding sub-text training information from the original audio information and the labeled text information, a large number of high-quality audio sub-text training information can be obtained. High-quality speech training data, the process is efficient and cost-effective.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means two or more, unless otherwise expressly and specifically defined.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily think of changes or replacements, which should cover within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method of generating training data, the method comprising:

receiving audio information and corresponding labeled text information;

generating voice recognition text information and first time stamp information corresponding to the audio information;

the content is matched with the labeled text information and the voice recognition text information, and second timestamp information corresponding to the labeled text information is generated according to the first timestamp information;

and acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.

2. The method of claim 1, wherein the content matches the annotation text information and speech recognition text information, comprising:

performing text similarity matching on the marked text information and the voice recognition text information by using an editing distance algorithm;

and performing text alignment processing on the characters/words in the matched voice text information by taking the marked text information as a reference.

3. The method of claim 2, wherein the generating second timestamp information corresponding to the annotation text information from the first timestamp information comprises:

acquiring start timestamp information and end timestamp information corresponding to each word/word information in the voice recognition text information from the first timestamp information;

and aiming at each character/word information in the labeled text information, copying the starting time stamp information and the ending time stamp information corresponding to the matched character/word information in the voice recognition text information, and generating second time stamp information corresponding to the labeled text information.

4. The method of claim 2, wherein prior to content matching the annotation text information and speech recognition text information, the method comprises:

acquiring a confidence coefficient corresponding to the character/word information in the voice recognition text information through a voice recognition system;

and detecting and replacing the corresponding character/word information in the labeled text information according to the confidence coefficient of each character/word information.

5. The method of claim 1, wherein the obtaining sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information comprises:

dividing the labeled text information into a plurality of sub-text training information according to the number of set characters or designated characters, and respectively acquiring the starting time stamp information and the ending time stamp information corresponding to the plurality of sub-text training information from the second time stamp information;

and splitting the audio information into a plurality of pieces of sub audio training information according to the start time stamp information and the end time stamp information corresponding to the plurality of pieces of sub text training information.

6. The method of claim 1, wherein prior to generating speech recognition text information and first timestamp information corresponding to the audio information, the method further comprises:

and inputting the tagged text information into a language model in a voice recognition system for training, or dynamically increasing the probability value of the tagged text information when the voice recognition system decodes.

7. An apparatus for generating training data, the apparatus comprising:

the information receiving module is used for receiving the audio information and the corresponding labeled text information;

a first information generating module for generating speech recognition text information and first time stamp information corresponding to the audio information;

the second information generation module is used for matching the content with the labeled text information and the voice recognition text information and generating second timestamp information corresponding to the labeled text information according to the first timestamp information;

and the training data generation module is used for acquiring sub-text training information in the labeled text information and sub-audio training information in the audio information according to the second timestamp information.

8. The apparatus of claim 7, wherein the second information generating module is specifically configured to:

9. The apparatus of claim 8, wherein the training data generation module is specifically configured to:

10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the training data generation method of any of claims 1-6.