CN111402865B

CN111402865B - Method for generating voice recognition training data and method for training voice recognition model

Info

Publication number: CN111402865B
Application number: CN202010201114.8A
Authority: CN
Inventors: 单亚慧; 李�杰; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2023-08-08
Anticipated expiration: 2040-03-20
Also published as: CN111402865A

Abstract

The present disclosure relates to a method for generating speech recognition training data and a method for training a speech recognition model. The generation method comprises the following steps: acquiring initial voice recognition data uploaded by a client, wherein the initial voice recognition data comprises voice data and text data corresponding to the voice data; comparing text data corresponding to the voice data with preset text data, and calculating the word error rate of the initial voice recognition data; screening initial voice recognition data with the word error rate in a preset word error rate interval, and determining the initial voice recognition data as weak tag voice recognition data; acquiring manually marked voice recognition data; and combining the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data. Because the weak tag voice recognition data is convenient to acquire, a large amount of effective weak tag voice recognition data can be acquired in a short time, the generation time of voice recognition training data is saved, and the generation cost of the voice recognition training data is reduced.

Description

Method for generating voice recognition training data and method for training voice recognition model

Technical Field

The disclosure relates to the technical field of voice recognition, and in particular relates to a method for generating voice recognition training data, a method and a device for training a voice recognition model, electronic equipment and a storage medium.

Background

With the development of artificial intelligence technology, speech recognition technology has made great progress and has begun to enter various fields such as home appliances, communications, automobiles, medical treatment, etc.

In the related art, when a voice recognition model is trained, in order to obtain a voice recognition model with excellent performance, a training sample is obtained by only manually labeling a large amount of voice recognition data, so that a training effect is ensured.

However, obtaining a large number of training samples by manual labeling only is time consuming and labor costly.

Disclosure of Invention

The disclosure provides a method for generating voice recognition training data, a method and a device for training a voice recognition model, electronic equipment and a storage medium, so as to at least solve the problems of time consumption and high labor cost of a manual labeling mode in the related technology. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for generating speech recognition training data, including:

Acquiring initial voice recognition data uploaded by a client, wherein the initial voice recognition data comprises voice data and text data corresponding to the voice data;

comparing the text data corresponding to the voice data with preset text data, and calculating the word error rate of the initial voice recognition data;

screening initial voice recognition data with the word error rate in a preset word error rate interval, and determining the initial voice recognition data as weak tag voice recognition data;

acquiring manually marked voice recognition data;

and combining the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the step of combining the weak tag speech recognition data and the artificially labeled speech recognition data to obtain speech recognition training data includes:

aligning the voice data in the weak tag voice recognition data with text data to obtain aligned weak tag voice recognition data;

and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the step of aligning the voice data in the weak tag voice recognition data with text data to obtain aligned weak tag voice recognition data includes:

Extracting audio features corresponding to voice data in the weak tag voice recognition data;

inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space;

searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points;

and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

In one embodiment, the step of combining the aligned weak tag speech recognition data and the artificially labeled speech recognition data to obtain speech recognition training data includes:

constructing a language model according to text data in the weak tag voice recognition data;

constructing a decoding diagram according to the language model and the acoustic model;

decoding the audio features through the decoding graph to obtain a decoded text;

Comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number;

screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data;

and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

acquiring silence segments in the aligned weak tag voice recognition data;

if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result;

And combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, if the time length of the mute segment is greater than a preset threshold, the step of discarding the mute segment to obtain the first alignment result includes:

if the time length of the mute segment is greater than a preset threshold value, discarding a first segment in the mute segment, and reserving a second segment in the mute segment to obtain a first alignment result, wherein the second segment is a segment with two ends of the mute segment within the preset time length, and the first segment is a segment with the second segment removed from the mute segment.

According to a second aspect of embodiments of the present disclosure, there is provided a training method of a speech recognition model, including a method of generating speech recognition training data as described above, the method further comprising:

and training the voice recognition model according to the voice recognition training data.

According to a third aspect of the embodiments of the present disclosure, there is provided a generation apparatus of speech recognition training data, including:

a first data acquisition unit configured to perform acquisition of initial voice recognition data uploaded by a client, wherein the initial voice recognition data includes voice data and text data corresponding to the voice data;

A data comparison unit configured to perform comparison of text data corresponding to the voice data and preset text data, and calculate a word error rate of the initial voice recognition data;

a data screening unit configured to perform screening of initial voice recognition data of which the word error rate is within a preset word error rate interval, and determine the initial voice recognition data as weak tag voice recognition data;

a second data acquisition unit configured to perform acquisition of manually noted speech recognition data;

and the data merging unit is configured to merge the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit is specifically configured to perform alignment of the voice data in the weak tag voice recognition data with text data, so as to obtain aligned weak tag voice recognition data; and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit is specifically configured to perform extraction of audio features corresponding to voice data in the weak tag voice recognition data; inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

In one embodiment, the data merging unit is specifically configured to perform constructing a language model from text data in the weak tag speech recognition data; constructing a decoding diagram according to the language model and the acoustic model; decoding the audio features through the decoding graph to obtain a decoded text; comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number; screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data; and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit is specifically configured to obtain silence segments in the aligned weak tag speech recognition data; if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result; and combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit is specifically configured to execute, if the time length of the mute segment is greater than a preset threshold, discarding a first segment in the mute segment, and reserving a second segment in the mute segment to obtain a first alignment result, where the second segment is a segment with two ends of the mute segment being located within a preset time length, and the first segment is a segment with the second segment removed from the mute segment.

According to a fourth aspect of embodiments of the present disclosure, there is provided a generating apparatus of speech recognition training data, including the generating apparatus of speech recognition training data as described above, the apparatus further including:

and a training unit configured to perform training of the speech recognition model according to the speech recognition training data.

According to a fifth aspect of embodiments of the present disclosure, there is provided a server comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of generating speech recognition training data as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of a server, enables the server to perform the method of generating speech recognition training data as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the method of generating speech recognition training data as described in any of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the method for generating the voice recognition training data, the method, the device, the electronic equipment and the storage medium for training the voice recognition model, the voice recognition training data is generated according to the weak tag voice recognition data and the manually marked voice recognition data, a large amount of effective weak tag voice recognition data can be obtained in a short time due to the convenience in obtaining the weak tag voice recognition data, and the dependence on the manually marked voice recognition data can be reduced by utilizing the weak tag voice recognition data, so that the generation time of the voice recognition training data is saved, and the generation cost of the voice recognition training data is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flowchart illustrating a method of generating speech recognition training data, according to an exemplary embodiment.

FIG. 2 is a flow chart illustrating a complementary scheme for combining weak tag speech recognition data with artificially labeled speech recognition data to arrive at speech recognition training data, according to an exemplary embodiment.

FIG. 3 is a flow chart illustrating a complementary scheme for merging aligned weak tag speech recognition data and manually labeled speech recognition data to arrive at speech recognition training data, according to an exemplary embodiment.

Fig. 4 is a block diagram illustrating a device for generating speech recognition training data according to an exemplary embodiment.

Fig. 5 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

FIG. 1 is a flowchart illustrating a method of generating speech recognition training data, according to an exemplary embodiment. In one embodiment, referring to fig. 1, the method for generating speech recognition training data includes the following steps:

in step S21, initial speech recognition data uploaded by the client is acquired, wherein the initial speech recognition data includes speech data and text data corresponding to the speech data.

In step S22, the text data corresponding to the voice data is compared with the preset text data, and the word error rate of the initial voice recognition data is calculated.

In step S23, the initial speech recognition data with the word error rate within the preset word error rate range is selected and determined as the weak tag speech recognition data.

In step S24, manually noted speech recognition data is acquired.

In step S25, the weak tag speech recognition data and the artificially labeled speech recognition data are combined to obtain speech recognition training data.

Wherein the weak tag voice recognition data includes voice data and text data corresponding to the voice data. Weak tag speech recognition data refers to marking data where speech recognition is not accurate enough, e.g., data where the user modifies text recognized by the speech recognition system.

Specifically, at first, at least one set of initial speech recognition data uploaded by a client is acquired, the initial speech recognition data being composed of speech data and corresponding text data. For example, the voice data in the initial voice recognition data may be a voice of a short video made by a user, and the text data in the initial voice recognition data may be a text subtitle input by the user for the short video, or a text obtained by modifying the text automatically recognized by the voice recognition system by the user. And comparing the text data in each group of initial voice recognition data with the preset text data, and calculating the word error rate of each group of initial voice recognition data. Alternatively, the preset text data may be text data automatically recognized after the voice recognition system inputs the voice data. Thus, the word error rate of each set of speech recognition data can be calculated by comparing the text data recognized by the speech with the text data modified by the user. It will be appreciated that word error rates may be used to characterize the accuracy of speech recognition, i.e., the lower the word error rate, the higher the accuracy of speech recognition; the higher the word error rate, the lower the accuracy of speech recognition.

And then judging whether the word error rate of each group of data is in a preset word error rate interval, and screening out the voice recognition data with the word error rate in the preset word error rate interval to form the weak tag voice recognition data. And finally, combining the weak tag voice recognition data with the manually marked voice recognition data to jointly form voice recognition training data. The voice recognition training data is used for training a voice recognition model, for example, the voice recognition model is trained according to the data characteristics of the weak tag voice recognition data and the data characteristics of the manually marked voice recognition data, so that the accuracy of voice recognition is improved.

Alternatively, the preset word error rate interval may be 10% to 30%. Setting 10% is to consider that below this value, speech recognition already has a very high accuracy, which does not significantly improve the performance of the speech recognition system. Setting 30% is to consider that above this value, the reason why the performance of the speech recognition system is not high may be that the user speaks a dialect, and at this time, using the data in this case for training may cause the speech recognition system to tilt in the direction of the dialect, which is rather disadvantageous for improvement of the performance of the speech recognition system. Of course, in other embodiments, the preset word error rate interval may also be selected based on different application requirements, for example, in order to further improve the performance of the speech recognition system, the word error rate interval may be set to 30% -45%, and so on.

According to the method for generating the voice recognition training data, the voice recognition training data are generated according to the weak tag voice recognition data and the manually marked voice recognition data, a large amount of effective weak tag voice recognition data can be obtained in a short time due to the convenience in obtaining the weak tag voice recognition data, and the dependence on the manually marked voice recognition data can be reduced by utilizing the weak tag voice recognition data, so that the generation time of the voice recognition training data is saved, and the generation cost of the voice recognition training data is reduced.

FIG. 2 is a flow chart illustrating a complementary scheme for generating speech recognition training data based on weak tag speech recognition data and manually labeled speech recognition data, according to an exemplary embodiment. Referring to fig. 2, in one embodiment, step S25 may include the following steps:

in step S252, the voice data in the weak tag voice recognition data is aligned with the text data, and the aligned weak tag voice recognition data is obtained.

In step S254, the aligned weak tag speech recognition data and the artificially labeled speech recognition data are combined to obtain speech recognition training data.

Specifically, in order to facilitate analysis and processing of large-scale voice data and text data, accuracy of data processing is guaranteed when the volume of data is large, after weak tag voice recognition data is obtained, voice data in the weak tag voice recognition data is aligned with the text data by using a preset acoustic model, and aligned weak tag voice recognition data is obtained. Optionally, a specific implementation of the alignment of the voice data with the text data may include the following steps:

Firstly, extracting audio features corresponding to aligned weak tag voice recognition data, inputting the audio features and text data into a preset acoustic model, expanding the text data into a search space formed by an HMM state sequence through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a plurality of first paths in a search space in a breadth-first search mode, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

Wherein, HMM is fully called: hidden Markov Model, a hidden Markov model. The acoustic model is constructed based on a hidden Markov model and is trained based on speech recognition data samples. The state of the phonemes after being subjected to binding clustering and the like is called triphones, and corresponds to the HMM state after being bound, wherein the triphones are modeling units of the acoustic model. It should be noted that the HMM state sequence is the bound HMM state, i.e., the triphone state. Specifically, after the audio features are extracted, text data in the weak tag voice recognition data are expanded into a search space composed of HMM state sequences by using a pre-trained acoustic model, and the token score of each path is calculated, wherein the search space uses beam search, only the paths in the top N bits are ranked after the ranking from high score to low score is reserved, namely the paths with higher scores are pruned by using the beam search. And then searching paths which are contained in the search space and have the same phoneme sequences but different time points by adopting a breadth-first search mode, and finding out the path with the highest score to align the audio features with the HMM state sequences so as to obtain an alignment result. The embodiment can be used for processing a large amount of data, realizing alignment of a large amount of weak tag voice recognition data and improving alignment efficiency.

In order to improve the accuracy of weak tag voice recognition data selection, a confidence screening mechanism is adopted to screen the weak tag voice recognition data. In one embodiment, step S254 may include the steps of:

in step S2541, a language model is constructed from text data in the weak tag speech recognition data;

in step S2542, a decoding diagram is constructed from the language model and the acoustic model;

in step S2543, decoding the audio feature through the decoding graph to obtain a decoded text;

in step S2544, comparing the decoded text with text data in the weak tag voice recognition data, and retaining first text data, wherein the first text data is a text segment in which the decoded text is identical to text data in the weak tag voice recognition data, and the number of text words exceeds a preset number of words;

in step S2545, second text data identical to the first text data in the aligned weak tag voice recognition data is screened, and voice data corresponding to the second text data is extracted from the aligned weak tag voice recognition data;

in step S2546, the speech recognition data including the second text data and the corresponding speech data and the manually-labeled speech recognition data are combined to generate the speech recognition training data.

Specifically, first, a language model is constructed from text data in the weak tag speech recognition data, and the language model may be a binary language model, a ternary language model, or a quaternary language model. And then, constructing a decoding diagram according to the language model and the acoustic model, and decoding the audio features corresponding to the voice data by using the decoding diagram to obtain a decoded text. And comparing the decoded text with the text data, and reserving text fragments which have the same characters and have the continuous word number exceeding the preset word number in the decoded text and the text data as first text data, for example, reserving text fragments which have more than 10 continuous words and are the same. And comparing the first text data with the text data in the aligned weak tag voice recognition data, screening out second text data contained in the first text data and the aligned weak tag voice recognition data, and acquiring voice data corresponding to the second text data. And generating voice recognition training data according to the second text data, the voice data and the manually marked voice recognition data.

FIG. 3 is a flow chart illustrating a complementary scheme for generating speech recognition training data based on alignment results and manually annotated speech recognition data, according to an exemplary embodiment. Referring to fig. 3, in one embodiment, step S254 may include the following steps:

In step S2542, a mute segment in the aligned weak tag voice recognition data is acquired.

In step S2544, if the duration of the mute segment is greater than the preset threshold, the mute segment is discarded to obtain a first alignment result.

In step S2546, the first alignment result and the manually labeled speech recognition data are combined to obtain speech recognition training data.

Typically, the aligned weak tag speech recognition data includes a speech segment and a mute segment. Specifically, after the aligned weak tag voice recognition data is obtained, a mute section therein is found, and in general, the mute section includes a plurality of segments. And acquiring the corresponding time length of each mute segment, and if the time length is greater than a preset threshold value, considering that the mute segment does not meet the requirement and discarding the mute segment. Therefore, the rest segments in the alignment result form a first alignment result, and then the first alignment result and the manually marked voice recognition data are combined to form voice recognition training data.

In this embodiment, the mute segment is removed as the interference number, which can improve the accuracy of the voice recognition data, reduce the error rate, and facilitate the improvement of the performance of the voice recognition system.

Alternatively, in one embodiment, step S2544 may include the steps of: if the duration of the mute segment is greater than a preset threshold, discarding the first segment in the mute segment, and reserving the second segment in the mute segment to obtain a first alignment result, wherein the second segment is a segment with two ends of the mute segment being positioned in the preset duration, and the first segment is a segment except the second segment in the mute segment.

Specifically, a part of mute data at two ends of the mute segment can be reserved, for example, mute data of 0.1s before and after the mute segment is reserved, and the mute data of the middle part is discarded, so that the finally obtained target alignment result is formed by splicing the voice segment and the reserved mute data.

In this embodiment, the silence data at two ends of the silence segment are reserved, so that the sudden interruption of the voice can be avoided, and the performance of the voice recognition system is prevented from being affected.

Based on the same inventive concept, the disclosure further provides a training method of a speech recognition model, which includes the method for generating speech recognition training data according to the embodiment, and training the speech recognition model according to the speech recognition training data obtained by the method for generating speech recognition training data, so as to improve performance of a speech recognition system.

Alternatively, the method according to the above embodiment may be applied to the short video field. The short video editing software provides an automatic subtitle recognition function while allowing the user to modify the recognized subtitle. If the user is not satisfied with the automatically identified subtitles, a secondary modification may be made. In general, the automatically identified subtitles and user-modified data are not necessarily completely correct. The incompletely correct data can be used as the voice recognition data of the weak tag, the data can be fully utilized by the method, the accuracy of an automatic subtitle recognition system can be improved, and a better subtitle recognition function is provided for a user.

It should be understood that, although the steps in the flowcharts of fig. 1-3 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or phases that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or phases are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or phases of other steps or other steps.

FIG. 4 is a block diagram illustrating a device for generating speech recognition training data according to an exemplary embodiment. Referring to fig. 4, the apparatus 30 includes a first data acquisition unit 302, a data comparison unit 304, a data filtering unit 306, a second data acquisition unit 308, and a display information determination unit 310.

The first data obtaining unit 302 is configured to perform obtaining initial voice recognition data uploaded by the client, where the initial voice recognition data includes voice data and text data corresponding to the voice data;

the data comparison unit 304 is configured to perform comparison of text data corresponding to the voice data and preset text data, and calculate a word error rate of the initial voice recognition data;

the data filtering unit 306 is configured to perform filtering of initial speech recognition data with a word error rate within a preset word error rate range, and determine the initial speech recognition data as weak tag speech recognition data;

the second data acquisition unit 308 is configured to perform acquisition of manually labeled speech recognition data;

the data merging unit 310 is configured to perform merging of the weak tag speech recognition data and the artificially labeled speech recognition data to obtain speech recognition training data.

According to the voice recognition training data generating device, the voice recognition training data is generated according to the weak tag voice recognition data and the manually marked voice recognition data, a large amount of effective weak tag voice recognition data can be obtained in a short time due to the convenience in obtaining the weak tag voice recognition data, and the dependence on the manually marked voice recognition data can be reduced by utilizing the weak tag voice recognition data, so that the generation time of the voice recognition training data is saved, and the generation cost of the voice recognition training data is reduced.

In one embodiment, the data merging unit 310 is specifically configured to perform alignment of the voice data in the weak tag voice recognition data with the text data, so as to obtain aligned weak tag voice recognition data; and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data. .

In one embodiment, the data merging unit 310 is specifically configured to perform extracting audio features corresponding to voice data in the weak tag voice recognition data; inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

In one embodiment, the data merging unit 310 is specifically configured to perform a language model construction from text data in the weak tag speech recognition data; constructing a decoding diagram according to the language model and the acoustic model; decoding the audio features through the decoding graph to obtain a decoded text; comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number; screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data; and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit 310 is specifically configured to obtain silence segments in the aligned weak tag speech recognition data; if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result; and combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the data merging unit 310 is specifically configured to execute, if the time length of the silence segment is greater than a preset threshold, discarding a first segment of the silence segments, and reserving a second segment of the silence segments to obtain a first alignment result, where the second segment is a segment with two ends of the silence segment being located within a preset time length, and the first segment is a segment with the second segment being divided among the silence segments.

For specific limitations on the generation means of the speech recognition training data, reference may be made to the above limitations on the generation method of the speech recognition training data, and no further description is given here. The above-described respective modules in the generation apparatus of the speech recognition training data may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 5 is a block diagram of an electronic device, according to an example embodiment. The electronic device may be a server. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is used to store data of user interactions with the item. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating speech recognition training data.

It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute instructions to implement the steps of:

acquiring manually marked voice recognition data;

According to the electronic equipment, the voice recognition training data are generated according to the weak tag voice recognition data and the manually marked voice recognition data, and a large amount of effective weak tag voice recognition data can be obtained in a short time due to the convenience in obtaining the weak tag voice recognition data.

In one embodiment, the processor when executing instructions further performs the steps of: aligning the voice data in the weak tag voice recognition data with text data to obtain aligned weak tag voice recognition data; and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the processor when executing instructions further performs the steps of: extracting audio features corresponding to voice data in the weak tag voice recognition data; inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

In one embodiment, the processor when executing instructions further performs the steps of: constructing a language model according to text data in the weak tag voice recognition data; constructing a decoding diagram according to the language model and the acoustic model; decoding the audio features through the decoding graph to obtain a decoded text; comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number; screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data; and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the processor when executing instructions further performs the steps of: acquiring silence segments in the aligned weak tag voice recognition data; if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result; and combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the processor when executing instructions further performs the steps of: if the duration of the mute segment is greater than a preset threshold, discarding the first segment in the mute segment, and reserving the second segment in the mute segment to obtain a first alignment result, wherein the second segment is a segment with two ends of the mute segment being positioned in the preset duration, and the first segment is a segment except the second segment in the mute segment.

In an exemplary embodiment, a storage medium is also provided, e.g., a memory, comprising instructions executable by a processor of the apparatus to perform the above-described method.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

Acquiring manually marked voice recognition data;

In one embodiment, the computer program when executed by the processor further performs the steps of: aligning the voice data in the weak tag voice recognition data with text data to obtain aligned weak tag voice recognition data; and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the computer program when executed by the processor further performs the steps of: extracting audio features corresponding to voice data in the weak tag voice recognition data; inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

In one embodiment, the computer program when executed by the processor further performs the steps of: constructing a language model according to text data in the weak tag voice recognition data; constructing a decoding diagram according to the language model and the acoustic model; decoding the audio features through the decoding graph to obtain a decoded text; comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number; screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data; and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring silence segments in the aligned weak tag voice recognition data; if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result; and combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

In one embodiment, the computer program when executed by the processor further performs the steps of: if the duration of the mute segment is greater than a preset threshold, discarding the first segment in the mute segment, and reserving the second segment in the mute segment to obtain a first alignment result, wherein the second segment is a segment with two ends of the mute segment being positioned in the preset duration, and the first segment is a segment except the second segment in the mute segment.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating speech recognition training data, comprising:

acquiring manually marked voice recognition data;

combining the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data;

the step of combining the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data comprises the following steps:

2. The method of claim 1, wherein the step of merging the weak tag speech recognition data and the artificially labeled speech recognition data to obtain speech recognition training data comprises:

3. The method of generating voice recognition training data according to claim 2, wherein the step of aligning the voice data in the weak tag voice recognition data with text data to obtain the aligned weak tag voice recognition data comprises:

4. The method for generating voice recognition training data according to claim 2, wherein the step of combining the aligned weak tag voice recognition data and the artificially labeled voice recognition data to obtain the voice recognition training data comprises:

acquiring silence segments in the aligned weak tag voice recognition data;

5. The method for generating speech recognition training data according to claim 4, wherein the step of discarding the silence segment if the time length of the silence segment is greater than a preset threshold value, and obtaining a first alignment result comprises:

6. A method of training a speech recognition model, comprising a method of generating speech recognition training data according to any one of claims 1 to 5, the method further comprising:

7. A speech recognition training data generating apparatus, comprising:

the data merging unit is configured to merge the weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data;

the data merging unit is specifically configured to execute a language model specifically configured to execute construction according to text data in the weak tag voice recognition data; constructing a decoding diagram according to the language model and the acoustic model; decoding the audio features through the decoding graph to obtain a decoded text; comparing the decoded text with text data in the weak tag voice recognition data, and reserving first text data, wherein the first text data is a text segment with the same words corresponding to the decoded text and the text data in the weak tag voice recognition data and the words number exceeding a preset word number; screening second text data which is the same as the first text data in the aligned weak tag voice recognition data, and extracting voice data corresponding to the second text data from the aligned weak tag voice recognition data; and combining the voice recognition data formed by the second text data and the voice data corresponding to the second text data with the manually marked voice recognition data to obtain voice recognition training data.

8. The apparatus according to claim 7, wherein the data merging unit is specifically configured to perform alignment of the voice data in the weak tag voice recognition data with text data to obtain aligned weak tag voice recognition data; and combining the aligned weak tag voice recognition data and the manually marked voice recognition data to obtain voice recognition training data.

9. The apparatus according to claim 8, wherein the data merging unit is specifically configured to perform extraction of audio features corresponding to voice data in the weak tag voice recognition data; inputting the audio characteristics and text data in the weak tag voice recognition data into a preset acoustic model, expanding the text data into a search space composed of HMM state sequences through the acoustic model, and calculating to obtain the score of each path in the search space; searching paths with highest scores in a breadth-first search mode in a plurality of first paths in the search space, wherein the first paths are paths with the same phoneme sequence and different time points; and aligning the audio features with the HMM state sequence according to the path with the highest score to obtain the aligned weak tag voice recognition data.

10. The apparatus according to claim 8, wherein the data merging unit is specifically configured to perform acquisition of silence segments in the aligned weak tag speech recognition data; if the time length of the mute segment is greater than a preset threshold value, discarding the mute segment to obtain a first alignment result; and combining the first alignment result and the manually marked voice recognition data to obtain voice recognition training data.

11. The apparatus according to claim 10, wherein the data merging unit is specifically configured to execute discarding a first segment of the silence segments and reserving a second segment of the silence segments to obtain a first alignment result if the time length of the silence segments is greater than a preset threshold, where the second segment is a segment of the silence segments with two ends within a preset time length, and the first segment is a segment of the silence segments excluding the second segment.

12. A training apparatus of a speech recognition model, characterized by comprising the speech recognition training data generating apparatus according to any one of claims 7 to 11, further comprising:

13. A server, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of generating speech recognition training data according to any one of claims 1 to 5.

14. A storage medium, which when executed by a processor of a server, enables the server to perform the method of generating speech recognition training data according to any one of claims 1 to 5.