CN112669814B

CN112669814B - Data processing method, device, equipment and medium

Info

Publication number: CN112669814B
Application number: CN202011496538.8A
Authority: CN
Inventors: 李旭; 刘欢
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2024-06-14
Anticipated expiration: 2040-12-17
Also published as: CN112669814A

Abstract

The invention discloses a data processing method, a device, equipment and a medium, which are used for solving the problem that quality of marked data cannot be checked through electronic equipment. In the process of quality inspection of the labeling data, the quality inspection data corresponding to the labeling data can be obtained based on the labeling data to be quality inspected and the audio data corresponding to the labeling data through a voice synthesis model, the corresponding relation between each character in the labeling data and each audio frame in the audio data corresponding to the labeling data is represented by the quality inspection data, and whether the labeling data is correct or not can be determined according to the quality inspection data corresponding to the labeling data, so that the quality inspection of the labeling data to be quality inspected without manual work is realized, the workload of quality inspection staff is reduced, the influence of the working capacity of the quality inspection staff on the quality inspection efficiency and accuracy is reduced, and the labeling data with wrong labeling can be conveniently traced and positioned.

Description

Data processing method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, device, and medium.

Background

In the prior art, the conversion of text information into voice information is generally realized based on a voice synthesis model. In order to obtain a speech synthesis model, a large number of speech samples and labeling data corresponding to each speech sample are generally required to train the original speech synthesis model. Under the same model structure, a high-precision speech synthesis model can be trained based on massive high-quality training speech samples and the labeling data corresponding to each training speech sample. In the test process of the trained voice synthesis model, the quantity of the tested voice samples, the balance degree of the sample training set and the sample test set and the quality of the labeling data corresponding to the tested voice samples have great influence on the test result of the trained voice synthesis model. Based on this, the quality of the labeling data corresponding to the voice sample (including the test voice sample and the training voice sample) is one of the important factors affecting the accuracy of the voice synthesis model, and how to improve the quality of the labeling data corresponding to the voice sample is a problem of increasing attention in recent years.

At present, the labeling and quality inspection of voice samples are mainly completed in a manual mode, and although some labeling tools, such as a voice labeling tool Praat, appear to assist manual labeling, the labeling data corresponding to the voice samples still need to be manually inspected, and manual quality inspection is a time-consuming, labor-consuming and financial task. Under the condition that the labeling data to be inspected are very much, the workload of quality inspection personnel is very large, and finally, the acquired quality inspection result is inevitably error, so that the quality inspection result of the labeling data is affected, and the defect of manual quality inspection is particularly obvious. Therefore, there is an urgent need for a method that can automatically complete quality inspection of annotation data.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a device, equipment and a medium, which are used for solving the problem that quality of marked data cannot be checked through electronic equipment.

The embodiment of the invention provides a data processing method, which comprises the following steps:

Any marking data to be inspected and audio data corresponding to the marking data are acquired, wherein the marking data comprise text data corresponding to the audio data and first text features thereof;

Based on the labeling data and the audio data, determining quality inspection data corresponding to the labeling data through a decoder of a voice synthesis model, wherein the quality inspection data represents the corresponding relation between each character in the labeling data and each audio frame in the audio data;

And judging whether the labeling data is correct or not according to the quality inspection data corresponding to the labeling data.

The embodiment of the invention provides a data processing device, which comprises:

the device comprises an acquisition unit, a quality detection unit and a storage unit, wherein the acquisition unit is used for acquiring any marking data to be inspected and audio data corresponding to the marking data, and the marking data comprises text data corresponding to the audio data and a first text characteristic of the text data;

The determining unit is used for determining quality inspection data corresponding to the annotation data through a decoder of a voice synthesis model based on the annotation data and the audio data, wherein the quality inspection data represents the corresponding relation between each character in the annotation data and each audio frame in the audio data;

And the judging unit is used for judging whether the labeling data are correct or not according to the quality inspection data corresponding to the labeling data.

An embodiment of the present invention provides an electronic device, where the electronic device includes a processor, and the processor is configured to implement the steps of any one of the data processing methods described above when executing a computer program stored in a memory.

An embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of a data processing method as described in any one of the above.

In the embodiment of the invention, the quality inspection data corresponding to the quality inspection data can be obtained based on the marking data to be inspected and the audio data corresponding to the marking data through the voice synthesis model, the quality inspection data characterizes the corresponding relation between each character in the marking data and each audio frame in the audio data corresponding to the marking data, and whether the marking data is correct or not can be determined according to the quality inspection data corresponding to the marking data, so that the quality inspection of the marking data to be inspected is realized without manual work, the workload of quality inspection personnel is reduced, the influence of the working capacity of the quality inspection personnel on the quality inspection efficiency and accuracy is reduced, and the marking data with wrong marking can be conveniently traced and positioned.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing process according to an embodiment of the present invention;

FIG. 2 is an alignment chart of correct annotation data provided by an embodiment of the present invention;

FIG. 3 is an alignment chart of labeling data representing errors according to an embodiment of the present invention;

FIG. 4 is an alignment chart of annotation data representing errors according to an embodiment of the present invention;

FIG. 5 is an alignment chart of erroneous annotation data provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a specific data processing flow according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the attached drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to improve accuracy and efficiency of quality inspection of labeling data to be inspected, the embodiment of the invention provides a data processing method, device, equipment and medium.

Example 1: fig. 1 is a schematic diagram of a data processing process according to an embodiment of the present invention, where the process includes:

S101: any marking data to be inspected and audio data corresponding to the marking data are acquired, wherein the marking data comprise text data corresponding to the audio data and first text features thereof.

The data processing method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent equipment such as a robot or a server.

In the embodiment of the present invention, the labeling data corresponding to the audio data includes text data corresponding to the audio data and text features (for convenience of description, denoted as first text features) of the text data, where the first text features include an initial and final sequence corresponding to the text data, and the initial and final sequence includes an initial and final corresponding to each character in the text data and a corresponding tone. For example, the corresponding vowel of the character "today" in the text data is "jin" and the corresponding tone is "1", the corresponding vowel of the character "day" in the text data "today" is "tian" and the corresponding tone is "1", so the first text feature of the text data "today", i.e. the sequence of vowels of the text data "today" is "jin1 tian". In the embodiment of the invention, the marking data corresponding to the audio data can be obtained in a manual marking mode, and can also be determined by a voice marking tool, and in the specific implementation process, the marking data can be flexibly set according to actual requirements, and the marking data is not particularly limited.

Specifically, any audio data and the labeling data corresponding to the audio data are acquired, and the currently acquired labeling data is determined as the labeling data to be inspected. In order to facilitate the subsequent processing of the labeling data to be inspected, the labeling data to be inspected can be processed to obtain a digital sequence corresponding to the labeling data. Specifically, the corresponding target number of each character contained in the labeling data can be determined through the corresponding relation between the preconfigured character and the number, and the corresponding number sequence of the labeling data can be determined sequentially according to each target number, or the corresponding number sequence of the labeling data can be determined directly through a model, for example, a Bert model. The number sequence corresponding to the labeling data may be obtained in advance, or may be obtained whenever quality inspection is required for a certain labeling data.

When the number sequence corresponding to the labeling data is determined according to the correspondence between the pre-configured characters and numbers, the numbers corresponding to the different characters are different in order to distinguish each character included in the labeling data.

S102: based on the labeling data and the audio data, quality inspection data corresponding to the labeling data is determined through a decoder of the speech synthesis model, and the quality inspection data characterizes the corresponding relation between each character in the labeling data and each audio frame in the audio data.

S103: and judging whether the labeling data is correct or not according to the quality inspection data corresponding to the labeling data.

After the digital sequence corresponding to any marking data to be inspected and the audio data corresponding to the marking data are obtained based on the embodiment, corresponding processing is performed based on the digital sequence and the audio data, so that whether the marking data are correct or not is determined.

In the practical application process, the voice synthesis model not only can generate the acoustic characteristics of the synthesized audio data corresponding to the digital sequence based on the digital sequence corresponding to the input annotation data, but also can determine the corresponding relation between each character in the annotation data and each audio frame in the audio data based on the digital sequence corresponding to the input annotation data and the audio data, and output the corresponding relation through a decoder in the voice synthesis model, namely, aiming at each character contained in the annotation data, the character corresponds to the audio frames contained in the audio data. The characters contained in the annotation data may be characters contained in the text data, or may be initials and finals in the first text feature. According to the corresponding relation, whether the current annotation data and the audio data are aligned or not can be reflected to a certain extent. Based on this, in the embodiment of the present invention, in order to conveniently determine whether the labeling data corresponding to each audio data is correct, the correspondence between each character in the labeling data determined by the speech synthesis model and each audio frame in the audio data may be determined as quality inspection data corresponding to the labeling data. In the specific implementation process, quality inspection data corresponding to the annotation data are obtained through a voice synthesis model based on the obtained digital sequence corresponding to a certain annotation data and the audio data corresponding to the annotation data.

And then, based on the scheme provided by the embodiment of the invention, the acquired quality inspection data corresponding to the labeling data is correspondingly processed, so that whether the labeling data is correct or not is determined. For example, if the corresponding relationship between each character in the determined annotation data and each audio frame in the audio data is inaccurate through the speech synthesis model, the annotation data is determined to be wrong.

Example 2: in order to improve accuracy and efficiency of quality inspection of labeling data to be inspected, on the basis of the above embodiment, in the embodiment of the present invention, judging whether the labeling data is correct according to quality inspection data corresponding to the labeling data includes:

if the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement, the labeling data is determined to be correct; or alternatively

If the quality inspection data corresponding to the labeling data is determined not to meet the pre-configured quality inspection requirement, determining that the labeling data is in labeling error.

In general, each character included in the correctly labeled labeling data corresponds to at least one audio frame in the audio data corresponding to the labeling data, and the last character included in the labeling data necessarily corresponds to the last audio frame in the audio data. Based on this, in order to facilitate determining whether the labeling data is correct, in the embodiment of the present invention, a quality inspection requirement is preconfigured. After quality inspection data corresponding to the labeling data is obtained based on the embodiment, whether the quality inspection data meets the pre-configured quality inspection requirement is judged, so that whether the labeling data is correct is determined.

In one possible implementation manner, if it is determined that quality inspection data corresponding to certain currently acquired labeling data meets a pre-configured quality inspection requirement, and it is determined that the labeling data is most likely to be correct, the labeling data is determined to be correct.

In another possible implementation manner, if it is determined that quality inspection data corresponding to a certain currently acquired labeling data does not meet a pre-configured quality inspection requirement, and it is indicated that the labeling data is most likely to be wrong, it is determined that the labeling data is wrong, and subsequent modification of the labeling data is required by a staff.

Further, in order to improve accuracy of quality inspection of the labeling data to be inspected, determining whether the quality inspection data corresponding to the labeling data meets a pre-configured quality inspection requirement according to at least one of the following modes includes:

Determining whether the number of characters contained in text data in the labeling data is equal to the number of characters corresponding to the audio data based on position quality inspection data contained in the quality inspection data so as to determine whether the quality inspection data meets a first quality inspection requirement configured in advance, wherein the position quality inspection data is used for identifying the corresponding relation between the position of each second text feature corresponding to the labeling data in a text feature sequence and the position of each audio feature corresponding to the audio data in an audio feature sequence; the second text feature is obtained by encoding the annotation data through an encoder in the speech synthesis model;

Based on probability vectors respectively corresponding to each audio feature contained in an audio feature sequence corresponding to the audio data contained in the quality inspection data, determining whether characters contained in text data in the labeling data are consistent with characters corresponding to the audio data so as to determine whether the quality inspection data meet a second quality inspection requirement configured in advance, wherein the probability vector corresponding to any audio feature contains a probability value of each second text feature corresponding to the audio feature respectively corresponding to the labeling data.

In the embodiment of the invention, the speech synthesis model is a model with an attention-oriented encoding-decoding (encoder-decoder) structure, such as Tacotron model, and can convert the annotation data into the acoustic characteristics of the synthesized audio data, and in the operation process, the corresponding relation between each character in the annotation data and each audio frame in the audio data corresponding to the annotation data is determined. For convenience in describing the correspondence, the correspondence may be visualized as an alignment graph. Fig. 2 is an alignment chart of correct annotation data according to an embodiment of the present invention. As shown in fig. 2, any value m on the abscissa of the alignment chart represents the mth audio feature in the audio feature sequence corresponding to the audio data, where the audio feature sequence may be a mel-frequency cepstrum sequence, for example, the abscissa is 10, represents the 10 th audio feature in the mel-frequency cepstrum sequence corresponding to the audio data, any value n on the ordinate represents the nth text feature (for convenience of description, referred to as a second text feature) in the text feature sequence corresponding to the labeling data, where the second text feature is obtained by encoding the labeling data by an encoder in the speech synthesis model, in the alignment chart, the corresponding relationship between each second text feature corresponding to the correct labeling data and each audio feature corresponding to the audio data is not necessarily a diagonal line, but may also be a curve from the lower left corner of fig. 2 to the upper right corner of fig. 2, which represents the corresponding relationship between each second text feature corresponding to the labeling data and each audio feature corresponding to the audio data, and if the pixel on the curve indicates that the pixel is more bright than the second text feature. Therefore, if one piece of labeling data is correct, in the alignment chart corresponding to the labeling data, the trend of the curve representing the correspondence relationship between each second text feature corresponding to the labeling data and each audio feature corresponding to the audio data should be from the lower left corner of the alignment chart to the upper right corner of the alignment chart as shown in fig. 2, and if the labeling data is wrong, in the alignment chart corresponding to the labeling data, the trend of the curve representing the correspondence relationship between each second text feature corresponding to the labeling data and each audio feature corresponding to the audio data may be different from the trend of the curve as shown in fig. 2, so that whether the labeling data is wrong or not can be determined according to the alignment chart of each second text feature corresponding to the labeling data and each audio feature corresponding to the labeling data.

In an actual application scene, the number of characters contained in text data in the annotation data is larger than the number of characters corresponding to the audio data, the number of characters contained in the text data in the annotation data is smaller than the number of characters corresponding to the audio data, the characters contained in the text data in the annotation data are inconsistent with the characters corresponding to the audio data, and the like are mainly caused in the annotation data corresponding to the audio data. The characters corresponding to the audio data refer to characters contained in text data corresponding to the content of the audio data. Based on this, in order to improve accuracy of quality inspection of the labeling data to be quality inspected, in the embodiment of the present invention, a first quality inspection requirement and a second quality inspection requirement are configured in advance. The number of characters contained in text data in the labeling data required in the first quality inspection requirement is equal to that of characters corresponding to audio data corresponding to the labeling data. And in the second quality inspection requirement, the characters contained in the text data in the labeling data are required to be consistent with the characters corresponding to the audio data corresponding to the labeling data.

In the embodiment of the invention, the type of the annotation data error is predefined. For example, a problem that the number of characters included in text data in the annotation data is greater than the number of characters corresponding to audio data corresponding to the annotation data is defined as a first error type; for another example, a problem that the number of characters contained in text data in the annotation data is smaller than the number of characters corresponding to audio data corresponding to the annotation data is defined as a second error type; for another example, a problem that characters included in text data in the annotation data are inconsistent with characters corresponding to audio data corresponding to the annotation data is defined as a third error type.

In a specific implementation, based on the above embodiment, based on the labeling data and the audio data, after quality inspection data corresponding to the labeling data is obtained through a speech synthesis model, it is determined whether the current quality inspection data meets a pre-configured quality inspection requirement, which mainly includes the following cases:

In the first case, in the embodiment of the present invention, the corresponding relationship between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence, for example, the corresponding relationship between the 37 th second text feature in the text feature sequence corresponding to the labeling data and the 127 th audio feature in the audio feature sequence corresponding to the audio data, may be determined by using the speech synthesis model. Therefore, in the embodiment of the present invention, the pre-configured quality inspection requirement is a first quality inspection requirement, and the quality inspection data corresponding to the labeling data is obtained based on the labeling data and the audio data through the speech synthesis model, wherein the quality inspection data includes position quality inspection data. The position quality inspection data is used for identifying the corresponding relation between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence. Based on the position quality inspection data, whether the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data or not can be determined, and whether the acquired quality inspection data meets a first quality inspection requirement configured in advance or not is further determined.

Specifically, if the number of characters contained in the text data in the labeling data is determined to be unequal to the number of characters corresponding to the audio data based on the position quality inspection data, it is determined that the currently acquired quality inspection data does not meet the preset first quality inspection requirement, that is, the currently acquired quality inspection data does not meet the quality inspection requirement that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data.

In one possible implementation manner, based on the position quality inspection data included in the quality inspection data, determining whether the number of characters included in the text data in the labeling data is equal to the number of characters corresponding to the audio data, so as to determine whether the quality inspection data meets a first quality inspection requirement configured in advance includes:

If the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a preset first quality inspection requirement; and/or

If the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a preset first quality inspection requirement;

the first text position is the position of the last second text feature in the text feature sequence, which has a corresponding relation with the audio feature in the audio feature sequence, in the text feature sequence, and the second text position is the position of the last second text feature in the text feature sequence.

In an actual application scenario, there may be a problem that the number of characters included in text data in annotation data is greater than the number of characters corresponding to audio data, so that when a corresponding relation between the position of each second text feature corresponding to the annotation data in a text feature sequence and the position of each audio feature corresponding to the audio data in an audio feature sequence is determined through a speech synthesis model, the determined position (for convenience of description, the first text position is marked) of the last second text feature in the audio feature sequence corresponding to the audio data is inconsistent with the position (for convenience of description, the second text position is marked) of the last second text feature in the text feature sequence, that is, the last audio frame in the audio data cannot correspond to the last character in the annotation data. Based on this, in the embodiment of the present invention, it is determined whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data, by the first text position and the second text position in the position quality inspection data.

Specifically, when the position quality inspection data comprises a first text position and a second text position, if the first text position is inconsistent with the second text position, and the problem that the number of characters contained in text data in the annotation data is more than the number of characters corresponding to the audio data is possibly solved, the fact that the number of characters contained in the text data in the annotation data is unequal to the number of characters corresponding to the audio data is determined, the fact that the annotation data does not meet a preset first quality inspection requirement is determined, and then the annotation data error is determined; if the first text position is determined to be consistent with the second text position, the problem that the number of characters contained in text data in the annotation data is larger than the number of characters corresponding to the audio data is solved, the number of characters contained in the text data in the annotation data is determined to be equal to the number of characters corresponding to the audio data, the annotation data is determined to meet a first quality inspection requirement configured in advance, and the annotation data is determined to be correct.

For convenience of explanation, the correspondence between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence is visualized, and fig. 3 is an alignment chart representing erroneous labeling data according to an embodiment of the present invention. As shown in fig. 3, any value m on the abscissa of the alignment chart represents an mth audio feature in an audio feature sequence corresponding to audio data, any value n on the ordinate represents an nth second text feature in a text feature sequence corresponding to labeling data, in the alignment chart 001001 (79,149) represents that the text feature sequence corresponding to the labeling data contains 79 second text features, the audio feature sequence corresponding to the audio data contains 149 audio features, that is, the second text position is 79, a curve in the chart represents a correspondence between each character in the labeling data and each audio frame in the audio data, a vertical dashed line in the chart represents a position of a last audio feature in the audio feature sequence, a horizontal dashed line in the chart represents a position of a last second text feature in the text feature sequence, a position where the curve overlaps with the vertical dashed line in the chart represents a first text position, a second text feature in the text feature sequence corresponding to the 149 audio feature in the audio feature sequence at the first text position, a curve represents a 37 th text feature in the text feature sequence, a position where the first position is a horizontal dashed line in the graph does not overlap with the text feature in the graph, and no position in the graph corresponds to the text feature in the graph. And the curve in this figure 3 does not run from the lower left corner of figure 3 to the upper right corner of figure 3. Therefore, since the first text position 37 and the second text position 79 are not equal, the number of characters included in the text data in the explanatory note data is greater than the number of characters corresponding to the audio data, and it is determined that the number of characters included in the text data in the explanatory note data is not equal to the number of characters corresponding to the audio data.

Similarly, there may be a problem that the number of characters included in the text data in the annotation data is smaller than the number of characters corresponding to the audio data, so that when the correspondence between the position of each second text feature corresponding to the annotation data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence is determined through the speech synthesis model, the determined first text position and the determined second text position are consistent, but the determined position (for convenience of description, the first audio position is marked) of the last audio feature in the text feature sequence corresponding to the annotation data, the position (for convenience of description, the second audio position is marked) of the last audio feature in the audio feature sequence corresponding to the audio data is inconsistent, that is, the last character in the annotation data cannot correspond to the last audio frame in the audio data. Based on this, in the embodiment of the present invention, whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data may also be determined by the first audio position and the second audio position included in the position quality inspection data.

Specifically, when the position quality inspection data comprises a first audio position and a second audio position, if the first audio position is inconsistent with the second audio position, which indicates that the problem that the number of characters contained in text data in the annotation data is less than the number of characters corresponding to the audio data possibly exists, the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, the annotation data is determined to not meet a preset first quality inspection requirement, and then the annotation data is determined to be wrong; if the first audio position and the second audio position are determined to be consistent, the problem that the number of characters contained in text data in the annotation data is smaller than the number of characters corresponding to the audio data is solved, the number of the characters contained in the text data in the annotation data is determined to be equal to the number of the characters corresponding to the audio data, the annotation data is determined to meet a first quality inspection requirement configured in advance, and the annotation data is determined to be correct.

For convenience of explanation, the correspondence between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence is visualized, and fig. 4 is an alignment chart representing erroneous labeling data according to an embodiment of the present invention. As shown in fig. 4, any value m on the abscissa of the alignment chart represents an mth audio feature in an audio feature sequence corresponding to audio data, any value n on the ordinate represents an nth second text feature in a text feature sequence corresponding to labeling data, in the alignment chart 001014 (31,181) represents that the text feature sequence corresponding to the labeling data includes 31 second text features, the audio feature sequence corresponding to the audio data includes 181 audio features, that is, the second audio position is 181, a curve in the chart represents a correspondence between each character in the labeling data and each audio frame in the audio data, a vertical dotted line in the chart represents a position of a last audio feature in the audio feature sequence, a horizontal dotted line in the chart represents a position of a last second text feature in the text feature sequence, a position where the curve overlaps with the horizontal dotted line represents a first audio position, an audio feature corresponding to the 31 st second text feature in the text feature sequence is a 125 th audio feature in the audio feature sequence at the first audio position, a curve represents a correspondence between each character in the audio feature sequence, and a vertical dotted line in the graph does not correspond to the last text feature in the audio feature sequence. And the curve in this fig. 4 does not run from the lower left corner of fig. 4 to the upper right corner of fig. 4. Since the first audio position 125 and the second audio position 181 are not equal, the number of characters included in the text data in the explanatory note data is smaller than the number of characters corresponding to the audio data, and it is determined that the number of characters included in the text data in the explanatory note data is not equal to the number of characters corresponding to the audio data.

Of course, the obtained position quality inspection data may include both the first text position and the second text position and the first audio position and the second audio position, and for the first text position and the second text position in the position quality inspection data, the above determination method is adopted to determine whether the labeling data currently performing quality inspection has a problem that the number of characters included in the text data in the labeling data is greater than the number of characters corresponding to the audio data, and for the first audio position and the second audio position in the position quality inspection data, the above determination method is adopted to determine whether the labeling data currently performing quality inspection has a problem that the number of characters included in the text data in the labeling data is less than the number of characters corresponding to the audio data. When the labeling data is determined to have any problem, namely, the first text position is determined to be inconsistent with the second text position or the first audio position is determined to be inconsistent with the second audio position, the quality inspection data corresponding to the labeling data is determined to not meet the first quality inspection requirement configured in advance, and the labeling data is determined to be wrong; if the first text position is determined to be consistent with the second text position and the first audio position is determined to be consistent with the second audio position, the quality inspection data corresponding to the annotation data is determined to meet the first quality inspection requirement which is preset, and the annotation data is determined to be correct.

As one possible implementation manner, determining, based on the position quality inspection data included in the quality inspection data, whether the number of characters included in the text data in the labeling data is equal to the number of characters corresponding to the audio data, so as to determine whether the quality inspection data meets a first quality inspection requirement configured in advance includes:

if the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement; and/or

If the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement;

As another possible implementation manner, based on the position quality inspection data included in the quality inspection data, determining whether the number of characters included in the text data in the labeling data is equal to the number of characters corresponding to the audio data, so as to determine whether the quality inspection data meets a first quality inspection requirement configured in advance, further includes:

If the first text position and the second text position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the labeling data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement; or alternatively

If the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement;

The first audio position is the position of the last audio feature in the audio feature sequence, which has a corresponding relation with the second text feature in the text sequence; the second audio position is the second audio position in the sequence of audio features where the last audio feature in the sequence of audio features.

In the second case, in the actual application scenario, the situation that the characters contained in the text data in the labeling data are inconsistent with the characters corresponding to the audio data may also occur, that is, the situation that the content labeled by the labeling data is inconsistent with the content sent in the audio data, is caused that through a speech synthesis model, when the corresponding relation between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence is determined, at least one second text feature may exist in each second text feature corresponding to the labeling data, and in the audio feature sequence corresponding to the audio data, no corresponding audio feature exists, that is, according to a probability vector corresponding to a certain audio feature in the audio feature sequence corresponding to the audio data, the second text feature corresponding to the audio feature cannot be determined in the text feature sequence corresponding to the labeling data, and the characters contained in the text data in the labeling data are inconsistent with the characters corresponding to the audio data. In this case, the pre-configured quality inspection requirement is a second quality inspection requirement, and the acquired quality inspection data includes probability vectors corresponding to each audio feature included in an audio feature sequence corresponding to the audio data through a speech synthesis model, and based on the magnitude of each probability value included in the probability vectors corresponding to each audio feature, whether characters included in text data in the labeling data are consistent with characters corresponding to the audio data can be determined, so as to determine whether the acquired quality inspection data meets the pre-configured second quality inspection requirement.

The probability vector corresponding to any audio feature includes the probability value of each second text feature corresponding to the labeling data corresponding to the audio feature, the meaning of the text feature sequence and the meaning of the audio feature sequence are described in the above embodiments, and the repetition is not repeated.

In one possible implementation manner, based on the probability vector corresponding to each audio feature included in the audio feature sequence corresponding to the audio data included in the quality inspection data, determining whether characters included in the text data in the annotation data are consistent with characters corresponding to the audio data, so as to determine whether the quality inspection data meets a second pre-configured quality inspection requirement includes:

respectively obtaining the maximum probability value in the probability vector corresponding to each audio feature;

If the maximum probability value corresponding to any audio feature is smaller than the preset probability threshold value, determining that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet the second quality inspection requirement; or alternatively

If the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, determining that characters contained in the text data in the annotation data are consistent with the characters corresponding to the audio data, and determining that the quality inspection data meet the second quality inspection requirement.

In the embodiment of the invention, in order to accurately determine whether each audio feature has a corresponding second text feature in the text feature sequence corresponding to the annotation data, a probability threshold is preset. In the specific implementation process, a probability vector corresponding to each audio feature in an audio feature sequence corresponding to audio data contained in quality inspection data corresponding to labeling data is obtained through a voice synthesis model, and a maximum probability value in the probability vectors corresponding to the audio feature is obtained for each audio feature. And comparing the maximum probability value corresponding to each audio feature with a preset probability threshold value according to the maximum probability value corresponding to each audio feature. If the maximum probability value corresponding to any audio feature is determined to be smaller than the preset probability threshold value, the fact that a second text feature corresponding to a certain audio feature corresponding to the audio data does not exist in the text feature sequence corresponding to the annotation data is indicated, and characters contained in the text data in the annotation data are determined to be inconsistent with the characters corresponding to the audio data; if the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, the fact that each audio feature corresponding to the audio data corresponds to the second text feature in the text feature sequence corresponding to the annotation data is explained, and characters contained in the text data in the annotation data are consistent with the characters corresponding to the audio data is confirmed.

For example, the preset probability threshold is 0.8, the maximum probability value corresponding to a certain audio feature corresponding to the obtained audio data a is 0.7, the maximum probability value 0.7 corresponding to the audio feature is compared with the preset probability threshold 0.8, the maximum probability value 0.7 corresponding to the audio feature is determined to be smaller than the preset probability threshold 0.8, and if the text feature sequence corresponding to the labeling data a does not have the second text feature corresponding to the audio feature, it is determined that characters contained in the text data in the labeling data a are inconsistent with characters corresponding to the audio data a.

Taking the above example as the case, the maximum probability value corresponding to a certain audio feature corresponding to the obtained audio data B is 0.9, comparing the maximum probability value 0.9 corresponding to the audio feature with the preset probability threshold value 0.8, determining that the maximum probability value 0.9 corresponding to the audio feature is greater than the preset probability threshold value 0.8, and if a second text feature corresponding to the audio feature exists in the text feature sequence corresponding to the labeling data B, obtaining the maximum probability value corresponding to the next audio feature. When it is determined based on the above steps that the maximum probability value corresponding to each audio feature corresponding to the audio data B is not smaller than the preset probability threshold, it is determined that the characters contained in the text data in the labeling data B are consistent with the characters corresponding to the audio data B.

Any audio feature corresponds to a plurality of probability values, and the audio data corresponds to a plurality of audio features, when it is determined that the maximum probability value corresponding to a certain audio feature is smaller than a preset probability threshold, it is determined that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data, and whether other audio features exist or not is stopped from being continuously determined, and the maximum probability value corresponding to the other audio features is also smaller than the preset probability threshold; and after determining whether the maximum probability value corresponding to each audio feature is smaller than the preset probability threshold, determining whether the maximum probability value corresponding to any audio feature is smaller than the preset probability threshold according to each determined comparison result.

For convenience of explanation, the correspondence between each character in the annotation data and each audio frame in the audio data is visualized, and fig. 5 is an alignment chart of erroneous annotation data according to an embodiment of the present invention. As shown in fig. 5, any value m on the abscissa of the alignment chart represents the mth audio feature in the audio feature sequence corresponding to the audio data, any value n on the ordinate represents the nth second text feature in the text feature sequence corresponding to the labeling data, in the alignment chart, the curve in the chart represents the correspondence between each character in the labeling data and each audio frame in the audio data, but the curve representing the correspondence between each character in the labeling data and each audio frame in the audio data cannot be found clearly in the area selected by the rectangular frame in the chart, which indicates that the audio feature of the part cannot find the second text feature with the correspondence in the text feature sequence corresponding to the labeling data, so that it is determined that the characters contained in the text data in the labeling data are inconsistent with the characters corresponding to the audio data.

In the third case, in the actual application scenario, the labeling data may have the problem that the number of characters included in text data in the labeling data is not equal to the number of characters corresponding to the audio data, and the number of characters included in the text data in the labeling data is not consistent with the number of characters corresponding to the audio data, so as to improve the requirement on quality inspection of the labeling data, in the embodiment of the present invention, the pre-configured quality inspection requirement includes a first quality inspection requirement and a second quality inspection requirement. In the implementation process, after quality inspection data corresponding to the labeling data are obtained through a voice synthesis model, whether the number of characters contained in text data in the labeling data is equal to the number of characters corresponding to the audio data or not is determined based on position quality inspection data contained in the quality inspection data, and whether the characters contained in the text data in the labeling data are consistent with the characters corresponding to the audio data or not is determined based on probability vectors respectively corresponding to each audio feature contained in an audio feature sequence corresponding to the audio data contained in the quality inspection data. When the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and/or the characters contained in the text data in the annotation data are not consistent with the characters corresponding to the audio data, determining that the quality inspection data does not meet the pre-configured quality inspection requirement; when the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and the characters contained in the text data in the annotation data are consistent with the characters corresponding to the audio data, the quality inspection data is determined to meet the pre-configured quality inspection requirement.

In order to ensure the comprehensiveness of the quality inspection labeling data, after determining that the quality inspection data does not meet the preconfigured first quality inspection requirement, whether the quality inspection data meets the preconfigured second quality inspection requirement may still be continuously determined, or after determining that the quality inspection data does not meet the preconfigured second quality inspection requirement, whether the quality inspection data meets the preconfigured first quality inspection requirement may still be continuously determined, for example, whether certain quality inspection data does not meet the preconfigured first quality inspection requirement may still be continuously determined, so as to determine all possible problems in the labeling data.

Example 3: in order to facilitate staff to modify the labeling data with labeling errors, in the embodiments of the present invention, after determining that quality inspection data corresponding to the labeling data does not meet a pre-configured quality inspection requirement, the method further includes: and outputting prompt information of the marked data errors.

In an actual application scene, after determining that quality inspection data corresponding to certain annotation data does not meet a pre-configured quality inspection requirement, determining that the annotation data is marked in error, modifying the annotation data according to audio data corresponding to the annotation data to enable the annotation data to be correct, and inputting the modified correct annotation data into a speech synthesis model again, wherein the acquired quality inspection data corresponding to the correct annotation data can meet the pre-configured quality inspection requirement.

In order to facilitate the staff to modify the wrong labeling data in time, in the embodiment of the invention, after the quality inspection data is determined to not meet the pre-configured quality inspection requirement, the prompt information of the labeling data error can be output so as to prompt the staff that the labeling data is labeled in error and modify in time.

The output prompt information can be in an audio format, such as voice broadcasting prompt information of "marking data error of current quality inspection", or display the corresponding prompt information in a text form on a display interface, such as display of the prompt information of "marking data error of current quality inspection", flashing red light, and frame playing prompt on the display interface, or send the prompt information to an intelligent terminal of a related staff through a short message or mail. Of course, at least two modes of outputting the prompt information can be combined at the same time, for example, simultaneously broadcasting the prompt information in the audio format, displaying the prompt information in the text format on the display interface, and the like. Specifically, the device can be flexibly set according to actual requirements, and is not limited herein.

The specific selection of which mode to output the prompt message can be preset according to the preference of the user, or can be selected according to the capabilities of the electronic devices, for example, some electronic devices do not have a display interface capable of displaying the prompt message, and when the prompt message is output, the prompt message in an audio format can be broadcasted for the electronic devices.

In one possible implementation manner, outputting the prompt information of the labeling error of the labeling data includes:

If the quality inspection data does not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in the text data in the labeling data is unequal to the number of characters corresponding to the audio data; and/or

And if the quality inspection data does not meet the second quality inspection requirement, outputting prompt information that characters contained in the text data in the annotation data are inconsistent with the characters corresponding to the audio data.

In the embodiment of the invention, in order to further facilitate the staff to modify the wrong labeling data, the content of the prompting information for outputting the labeling data labeling errors can be determined according to the type of the quality inspection requirement which is not satisfied by the quality inspection data corresponding to the current labeling data. Specifically, according to the type of quality inspection requirements which are not met by quality inspection data corresponding to the current annotation data, determining the content of prompt information which is output and is not met by the quality inspection requirements which are configured in advance by the annotation data, wherein the method comprises the following conditions:

In the first case, if the quality inspection data corresponding to a certain labeling data does not meet the preset first quality inspection requirement, it is indicated that the number of characters contained in the text data in the labeling data is not equal to the number of characters corresponding to the audio data, and in order to facilitate the staff to modify the text data directly according to the problem existing in the labeling data, the staff can output prompt information that the number of characters contained in the text data in the labeling data is not equal to the number of characters corresponding to the audio data. For example, the number of characters contained in the "labeling data" how to get the tomorrow "is not equal to the number of characters corresponding to the corresponding audio data, and please check the" prompt message.

In one possible implementation manner, if quality inspection data corresponding to a certain labeling data does not meet a preset first quality inspection requirement and it is determined that the number of characters contained in text data in the labeling data is greater than the number of characters corresponding to audio data, determining that an error type of the labeling data is a first error type, and outputting prompt information that the first error type exists in the labeling data.

In another possible implementation manner, if the quality inspection data corresponding to a certain labeling data does not meet the preset first quality inspection requirement, and it is determined that the number of characters included in the text data in the labeling data is smaller than the number of characters corresponding to the audio data, determining that the error type of the labeling data is a second error type, and outputting prompt information that the second error type exists in the labeling data.

And secondly, if the quality inspection data corresponding to a certain marking data does not meet the second quality inspection requirement which is configured in advance, explaining the problem that the characters contained in the text data in the marking data are inconsistent with the characters corresponding to the audio data, and outputting prompt information that the characters contained in the text data in the marking data are inconsistent with the characters corresponding to the audio data in order to facilitate the staff to modify the text data directly according to the problem existing in the marking data. For example, the prompt message of "how to annotate data" tomorrow "is output, wherein the character contained in the" how to annotate data "is inconsistent with the character corresponding to the corresponding audio data, and please check.

In one possible implementation manner, if the quality inspection data corresponding to a certain labeling data does not meet the second quality inspection requirement configured in advance, and the problem that characters contained in text data in the labeling data are inconsistent with characters corresponding to audio data is described, determining that the error type of the labeling data is a third error type, and outputting prompt information that the third error type exists in the labeling data.

In order to reduce the workload required by staff to modify annotation data, in an embodiment of the present invention, the method further includes: determining a target audio frame in the audio data;

the prompt information also comprises the position of the target audio frame in the audio data.

In the embodiment of the invention, after determining that the quality inspection data corresponding to the annotation data does not meet the second quality inspection requirement configured in advance, an audio frame with corresponding characters does not exist in each character contained in the annotation data, the audio frame is determined to be a target audio frame, and the prompt information carrying the characters contained in the text data in the annotation data and the characters corresponding to the audio data and the position of the target audio frame in the audio data is output. The subsequent staff can quickly find the target audio segment in the audio data according to the position of the target audio frame in the prompt information, so as to modify and adjust the labeling data according to the target audio segment.

Wherein determining the target audio frame in the audio data comprises:

respectively obtaining the maximum probability value in the probability vector corresponding to each audio feature; and determining the audio frame corresponding to the audio feature with the maximum probability value smaller than the preset probability threshold as a target audio frame.

In the actual application process, the probability value of each second text feature corresponding to the annotation data corresponding to each audio feature in the audio feature sequence corresponding to the audio data can be determined through the voice synthesis model. The larger the probability value is, the more likely the audio feature has a corresponding relation with the second text feature corresponding to the probability value is; the smaller the probability value, the less likely the audio feature is to have a correspondence to the second text feature to which the probability value corresponds. For the correct annotation data, in general, for each audio feature included in the audio feature sequence corresponding to the audio data of the annotation data, the audio feature has a corresponding second text feature in each second text feature corresponding to the annotation data, and the second text feature having the corresponding relationship is generally the second text feature corresponding to the maximum probability value in the probability vector corresponding to the audio feature. Based on this, in the embodiment of the present invention, the maximum probability value in the probability vector corresponding to each audio feature is obtained respectively. For each audio feature, determining whether a maximum probability value corresponding to the audio feature is smaller than a preset probability threshold, if so, determining the audio feature as a target audio feature, and determining an audio frame corresponding to the target audio feature as a target audio frame.

Based on the manner in the above embodiment, after each target audio frame existing in the audio data is determined, the character corresponding to the audio data, which is included in the text data in the annotation data, and the prompt information of the position of each target audio frame in the audio data are output. The subsequent staff can directly determine the position and the modification mode for modifying the annotation data according to the position information of each target audio frame in the audio data carried in the prompt information, so that the workload of the staff is reduced, and the efficiency of the staff for modifying the wrong annotation data is improved.

In the third case, in the actual application scenario, a situation that the quality of the labeling data is very poor may occur, and the quality inspection data corresponding to the labeling data does not meet the preset first quality inspection requirement and the preset second quality inspection requirement, so that in order to facilitate a worker to accurately modify the labeling data, after determining that the quality inspection data corresponding to the labeling data does not meet the preset first quality inspection requirement and the preset second quality inspection requirement, the prompt information that the number of characters included in the text data in the labeling data is not equal to the number of characters corresponding to the audio data and the prompt information that the characters included in the text data in the labeling data are inconsistent with the characters corresponding to the audio data may be output.

Example 4: the following describes a data processing method provided by the embodiment of the present invention through a specific implementation manner, as shown in fig. 6, the flow includes:

S601: and acquiring a trained speech synthesis model.

In the embodiment of the present invention, the electronic device for training the speech synthesis model may be the same as or different from the electronic device for performing data processing in the above embodiment. Specifically, the setting may be performed according to actual requirements, which is not limited herein.

In order to train a speech synthesis model, for example, tracotron models, sample audio data for training the speech synthesis model needs to be collected in advance, labeling data corresponding to the sample audio data is determined, and the labeling data corresponding to the sample audio data is determined as sample data, so that the original speech synthesis model is trained.

In an actual application scenario, the number of marking data marked with errors is generally smaller than that of correct marking data, so in the embodiment of the invention, if the marking data to be inspected are enough, the marking data to be inspected can be directly used as sample data to train the original speech synthesis model so as to acquire a trained speech synthesis model, thereby reducing the time consumed for acquiring the sample data for training the speech synthesis model, and a large number of marking correct marking data collected in advance can be used as sample data to train the original speech synthesis model so as to acquire the trained speech synthesis model.

Of course, the pre-collected labeling data with correct labeling and the labeling data to be inspected are taken as sample data, the original speech synthesis model is trained according to the pre-collected labeling data with correct labeling, the trained basic speech synthesis model is obtained, and then the obtained basic speech synthesis model is continuously trained according to the labeling data to be inspected, so that the speech synthesis model after the training is completed is obtained. The specific manner of training the speech synthesis model can be flexibly set according to actual requirements, and is not specifically limited herein.

Specifically, training the speech synthesis model based on the sample data:

Any sample data and sample audio data corresponding to the sample data are acquired;

acquiring acoustic characteristic parameters corresponding to sample data through an original voice synthesis model;

and training the original speech synthesis model according to the acoustic characteristic parameters and the sample audio data.

Since there are many sample data for training the speech synthesis model, the above operation is performed for each sample data, and when a preset convergence condition is satisfied, the speech synthesis model training is completed.

The meeting of the preset convergence condition may be based on the acoustic characteristic parameter of the sample audio data corresponding to each sample data, where the determined loss value (loss) is smaller than a preset loss value threshold, or the determined loss value is always in a downward trend and tends to be flat, or the number of iterations of training the original speech synthesis model reaches a set maximum number of iterations, and so on. The implementation may be flexibly set, and is not particularly limited herein.

And determining any voice synthesis model meeting the preset convergence condition as a trained voice synthesis model for subsequent quality inspection of the labeling data to be inspected. For example, a speech synthesis model with a smaller loss value is used for subsequent quality inspection of the labeling data to be inspected.

As a possible implementation manner, when training a speech synthesis model, sample data may be divided into a training sample and a test sample, the original speech synthesis model is trained based on the training sample, and then the reliability of the trained speech synthesis model is verified based on the test sample.

When a trained speech synthesis model is tested based on test samples, the determined test set loss value (val_loss) needs to be calculated during the test based on the acoustic characteristic parameters of the test audio data corresponding to each test sample. When the loss value of the currently acquired test set is smaller than the preset test loss value threshold value or is always in a descending trend and tends to be gentle, the reliability of the trained voice synthesis model is determined, and the trained voice synthesis model can be used for voice synthesis or quality inspection of the labeling data to be inspected.

In the process of training a speech synthesis model, an offline mode is generally adopted, and an original speech synthesis model is trained in advance through electronic equipment and sample data for training the model, so as to obtain a trained speech synthesis model.

Based on the speech synthesis model trained in the embodiment, the speech synthesis model trained in the embodiment is stored in the electronic equipment for subsequent data processing, and quality inspection of the data to be inspected is realized through the electronic equipment for data processing.

S602: any marking data to be inspected and the audio data corresponding to the marking data are acquired.

S603: and acquiring a digital sequence corresponding to the labeling data.

S604: and determining quality inspection data corresponding to the labeling data based on the digital sequence and the audio data through a voice synthesis model.

S605: and judging whether the quality inspection data meets the pre-configured quality inspection requirement, if so, executing S606, otherwise, executing S607.

Wherein the pre-configured quality inspection requirements comprise a first quality inspection requirement and/or a second quality inspection requirement. The specific determination of whether the quality inspection data meets the pre-configured quality inspection requirement is described in the above embodiment, and the repetition is not repeated.

S606: the annotation data is determined to be correct.

S607: determining the annotation data error and outputting prompt information of the annotation data error.

Specifically, the method for outputting the prompt information is also described in the above embodiments, and the repetition is not described in detail.

Example 5: an embodiment of the present invention provides a data processing apparatus, as shown in fig. 7, including:

an obtaining unit 71, configured to obtain any one of annotation data to be inspected and audio data corresponding to the annotation data, where the annotation data includes text data corresponding to the audio data and a first text feature thereof;

A determining unit 72, configured to determine, based on the labeling data and the audio data, quality inspection data corresponding to the labeling data through a decoder of the speech synthesis model, where the quality inspection data characterizes a correspondence between each character in the labeling data and each audio frame in the audio data;

And the judging unit 73 is configured to judge whether the labeling data is correct according to the quality inspection data corresponding to the labeling data.

In a possible embodiment, the judging unit 73 is specifically configured to:

If the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement, the labeling data is determined to be correct; or if the quality inspection data corresponding to the labeling data is determined to not meet the pre-configured quality inspection requirement, determining that the labeling data labeling errors.

In one possible implementation manner, the determining unit 73 determines whether the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement according to at least one of the following manners:

In a possible embodiment, the judging unit 73 is specifically configured to:

If the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement; and/or if the first audio position and the second audio position in the position quality inspection data are consistent, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meet the first quality inspection requirement;

In a possible embodiment, the judging unit 73 is specifically configured to:

If the first text position and the second text position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the labeling data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement; or if the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement;

In a possible embodiment, the judging unit 73 is specifically configured to:

If the maximum probability value corresponding to any audio feature is smaller than the preset probability threshold value, determining that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet a second quality inspection requirement; or if the maximum probability value corresponding to each audio feature is not less than the preset probability threshold value, determining that the characters contained in the text data in the annotation data are consistent with the characters corresponding to the audio data, and determining that the quality inspection data meet the second quality inspection requirement.

In one possible embodiment, the apparatus further comprises: an output unit;

And the output unit is used for outputting prompt information of labeling data errors after the judging unit 73 determines that the quality inspection data corresponding to the labeling data does not meet the pre-configured quality inspection requirement.

In one possible embodiment, the output unit is specifically configured to:

if the quality inspection data does not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in the text data in the labeling data is unequal to the number of characters corresponding to the audio data; and/or outputting prompt information that characters contained in the text data in the annotation data are inconsistent with characters corresponding to the audio data if the quality inspection data do not meet the second quality inspection requirement.

In a possible implementation manner, the determining unit 72 is further configured to determine a target audio frame in the audio data, so that the prompt information output by the output unit further includes a position of the target audio frame in the audio feature sequence.

In a possible embodiment, the determining unit 72 is specifically configured to:

Example 6: fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, the embodiment of the present invention further provides an electronic device, as shown in fig. 8, including: the processor 81, the communication interface 82, the memory 83 and the communication bus 84, wherein the processor 81, the communication interface 82 and the memory 83 complete communication with each other through the communication bus 84;

The memory 83 has stored therein a computer program which, when executed by the processor 81, causes the processor 81 to perform the steps of any of the data processing method embodiments described above.

Since the principle of the electronic device for solving the problem is similar to that of the data processing method, the implementation of the electronic device can refer to the implementation of the method, and the repetition is omitted.

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 82 is used for communication between the above-described electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (DIGITAL SIGNAL Processing units, DSPs), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Example 7: on the basis of the above embodiments, the present invention further provides a computer readable storage medium, in which a computer program executable by a processor is stored, which when executed on the processor causes the processor to implement the steps in any of the above data processing method embodiments.

Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the above-mentioned data processing method, the implementation of the above-mentioned computer readable storage medium can refer to the implementation of the method, and the repetition is omitted.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of data processing, the method comprising:

If the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement, determining that the labeling data is correct; or alternatively

If the quality inspection data corresponding to the labeling data does not meet the pre-configured quality inspection requirement, determining that the labeling data labeling errors;

Wherein determining whether the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement according to at least one of the following modes comprises:

Determining whether the number of characters contained in text data in the labeling data is equal to the number of characters corresponding to the audio data based on position quality inspection data contained in the quality inspection data so as to determine whether the quality inspection data meets a first quality inspection requirement configured in advance, wherein the position quality inspection data is used for identifying the corresponding relation of the position of each second text feature corresponding to the labeling data in a text feature sequence and the position of each audio feature corresponding to the audio data in an audio feature sequence; the second text feature is obtained by encoding the annotation data through an encoder in the voice synthesis model;

And determining whether characters contained in text data in the labeling data are consistent with characters corresponding to the audio data or not based on probability vectors corresponding to each audio feature contained in an audio feature sequence corresponding to the audio data and contained in the quality inspection data, so as to determine whether the quality inspection data meet second quality inspection requirements configured in advance or not, wherein probability vectors corresponding to any audio feature contain probability values corresponding to each second text feature corresponding to the labeling data.

2. The method according to claim 1, wherein the determining, based on the position quality inspection data included in the quality inspection data, whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data to determine whether the quality inspection data meets a first quality inspection requirement configured in advance includes:

If the first text position and the second text position in the position quality inspection data are consistent, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meet the first quality inspection requirement; and/or

If the first audio position and the second audio position in the position quality inspection data are consistent, determining that the number of characters contained in text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meet the first quality inspection requirement;

3. The method according to claim 1, wherein the determining, based on the position quality inspection data included in the quality inspection data, whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data to determine whether the quality inspection data meets a first quality inspection requirement configured in advance includes:

If the first text position and the second text position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement; or alternatively

If the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the first quality inspection requirement;

The first audio position is the position of the last audio feature in the audio feature sequence, which has a corresponding relation with the second text feature in the text sequence; the second audio position is the second audio position of the last audio feature in the sequence of audio features.

4. The method according to claim 1, wherein the determining whether characters included in text data in the annotation data are consistent with characters corresponding to the audio data based on probability vectors respectively corresponding to each audio feature included in an audio feature sequence corresponding to the audio data included in the quality inspection data to determine whether the quality inspection data meets a second quality inspection requirement configured in advance includes:

If the maximum probability value corresponding to any audio feature is smaller than a preset probability threshold, determining that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet the second quality inspection requirement; or alternatively

If the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, determining that characters contained in text data in the annotation data are consistent with the characters corresponding to the audio data, and determining that the quality inspection data meet the second quality inspection requirement.

5. The method of claim 1, wherein after the determining that the quality inspection data corresponding to the labeling data does not meet the pre-configured quality inspection requirement, the method further comprises:

And outputting the prompt information of the labeling data errors.

6. The method of claim 5, wherein outputting the notification of the annotation error of the annotation data comprises:

if the quality inspection data does not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in text data in the annotation data is unequal to the number of characters corresponding to the audio data; and/or

And if the quality inspection data does not meet the second quality inspection requirement, outputting prompt information that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data.

7. The method of claim 6, wherein the method further comprises: determining a target audio frame in the audio data;

8. The method of claim 7, wherein the determining the target audio frame in the audio data comprises:

And determining the audio frame corresponding to the audio feature with the maximum probability value smaller than the preset probability threshold as the target audio frame.

9. A data processing apparatus, the apparatus comprising:

The judging unit is used for determining that the labeling data are correct if the quality inspection data corresponding to the labeling data meet the pre-configured quality inspection requirement; or if the quality inspection data corresponding to the labeling data is determined to not meet the pre-configured quality inspection requirement, determining that the labeling data is in a labeling error;

The judging unit is specifically configured to determine whether quality inspection data corresponding to the labeling data meets a pre-configured quality inspection requirement according to at least one of the following manners:

10. The apparatus according to claim 9, wherein the judging unit is specifically configured to:

If the first text position and the second text position in the position quality inspection data are consistent, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meet the first quality inspection requirement; and/or if the first audio position and the second audio position in the position quality inspection data are consistent, determining that the number of characters contained in text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meet the first quality inspection requirement;

11. The apparatus according to claim 9, wherein the judging unit is specifically configured to:

If the first text position and the second text position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the second quality inspection requirement; or if the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data do not meet the second quality inspection requirement;

12. The apparatus according to claim 9, wherein the judging unit is specifically configured to:

If the maximum probability value corresponding to any audio feature is smaller than a preset probability threshold value, determining that characters contained in text data in the annotation data are inconsistent with the characters corresponding to the audio data; or if the maximum probability value corresponding to each audio feature is not less than the preset probability threshold value, determining that the characters contained in the text data in the annotation data are consistent with the characters corresponding to the audio data.

13. The apparatus of claim 9, wherein the apparatus further comprises:

And the output unit is used for outputting prompt information of the labeling data errors after the judging unit determines that the quality inspection data corresponding to the labeling data does not meet the pre-configured quality inspection requirement.

14. The device according to claim 13, wherein the output unit is specifically configured to:

If the quality inspection data does not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in text data in the annotation data is unequal to the number of characters corresponding to the audio data; and/or if the quality inspection data does not meet the second quality inspection requirement, outputting prompt information that characters contained in text data in the annotation data are inconsistent with characters corresponding to the audio data.

15. The apparatus of claim 14, wherein the determining unit is further configured to determine a target audio frame in the audio data, such that the prompt message output by the output unit further includes a position of the target audio frame in the sequence of audio features.

16. The apparatus according to claim 15, wherein the determining unit is specifically configured to:

17. An electronic device, characterized in that it comprises a processor for implementing the steps of the data processing method according to any of claims 1-8 when executing a computer program stored in a memory.

18. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the data processing method according to any of claims 1-8.