CN112669814A

CN112669814A - Data processing method, device, equipment and medium

Info

Publication number: CN112669814A
Application number: CN202011496538.8A
Authority: CN
Inventors: 李旭; 刘欢
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-04-16
Anticipated expiration: 2040-12-17
Also published as: CN112669814B

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a data processing medium, which are used for solving the problem that quality inspection cannot be carried out on the quality of marked data through electronic equipment. In the process of quality inspection of the labeled data, the quality inspection data corresponding to the labeled data can be obtained through a speech synthesis model based on the labeled data to be quality inspected and the audio data corresponding to the labeled data, the quality inspection data represents the corresponding relation between each character in the labeled data and each audio frame in the audio data corresponding to the labeled data, and whether the labeled data is correct or not can be determined according to the quality inspection data corresponding to the labeled data, so that the quality inspection of the labeled data to be quality inspected is realized without manual work, the workload of quality inspection personnel is reduced, the influence of the working capacity of the quality inspection personnel on the quality inspection efficiency and accuracy is reduced, and the labeled data with wrong labeling is conveniently traced and positioned.

Description

Data processing method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, device, and medium.

Background

In the prior art, text information is generally converted into speech information based on a speech synthesis model. In order to obtain the speech synthesis model, a large number of speech samples and labeled data corresponding to each speech sample are generally required to train the original speech synthesis model. Under the same model structure, a high-precision speech synthesis model can be trained based on massive high-quality training speech samples and the corresponding labeled data of each training speech sample. In the testing process of the trained speech synthesis model, the number of the tested speech samples, the balance between the sample training set and the sample testing set and the quality of the labeled data corresponding to the tested speech samples have great influence on the testing result of the trained speech synthesis model. Based on this, the quality of the labeled data corresponding to the voice sample (including the test voice sample and the training voice sample) is one of the important factors affecting the accuracy of the voice synthesis model, and how to improve the quality of the labeled data corresponding to the voice sample is a problem that people have paid increasing attention in recent years.

At present, the labeling and quality inspection work of voice samples is mainly completed in a manual mode, although some labeling tools, such as a voice labeling tool Praat, are provided to assist manual labeling, the labeling work still needs to be performed in a manual mode when labeling data corresponding to the voice samples are inspected, and the manual quality inspection is a time-consuming, labor-consuming and financial task. Under the condition that the marked data to be inspected are very much, the workload of quality inspection personnel is very large, and finally, the acquired quality inspection result also has inevitable errors, so that the quality inspection result of the marked data is influenced, and the defect of manual quality inspection is particularly obvious. Therefore, a method for automatically performing quality inspection of the labeled data is urgently needed.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, data processing equipment and a data processing medium, which are used for solving the problem that quality inspection cannot be carried out on the quality of marked data through electronic equipment.

The embodiment of the invention provides a data processing method, which comprises the following steps:

acquiring any marked data to be subjected to quality inspection and audio data corresponding to the marked data, wherein the marked data comprises text data corresponding to the audio data and first text characteristics of the text data;

determining quality inspection data corresponding to the labeling data through a decoder of a speech synthesis model based on the labeling data and the audio data, wherein the quality inspection data represent the corresponding relation between each character in the labeling data and each audio frame in the audio data;

and judging whether the labeled data is correct or not according to the quality inspection data corresponding to the labeled data.

An embodiment of the present invention provides a data processing apparatus, where the apparatus includes:

the system comprises an acquisition unit, a quality testing unit and a control unit, wherein the acquisition unit is used for acquiring any marking data to be subjected to quality testing and audio data corresponding to the marking data, and the marking data comprises text data corresponding to the audio data and a first text characteristic thereof;

the determining unit is used for determining quality inspection data corresponding to the labeling data through a decoder of a speech synthesis model based on the labeling data and the audio data, and the quality inspection data represent the corresponding relation between each character in the labeling data and each audio frame in the audio data;

and the judging unit is used for judging whether the marked data is correct or not according to the quality inspection data corresponding to the marked data.

An embodiment of the present invention provides an electronic device, which includes a processor, and the processor is configured to implement the steps of any of the data processing methods described above when executing a computer program stored in a memory.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of any of the data processing methods described above.

In the embodiment of the invention, the quality inspection data corresponding to the marking data can be obtained through the voice synthesis model based on the marking data to be inspected and the audio data corresponding to the marking data, the quality inspection data represents the corresponding relation between each character in the marking data and each audio frame in the audio data corresponding to the marking data, and whether the marking data is correct can be determined according to the quality inspection data corresponding to the marking data, so that the quality inspection of the marking data to be inspected is realized without manual work, the workload of quality inspectors is reduced, the influence of the working capacity of the quality inspectors on the quality inspection efficiency and accuracy is reduced, and the marking data with wrong marking can be conveniently traced and positioned.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

Fig. 1 is a schematic diagram of a data processing process according to an embodiment of the present invention;

FIG. 2 is an alignment chart of correct annotation data according to an embodiment of the present invention;

FIG. 3 is an alignment chart of the annotation data indicating an error according to an embodiment of the present invention;

FIG. 4 is an alignment chart of the annotation data indicating an error according to an embodiment of the present invention;

FIG. 5 is an alignment chart of the error marking data according to the embodiment of the present invention;

FIG. 6 is a flow chart illustrating a specific data processing procedure according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the attached drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, belong to the protection scope of the present invention.

In order to improve the accuracy and efficiency of quality inspection of labeled data to be subjected to quality inspection, the embodiment of the invention provides a data processing method, a device, equipment and a medium.

Example 1: fig. 1 is a schematic diagram of a data processing process provided in an embodiment of the present invention, where the process includes:

s101: and acquiring any marked data to be subjected to quality inspection and audio data corresponding to the marked data, wherein the marked data comprises text data corresponding to the audio data and first text characteristics of the text data.

The data processing method provided by the embodiment of the invention is applied to the electronic equipment, and the electronic equipment can be intelligent equipment such as a robot and the like and can also be a server.

In the embodiment of the present invention, the label data corresponding to the audio data includes text data corresponding to the audio data and text features (for convenience of description, it is denoted as first text features) of the text data, the first text features include initial and final sequences corresponding to the text data, and the initial and final sequences include initial and final corresponding to each character in the text data and corresponding tones. For example, the initial consonant and the final sound corresponding to the character "today" in the text data "today" is "jin" and the corresponding tone is one sound, which is denoted as "1", the initial consonant and the final sound corresponding to the character "day" in the text data "today" is "tie" and the corresponding tone is one sound, which is denoted as "1", and thus the first text feature of the text data "today", that is, the initial and final sound sequence of the text data "today" is "jin 1 tie 1". In the embodiment of the present invention, the tagging data corresponding to the audio data may be obtained in a manual tagging manner, or may be determined by a voice tagging tool, and in the specific implementation process, the tagging data may be flexibly set according to actual requirements, which is not specifically limited herein.

Specifically, any audio data and annotation data corresponding to the audio data are acquired, and the currently acquired annotation data is determined as the annotation data to be subjected to quality inspection. In order to facilitate subsequent processing of the labeled data to be subjected to quality inspection, the labeled data to be subjected to quality inspection can be processed to obtain a digital sequence corresponding to the labeled data. Specifically, a target number corresponding to each character included in the annotation data may be determined according to a preset correspondence between the character and the number, and a number sequence corresponding to the annotation data may be determined sequentially according to each target number, or a number sequence corresponding to the annotation data may be determined directly through a model, for example, a Bert model. The digital sequence corresponding to the label data may be obtained in advance, or the digital sequence corresponding to the label data may be obtained whenever a certain label data needs to be subjected to quality inspection.

When the number sequence corresponding to the label data is determined based on the correspondence relationship between characters and numbers arranged in advance, the numbers corresponding to different characters are different for each character included in the label data.

S102: and determining quality inspection data corresponding to the labeling data through a decoder of the speech synthesis model based on the labeling data and the audio data, wherein the quality inspection data represent the corresponding relation between each character in the labeling data and each audio frame in the audio data.

S103: and judging whether the marked data is correct or not according to the quality inspection data corresponding to the marked data.

After acquiring the digital sequence corresponding to any piece of annotation data to be tested and the audio data corresponding to the annotation data based on the above embodiment, corresponding processing is performed based on the digital sequence and the audio data, so as to determine whether the annotation data is correct.

In an actual application process, the speech synthesis model may not only generate the acoustic features of the synthesized audio data corresponding to the digital sequence based on the digital sequence corresponding to the input tagged data, but also determine the corresponding relationship between each character in the tagged data and each audio frame in the audio data and output the corresponding relationship through a decoder in the speech synthesis model based on the digital sequence and the audio data corresponding to the input tagged data, that is, for each character included in the tagged data, the character corresponds to which audio frames included in the audio data. The characters contained in the annotation data may be characters contained in the text data, or initial consonants and vowels in the first text feature. And according to the corresponding relation, whether the current marking data is aligned with the audio data can be reflected to a certain degree. Based on this, in the embodiment of the present invention, in order to determine whether the annotation data corresponding to each piece of audio data is correct, a correspondence between each character in the annotation data determined by the speech synthesis model and each audio frame in the audio data may be determined as quality control data corresponding to the annotation data. In a specific implementation process, quality inspection data corresponding to certain acquired labeled data is acquired through a speech synthesis model based on a digital sequence corresponding to the labeled data and audio data corresponding to the labeled data.

Subsequently, based on the scheme provided by the embodiment of the invention, the quality inspection data corresponding to the obtained labeling data is correspondingly processed, so that whether the labeling data is correct or not is determined. For example, if the correspondence between each character in the determined annotation data and each audio frame in the audio data is not accurate through the speech synthesis model, it is determined that the annotation data is erroneous.

Example 2: in order to improve the accuracy and efficiency of quality inspection of the labeled data to be inspected, on the basis of the above embodiment, in the embodiment of the present invention, judging whether the labeled data is correct according to the quality inspection data corresponding to the labeled data includes:

if the quality inspection data corresponding to the marked data meet the pre-configured quality inspection requirement, the marked data are determined to be correct; or

And if the quality inspection data corresponding to the labeling data do not meet the pre-configured quality inspection requirements, determining that the labeling of the labeling data is wrong.

Generally, each character included in the labeled data corresponds to at least one audio frame in the audio data corresponding to the labeled data, and the last character included in the labeled data must correspond to the last audio frame in the audio data. Based on this, in order to determine whether the annotation data is correct, in the embodiment of the present invention, a quality inspection requirement is configured in advance. After the quality inspection data corresponding to the annotation data is acquired based on the above embodiment, whether the quality inspection data meets the pre-configured quality inspection requirement is judged, so as to determine whether the annotation data is correct.

In a possible implementation manner, if it is determined that the quality inspection data corresponding to a certain currently acquired tagged data meets the pre-configured quality inspection requirement, which indicates that the tagged data is most likely to be correct, the tagged data is determined to be correct.

In another possible implementation manner, if it is determined that the quality inspection data corresponding to a certain currently acquired annotation data does not meet the pre-configured quality inspection requirement, which indicates that the annotation data is most likely to be wrong, it is determined that the annotation data is wrong, and a worker needs to perform subsequent modification on the annotation data.

Further, in order to improve the accuracy of quality inspection of the labeled data to be subjected to quality inspection, determining whether the quality inspection data corresponding to the labeled data meets the pre-configured quality inspection requirement according to at least one of the following modes, including:

determining whether the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data or not based on position quality inspection data contained in the quality inspection data so as to determine whether the quality inspection data meets a first pre-configured quality inspection requirement or not, wherein the position quality inspection data is used for identifying the corresponding relation between the position of each second text feature corresponding to the labeling data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence; the second text characteristic is obtained by encoding the labeled data through an encoder in the speech synthesis model;

and determining whether characters contained in the text data in the label data are consistent with characters corresponding to the audio data based on probability vectors corresponding to each audio feature contained in the audio feature sequence corresponding to the audio data in the quality inspection data, so as to determine whether the quality inspection data meets a second pre-configured quality inspection requirement, wherein the probability vector corresponding to any audio feature contains probability values of each second text feature corresponding to the label data respectively corresponding to the audio feature.

In an embodiment of the present invention, the speech synthesis model is a model of an encoding-decoding (encoder-decoder) structure with attention mechanism introduced, such as a Tacotron model, and the speech synthesis model can realize conversion of the acoustic features of the annotation data into synthesized audio data, and during operation, a corresponding relationship between each character in the annotation data and each audio frame in the audio data corresponding to the annotation data is determined. For convenience of describing the correspondence relationship, the correspondence relationship may be visualized as an alignment chart. FIG. 2 is an alignment chart of correct annotation data according to an embodiment of the present invention. As shown in fig. 2, any value m on the abscissa of the alignment chart represents the mth audio feature in the audio feature sequence corresponding to the audio data, the audio feature sequence may be a mel-frequency cepstrum sequence, for example, the abscissa is 10, represents the 10 th audio feature in the mel-frequency cepstrum sequence corresponding to the audio data, and any value n on the ordinate represents the nth text feature (for convenience of description, it is referred to as the second text feature) in the text feature sequence corresponding to the labeled data, the second text feature is obtained by encoding the labeled data by an encoder in the speech synthesis model, in the alignment chart, the corresponding relationship between each second text feature corresponding to the correct labeled data and each audio feature corresponding to the audio data is not necessarily a diagonal line, and may also be a curve from the lower left corner of fig. 2 to the upper right corner of fig. 2, it represents the corresponding relationship between each second text feature corresponding to the label data and each audio feature corresponding to the audio data, and for each pixel point on the curve, if the pixel point is brighter, it indicates that the second text feature is more corresponding to the audio feature. Therefore, if a piece of label data is correct, the trend of the curve representing the correspondence between each second text feature corresponding to the label data and each audio feature corresponding to the audio data in the alignment chart corresponding to the label data should be from the lower left corner of the alignment chart to the upper right corner of the alignment chart as shown in fig. 2, and if the label data is wrong, the trend of the curve representing the correspondence between each second text feature corresponding to the label data and each audio feature corresponding to the audio data in the alignment chart corresponding to the label data will be different from the trend of the curve shown in fig. 2, so that it can be determined whether the label data is labeled wrongly according to the alignment chart of each second text feature corresponding to the label data and each audio feature corresponding to the audio data corresponding to the label data.

In an actual application scenario, the main problems in the labeled data corresponding to the audio data include that the number of characters contained in the text data in the labeled data is greater than the number of characters corresponding to the audio data, the number of characters contained in the text data in the labeled data is less than the number of characters corresponding to the audio data, the characters contained in the text data in the labeled data are inconsistent with the characters corresponding to the audio data, and the like. The characters corresponding to the audio data refer to characters contained in text data corresponding to the content of the audio data. Therefore, in order to improve the accuracy of quality inspection of the labeled data to be inspected, in the embodiment of the invention, the first quality inspection requirement and the second quality inspection requirement are configured in advance. The number of characters contained in the text data in the labeling data is required to be equal to the number of characters corresponding to the audio data corresponding to the labeling data in the first quality inspection requirement. The second quality inspection requirement requires that characters contained in the text data in the tagged data are consistent with characters corresponding to the audio data corresponding to the tagged data.

In the embodiment of the invention, the type of the annotation data error is predefined. For example, a problem that the number of characters contained in text data in the labeled data is larger than the number of characters corresponding to audio data corresponding to the labeled data is defined as a first error type; for another example, the problem that the number of characters contained in the text data in the labeled data is less than the number of characters corresponding to the audio data corresponding to the labeled data is defined as a second error type; for another example, a problem that characters included in text data in the markup data are not consistent with characters corresponding to audio data corresponding to the markup data is defined as a third error type.

In the specific implementation, on the basis of the above embodiment, after quality inspection data corresponding to the annotation data is obtained through a speech synthesis model based on the annotation data and the audio data, it is determined whether the current quality inspection data meets a pre-configured quality inspection requirement, which mainly includes the following conditions:

in the first case, in the embodiment of the present invention, through a speech synthesis model, a corresponding relationship between a position of each second text feature corresponding to the annotation data in the text feature sequence and a position of each audio feature corresponding to the audio data in the audio feature sequence may be determined, for example, a corresponding relationship exists between a 37 th second text feature in the text feature sequence corresponding to the annotation data and a 127 th audio feature in the audio feature sequence corresponding to the audio data. Therefore, in the embodiment of the present invention, the pre-configured quality inspection requirement is the first quality inspection requirement, and the quality inspection data corresponding to the annotation data is obtained by the speech synthesis model based on the annotation data and the audio data, and includes the position quality inspection data. The position quality inspection data is used for identifying the corresponding relation between the position of each second text feature corresponding to the marking data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence. Based on the position quality inspection data, whether the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data can be determined, and whether the acquired quality inspection data meets a first pre-configured quality inspection requirement is further determined.

Specifically, if it is determined that the number of characters included in the text data in the annotation data is not equal to the number of characters corresponding to the audio data based on the position quality inspection data, it is determined that the currently acquired quality inspection data does not meet a first pre-configured quality inspection requirement, that is, the currently acquired quality inspection data does not meet a quality inspection requirement that the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data.

In one possible implementation, determining whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data based on the position quality inspection data included in the quality inspection data to determine whether the quality inspection data meets a first pre-configured quality inspection requirement includes:

if the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a first pre-configured quality inspection requirement; and/or

If the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a first pre-configured quality inspection requirement;

the first text position is the position of the last second text feature in the text feature sequence, which has a corresponding relationship with the audio feature in the audio feature sequence, in the text feature sequence, and the second text position is the position of the last second text feature in the text feature sequence.

In practical application scenarios, there may be a problem that the number of characters contained in the text data in the annotation data is larger than the number of characters corresponding to the audio data, so that when the corresponding relation between the position of each second text feature corresponding to the label data in the text feature sequence and the position of each audio feature corresponding to the audio data in the audio feature sequence is determined through the speech synthesis model, the determined position (for convenience of description, referred to as a first text position) of the last second text feature corresponding to the audio feature in the audio feature sequence corresponding to the audio data, does not coincide with the position of the last second text feature in the sequence of text features (for convenience of explanation, denoted as the second text position), i.e. the last audio frame in the audio data cannot correspond to the last character in the annotation data. Based on this, in the embodiment of the present invention, it is determined whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data according to the first text position and the second text position in the position quality inspection data.

Specifically, when the position quality inspection data includes a first text position and a second text position, if it is determined that the first text position is inconsistent with the second text position, which indicates that there may be a problem that the number of characters included in the text data in the labeled data is greater than the number of characters corresponding to the audio data, it is determined that the number of characters included in the text data in the labeled data is not equal to the number of characters corresponding to the audio data, it is determined that the labeled data does not meet a first pre-configured quality inspection requirement, and it is further determined that the labeled data is erroneous; if the first text position is determined to be consistent with the second text position, the problem that the number of characters contained in the text data in the labeled data is larger than the number of characters corresponding to the audio data does not exist, the number of characters contained in the text data in the labeled data is determined to be equal to the number of characters corresponding to the audio data, the labeled data is determined to meet a first pre-configured quality inspection requirement, and then the labeled data is determined to be correct.

For convenience of explanation, a corresponding relationship between a position of each second text feature corresponding to the annotation data in the text feature sequence and a position of each audio feature corresponding to the audio data in the audio feature sequence is visualized, and fig. 3 is an alignment chart for representing incorrect annotation data according to an embodiment of the present invention. As shown in fig. 3, any value m on the abscissa of the alignment chart characterizes the mth audio feature in the audio feature sequence corresponding to the audio data, any value n on the ordinate characterizes the nth second text feature in the text feature sequence corresponding to the annotation data, in the alignment chart, 001001(79,149) indicates that the text feature sequence corresponding to the annotation data contains 79 second text features, the audio feature sequence corresponding to the audio data contains 149 audio features, i.e. the second text position is 79, a curve in the diagram indicates the correspondence relationship between each character in the annotation data and each audio frame in the audio data, a vertical dashed line in the diagram indicates the position of the last audio feature in the audio feature sequence, a horizontal dashed line in the diagram indicates the position of the last second text feature in the text feature sequence, and a position where the curve overlaps with the vertical dashed line in the diagram, and the second text feature which represents the first text position and has the corresponding relation with the 149 th audio feature in the audio feature sequence is the 37 th text feature in the text feature sequence, the first text position is 37, and the overlapping position of the horizontal dotted line and the vertical dotted line in the graph is not on the curve in the graph, so that the last second text feature in the text feature sequence does not have the corresponding audio feature. And the curves in this figure 3 do not run from the lower left corner of figure 3 to the upper right corner of figure 3. Therefore, since the first text position 37 is not equal to the second text position 79, it is determined that the text data in the markup data includes more characters than the audio data, and the number of characters included in the text data in the markup data is not equal to the number of characters corresponding to the audio data.

Similarly, there may be a problem that the text data in the markup data contains fewer characters than the number of characters corresponding to the audio data, thereby, when the corresponding relation between the position of each second text characteristic corresponding to the marking data in the text characteristic sequence and the position of each audio characteristic corresponding to the audio data in the audio characteristic sequence is determined through the speech synthesis model, the determined first text position is consistent with the second text position, but the position of the last audio feature (for convenience of description, denoted as the first audio position) corresponding to the second text feature in the text feature sequence corresponding to the labeling data is determined, the position of the last audio feature in the sequence of audio features corresponding to the audio data (for convenience of description, denoted as the second audio position) does not coincide, i.e. the last character in the annotation data cannot correspond to the last audio frame in the audio data. Based on this, in the embodiment of the present invention, it may also be determined whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data according to the first audio position and the second audio position included in the position quality inspection data.

Specifically, when the position quality inspection data includes a first audio position and a second audio position, if it is determined that the first audio position is inconsistent with the second audio position, which indicates that there may be a problem that the number of characters included in the text data in the labeled data is less than the number of characters corresponding to the audio data, it is determined that the number of characters included in the text data in the labeled data is not equal to the number of characters corresponding to the audio data, it is determined that the labeled data does not meet a pre-configured first quality inspection requirement, and it is further determined that the labeled data is erroneous; if the first audio position is determined to be consistent with the second audio position, the problem that the number of characters contained in the text data in the labeled data is less than the number of characters corresponding to the audio data does not exist, the number of characters contained in the text data in the labeled data is determined to be equal to the number of characters corresponding to the audio data, the labeled data is determined to meet a first pre-configured quality inspection requirement, and then the labeled data is determined to be correct.

For convenience of explanation, a corresponding relationship between a position of each second text feature corresponding to the annotation data in the text feature sequence and a position of each audio feature corresponding to the audio data in the audio feature sequence is visualized, and fig. 4 is an alignment chart of annotation data indicating an error according to an embodiment of the present invention. As shown in fig. 4, any value m on the abscissa of the alignment chart characterizes the mth audio feature in the audio feature sequence corresponding to the audio data, and any value n on the ordinate characterizes the nth second text feature in the text feature sequence corresponding to the annotation data, in the alignment chart, 001014(31,181) indicates that the text feature sequence corresponding to the annotation data contains 31 second text features, the audio feature sequence corresponding to the audio data contains 181 audio features, i.e., the second audio position is 181, a curve in the diagram indicates the correspondence relationship between each character in the annotation data and each audio frame in the audio data, a vertical dashed line in the diagram indicates the position of the last audio feature in the audio feature sequence, a horizontal dashed line in the diagram indicates the position of the last second text feature in the text feature sequence, where the curve overlaps the horizontal dashed line, and representing a first audio position, wherein the audio feature corresponding to the 31 st second text feature in the text feature sequence is the 125 th audio feature in the audio feature sequence, the first audio position is 125, and the overlapping position of the horizontal dotted line and the vertical dotted line in the figure is not on the curve in the figure, which indicates that the last second text feature in the text feature sequence does not correspond to the last audio feature in the audio feature sequence. And the curves in this figure 4 do not run from the lower left corner of figure 4 to the upper right corner of figure 4. Since the first audio position 125 is not equal to the second audio position 181, it is indicated that the number of characters included in the text data in the annotation data is less than the number of characters corresponding to the audio data, and it is determined that the number of characters included in the text data in the annotation data is not equal to the number of characters corresponding to the audio data.

Of course, the acquired position quality inspection data may include both the first text position and the second text position, and also include the first audio position and the second audio position, for the first text position and the second text position in the position quality inspection data, the above-mentioned determining method is adopted to determine whether the problem that the number of characters contained in the text data in the labeling data is more than the number of characters corresponding to the audio data exists in the labeling data currently subjected to quality inspection, and for the first audio position and the second audio position in the position quality inspection data, the above-mentioned determining method is adopted to determine whether the problem that the number of characters contained in the text data in the labeling data currently subjected to quality inspection is less than the number of characters corresponding to the audio data exists in the labeling data. When any problem of the labeled data is determined, namely the first text position is determined to be inconsistent with the second text position or the first audio position is determined to be inconsistent with the second audio position, the quality inspection data corresponding to the labeled data is determined not to meet a first pre-configured quality inspection requirement, and the labeled data is determined to be wrong; and if the first text position is consistent with the second text position and the first audio position is consistent with the second audio position, determining that the quality inspection data corresponding to the labeling data meets a first pre-configured quality inspection requirement and determining that the labeling data is correct.

As a possible implementation manner, determining whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data based on the position quality inspection data included in the quality inspection data to determine whether the quality inspection data meets a first pre-configured quality inspection requirement includes:

if the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a first quality inspection requirement; and/or

If the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a first quality inspection requirement;

As another possible implementation manner, determining whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data based on the position quality inspection data included in the quality inspection data, so as to determine whether the quality inspection data meets a first pre-configured quality inspection requirement, further includes:

if the first text position of the position quality inspection data is inconsistent with the second text position, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data does not meet the first quality inspection requirement; or

If the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data does not meet the first quality inspection requirement;

the first audio position is the position of the last audio characteristic in the audio characteristic sequence, which has a corresponding relation with the second text characteristic in the text sequence, in the audio characteristic sequence; the second audio position is a second audio position of a last audio feature in the sequence of audio features.

In a practical application scenario, a situation that characters included in text data in the tagged data are inconsistent with characters corresponding to the audio data, that is, a situation that content tagged by the tagged data is inconsistent with content emitted in the audio data may also occur, which results in that, when determining a correspondence relationship between a position of each second text feature corresponding to the tagged data in the text feature sequence and a position of each audio feature corresponding to the audio data in the audio feature sequence through a speech synthesis model, at least one second text feature may exist in each second text feature corresponding to the tagged data, and there is no corresponding audio feature in the audio feature sequence corresponding to the audio data, that is, according to a probability vector corresponding to a certain audio feature in the audio feature sequence corresponding to the audio data, a second text feature corresponding to the audio feature cannot be determined in the text feature sequence corresponding to the tagged data, the characters contained in the text data in the annotation data are not consistent with the characters corresponding to the audio data. In this case, the pre-configured quality inspection requirement is a second quality inspection requirement, and through the speech synthesis model, the obtained quality inspection data includes probability vectors corresponding to each audio feature included in the audio feature sequence corresponding to the audio data, and based on the magnitude of each probability value included in the probability vector corresponding to each audio feature, it can be determined whether characters included in the text data in the annotation data are consistent with characters corresponding to the audio data, and further it is determined whether the obtained quality inspection data meets the pre-configured second quality inspection requirement.

The probability vector corresponding to any audio feature includes a probability value of each second text feature corresponding to the audio feature corresponding to the label data, and the meanings of the text feature sequence and the audio feature sequence are explained in the above embodiments, and repeated parts are not described again.

In a possible implementation manner, determining whether characters included in text data in the annotation data are consistent with characters corresponding to the audio data based on a probability vector corresponding to each audio feature included in an audio feature sequence corresponding to the audio data included in the quality inspection data, so as to determine whether the quality inspection data meets a second pre-configured quality inspection requirement includes:

respectively acquiring the maximum probability value in the probability vector corresponding to each audio feature;

if the maximum probability value corresponding to any audio feature is smaller than a preset probability threshold value, determining that characters contained in the text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet a second quality inspection requirement; or

And if the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, determining that characters contained in the text data in the labeling data are consistent with characters corresponding to the audio data, and determining that the quality inspection data meet a second quality inspection requirement.

In the embodiment of the present invention, in order to accurately determine whether each audio feature has a corresponding second text feature in the text feature sequence corresponding to the annotation data, a probability threshold is preset. In a specific implementation process, probability vectors corresponding to each audio feature contained in an audio feature sequence corresponding to the audio data contained in the quality inspection data corresponding to the labeling data are obtained through a speech synthesis model, and for each audio feature, a maximum probability value in the probability vector corresponding to the audio feature is obtained. Then, for the maximum probability value corresponding to each audio feature, the maximum probability value corresponding to the audio feature is compared with a preset probability threshold. If the maximum probability value corresponding to any audio feature is smaller than a preset probability threshold value, it is indicated that a second text feature corresponding to a certain audio feature corresponding to the audio data does not exist in a text feature sequence corresponding to the labeling data, and it is determined that characters contained in the text data in the labeling data are inconsistent with characters corresponding to the audio data; and if the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, the fact that each audio feature corresponding to the audio data corresponds to the second text feature in the text feature sequence corresponding to the label data is indicated, and then the fact that the characters contained in the text data in the label data are consistent with the characters corresponding to the audio data is determined.

For example, the preset probability threshold is 0.8, the maximum probability value corresponding to a certain audio feature corresponding to the obtained audio data a is 0.7, the maximum probability value 0.7 corresponding to the audio feature is compared with the preset probability threshold 0.8, it is determined that the maximum probability value 0.7 corresponding to the audio feature is smaller than the preset probability threshold 0.8, it is stated that there is no second text feature corresponding to the audio feature in the text feature sequence corresponding to the label data a, and it is determined that characters included in the text data in the label data a are inconsistent with characters corresponding to the audio data a.

Still taking the above as an example, the maximum probability value corresponding to a certain audio feature corresponding to the obtained audio data B is 0.9, the maximum probability value 0.9 corresponding to the audio feature is compared with the preset probability threshold value 0.8, it is determined that the maximum probability value 0.9 corresponding to the audio feature is greater than the preset probability threshold value 0.8, it is stated that the second text feature corresponding to the audio feature exists in the text feature sequence corresponding to the annotation data B, and then the maximum probability value corresponding to the next audio feature is obtained. When it is determined that the maximum probability value corresponding to each audio feature corresponding to the audio data B is not less than the preset probability threshold based on the above steps, it is determined that the characters included in the text data in the label data B are consistent with the characters corresponding to the audio data B.

When the maximum probability value corresponding to a certain audio feature is determined to be smaller than a preset probability threshold, characters contained in text data in the labeling data are determined to be inconsistent with characters corresponding to the audio data, and whether the maximum probability values corresponding to other audio features are smaller than the preset probability threshold is stopped to be continuously determined; after determining whether the maximum probability value corresponding to each audio feature is smaller than the preset probability threshold, determining whether the maximum probability value corresponding to any audio feature is smaller than the preset probability threshold according to each determined comparison result.

For convenience of explanation, the corresponding relationship between each character in the annotation data and each audio frame in the audio data is visualized, and fig. 5 is an alignment diagram of erroneous annotation data according to an embodiment of the present invention. As shown in fig. 5, any value m on the abscissa of the alignment chart represents the mth audio feature in the audio feature sequence corresponding to the audio data, and any value n on the ordinate represents the nth second text feature in the text feature sequence corresponding to the annotation data, in the alignment chart, a curve in the chart represents the corresponding relationship between each character in the annotation data and each audio frame in the audio data, but a curve representing the corresponding relationship between each character in the annotation data and each audio frame in the audio data cannot be found clearly in the region selected by the rectangular box in the chart, which indicates that the audio feature of the part cannot find the second text feature having the corresponding relationship in the text feature sequence corresponding to the annotation data, so that it is determined that the character included in the text data in the annotation data is not consistent with the character corresponding to the audio data.

In a practical application scenario, the problem that the number of characters included in text data in the tagged data is not equal to the number of characters corresponding to the audio data and the problem that the characters included in the text data in the tagged data are determined to be inconsistent with the characters corresponding to the audio data may exist in the tagged data at the same time. In a specific implementation process, after quality inspection data corresponding to the labeled data is acquired through a speech synthesis model, whether the number of characters contained in text data in the labeled data is equal to the number of characters corresponding to audio data is determined based on position quality inspection data contained in the quality inspection data, and whether the characters contained in the text data in the labeled data are consistent with the characters corresponding to the audio data is determined based on probability vectors corresponding to each audio feature contained in an audio feature sequence corresponding to the audio data contained in the quality inspection data. When the number of characters contained in the text data in the annotation data is determined to be not equal to the number of characters corresponding to the audio data, and/or the characters contained in the text data in the annotation data is determined to be inconsistent with the characters corresponding to the audio data, determining that the quality inspection data does not meet the pre-configured quality inspection requirement; and when the number of characters contained in the text data in the annotation data is determined to be equal to the number of characters corresponding to the audio data, and the characters contained in the text data in the annotation data are determined to be consistent with the characters corresponding to the audio data, determining that the quality inspection data meets the pre-configured quality inspection requirement.

In order to ensure the comprehensiveness of the quality inspection labeling data, after the quality inspection data is determined not to satisfy the preconfigured first quality inspection requirement, it may be still determined whether the quality inspection data satisfies the preconfigured second quality inspection requirement, or after the quality inspection data is determined not to satisfy the preconfigured second quality inspection requirement, it may still be determined whether the quality inspection data satisfies the preconfigured first quality inspection requirement, for example, if a certain quality inspection data does not satisfy the preconfigured first quality inspection requirement, it is still determined whether the quality inspection data satisfies the preconfigured second quality inspection requirement, so as to determine all the problems possibly existing in the labeling data.

Example 3: for the convenience of the staff to modify the labeling data with the labeling error, on the basis of the foregoing embodiments, in the embodiment of the present invention, after determining that the quality inspection data corresponding to the labeling data does not meet the pre-configured quality inspection requirement, the method further includes: and outputting prompt information of the error of the marked data.

In an actual application scenario, after it is determined that the quality inspection data corresponding to a certain annotation data does not meet the pre-configured quality inspection requirement, it is determined that the annotation data is incorrectly labeled, the annotation data needs to be modified according to the audio data corresponding to the annotation data, so that the annotation data is correct, and after the modified correct annotation data is input to the speech synthesis model again, the obtained quality inspection data corresponding to the correct annotation data can meet the pre-configured quality inspection requirement.

In order to facilitate the staff to modify the wrong labeling data in time, in the embodiment of the invention, after the quality inspection data is determined not to meet the pre-configured quality inspection requirement, the prompt information of the labeling data error can be output to prompt the staff that the labeling data is labeled with the error and is modified in time.

The output prompt information may be prompt information in an audio format, such as a voice broadcast prompt information "the current quality inspection labeled data is wrong", or prompt information corresponding to a text form may be displayed on the display interface, such as a display mode of the prompt information "the current quality inspection labeled data is wrong", a red light flashing, a frame popping prompt, or the like, or the output prompt information may be sent to an intelligent terminal of a relevant worker through a short message or an email. Of course, at least two ways of outputting the prompt information may also be combined at the same time, such as simultaneously broadcasting the prompt information in the audio format and displaying the prompt information in the text format on the display interface. Specifically, the setting can be flexibly set according to actual requirements, and is not limited herein.

Specifically, which mode is selected to output the prompt information may be preset according to the preference of the user, or may be selected according to the capability of the electronic device, for example, some electronic devices do not have a display interface capable of displaying the prompt information, and for these electronic devices, when the prompt information is output, the prompt information in the audio format may be broadcasted.

In a possible implementation manner, outputting a prompt message of the annotation data annotation error includes:

if the quality inspection data do not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data; and/or

And if the quality inspection data do not meet the second quality inspection requirement, outputting prompt information that the characters contained in the text data in the annotation data are inconsistent with the characters corresponding to the audio data.

In the embodiment of the invention, in order to further facilitate the modification of the wrong labeling data by the staff, the content of the prompt message for outputting the labeling data with the wrong labeling can be determined according to the type of the quality inspection requirement which is not satisfied by the quality inspection data corresponding to the current labeling data. Specifically, determining the content of the prompt information of which the output label data does not meet the pre-configured quality inspection requirement according to the type of the quality inspection requirement which is not met by the quality inspection data corresponding to the current label data includes the following conditions:

in a first case, if the quality inspection data corresponding to a certain tagging data does not meet the pre-configured first quality inspection requirement, it is described that the number of characters included in the text data in the tagging data is not equal to the number of characters corresponding to the audio data, so that a worker can directly modify the text data according to the problem of the tagging data, and prompt information that the number of characters included in the text data in the tagging data is not equal to the number of characters corresponding to the audio data is output. For example, the number of characters included in "how the weather of the annotation data" tomorrow "is not equal to the number of characters corresponding to the audio data, and a prompt message" please check "is output.

In a possible implementation manner, if the quality inspection data corresponding to a certain tagging data does not meet the first pre-configured quality inspection requirement and it is determined that the number of characters contained in the text data in the tagging data is greater than the number of characters corresponding to the audio data, it is determined that the error type of the tagging data is the first error type, and prompt information that the first error type exists in the tagging data is output.

In another possible implementation manner, if the quality inspection data corresponding to a certain marking data does not meet the first quality inspection requirement configured in advance and the number of characters contained in the text data in the marking data is determined to be less than the number of characters corresponding to the audio data, the error type of the marking data is determined to be the second error type, and prompt information that the second error type exists in the marking data is output.

And in case that the quality inspection data corresponding to a certain marking data does not meet the second pre-configured quality inspection requirement, explaining the problem that the characters contained in the text data in the marking data are inconsistent with the characters corresponding to the audio data, in order to facilitate workers to directly modify the characters according to the problems existing in the marking data, and outputting prompt information that the characters contained in the text data in the marking data are inconsistent with the characters corresponding to the audio data. For example, a prompt message that the characters included in the "annotation data" how the weather is tomorrow "do not match the characters corresponding to the corresponding audio data and you check" is output.

In a possible implementation manner, if the quality inspection data corresponding to a certain marking data does not meet the second quality inspection requirement configured in advance and the problem that the characters contained in the text data in the marking data are inconsistent with the characters corresponding to the audio data is explained, the error type of the marking data is determined to be a third error type, and prompt information that the third error type exists in the marking data is output.

In order to reduce the workload required by the staff to modify the annotation data, in an embodiment of the present invention, the method further includes: determining a target audio frame in the audio data;

wherein, the prompt message also includes the position of the target audio frame in the audio data.

In the embodiment of the present invention, after it is determined that the quality inspection data corresponding to the annotation data does not meet the second pre-configured quality inspection requirement, it may be determined that an audio frame without a corresponding character exists in each character included in the annotation data, the audio frame is determined as a target audio frame, and prompt information carrying the character included in the text data in the annotation data, the character corresponding to the audio data, and the position of the target audio frame in the audio data is output. And the follow-up staff can quickly find the target audio segment in the audio data according to the position of the target audio frame in the prompt message in the audio data, so that the marking data can be modified and adjusted according to the target audio segment.

Wherein determining a target audio frame in audio data comprises:

respectively acquiring the maximum probability value in the probability vector corresponding to each audio feature; and determining the audio frame corresponding to the audio feature with the maximum probability value smaller than the preset probability threshold value as the target audio frame.

In the practical application process, the probability value of each second text characteristic corresponding to the labeling data corresponding to each audio characteristic in the audio characteristic sequence corresponding to the audio data can be determined through the speech synthesis model. The larger the probability value is, the more likely the audio feature is to have a corresponding relation with the second text feature corresponding to the probability value; the smaller the probability value is, the less likely the audio feature is to correspond to the second text feature corresponding to the probability value. Generally, for correct labeled data, for each audio feature included in the audio feature sequence corresponding to the audio data of the labeled data, the audio feature has a corresponding second text feature in each second text feature corresponding to the labeled data, and the second text feature having a corresponding relationship is generally the second text feature corresponding to the maximum probability value in the probability vector corresponding to the audio feature. Based on this, in the embodiment of the present invention, the maximum probability value in the probability vector corresponding to each audio feature is obtained respectively. And determining whether the maximum probability value corresponding to each audio feature is smaller than a preset probability threshold or not for each audio feature, if the maximum probability value is smaller than the preset probability threshold, determining the audio feature as a target audio feature, and determining an audio frame corresponding to the target audio feature as a target audio frame.

Based on the manner in the above embodiment, after each target audio frame existing in the audio data is determined, the characters corresponding to the audio data and the characters contained in the text data in the annotation data, and the prompt information of the position of each target audio frame in the audio data are output. Subsequent workers can directly determine the position and the modification mode for modifying the annotation data according to the position information of each target audio frame in the audio data, which is carried in the prompt message, so that the workload of the workers is reduced, and the efficiency of the workers in modifying wrong annotation data is improved.

In a practical application scenario, the quality of the labeled data may be very poor, and the quality inspection data corresponding to the labeled data does not meet the first and second pre-configured quality inspection requirements, so that a worker can accurately modify the labeled data, and after it is determined that the quality inspection data corresponding to the labeled data does not meet the first and second pre-configured quality inspection requirements, prompt information that the number of characters included in the text data in the labeled data is not equal to the number of characters corresponding to the audio data and prompt information that the characters included in the text data in the labeled data are not equal to the characters corresponding to the audio data can be output.

Example 4: the following describes a data processing method according to an embodiment of the present invention by a specific implementation manner, and as shown in fig. 6, the flow includes:

s601: and acquiring the trained voice synthesis model.

In the embodiment of the present invention, the electronic device for training the speech synthesis model may be the same as or different from the electronic device for performing data processing in the above embodiment. Specifically, the setting may be performed according to actual requirements, and is not limited herein.

In order to train a speech synthesis model, for example, a tracotron model, sample audio data for training the speech synthesis model needs to be collected in advance, label data corresponding to the sample audio data is determined, and the label data corresponding to the sample audio data is determined as sample data, so as to train an original speech synthesis model.

In an actual application scenario, the number of the labeled data to be labeled incorrectly is generally smaller than the number of the labeled data to be labeled correctly, therefore, in the embodiment of the present invention, if the labeled data to be quality tested is enough, the labeled data to be quality tested can be directly used as sample data to train the original speech synthesis model to obtain the trained speech synthesis model, thereby reducing the time consumed for obtaining the sample data used for training the speech synthesis model, and a large amount of labeled data collected in advance and labeled correctly can also be used as sample data to train the original speech synthesis model to obtain the trained speech synthesis model.

Of course, the pre-collected labeling data with correct label and the labeling data to be quality-checked can be used as sample data, the original speech synthesis model is trained according to the pre-collected labeling data with correct label, after the trained basic speech synthesis model is obtained, the obtained basic speech synthesis model is trained continuously according to the labeling data to be quality-checked, so as to obtain the trained speech synthesis model. The specific way of training the speech synthesis model can be flexibly set according to actual requirements, and is not specifically limited herein.

Specifically, based on sample data, a speech synthesis model is trained:

acquiring any sample data and sample audio data corresponding to the sample data;

acquiring acoustic characteristic parameters corresponding to sample data through an original speech synthesis model;

and training the original speech synthesis model according to the acoustic characteristic parameters and the sample audio data.

Since there are many sample data for training the speech synthesis model, the above operation is performed on each sample data, and when a preset convergence condition is satisfied, the speech synthesis model training is completed.

The condition that the preset convergence condition is satisfied may be that based on the acoustic characteristic parameter of the sample audio data corresponding to each sample data, the determined loss value (loss) is smaller than a preset loss value threshold, or the determined loss value is always in a downward trend and tends to be flat, or the number of iterations for training the original speech synthesis model reaches a set maximum number of iterations, and the like. The specific implementation can be flexibly set, and is not particularly limited herein.

And determining any voice synthesis model meeting the preset convergence condition as a trained voice synthesis model for subsequent quality inspection of the marked data to be inspected. For example, a speech synthesis model with a smaller loss value is used for quality inspection of the labeled data to be inspected subsequently.

As a possible implementation manner, when performing speech synthesis model training, sample data may be divided into a training sample and a test sample, and an original speech synthesis model is trained based on the training sample, and then the reliability of the trained speech synthesis model is verified based on the test sample.

When testing the trained speech synthesis model based on the test samples, it is also necessary to calculate a test set loss value (val _ loss) determined based on the acoustic feature parameters of the test audio data corresponding to each test sample in the testing process. And when the loss value of the currently acquired test set is determined to be smaller than a preset test loss value threshold value or is always in a descending trend and tends to be flat, determining the reliability of the trained voice synthesis model, and subsequently performing voice synthesis by using the trained voice synthesis model or performing quality inspection on the marked data to be inspected.

In the process of performing the speech synthesis model training, an offline mode is generally adopted, and the original speech synthesis model is trained in advance through the electronic device performing the model training and sample data to obtain a trained speech synthesis model.

Based on the trained speech synthesis model in the above embodiment, the trained speech synthesis model is stored in the electronic device for subsequent data processing, and the electronic device for data processing is used to perform quality inspection on the data to be inspected.

S602: and acquiring any marking data to be subjected to quality inspection and audio data corresponding to the marking data.

S603: and acquiring a digital sequence corresponding to the labeling data.

S604: and determining quality inspection data corresponding to the marking data based on the digital sequence and the audio data through a speech synthesis model.

S605: and judging whether the quality inspection data meet the pre-configured quality inspection requirement, if so, executing S606, and otherwise, executing S607.

The pre-configured quality inspection requirement comprises a first quality inspection requirement and/or a second quality inspection requirement. The specific determination of whether the quality inspection data meets the pre-configured quality inspection requirement has been described in the above embodiments, and repeated details are not described herein.

S606: and determining that the annotation data is correct.

S607: and determining the error of the marked data and outputting prompt information of the error of the marked data.

Specifically, the method for outputting the prompt information is also described in the above embodiments, and repeated parts are not described again.

Example 5: an embodiment of the present invention provides a data processing apparatus, as shown in fig. 7, including:

the acquiring unit 71 is configured to acquire any piece of labeled data to be subjected to quality inspection and audio data corresponding to the labeled data, where the labeled data includes text data corresponding to the audio data and a first text feature of the text data;

a determining unit 72, configured to determine, based on the annotation data and the audio data, quality inspection data corresponding to the annotation data through a decoder of the speech synthesis model, where the quality inspection data represents a corresponding relationship between each character in the annotation data and each audio frame in the audio data;

and a judging unit 73, configured to judge whether the labeled data is correct according to the quality inspection data corresponding to the labeled data.

In a possible implementation, the determining unit 73 is specifically configured to:

if the quality inspection data corresponding to the marked data meet the pre-configured quality inspection requirement, the marked data are determined to be correct; or, if the quality inspection data corresponding to the marking data does not meet the pre-configured quality inspection requirement, determining that the marking of the marking data is wrong.

In a possible embodiment, the determining unit 73 determines whether the quality inspection data corresponding to the labeling data meets the pre-configured quality inspection requirement according to at least one of the following manners:

if the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets a first quality inspection requirement; and/or if the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the annotation data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement;

if the first text position of the position quality inspection data is inconsistent with the second text position, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data does not meet the first quality inspection requirement; or if the first audio position and the second audio position of the position quality inspection data are inconsistent, determining that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data, and determining that the quality inspection data does not meet the first quality inspection requirement;

if the maximum probability value corresponding to any audio feature is smaller than a preset probability threshold value, determining that characters contained in the text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet a second quality inspection requirement; or if the maximum probability value corresponding to each audio feature is not smaller than the preset probability threshold value, determining that characters contained in the text data in the labeling data are consistent with characters corresponding to the audio data, and determining that the quality inspection data meet the second quality inspection requirement.

In one possible embodiment, the apparatus further comprises: an output unit;

and the output unit is used for outputting prompt information of the error of the labeled data after the judging unit 73 determines that the quality inspection data corresponding to the labeled data does not meet the pre-configured quality inspection requirement.

In a possible embodiment, the output unit is specifically configured to:

if the quality inspection data do not meet the first quality inspection requirement, outputting prompt information that the number of characters contained in the text data in the annotation data is not equal to the number of characters corresponding to the audio data; and/or if the quality inspection data does not meet the second quality inspection requirement, outputting prompt information that characters contained in the text data in the labeling data are inconsistent with characters corresponding to the audio data.

In a possible implementation manner, the determining unit 72 is further configured to determine the target audio frame in the audio data, so that the prompt information output by the output unit further includes a position of the target audio frame in the audio feature sequence.

In a possible implementation, the determining unit 72 is specifically configured to:

Example 6: fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, an embodiment of the present invention further provides an electronic device, as shown in fig. 8, including: the system comprises a processor 81, a communication interface 82, a memory 83 and a communication bus 84, wherein the processor 81, the communication interface 82 and the memory 83 are communicated with each other through the communication bus 84;

the memory 83 stores therein a computer program which, when executed by the processor 81, causes the processor 81 to perform the steps of any of the data processing method embodiments described above.

Because the principle of the electronic device for solving the problem is similar to the data processing method, the implementation of the electronic device may refer to the implementation of the method, and repeated details are not repeated.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 82 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Example 7: on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the computer program runs on the processor, the computer program causes the processor to implement the steps in any of the data processing method embodiments described above.

Since the principle of solving the problem of the computer-readable storage medium is similar to that of the data processing method, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the determining whether the labeled data is correct according to the quality inspection data corresponding to the labeled data comprises:

if the quality inspection data corresponding to the labeled data meet the pre-configured quality inspection requirements, determining that the labeled data are correct; or

3. The method of claim 2, wherein determining whether the quality inspection data corresponding to the annotation data meets a pre-configured quality inspection requirement according to at least one of the following methods comprises:

determining whether the number of characters contained in the text data in the labeled data is equal to the number of characters corresponding to the audio data or not based on position quality inspection data included in the quality inspection data so as to determine whether the quality inspection data meets a first pre-configured quality inspection requirement or not, wherein the position quality inspection data is used for identifying the corresponding relation between the position of each second text feature corresponding to the labeled data in a text feature sequence and the position of each audio feature corresponding to the audio data in an audio feature sequence; the second text feature is obtained by encoding the labeling data through an encoder in the speech synthesis model;

and determining whether characters contained in the text data in the labeled data are consistent with characters corresponding to the audio data based on probability vectors corresponding to each audio feature contained in the audio feature sequence corresponding to the audio data in the quality inspection data, so as to determine whether the quality inspection data meets a pre-configured second quality inspection requirement, wherein the probability vector corresponding to any audio feature contains probability values corresponding to each second text feature corresponding to the labeled data.

4. The method of claim 3, wherein the determining whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data based on the position quality inspection data included in the quality inspection data to determine whether the quality inspection data meets a first pre-configured quality inspection requirement comprises:

if the first text position in the position quality inspection data is consistent with the second text position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement; and/or

If the first audio position in the position quality inspection data is consistent with the second audio position, determining that the number of characters contained in the text data in the labeling data is equal to the number of characters corresponding to the audio data, and determining that the quality inspection data meets the first quality inspection requirement;

the first text position is the position of the last second text feature in the text feature sequence, which has a corresponding relationship with the audio features in the audio feature sequence, in the text feature sequence, and the second text position is the position of the last second text feature in the text feature sequence.

5. The method of claim 3, wherein the determining whether the number of characters included in the text data in the annotation data is equal to the number of characters corresponding to the audio data based on the position quality inspection data included in the quality inspection data to determine whether the quality inspection data meets a first pre-configured quality inspection requirement comprises:

the first audio position is the position of the last audio feature in the audio feature sequence, which has a corresponding relation with the second text feature in the text sequence, in the audio feature sequence; the second audio position is a second audio position of a last audio feature in the sequence of audio features.

6. The method according to claim 3, wherein the determining whether characters included in the text data in the labeled data are consistent with characters corresponding to the audio data based on a probability vector corresponding to each audio feature included in an audio feature sequence corresponding to the audio data included in the quality inspection data to determine whether the quality inspection data meets a second pre-configured quality inspection requirement includes:

if the maximum probability value corresponding to any audio feature is judged to be smaller than a preset probability threshold value, determining that characters contained in the text data in the annotation data are inconsistent with characters corresponding to the audio data, and determining that the quality inspection data do not meet the second quality inspection requirement; or

And if the maximum probability value corresponding to each audio feature is not smaller than a preset probability threshold value, determining that characters contained in the text data in the labeled data are consistent with the characters corresponding to the audio data, and determining that the quality inspection data meet the second quality inspection requirement.

7. The method of claim 3, wherein after determining that the quality inspection data corresponding to the annotation data does not meet the pre-configured quality inspection requirement, the method further comprises:

and outputting prompt information of the error of the labeled data.

8. A data processing apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises a processor for implementing the steps of the data processing method according to any one of claims 1-7 when executing a computer program stored in a memory.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 7.