CN113284509B

CN113284509B - Method and device for obtaining accuracy of voice annotation and electronic equipment

Info

Publication number: CN113284509B
Application number: CN202110491593.6A
Authority: CN
Inventors: 杨雪
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2024-01-16
Anticipated expiration: 2041-05-06
Also published as: CN113284509A

Abstract

The invention discloses a method, a device and electronic equipment for obtaining the accuracy of voice annotation, relates to the technical field of artificial intelligence, and particularly relates to the technical field of computer vision and voice transcription. The specific implementation scheme is as follows: obtaining a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result of the original voice and a labeling result of the voice segment segmented by the original voice; identifying a labeling object of the labeling result, wherein the labeling object comprises at least one of the original voice and the voice segment; determining the marking granularity of the marking result based on the marking object; and acquiring the labeling accuracy under the dimension of the target voice feature based on the labeling granularity of the labeling result. Therefore, the labeling accuracy under different feature dimensions can be obtained based on the labeling granularity, the flexibility is high, and the diversity of the accuracy of voice labeling is improved.

Description

Method and device for obtaining accuracy of voice annotation and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, an electronic device, a storage medium, and a computer program product for obtaining a correctness of a voice label.

Background

At present, along with the development of artificial intelligence technology, voice labeling is widely applied in the fields of intelligent customer service, intelligent home and the like, for example, collected user voice can be labeled with attributes such as tone, transfer content and the like in an intelligent customer service application scene. However, the method for obtaining the accuracy of the voice mark in the prior art is single and not flexible enough, and cannot comprehensively reflect the accuracy of the voice mark.

Disclosure of Invention

Provided are a method, an apparatus, an electronic device, a storage medium, and a computer program product for obtaining the accuracy of voice markup.

According to a first aspect, there is provided a method for obtaining a correctness of a voice annotation, including: obtaining a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result aiming at the original voice and a labeling result of the voice segment after the original voice is segmented; identifying a labeling object of the labeling result, wherein the labeling object comprises at least one of the original voice and the voice segment; determining the marking granularity of the marking result based on the marking object; and acquiring the labeling accuracy under the dimension of the target voice feature based on the labeling granularity of the labeling result.

According to a second aspect, there is provided a device for obtaining a correctness of a voice annotation, including: the first acquisition module is used for acquiring a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result aiming at the original voice and a labeling result of the voice segment after the original voice is segmented; the first recognition module is used for recognizing a labeling object of the labeling result, wherein the labeling object comprises at least one of the original voice and the voice segment; the determining module is used for determining the marking granularity of the marking result based on the marking object; and the second acquisition module is used for acquiring the labeling accuracy under the dimension of the target voice characteristic based on the labeling granularity of the labeling result.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of obtaining accuracy of voice markup according to the first aspect of the present disclosure.

According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of correct rate acquisition of phonetic annotations according to the first aspect of the present disclosure.

According to a fifth aspect, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method for obtaining accuracy of a voice annotation according to the first aspect of the disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method for obtaining accuracy of voice markup according to a first embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for obtaining accuracy of labeling in a dimension of a target voice feature according to a second embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for obtaining accuracy of labeling in a dimension of a target voice feature according to a third embodiment of the present disclosure;

Fig. 4 is a flowchart of a determination result of a recognition annotation result in a method for obtaining a correctness of a voice annotation according to a fourth embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a method for obtaining accuracy of labeling in a dimension of a target voice feature according to a fifth embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for obtaining a correctness of a voice annotation according to a sixth embodiment of the disclosure for obtaining a weight of at least one membership annotation granularity;

FIG. 7 is a flowchart of a method for obtaining accuracy of labeling of speech features according to a seventh embodiment of the present disclosure;

FIG. 8 is a block diagram of a voice tagging accuracy acquisition device according to a first embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a method of obtaining accuracy of voice markup in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

AI (Artificial Intelligence ) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

Computer Vision (Computer Vision) refers to machine Vision that uses a camera and a Computer to replace human eyes to recognize, track and measure targets, and further performs graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument for detection. Computer vision is a comprehensive discipline including computer science and engineering, signal processing, physics, applied mathematics and statistics, neurophysiology and cognitive sciences, and the like.

Speech transcription is a technique that allows a machine to convert speech signals into corresponding text or commands through recognition and understanding processes, and mainly includes three aspects, namely feature extraction techniques, pattern matching criteria, and model training techniques.

Fig. 1 is a flowchart of a method for obtaining accuracy of voice markup according to a first embodiment of the present disclosure.

As shown in fig. 1, a method for obtaining a correctness of a voice annotation according to a first embodiment of the present disclosure includes:

S101, obtaining a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result of the original voice and a labeling result of the voice segment after the segmentation of the original voice.

It should be noted that, the execution body of the voice marking accuracy obtaining method according to the embodiment of the present disclosure may be a hardware device with data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

In the embodiment of the disclosure, a labeling result of a voice may be obtained, where the labeling result includes at least one of a labeling result for an original voice and a labeling result for a voice segment after the original voice is segmented. It should be noted that the original speech may be segmented, and the segmented speech segments may be labeled.

The labeling results for the original voice include, but are not limited to, labeling results for whether the original voice is clear, whether the original voice is markable, whether the original voice is transcribable, whether the tone color of the original voice is male, female or system tone, and the like.

The labeling results of the voice segments after the original voice segmentation include, but are not limited to, labeling results of the attributes such as the position, tone color, definition, transcription content and the like of the voice segments. For example, the labeling results for the speech segments include, but are not limited to, labeling results for the number of speech segments, the start time of the speech segments, whether the speech segments can be transcribed, whether the speech segments are clear, whether the tone of the speech segments is a boy, a girl, or a system tone, whether the transcription content of the speech segments has wrongly written words, whether the transcription content of the speech segments is missing, whether the transcription content of the speech segments is consistent with the speaking content, and the like.

S102, identifying a labeling object of a labeling result, wherein the labeling object comprises at least one of original voice and voice segmentation.

In an embodiment of the present disclosure, a labeling object of a labeling result may be identified, wherein the labeling object comprises at least one of an original speech and a speech segment.

S103, determining the marking granularity of the marking result based on the marking object.

In embodiments of the present disclosure, the annotation granularity of the annotation result may be determined based on the annotation object.

In one embodiment, determining the labeling granularity of the labeling result based on the labeling object may include obtaining a candidate labeling granularity corresponding to the labeling object according to a correspondence between the labeling object and the labeling granularity, and determining the labeling granularity of the labeling result from the candidate labeling granularities based on the content of the labeling result. The corresponding relation between the labeling object and the labeling granularity can be set according to the actual situation, and is not limited too much.

For example, if the labeling result is a labeling result for whether the original voice is clear, it may be determined that the labeling object of the labeling result is the original voice, and the obtained candidate labeling granularity corresponding to the original voice includes whether the original voice is clear, whether the original voice can be transcribed, whether the tone color of the original voice is male, female or system tone, and the like, and then it is determined whether the labeling granularity of the labeling result is clear from the candidate labeling granularities based on the content of the labeling result.

And S104, acquiring the labeling accuracy under the dimension of the target voice feature based on the labeling granularity of the labeling result.

In the embodiment of the present disclosure, the dimension of the voice feature may be preset according to the actual situation, which is not limited herein. For example, speech feature dimensions include, but are not limited to, location, timbre, clarity, transcription content, attributes, elements, data, topics, pages, batches. The attribute feature dimension is obtained by aggregating position, tone color, definition and transfer content feature dimension, the element feature dimension is obtained by aggregating a plurality of attribute feature dimensions, the data feature dimension is obtained by aggregating a plurality of element feature dimensions, the topic feature dimension is obtained by aggregating a plurality of data feature dimensions, the page feature dimension is obtained by aggregating a plurality of topic feature dimensions, and the batch feature dimension is obtained by aggregating a plurality of page feature dimensions.

In the embodiment of the disclosure, the labeling accuracy under the dimension of the target voice feature can be obtained based on the labeling granularity of the labeling result.

For example, when the target voice feature dimension is the position feature dimension, the labeling accuracy under the position feature dimension can be obtained according to the labeling result that the labeling granularity is the number of the voice segments and the starting time of the voice segments.

For example, when the target voice feature dimension is the feature dimension of the transcribed content, the accuracy of the annotation under the feature dimension of the transcribed content can be obtained according to the annotation result that the annotation granularity is whether the transcribed content for the voice segment has wrongly written words, whether the transcribed content of the voice segment is missing, whether the transcribed content of the voice segment is consistent with the speaking content, and the like.

In summary, according to the method for obtaining the accuracy of voice annotation according to the embodiment of the present disclosure, a voice annotation result is obtained, the annotation result includes at least one of an original voice annotation result and a voice segmentation annotation result after the original voice segmentation, an annotation object of the annotation result is identified, an annotation granularity of the annotation result is determined based on the annotation object, and the accuracy of annotation under the dimension of the target voice feature is obtained based on the annotation granularity of the annotation result. Therefore, the labeling accuracy under different feature dimensions can be obtained based on the labeling granularity, the flexibility is high, and the diversity of the accuracy of voice labeling is improved.

On the basis of any of the above embodiments, as shown in fig. 2, the step S104 of obtaining the labeling accuracy under the dimension of the target voice feature based on the labeling granularity of the labeling result includes:

s201, at least one membership labeling granularity belonging to the target voice feature dimension is obtained from the labeling granularity, and a labeling result of the membership labeling granularity is obtained.

In embodiments of the present disclosure, the labeling granularity has a membership relationship with the speech feature dimension, and one or more labeling granularities may be affiliated with one speech feature dimension.

For example, the labeling granularity for the number of speech segments, the start time of the speech segments, etc. is subordinate to the location feature dimension.

For example, labeling granularity such as whether the transcribed content of the voice segment has wrongly written words, whether the transcribed content of the voice segment is missing, whether the transcribed content of the voice segment is consistent with the speaking content, etc. is subordinate to the feature dimension of the transcribed content.

For example, labeling granularity for the number of voice segments, the starting time of the voice segments, whether the transcribed content of the voice segments has wrongly written words, whether the transcribed content of the voice segments is missing, whether the transcribed content of the voice segments is consistent with the speaking content, and the like belongs to attribute feature dimensions.

It should be noted that, the membership relationship between the labeling granularity and the dimension of the voice feature may also include other embodiments, which are not limited herein.

In embodiments of the present disclosure, at least one membership labeling granularity that is subordinate to the target speech feature dimension may be obtained from the labeling granularities, and labeling results of the membership labeling granularity may be obtained. The number of the target voice feature dimensions can be one or more, and at least one membership annotation granularity belonging to different target voice feature dimensions can be obtained respectively.

For example, when the target voice feature dimension is the location feature dimension, the membership annotation granularity obtained from the annotation granularity includes the annotation granularity for the number of voice segments, the start time of the voice segments, and the like.

S202, obtaining the labeling accuracy under the dimension of the target voice feature according to the labeling result of the membership labeling granularity.

In the embodiment of the disclosure, the labeling accuracy under the dimension of the target voice feature can be obtained according to the labeling result of the membership labeling granularity.

For example, when the target voice feature dimension is the position feature dimension, the membership annotation granularity obtained from the annotation granularity comprises annotation granularities for the number of voice segments, the starting time of the voice segments and the like, and the annotation accuracy under the target voice feature dimension is obtained according to annotation results of the annotation granularities for the number of voice segments, the starting time of the voice segments and the like.

The method comprises the steps of obtaining at least one membership labeling granularity belonging to a target voice feature dimension from the labeling granularity, obtaining a labeling result of the membership labeling granularity, and obtaining the labeling accuracy under the target voice feature dimension according to the labeling result of the membership labeling granularity.

On the basis of any of the above embodiments, as shown in fig. 3, in step S202, according to the labeling result of the membership labeling granularity, the labeling accuracy under the dimension of the target voice feature is obtained, including:

s301, identifying a judgment result of the labeling result.

In the embodiment of the disclosure, the judgment result of the labeling result may be correct or incorrect.

In one embodiment, the determination of the labeling result may be identified manually. For example, if the labeling result is the number of the voice segments and the labeling is 4, if the number of the voice segments is 5, that is, the labeling result is inconsistent with the manual judgment, the judgment result of the labeling result can be identified as error, otherwise, if the number of the voice segments is 4, that is, the labeling result is consistent with the manual judgment, the judgment result of the labeling result can be identified as correct.

S302, obtaining the labeling accuracy under the dimension of the target voice feature according to the judgment result of the labeling result of the membership labeling granularity.

In the embodiment of the disclosure, the labeling accuracy under the dimension of the target voice feature can be obtained according to the judgment result of the labeling result of the membership labeling granularity.

For example, when the target voice feature dimension is the position feature dimension, the membership annotation granularity obtained from the annotation granularity comprises annotation granularities for the number of voice segments, the starting time of the voice segments and the like, and the annotation accuracy under the target voice feature dimension is obtained according to the judgment result of the annotation results of the annotation granularities for the number of voice segments, the starting time of the voice segments and the like.

Therefore, the method can identify the judgment result of the labeling result, and obtain the labeling accuracy under the dimension of the target voice feature according to the judgment result of the labeling result of the subordinate labeling granularity.

On the basis of any of the above embodiments, as shown in fig. 4, the determination result of the identification marking result in step S301 includes:

s401, obtaining a reference marking result corresponding to the marking result.

In the embodiment of the disclosure, the reference marking result corresponding to the marking result can be obtained.

In one embodiment, the voice can be marked manually, and the marking result of the voice can be used as a reference marking result.

S402, comparing the labeling result with the reference labeling result.

S403, in response to the labeling result being consistent with the reference labeling result, the judgment result of the identification labeling result is correct.

S404, in response to the fact that the labeling result is inconsistent with the reference labeling result, the judgment result of the identification labeling result is an error.

In embodiments of the present disclosure, the labeling results may be compared to reference labeling results.

In one embodiment, the determination of the labeling result may be identified as correct in response to the labeling result being consistent with the reference labeling result. For example, if the labeling result is the number of segments for voice and the labeling result is 4, if the reference labeling result is the number of segments for voice and the labeling result is 4, the labeling result can be responded to being consistent with the reference labeling result, and the judgment result of the labeling result can be identified as correct.

In one embodiment, the determination of the labeling result may be identified as erroneous in response to the labeling result not being consistent with the reference labeling result. If the labeling result is the number of the specific voice segments and is labeled as 4, if the reference labeling result is the number of the specific voice segments and is labeled as 5, the labeling result is not consistent with the reference labeling result, and the judgment result of the labeling result can be identified as error.

Therefore, the method can compare the labeling result with the reference labeling result, and identify the judgment result of the labeling result according to whether the labeling result is consistent with the reference labeling result.

On the basis of any of the above embodiments, as shown in fig. 5, in step S302, according to the determination result of the labeling result of the membership labeling granularity, the labeling accuracy under the dimension of the target voice feature is obtained, including:

s501, obtaining the weight of at least one membership labeling granularity.

In the embodiment of the disclosure, weights may be set for the labeling granularity in advance, and different labeling granularities may correspond to different weights. For example, the granularity of labeling for the location attribute of a speech segment may be set to 50% and the granularity of labeling for the timbre attribute of a speech segment may be set to 10%.

In embodiments of the present disclosure, a weight of at least one membership labeling granularity may be obtained.

In one embodiment, a mapping relationship or mapping table between the labeling granularity and the weight may be established in advance, and after the membership labeling granularity is obtained, the weight of the membership labeling granularity may be obtained by querying the mapping relationship or mapping table. It should be noted that, the mapping relationship or the mapping table may be set according to the actual situation.

S502, obtaining the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity.

In the embodiment of the disclosure, the labeling accuracy of any membership labeling granularity can be obtained according to the judgment result of the labeling result of any membership labeling granularity.

In one embodiment, according to the judgment result of the labeling result of any membership labeling granularity, obtaining the labeling accuracy of any membership labeling granularity may include obtaining a first number of labeling results of any membership labeling granularity, obtaining a second number of labeling results of which the judgment result is correct in the labeling results of any membership labeling granularity, and obtaining a ratio of the second number to the first number as the labeling accuracy of any membership labeling granularity.

For example, if the first number of labeling results of any membership labeling granularity is 10, and the second number of labeling results of any membership labeling granularity, which are judged to be correct in the labeling results of any membership labeling granularity, is 4, the labeling accuracy of any membership labeling granularity is 40%.

And S503, obtaining the labeling accuracy under the dimension of the target voice feature according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity.

In the embodiment of the disclosure, the labeling accuracy under the dimension of the target voice feature can be obtained according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity.

In one embodiment, obtaining the labeling accuracy under the target speech feature dimension according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity may include identifying that the membership labeling granularity includes at least one target labeling granularity, and obtaining the labeling accuracy under the target speech feature dimension as 0 in response to a determination that a labeling result of any of the target labeling granularities is incorrect.

The target labeling granularity may be set according to practical situations, for example, the target labeling granularity includes, but is not limited to, whether the original voice is clear, whether the original voice is scaleable, and the like.

For example, if the membership labeling granularity includes labeling granularity for whether the original voice is scaleable, and the judgment result of the labeling granularity for whether the original voice is scaleable is wrong, the labeling accuracy under the feature dimension of the target voice can be obtained to be 0.

Therefore, when the membership labeling granularity comprises the target labeling granularity and the judgment result of the labeling result of any target labeling granularity is wrong, the labeling accuracy rate under the dimension of the target voice feature can be directly obtained to be 0.

In one embodiment, the obtaining of the labeling accuracy under the target speech feature dimension according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity may include identifying that the membership labeling granularity includes at least one target labeling granularity, and obtaining the sum of products of the labeling accuracy and the weight of the rest of the membership labeling granularities as the labeling accuracy under the target speech feature dimension in response to the judgment result of the labeling results of all the target labeling granularities being correct.

For example, if the labeling granularity of the membership comprises labeling granularity which is not acceptable for the original voice, and the judgment result of the labeling granularity which is not acceptable for the original voice is correct, the labeling accuracy of the labeling granularity of the rest membership is respectively 80%, 50% and 60%, the weights of the labeling granularity of the rest membership are respectively 10%,50% and 40%, and the labeling accuracy under the characteristic dimension of the target voice is 80% +50% +60% +40% =57%.

Therefore, when the membership labeling granularity contains the target labeling granularity and the judgment result of the labeling results of all the target labeling granularity is correct, the method can acquire the sum of products of the labeling accuracy rates and weights of the rest membership labeling granularity as the labeling accuracy rate under the dimension of the target voice feature.

In one embodiment, obtaining the labeling accuracy under the target speech feature dimension according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity may include identifying that the membership labeling granularity does not include the target labeling granularity, and obtaining a sum of products of the labeling accuracy of the membership labeling granularity and the weight as the labeling accuracy under the target speech feature dimension. Therefore, when the membership labeling granularity does not contain the target labeling granularity, the method can directly acquire the sum of products of the labeling accuracy rate and the weight of the membership labeling granularity, and the sum is used as the labeling accuracy rate under the dimension of the target voice characteristics.

It should be noted that, according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity, the labeling accuracy under the dimension of the target voice feature is obtained, and other possible embodiments may also be included, which are not limited too much.

Therefore, the method can obtain the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity, and obtain the labeling accuracy under the dimension of the target voice feature according to the labeling accuracy and the weight of the membership labeling granularity.

On the basis of any of the above embodiments, as shown in fig. 6, the obtaining the weight of the at least one membership label granularity in step S501 may include:

S601, historical weight of the membership labeling granularity, a first attention degree parameter and/or a second attention degree parameter are obtained, wherein the first attention degree parameter is used for representing the attention degree of a user to the membership labeling granularity, and the second attention degree parameter is used for representing the attention degree of a server to the membership labeling granularity.

In embodiments of the present disclosure, a historical weight, a first attention parameter, and/or a second attention parameter of a membership annotation granularity may be obtained.

In one embodiment, the larger the first attention parameter, the higher the attention of the characterization user to the membership labeling granularity, and the larger the second attention parameter, the higher the attention of the characterization server to the membership labeling granularity.

In one embodiment, the weight set before the membership labeling granularity may be used as a historical weight, for example, the set weight of the labeling granularity may be saved in a storage space of the server, and then the weight of the membership labeling granularity set before may be obtained from the storage space of the server as the historical weight of the membership labeling granularity. For example, an average value of weights set N times before the membership labeling granularity can be obtained as a historical weight, so that timeliness is good, wherein N is a positive integer and can be set according to actual conditions.

In one embodiment, a mapping relation or a mapping table between the membership labeling granularity and the historical weight, the first attention parameter and the second attention parameter can be established in advance, and the historical weight, the first attention parameter and the second attention parameter corresponding to the membership labeling granularity are obtained by inquiring the mapping relation or the mapping table. It should be noted that, the mapping relationship or the mapping table may be set according to the actual situation.

S602, according to the first attention degree parameter and/or the second attention degree parameter, the adjustment parameter of the historical weight is determined.

In embodiments of the present disclosure, the adjustment parameters of the historical weights may be determined according to the first and/or second attention parameters.

In one embodiment, the adjustment direction and the adjustment value of the historical weight may be determined according to the first attention parameter and/or the second attention parameter. For example, the larger the first attention degree parameter and/or the second attention degree parameter, the higher the attention degree of the user and/or the server side to the membership labeling granularity is represented, the adjusting direction of the historical weight can be determined to be an increasing direction, and the adjusting value of the historical weight can be determined according to the first attention degree parameter and/or the second attention degree parameter.

And S603, adjusting the historical weight based on the adjustment parameter, and taking the adjusted historical weight as the weight of the membership labeling granularity.

In the embodiment of the disclosure, the historical weight can be adjusted based on the adjustment parameter, and the adjusted historical weight is used as the weight of the membership labeling granularity. For example, if the adjustment parameter of the history weight of the membership labeling granularity is that the adjustment direction is the improvement direction, the adjustment value is 10%, the history weight is 20%, and the adjusted history weight is 30%, the weight of the membership labeling granularity is 30%.

Therefore, the method can determine the adjustment parameters of the historical weights according to the first attention degree parameters and/or the second attention degree parameters, adjust the historical weights based on the adjustment parameters, and take the adjusted historical weights as weights of membership labeling granularity.

On the basis of any of the above embodiments, as shown in fig. 7, obtaining the labeling accuracy under the dimension of the voice feature may include:

s701, identifying hierarchical relationships among voice feature dimensions.

In embodiments of the present disclosure, the voice feature dimensions have a hierarchical relationship therebetween, and the hierarchical relationship between the voice feature dimensions can be identified.

S702, for any first voice feature dimension belonging to the first level, acquiring each second voice feature dimension belonging to the second level and corresponding to the first voice feature dimension, and a labeling accuracy under the second voice feature dimension, wherein the second voice feature dimension belonging to the second level is used for aggregation into the first voice feature dimension of the first level.

In an embodiment of the present disclosure, the speech feature dimensions include a first speech feature dimension belonging to a first hierarchy and a second speech feature dimension belonging to a second hierarchy. The second voice feature dimensions belonging to the second level are used for being aggregated into first voice feature dimensions of the first level, the first voice feature dimensions and the second voice feature dimensions have corresponding relations, and different first voice feature dimensions can correspond to different second voice feature dimensions.

For example, speech feature dimensions include, but are not limited to, location, timbre, clarity, transcription content, attributes, elements, data, topics, pages, batches. The position, tone color, definition and transfer content feature dimensions are used for being aggregated into attribute feature dimensions, a plurality of attribute feature dimensions are used for being aggregated into element feature dimensions, a plurality of element feature dimensions are used for being aggregated into data feature dimensions, a plurality of data feature dimensions are used for being aggregated into topic feature dimensions, a plurality of topic feature dimensions are used for being aggregated into page feature dimensions, and a plurality of page feature dimensions are used for being aggregated into batch feature dimensions.

In the embodiment of the disclosure, each second voice feature dimension belonging to the second hierarchy corresponding to the first voice feature dimension and the labeling accuracy under the second voice feature dimension may be obtained for any one first voice feature dimension belonging to the first hierarchy.

For example, for an element feature dimension belonging to a first hierarchy, the acquired second speech feature dimension belonging to a second hierarchy includes an attribute feature dimension, and a labeling accuracy under the attribute feature dimension may be acquired.

S703, obtaining the labeling accuracy rate of the first voice feature dimension according to the labeling accuracy rate of each second voice feature dimension.

In the embodiment of the disclosure, the labeling accuracy under the first voice feature dimension can be obtained according to the labeling accuracy under each second voice feature dimension.

In one embodiment, according to the labeling accuracy rate in each second voice feature dimension, obtaining the labeling accuracy rate in the first voice feature dimension may include obtaining an average value of the labeling accuracy rates in all the second voice feature dimensions as the labeling accuracy rate in the first voice feature dimension.

For example, for the question feature dimension belonging to the first hierarchy, the acquired second voice feature dimension belonging to the second hierarchy includes data feature dimensions, the labeling accuracy rates under the acquired data feature dimensions are respectively 80%, 50% and 60%, the average value of the labeling accuracy rates under all the data feature dimensions is 63.3%, and the labeling accuracy rate under the question feature dimension is 63.3%.

Therefore, the method can obtain the labeling accuracy under the first voice feature dimension of the first level according to the labeling accuracy under the second voice feature dimension of the second level, and can obtain the labeling accuracy by utilizing the level relation among the voice feature dimensions.

Fig. 8 is a block diagram of a voice markup accuracy acquisition apparatus according to a first embodiment of the present disclosure.

As shown in fig. 8, a voice marking accuracy obtaining device 800 according to an embodiment of the present disclosure includes: a first acquisition module 801, a first identification module 802, a determination module 803, and a second acquisition module 804.

A first obtaining module 801, configured to obtain a labeling result of a voice, where the labeling result includes at least one of a labeling result for an original voice and a labeling result of a voice segment after the original voice is segmented;

a first recognition module 802, configured to recognize a labeling object of the labeling result, where the labeling object includes at least one of the original speech and the speech segment;

a determining module 803, configured to determine a labeling granularity of the labeling result based on the labeling object;

the second obtaining module 804 is configured to obtain a labeling accuracy under the dimension of the target voice feature based on the labeling granularity of the labeling result.

In one embodiment of the present disclosure, the second obtaining module 804 includes: the first acquisition unit is used for acquiring at least one membership labeling granularity belonging to the target voice feature dimension from the labeling granularity and acquiring a labeling result of the membership labeling granularity; and the second acquisition unit is used for acquiring the labeling accuracy under the dimension of the target voice characteristic according to the labeling result of the membership labeling granularity.

In one embodiment of the present disclosure, the second obtaining unit includes: the identification subunit is used for identifying the judgment result of the labeling result; and the obtaining subunit is used for obtaining the labeling accuracy under the dimension of the target voice feature according to the judgment result of the labeling result of the membership labeling granularity.

In one embodiment of the present disclosure, the identification subunit is specifically configured to: obtaining a reference marking result corresponding to the marking result; comparing the labeling result with the reference labeling result; responding to the labeling result being consistent with the reference labeling result, and identifying that the judging result of the labeling result is correct; and in response to the inconsistent labeling result and the reference labeling result, recognizing that the judgment result of the labeling result is an error.

In one embodiment of the present disclosure, the acquiring subunit is specifically configured to: acquiring the weight of the at least one membership labeling granularity; obtaining the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity; and obtaining the labeling accuracy under the dimension of the target voice feature according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity.

In one embodiment of the present disclosure, the acquiring subunit is specifically configured to: acquiring a first number of labeling results of any membership labeling granularity; obtaining a second number of labeling results with correct judgment results in the labeling results of any membership labeling granularity; and obtaining the ratio of the second quantity to the first quantity as the labeling accuracy of any membership labeling granularity.

In one embodiment of the present disclosure, the acquiring subunit is specifically configured to: identifying that the membership annotation particle size comprises at least one target annotation particle size; responding to the judgment result of the labeling result with any target labeling granularity as an error, and acquiring the labeling accuracy under the dimension of the target voice characteristics as 0; or, in response to the judgment results of the labeling results of all the target labeling granularity being correct, obtaining the sum of products of the labeling accuracy rates and weights of the rest membership labeling granularity as the labeling accuracy rate under the dimension of the target voice feature.

In one embodiment of the present disclosure, the acquiring subunit is specifically configured to: and identifying that the membership labeling granularity does not contain the target labeling granularity, and acquiring the sum of products of the labeling accuracy rate and the weight of the membership labeling granularity as the labeling accuracy rate under the dimension of the target voice characteristics.

In one embodiment of the present disclosure, the acquiring subunit is specifically configured to: acquiring historical weights of the membership labeling granularity, a first attention degree parameter and/or a second attention degree parameter, wherein the first attention degree parameter is used for representing the attention degree of a user to the membership labeling granularity, and the second attention degree parameter is used for representing the attention degree of a server to the membership labeling granularity; determining an adjustment parameter of the historical weight according to the first attention parameter and/or the second attention parameter; and adjusting the historical weight based on the adjustment parameter, and taking the adjusted historical weight as the weight of the membership labeling granularity.

In one embodiment of the present disclosure, the apparatus further comprises: the second recognition module is used for recognizing the hierarchical relationship among the voice feature dimensions; the third acquisition module is used for acquiring each second voice feature dimension belonging to a second level and corresponding to any first voice feature dimension belonging to a first level and the labeling accuracy under the second voice feature dimension, wherein the second voice feature dimension belonging to the second level is used for aggregation into the first voice feature dimension of the first level; and the fourth acquisition module is used for acquiring the labeling accuracy rate under the first voice feature dimension according to the labeling accuracy rate under each second voice feature dimension.

In summary, the device for obtaining the accuracy of voice annotation according to the embodiment of the present disclosure obtains a voice annotation result, where the annotation result includes at least one of an annotation result for an original voice and an annotation result for a voice segment after the original voice is segmented, identifies an annotation object of the annotation result, determines an annotation granularity of the annotation result based on the annotation object, and obtains the accuracy of annotation under a dimension of a target voice feature based on the annotation granularity of the annotation result. Therefore, the labeling accuracy under different feature dimensions can be obtained based on the labeling granularity, the flexibility is high, and the diversity of the accuracy of voice labeling is improved.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as the accuracy rate acquisition method of voice marking described in fig. 1 to 7. For example, in some embodiments, the accuracy rate acquisition method of voice markup may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described accuracy rate acquisition method of voice marking can be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the accuracy acquisition method of phonetic annotation in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to an embodiment of the disclosure, the disclosure further provides a computer program product, which comprises a computer program, wherein the computer program realizes the method for obtaining the accuracy of the voice annotation according to the above embodiment of the disclosure when the computer program is executed by a processor.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for obtaining the accuracy of voice annotation comprises the following steps:

obtaining a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result of the original voice and a labeling result of the voice segment segmented by the original voice;

Identifying a labeling object of the labeling result, wherein the labeling object comprises at least one of the original voice and the voice segment;

determining a labeling granularity of the labeling result based on the labeling object, wherein the labeling granularity has a membership relationship with a voice feature dimension, and one or more labeling granularities belong to one voice feature dimension;

obtaining at least one membership labeling granularity belonging to the target voice feature dimension from the labeling granularity, and obtaining a labeling result of the membership labeling granularity;

identifying the judgment result of the labeling result;

acquiring the weight of the at least one membership labeling granularity;

obtaining the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity;

obtaining the labeling accuracy under the dimension of the target voice feature according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity;

wherein the obtaining the weight of the at least one membership labeling granularity comprises:

acquiring historical weights of the membership labeling granularity, a first attention degree parameter and/or a second attention degree parameter, wherein the first attention degree parameter is used for representing the attention degree of a user to the membership labeling granularity, and the second attention degree parameter is used for representing the attention degree of a server to the membership labeling granularity;

Determining an adjustment parameter of the historical weight according to the first attention parameter and/or the second attention parameter;

and adjusting the historical weight based on the adjustment parameter, and taking the adjusted historical weight as the weight of the membership labeling granularity.

2. The method of claim 1, wherein the identifying the labeling result comprises:

obtaining a reference marking result corresponding to the marking result;

comparing the labeling result with the reference labeling result;

responding to the labeling result being consistent with the reference labeling result, and identifying that the judging result of the labeling result is correct;

and in response to the inconsistent labeling result and the reference labeling result, recognizing that the judgment result of the labeling result is an error.

3. The method of claim 1, wherein the obtaining the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity comprises:

acquiring a first number of labeling results of any membership labeling granularity;

obtaining a second number of labeling results with correct judgment results in the labeling results of any membership labeling granularity;

And obtaining the ratio of the second quantity to the first quantity as the labeling accuracy of any membership labeling granularity.

4. The method of claim 1, wherein the obtaining the labeling accuracy under the target speech feature dimension according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity comprises:

identifying that the membership annotation particle size comprises at least one target annotation particle size;

responding to the judgment result of the labeling result with any target labeling granularity as an error, and acquiring the labeling accuracy under the dimension of the target voice characteristics as 0; or,

and responding to the judgment results of the labeling results of all the target labeling granularity to be correct, and obtaining the sum of products of the labeling accuracy rates and weights of the rest membership labeling granularity as the labeling accuracy rate under the dimension of the target voice characteristics.

5. The method of claim 4, wherein the obtaining the labeling accuracy under the target speech feature dimension according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity further comprises:

and identifying that the membership labeling granularity does not contain the target labeling granularity, and acquiring the sum of products of the labeling accuracy rate and the weight of the membership labeling granularity as the labeling accuracy rate under the dimension of the target voice characteristics.

6. The method of any of claims 1-5, wherein the method further comprises:

identifying hierarchical relationships between speech feature dimensions;

for any first voice feature dimension belonging to a first level, acquiring each second voice feature dimension belonging to a second level corresponding to the first voice feature dimension and a labeling accuracy under the second voice feature dimension, wherein the second voice feature dimension belonging to the second level is used for aggregation into the first voice feature dimension of the first level;

and obtaining the labeling accuracy rate of the first voice feature dimension according to the labeling accuracy rate of each second voice feature dimension.

7. A device for obtaining the accuracy of voice annotation comprises:

the first acquisition module is used for acquiring a labeling result of the voice, wherein the labeling result comprises at least one of a labeling result aiming at the original voice and a labeling result aiming at the voice segment after the original voice is segmented;

the first recognition module is used for recognizing a labeling object of the labeling result, wherein the labeling object comprises at least one of the original voice and the voice segment;

The determining module is used for determining the labeling granularity of the labeling result based on the labeling object, wherein the labeling granularity and the voice feature dimension have a membership, and one or more labeling granularities are affiliated to one voice feature dimension;

the second acquisition module is used for acquiring at least one membership labeling granularity belonging to the target voice feature dimension from the labeling granularity and acquiring a labeling result of the membership labeling granularity; identifying the judgment result of the labeling result; acquiring the weight of the at least one membership labeling granularity; obtaining the labeling accuracy of any membership labeling granularity according to the judgment result of the labeling result of any membership labeling granularity; obtaining the labeling accuracy under the dimension of the target voice feature according to the labeling accuracy of the membership labeling granularity and the weight of the membership labeling granularity;

wherein the second acquisition module comprises an acquisition subunit,

the obtaining subunit is configured to obtain a historical weight of the membership labeling granularity, a first attention parameter and/or a second attention parameter, where the first attention parameter is used to characterize the attention of the user to the membership labeling granularity, and the second attention parameter is used to characterize the attention of the server to the membership labeling granularity; determining an adjustment parameter of the historical weight according to the first attention parameter and/or the second attention parameter; and adjusting the historical weight based on the adjustment parameter, and taking the adjusted historical weight as the weight of the membership labeling granularity.

8. The apparatus of claim 7, wherein the second acquisition module further comprises an identification subunit, the identification subunit being specifically configured to:

obtaining a reference marking result corresponding to the marking result;

comparing the labeling result with the reference labeling result;

9. The apparatus of claim 7, wherein the acquisition subunit is specifically configured to:

10. The apparatus of claim 7, wherein the acquisition subunit is specifically configured to:

11. The apparatus of claim 10, wherein the acquisition subunit is specifically configured to:

12. The apparatus according to any one of claims 7-11, wherein the apparatus further comprises:

the second recognition module is used for recognizing the hierarchical relationship among the voice feature dimensions;

the third acquisition module is used for acquiring each second voice feature dimension belonging to a second level and corresponding to any first voice feature dimension belonging to a first level and the labeling accuracy under the second voice feature dimension, wherein the second voice feature dimension belonging to the second level is used for aggregation into the first voice feature dimension of the first level;

And the fourth acquisition module is used for acquiring the labeling accuracy rate under the first voice feature dimension according to the labeling accuracy rate under each second voice feature dimension.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of accuracy retrieval of phonetic annotations according to any of claims 1-6.

14. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the accuracy rate acquisition method of phonetic annotation according to any one of claims 1-6.