CN117542349A

CN117542349A - Data labeling method and device and voice recognition method and device

Info

Publication number: CN117542349A
Application number: CN202311602965.3A
Authority: CN
Inventors: 付立; 范璐
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-02-09

Abstract

The disclosure relates to a data labeling method and device and a voice recognition method and device, and relates to the technical field of computers. The data labeling method comprises the following steps: performing voice recognition on the audio stream data of the video by utilizing a voice recognition model to obtain a voice recognition result and the confidence coefficient of the voice recognition result; text recognition is carried out on the caption area of the video by utilizing the text recognition model so as to obtain a text recognition result; according to the confidence level, carrying out fusion processing on the voice recognition result and the text recognition result to determine a final recognition result; and marking the audio stream data according to the final recognition result. According to the technical scheme, the labor cost of data annotation can be reduced, and the efficiency of data annotation is improved.

Description

Data labeling method and device and voice recognition method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data labeling method, a data labeling device, a voice recognition method, a voice recognition device, an electronic device, and a non-volatile computer readable storage medium.

Background

Currently, with the rapid development of artificial intelligence technology, a universal chat robot program represented by ChatGPT presents great application potential. ASR (Automatic Speech Recognition ) technology aims to transcribe a section of speaker's audio into corresponding text, which is one of the most important information interaction portals of the universal chat robot system.

However, in many practical speech interaction scenario applications, there may be multiple interference factors in the speech to be recognized, such as inaccurate pronunciation, variable speech speed, dialect accent, background noise, far-field reverberation, etc., which present new challenges to the generalization and versatility of ASR models.

To further improve the performance of ASR models, one of the most efficient approaches is to enrich the annotation data used in the ASR model training process. Typically, a training set of ASR models that meet the application requirements includes tens of thousands of hours (or even more) of speech annotation data, where each annotation data consists of a segment of audio and a text pair of the corresponding content.

In the related art, the data are marked mainly manually.

Disclosure of Invention

The inventors of the present disclosure found that the above-described related art has the following problems: the labor cost of data annotation is high and the efficiency is low.

In view of this, the present disclosure proposes a data labeling technical solution, which can reduce labor cost of data labeling and improve efficiency of data labeling.

According to some embodiments of the present disclosure, there is provided a data labeling method, including: performing voice recognition on the audio stream data of the video by utilizing a voice recognition model to obtain a voice recognition result and the confidence coefficient of the voice recognition result; text recognition is carried out on the caption area of the video by utilizing the text recognition model so as to obtain a text recognition result; according to the confidence level, carrying out fusion processing on the voice recognition result and the text recognition result to determine a final recognition result; and marking the audio stream data according to the final recognition result.

In some embodiments, fusing the speech recognition results with the text recognition results includes: and according to the degree and the confidence coefficient of the difference between the voice recognition result and the text recognition result, carrying out fusion processing on the voice recognition result and the text recognition result.

In some embodiments, fusing the speech recognition results with the text recognition results includes: determining a voice recognition result or a text recognition result as a candidate recognition result according to the confidence level and the degree of the difference; and correcting the candidate recognition results to determine a final recognition result.

In some embodiments, determining to use the speech recognition result or the text recognition result as a candidate recognition result comprises: in the case that the confidence is greater than the first confidence threshold and the degree of the difference is less than the difference threshold, determining that the speech recognition result or the text recognition result is used as a candidate recognition result according to the type of the difference.

In some embodiments, determining whether to take the speech recognition result or the text recognition result as a candidate recognition result comprises: and under the condition that the number of the insertion class differences included in the differences is smaller than or equal to a number threshold, taking the text recognition result as a candidate recognition result, wherein the insertion class differences are used for representing characters which are not included in the text recognition result in the voice recognition result.

In some embodiments, determining whether to take the speech recognition result or the text recognition result as a candidate recognition result comprises: and under the condition that the number of the insertion class differences included in the differences is larger than a number threshold and the confidence coefficient is larger than a second confidence coefficient threshold, taking the voice recognition result as a candidate recognition result, wherein the insertion class differences are used for representing characters which are not included in the text recognition result in the voice recognition result, and the second confidence coefficient threshold is larger than the first confidence coefficient threshold.

In some embodiments, determining the final recognition result includes: and determining the voice recognition result as a final recognition result in the case that the degree of difference is greater than or equal to a difference threshold and the confidence is greater than a second confidence threshold, wherein the second confidence threshold is greater than the first confidence threshold.

In some embodiments, correcting the candidate recognition results includes: establishing a corresponding relation between each of a plurality of voice recognition words in the voice recognition result and each of a plurality of text recognition words in the text recognition result; and correcting the candidate recognition result according to the corresponding relation.

In some embodiments, correcting the candidate recognition results includes: correcting the first voice recognition word by using the first text recognition word under the condition that the voice recognition result is taken as a candidate recognition result and the characters contained in the first voice recognition word and the first text recognition word with the corresponding relation are different; when the text recognition result is used as a candidate recognition result and the second speech recognition word having a correspondence relationship and the character included in the second text recognition word are different, the second text recognition word is corrected by using the second speech recognition word.

In some embodiments, correcting the candidate recognition results includes: correcting the candidate recognition result by using a shape-near word data sample under the condition that the voice recognition result is taken as the candidate recognition result, wherein the shape-near word data sample comprises the corresponding relation between characters with the similarity degree of the shapes exceeding a first threshold value; and correcting the candidate recognition result by using a near-pronunciation data sample when the text recognition result is taken as the candidate recognition result, wherein the near-pronunciation data sample comprises the corresponding relation between characters with the pronunciation similarity exceeding a second threshold value.

In some embodiments, the degree of difference includes ratio information of a portion of the text recognition result different from the speech recognition result to the text recognition result.

In some embodiments, the video includes a plurality of image frames, each of the plurality of image frames having at least one caption area, and text identifying the caption area of the video to obtain a text identification result includes: determining an association relationship between caption areas belonging to different image frames according to position information of at least one caption area of each of the plurality of image frames; fusion processing is carried out on text recognition result fragments in the caption area with the association relation; and acquiring a text recognition result according to the fusion processing result.

In some embodiments, determining the association between at least one caption region possessed by each of the plurality of image frames includes: and determining the association relation according to the intersection ratio between the caption areas belonging to different image frames.

In some embodiments, speech recognition of audio stream data of video includes: performing voice recognition on an audio fragment containing voice in the audio stream data; text recognition of a subtitle region of a video includes: determining a video clip in the image stream data associated with the audio clip according to the audio sampling rate of the audio stream data and the image sampling rate of the image stream data of the video; text recognition is performed on the subtitle region of the video clip.

In some embodiments, the annotated audio stream data is used to train a speech recognition model.

According to further embodiments of the present disclosure, there is provided a voice recognition method including: and performing voice recognition on the audio to be recognized by using a voice recognition model, training the voice recognition model by using audio stream data, and marking the audio stream data by using the data marking method in any one embodiment.

According to still further embodiments of the present disclosure, there is provided a data tagging device including: the first recognition unit is used for carrying out voice recognition on the audio stream data of the video by utilizing the voice recognition model so as to acquire a voice recognition result and the confidence coefficient of the voice recognition result; the second recognition unit is used for carrying out text recognition on the caption area of the video by utilizing the text recognition model so as to obtain a text recognition result; the determining unit is used for carrying out fusion processing on the voice recognition result and the text recognition result according to the confidence coefficient so as to determine a final recognition result; and the labeling unit is used for labeling the audio stream data according to the final recognition result.

According to still further embodiments of the present disclosure, there is provided a voice recognition apparatus including: the acquisition unit is used for acquiring the audio to be identified; the recognition unit is used for carrying out voice recognition on the audio to be recognized by utilizing a voice recognition model, the voice recognition model is trained by utilizing audio stream data, and the audio stream data is marked by the data marking method in any one embodiment.

According to still further embodiments of the present disclosure, there is provided an electronic device including: a memory; and a processor coupled to the memory, the processor configured to perform the data tagging method or the speech recognition method in any of the above embodiments based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data labeling method or the speech recognition method in any of the above embodiments.

In the above embodiment, the text recognition model and the voice recognition model that are pre-trained are used to respectively recognize the subtitle and the audio stream data in the video, and the two recognition results are fused according to the confidence level, so as to realize the annotation of the audio stream data. Therefore, the data can be automatically marked, the labor cost is reduced, and the marking efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates a flow chart of some embodiments of a data tagging method of the present disclosure;

FIG. 2a illustrates a schematic diagram of some embodiments of a data annotation method of the present disclosure;

FIG. 2b shows a schematic diagram of further embodiments of the data annotation method of the present disclosure;

FIG. 2c illustrates a flow chart of further embodiments of the data annotation method of the present disclosure;

FIG. 3 illustrates a flow chart of yet other embodiments of the data tagging method of the present disclosure;

FIG. 4a illustrates a block diagram of some embodiments of a data tagging device of the present disclosure;

FIG. 4b illustrates a block diagram of some embodiments of a speech recognition apparatus of the present disclosure;

FIG. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure;

fig. 6 shows a block diagram of further embodiments of the electronic device of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

As previously described, artificial annotation data can be collected and the ASR model trained from the head.

The scheme obtains voice annotation data of tens of thousands of hours (even more) by means of collection and manual annotation, and a model is trained from scratch. However, large-scale annotation of speech data is very costly because each piece of audio needs to be manually heard at least once during manual annotation.

The distillation training may be performed on the ASR model to be trained using an open source pre-trained ASR model.

The scheme utilizes an open-source pre-training ASR model to carry out knowledge distillation, namely the open-source ASR model is used as a teacher model to guide the training of the ASR model to be trained. However, the main disadvantage of this type of technology is that the final model performance often depends on the design experience of the knowledge distillation process, increasing development and training costs; moreover, the performance of the trained model is often worse than that of an open-source 'teacher' model, and the practical application requirements are difficult to meet.

The ASR model may be trained using a pre-trained OCR (Optical Character Recognition ) model to assist in the acquisition of video speech annotation data.

According to the scheme, the characteristic of rich audio resources of the public video website is utilized, and consistency verification is carried out on a subtitle recognition result of the pre-training OCR model and an audio recognition result of the pre-training ASR model to automatically obtain a large number of audio labels. However, the way in which consistency checks is mainly aimed at text alignment and audio content, ignoring how to fuse the results of pre-trained OCR and ASR models. Thus, the data acquisition range and accuracy of the scheme are limited. Moreover, this technique cannot be applied to video resources without subtitles.

Aiming at the technical problems, the video data has the advantages of rich scene content, large user group and the like, and can provide possibility for collecting large-scale voice data. Therefore, in order to obtain more accurate training data of an ASR model efficiently and at low cost, the present disclosure proposes a video speech automatic labeling method that fuses pre-training OCR and ASR models.

For example, the problem that the OCR model is difficult to distinguish the "shape and the near word" can be made up by using the ASR model, and the problem that the OCR model is difficult to distinguish the "sound and the near word" can be made up by using the OCR model, and the recognition results of the OCR model and the ASR model are aligned and fused, so that the accuracy of automatic voice labeling is improved.

In order to utilize the dominant characteristics of the OCR pre-training model and the ASR pre-training model in the respective recognition capability, the ASR model training data is accurately, efficiently and low-cost obtained, and the video voice automatic labeling method integrating the pre-training OCR and the ASR models is provided.

For example, pre-trained OCR and ASR models may be utilized to identify captions and spoken text in the image, respectively; and confidence filtering and alignment are carried out on the recognition results of the two models, so that error recognition samples are reduced. And combining the 'shape near word' and the 'sound near word' list (data sample), fusing the results of OCR and ASR, and improving the accuracy of the automatic video voice labeling result.

For example, the technical solution of the present disclosure may be implemented by the following embodiments.

FIG. 1 illustrates a flow chart of some embodiments of a data tagging method of the present disclosure.

As shown in fig. 1, in step 110, voice recognition is performed on audio stream data of a video using a voice recognition model to obtain a voice recognition result and a confidence of the voice recognition result.

In some embodiments, video clips in the image stream data associated with audio clips in the audio stream data that contain speech are determined based on an audio sample rate of the audio stream data and an image sample rate of the image stream data of the video.

For example, each video (e.g., mp4 format) is separated into image stream data I _1:M Audio stream data I _1:N The positive integer M is the image sampling rate (e.g., the number of sampling frames) in the video, and the positive integer N is the audio sampling rate (e.g., the number of sampling points) in the video. The audio associated with each image frame may be determined from the ratio of the audio sample rate to the image sample rateAnd (3) a frame.

For example, in the case that the image sampling rate is different from the audio sampling rate in the video, if the image sampling rate is 25 frames/second, the audio sampling rate is 16000Hz (the audio sampling rates can be unified in advance), and the number of audio sampling points corresponding to each frame of picture is 16000/25=640; the kth image frame (k e 1,2, …, 25) in the image stream data for each second may be derived, and its associated audio frame may include the kth x 640 to (k+1) x 640 audio sample points.

In some embodiments, voice detection may be performed on the audio stream data to determine the audio clip containing the voice.

For example, a VAD (Voice Activity Detection ) model (e.g., an open Source webrtcvad tool or a pre-trained VAD model, etc.), for the audio stream data I _1:N And performing voice detection and cutting.

For example, the entire audio stream data may be sliced into segments of audio segments containing portions of speaking content (i.e., speech) according to audio pauses (audio segments that do not contain speech); based on the obtained time stamp of each audio segment containing voice, the image stream data I _1:M And cutting to obtain video clips corresponding to the audio clips containing the voice.

For example, the start-stop time of a video clip may be determined using the start-stop time of an audio clip containing speech according to the association of an image frame with an audio frame.

In some embodiments, speech recognition is performed on audio clips that contain speech in the audio stream data.

For example, speech recognition is performed on the audio segment using a pre-trained ASR model to obtain a speech recognition result T _a Confidence S of _a 。

In step 120, text recognition is performed on the subtitle region of the video using the text recognition model to obtain a text recognition result. For example, text recognition is performed on a subtitle region of a video clip.

In some embodiments, the video includes a plurality of image frames, each of the plurality of image frames having at least one subtitle region. Determining an association relationship between caption areas belonging to different image frames according to position information of at least one caption area of each of the plurality of image frames; fusion processing is carried out on text recognition result fragments in the caption area with the association relation; and acquiring a text recognition result according to the fusion processing result.

For example, the association relationship is determined according to the cross-correlation between subtitle regions belonging to different image frames.

In some embodiments, the text recognition result fragments may be fused by the embodiment of fig. 2 a.

Fig. 2a shows a schematic diagram of some embodiments of the data labeling method of the present disclosure.

As shown in fig. 2a, for text recognition results using pre-trained OCR models, text may appear at different locations on each image frame; the text in the same location may or may not be the same for different image frames.

For example, image frames 1, 2, and 3 may be adjacent frames that together constitute a video segment containing speech; the caption area a1 in the image frame 1, the caption area a2 in the image frame 2 and the caption area a3 in the image frame 3 are positioned at the same position and belong to the same caption area a, namely have an association relation; the caption area b1 in the image frame 1 and the caption area b2 in the image frame 2 are positioned at the same position and belong to the same caption area b, namely have an association relation; the caption area c1 in the image frame 1, the caption area c2 in the image frame 2 and the caption area c3 in the image frame 3 are positioned at the same position, and belong to the same caption area c, namely have an association relationship.

For example, the text content in the caption area a1 is the same as the text content in the caption area a2, and is different from the text content in the caption area a 3; the caption area b1 is different from the text content in the caption area b 2; the caption area c1 is identical to the text content in the caption area c2, and is different from the text content in the caption area c 3.

For example, considering that the same text may have errors in OCR detection frames (i.e., subtitle regions) of different picture frames, the IOU (Intersection Over Union, cross-point ratio) of the detection frames of two adjacent frames is used to determine whether the same detection frame is the same. If IOU >0.9, judging the same detection frame; otherwise, the detection frame is different.

In some embodiments, the text belonging to the same subtitle region in all the image frames is fused (such as combining, splicing, etc.), so as to obtain multiple text sequences T of each picture data stream _o,1 ,T _o,2 ,…,T _o,M (where M is the number of text sequences).

For example, the text contents (i.e., text recognition result fragments) in the caption areas a1 and a2 are the same, and then are combined into one text content 1; if the text contents in the caption areas a1 and a3 are different, the text contents 1 and 5 are spliced into a text sequence T _o,1 。

For example, if the text contents in the subtitle regions b1 and b2 are different, the text contents 2 and 4 are spliced into a text sequence T _o,2 。

For example, if the text contents in the caption areas c1 and c2 are the same, the text contents are combined into one text content 3; if the text contents in the caption areas c1 and c3 are different, the text contents 3 and 7 are spliced into a text sequence T _o,3 。

Thus, the text sequences were obtained as follows: text sequence T consisting of text content 1+text content 5 _o,1 Text sequence T consisting of text content 2+text content 4 _o,2 Text sequence T consisting of text content 3+text content 7 _o,3 。

In some embodiments, text recognition is performed on each image frame in the image stream data, and text recognition result fragments of each subtitle region in each image frame are obtained; fusing text recognition result fragments belonging to the same subtitle region but not belonging to the same image frame into text sequences corresponding to the subtitle regions; and determining the text recognition result from each text sequence according to the similarity between each text sequence and the voice recognition result.

For example, text sequence T to be recognized by OCR model _o,1 ,T _o,2 ,…,T _o,M Text T respectively identified with ASR model _a Calculating the similarity; text sequence T with highest similarity _o And determining as a text recognition result.

In step 130, the speech recognition result and the text recognition result are fused according to the confidence level, so as to determine a final recognition result.

In some embodiments, a correspondence is established between each of the plurality of speech recognition words in the speech recognition result and each of the plurality of text recognition words in the text recognition result.

For example, a correspondence of each of the plurality of speech recognition words to each of the plurality of text recognition words is determined based on whether an object corresponding to each of the plurality of speech recognition words in the speech recognition result is related to an object corresponding to each of the plurality of text recognition words in the text recognition result.

In some embodiments, the correspondence and type of difference between the speech recognition word and the corresponding text recognition word may be determined according to the embodiment in fig. 2 b.

Fig. 2b shows a schematic diagram of further embodiments of the data annotation method of the present disclosure.

As shown in fig. 2b, the object corresponding to the text recognition word "Zhang Shan" is a name of a person, and the object corresponding to the speech recognition word "Zhang Sanu" is also a name of a person, so that a corresponding relationship exists between the two; the object corresponding to the text recognition word 'today' is a certain time, and the object corresponding to the voice recognition word 'present' is a certain time, so that a corresponding relation exists between the two; and so on.

Each word in the text recognition result and the speech recognition result may be "aligned" according to the correspondence, and the result is shown in fig. 2 b.

As shown in fig. 2b, the content of the characters contained in "Zhang Shan" and "Zhang Sanu" with the corresponding relationship is different, and the type of the difference can be determined to be a replacement difference type (i.e. the number of replacement errors is S); the number of characters contained in the text recognition word is different from the number of characters contained in the text recognition word, and the number of characters of the text recognition word is greater than the number of characters of the voice recognition word, so that the type of the difference can be determined to be a deletion difference type (namely, deletion error, the number is D); the character content contained in the 'too' and the 'large' with the corresponding relation is different, and the type of the difference can be determined to be a replacement difference type; the number of characters contained in the good and good words with the corresponding relation is different, and the number of characters of the text recognition word is smaller than that of characters of the voice recognition word, and the type of the difference can be determined to be an insertion difference type (namely, the number of insertion errors is I).

In some embodiments, the degree of difference includes ratio information of a portion of the text recognition result different from the speech recognition result to the text recognition result. For example, the degree of the difference may be determined based on the proportion r of the number of characters having the difference (all types of differences) to the total number of characters of the text recognition result or the voice recognition result.

In some embodiments, where the ratio r of the differences is greater than the difference threshold p, the text recognition results of the OCR model differ significantly from the speech recognition results of the ASR model. In this case, if the confidence level of the speech recognition result is higher than the first confidence threshold p ₁ It can be determined that the text recognition result of the OCR model is invalid and the result T is used for speech recognition _a Confidence S of _a As a final recognition result.

Therefore, when the voice recognition result is reliable and the text recognition result and the voice recognition result have large differences, the voice recognition result is adopted for marking, and the accuracy of data marking can be improved.

In some embodiments, in the event of a replacement error, the text recognition results for the OCR model are correct and the speech recognition results for the ASR model are incorrect for the near words "so" and "three"; in the case of substitution errors, the speech recognition result of the ASR model is correct and the text recognition result of the OCR model is wrong for the shape and the close words "too" and "big"; in the case of deletion errors, the text recognition result of the OCR model is correct and the speech recognition result of the ASR model is wrong; in the case of an insertion error, since the text recognition result of the OCR model in the subtitle region may omit the imaginary words such as the intonation word, the speech recognition result of the ASR model is more likely to be correct.

Based on the above, it may not be possible to determine whether to use text recognition results or speech recognition results to determine the final recognition results. In order to further improve the accuracy of video voice annotation, a fusion scheme of model recognition results is provided. For example, the following examples can be used for fusion.

In some embodiments, the speech recognition result and the text recognition result are fused according to the degree of difference and the confidence level between the speech recognition result and the text recognition result.

In some embodiments, determining a speech recognition result or a text recognition result as a candidate recognition result based on the confidence and the degree of the difference; and correcting the candidate recognition results to determine a final recognition result.

For example, the fusion process may be performed by the embodiment of fig. 2 c.

FIG. 2c illustrates a flow chart of further embodiments of the data annotation method of the present disclosure.

As shown in FIG. 2c, in step 210, the confidence S of the speech recognition result of the ASR model _a Is relatively low (i.e. S _a Less than p ₁ For example p ₁ When the speech recognition result is determined to be background noise or the speech recognition result is not authentic, =0.4), the sample (audio clip and video clip corresponding thereto) is deleted; otherwise, step 220 is performed.

In step 220, it is determined whether the difference ratio r of the speech recognition result of the ASR model to the text result of the OCR model is greater than p (e.g., p=0.8). When r is larger than or equal to p, the text recognition result of the OCR model is larger than the speech recognition result of the ASR model, and the text recognition result of the OCR model is determined to be invalid (possibly blank, namely no subtitle or no subtitle can be recognized).

In some embodiments, where the confidence level is greater than a first confidence threshold and the degree of variance is less than a variance threshold, a determination is made to use either the speech recognition result or the text recognition result as a candidate recognition result based on the type of variance.

For example, in the case where r.gtoreq.p, T alone is used _a Confidence S of _a As a result of final recognitionDetermining a basis, and executing step 230; otherwise, r<p, step 240 is performed.

In step 230, a confidence S of the speech recognition result of the ASR model is determined _a Whether or not it is greater than a second confidence threshold p ₂ (e.g. p ₂ ＝0.8)。

In some embodiments, the speech recognition result is determined to be the final recognition result if the degree of variance is greater than or equal to a variance threshold and the confidence is greater than a second confidence threshold, the second confidence threshold being greater than the first confidence threshold.

For example, at S _a >p ₂ In the case of (1), the speech recognition result Ta of the ASR model is used as the final speech recognition resultOtherwise, T is taken _a And judging that the background noise or the model result is not credible, and deleting the sample.

In step 240, in the case where the number of insertion class differences i=0, the text recognition result T of the OCR model is adopted _o As a candidate recognition result; otherwise, step 250 is performed.

For example, in the case where the number of insertion class differences included in the differences is less than or equal to the number threshold, the text recognition result is used as a candidate recognition result, and the insertion class differences are used to characterize that the speech recognition result includes characters that are not present in the text recognition result.

In step 250, a confidence S of the speech recognition result of the ASR model is determined _a Whether or not it is greater than p ₂ . At S _a >p ₂ In the case of (a), the speech recognition result T of the ASR model is used _a As a candidate recognition result; otherwise, judge T _a This sample is deleted for background noise or for unreliable model results.

In some embodiments, where the difference includes an insertion class difference having a number greater than a number threshold and a confidence greater than a second confidence threshold, the speech recognition result is used as a candidate recognition result, the insertion class difference is used to characterize the speech recognition result as including characters not present in the text recognition result, the second confidence threshold being greater than the first confidence threshold.

In step 260, the replacement errors in the candidate recognition results are processed using the already constructed word list and word list, thereby fusing the recognition results of the OCR model and ASR model.

In some embodiments, the candidate recognition results are corrected based on correspondence between each of the plurality of speech recognition words in the speech recognition results and each of the plurality of text recognition words in the text recognition results.

In some embodiments, in a case where the speech recognition result is taken as a candidate recognition result and the first speech recognition word and the first text recognition word having the correspondence relationship contain different characters, correcting the first speech recognition word with the first text recognition word; when the text recognition result is used as a candidate recognition result and the second speech recognition word having a correspondence relationship and the character included in the second text recognition word are different, the second text recognition word is corrected by using the second speech recognition word.

In some embodiments, in the case of using the speech recognition result as a candidate recognition result, correcting the candidate recognition result using a shape-near-word data sample including a correspondence between characters whose degree of similarity of the shape exceeds a first threshold; and correcting the candidate recognition result by using a near-pronunciation data sample when the text recognition result is taken as the candidate recognition result, wherein the near-pronunciation data sample comprises the corresponding relation between characters with the pronunciation similarity exceeding a second threshold value.

For example, at T _a In the case of candidate recognition results, T is determined by using the near word list _a The identification result corresponding to the replacement error in the alignment is replaced by the aligned T _o A corresponding recognition result; at T _o In the case of candidate recognition results, T is determined by using the word list _o The identification result corresponding to the replacement error in the alignment is replaced by the aligned T _a Corresponding recognition result to obtain final recognition result

After the final identification result is obtained, the labeling can be continued through the remaining steps in fig. 1.

In step 140, the audio stream data is annotated according to the final recognition result.

FIG. 3 illustrates a flow chart of yet other embodiments of the data tagging method of the present disclosure.

As shown in fig. 3, in step 310, preprocessing of video data is performed. For example, each video is separated into image stream data and audio stream data.

In step 320, a data stream cut is made. For example, using a VAD model, cutting the audio data stream to obtain audio containing portions of speech content; and cutting the image data stream by utilizing the time stamp of the audio cutting to obtain the corresponding image stream and audio stream fragments.

In step 330, the pattern recognition results are filtered and aligned. For example, recognition of subtitles in image frames and speech in audio frames is performed using pre-trained OCR and ASR models, respectively; and confidence filtering and alignment are carried out on the identification of the two models, so that error identification samples are reduced.

In step 340, the model identification results are fused. For example, the results of the OCR model and the ASR model are fused by constructing a list of 'shape near words' and 'sound near words', so as to obtain the automatic video voice labeling result.

In the above embodiment, a video voice automatic labeling method integrating a pre-training OCR model and an ASR model is provided. Respectively utilizing the pre-trained OCR model and the pre-trained ASR model to respectively recognize the caption in the image frame and the voice in the audio frame; confidence filtering and alignment are carried out on the identification of the two models, so that error identification samples are reduced; combining the 'shape near word' and the 'sound near word' list, fusing the results of OCR and ASR, and improving the accuracy of the automatic labeling result of video and voice.

In some embodiments, the audio to be identified is speech-recognized using a speech recognition model, which is trained using audio stream data, which is annotated by the data annotation method of any of claims 1 to 15.

Fig. 4a illustrates a block diagram of some embodiments of a data tagging device of the present disclosure.

As shown in fig. 4a, the data marking apparatus 4a includes: a first recognition unit 41a for performing voice recognition on the audio stream data of the video using a voice recognition model to obtain a voice recognition result and a confidence level of the voice recognition result; a second recognition unit 42a, configured to perform text recognition on the subtitle region of the video using the text recognition model, so as to obtain a text recognition result; a determining unit 43a, configured to perform fusion processing on the speech recognition result and the text recognition result according to the confidence level, so as to determine a final recognition result; and the labeling unit 44a is configured to label the audio stream data according to the final recognition result.

In some embodiments, the determining unit 43a performs fusion processing on the speech recognition result and the text recognition result according to the degree of difference and the confidence between the speech recognition result and the text recognition result.

In some embodiments, the determining unit 43a determines the speech recognition result or the text recognition result as a candidate recognition result according to the confidence and the degree of the difference; and correcting the candidate recognition results to determine a final recognition result.

In some embodiments, the determining unit 43a determines, in a case where the confidence is greater than the first confidence threshold and the degree of difference is less than the difference threshold, whether to use the speech recognition result or the text recognition result as the candidate recognition result according to the type of difference.

In some embodiments, the determining unit 43a uses the text recognition result as a candidate recognition result in a case where the number of insertion class differences included in the differences is less than or equal to the number threshold, the insertion class differences being used to characterize that characters not present in the text recognition result are included in the speech recognition result.

In some embodiments, the determining unit 43a uses the speech recognition result as the candidate recognition result in a case where the number of insertion class differences included in the differences is greater than a number threshold and the confidence is greater than a second confidence threshold, the second confidence threshold being greater than the first confidence threshold, the insertion class differences being used to characterize characters included in the speech recognition result that are not present in the text recognition result.

In some embodiments, the determining unit 43a determines the speech recognition result as the final recognition result in the case where the degree of difference is greater than or equal to the difference threshold and the confidence is greater than a second confidence threshold, the second confidence threshold being greater than the first confidence threshold.

In some embodiments, the determining unit 43a establishes a correspondence between each of a plurality of speech recognition words in the speech recognition result and each of a plurality of text recognition words in the text recognition result; and correcting the candidate recognition result according to the corresponding relation.

In some embodiments, the determining unit 43a corrects the first speech recognition word with the first text recognition word in a case where the speech recognition result is taken as a candidate recognition result and the first speech recognition word having a correspondence relationship and the first text recognition word contain different characters; when the text recognition result is used as a candidate recognition result and the second speech recognition word having a correspondence relationship and the character included in the second text recognition word are different, the second text recognition word is corrected by using the second speech recognition word.

In some embodiments, the determining unit 43a corrects the candidate recognition result using a form-near word data sample including a correspondence relationship between characters whose degree of similarity of the shape exceeds a first threshold value, in the case where the speech recognition result is taken as the candidate recognition result; and correcting the candidate recognition result by using a near-pronunciation data sample when the text recognition result is taken as the candidate recognition result, wherein the near-pronunciation data sample comprises the corresponding relation between characters with the pronunciation similarity exceeding a second threshold value.

In some embodiments, the video includes a plurality of image frames, each of the plurality of image frames having at least one subtitle region. The second identifying unit 42a determines an association relationship between caption areas belonging to different image frames based on the position information of at least one caption area that each of the plurality of image frames has; fusion processing is carried out on text recognition result fragments in the caption area with the association relation; and acquiring a text recognition result according to the fusion processing result.

In some embodiments, the second recognition unit 42a determines the association relationship according to the cross-ratio between subtitle regions belonging to different image frames.

In some embodiments, the first recognition unit 41a performs speech recognition on an audio piece containing speech in the audio stream data; the second identifying unit 42a determines a video clip in the image stream data associated with the audio clip based on the audio sampling rate of the audio stream data and the image sampling rate of the image stream data of the video; text recognition is performed on the subtitle region of the video clip.

Fig. 4b shows a block diagram of some embodiments of a speech recognition apparatus of the present disclosure.

As shown in fig. 4b, the voice recognition apparatus 4b includes: an acquisition unit 41b for acquiring audio to be recognized; the recognition unit 42b is configured to perform speech recognition on the audio to be recognized using a speech recognition model, where the speech recognition model is trained using audio stream data, and the audio stream data is labeled by the data labeling method in any of the above embodiments.

Fig. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure.

As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to perform the data tagging method or the speech recognition method in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader, database, and other programs.

As shown in fig. 6, the electronic device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the data tagging method or the speech recognition method of any of the foregoing embodiments based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader, and other programs.

The electronic device 6 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650 and the memory 610 and processor 620 may be connected by, for example, a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a speaker. Network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having computer-usable program code embodied therein.

Heretofore, a data labeling method, a data labeling apparatus, a voice recognition method, a voice recognition apparatus, an electronic device, and a non-volatile computer-readable storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.

The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of labeling data, comprising:

performing voice recognition on the audio stream data of the video by utilizing a voice recognition model to obtain a voice recognition result and the confidence of the voice recognition result;

performing text recognition on the caption area of the video by using a text recognition model to obtain a text recognition result;

according to the confidence level, carrying out fusion processing on the voice recognition result and the text recognition result to determine a final recognition result;

and marking the audio stream data according to the final recognition result.

2. The data labeling method of claim 1, wherein the fusing the speech recognition result with the text recognition result comprises:

and according to the degree of difference between the voice recognition result and the text recognition result and the confidence level, carrying out fusion processing on the voice recognition result and the text recognition result.

3. The data labeling method of claim 2, wherein the fusing the speech recognition result with the text recognition result comprises:

determining the voice recognition result or the text recognition result as a candidate recognition result according to the confidence level and the degree of the difference;

And correcting the candidate recognition results to determine the final recognition result.

4. A data tagging method according to claim 3 wherein said determining to treat the speech recognition result or the text recognition result as a candidate recognition result comprises:

and under the condition that the confidence coefficient is larger than a first confidence coefficient threshold value and the degree of the difference is smaller than a difference threshold value, determining the voice recognition result or the text recognition result as a candidate recognition result according to the type of the difference.

5. The data tagging method of claim 4, wherein the determining whether to treat the speech recognition result or the text recognition result as a candidate recognition result comprises:

and taking the text recognition result as the candidate recognition result when the number of the insertion class differences included in the differences is smaller than or equal to a number threshold, wherein the insertion class differences are used for representing characters which are not included in the text recognition result in the voice recognition result.

6. The data tagging method of claim 4, wherein the determining whether to treat the speech recognition result or the text recognition result as a candidate recognition result comprises:

And when the number of the insertion class differences included in the differences is larger than a number threshold and the confidence is larger than a second confidence threshold, the voice recognition result is used as the candidate recognition result, the insertion class differences are used for representing that characters which are not existed in the text recognition result are included in the voice recognition result, and the second confidence threshold is larger than the first confidence threshold.

7. The data annotation method of claim 4, wherein the determining the final recognition result comprises:

and determining the voice recognition result as the final recognition result when the degree of difference is greater than or equal to the difference threshold and the confidence is greater than a second confidence threshold, the second confidence threshold being greater than the first confidence threshold.

8. A data tagging method according to claim 3 wherein said correcting said candidate recognition results comprises:

establishing a corresponding relation between each of a plurality of voice recognition words in the voice recognition result and each of a plurality of text recognition words in the text recognition result;

and correcting the candidate recognition result according to the corresponding relation.

9. The data tagging method according to claim 8, wherein the correcting the candidate recognition result includes:

correcting the first speech recognition word by using the first text recognition word when the speech recognition result is taken as the candidate recognition result and the first speech recognition word and the first text recognition word with the corresponding relation contain different characters;

and correcting the second text recognition word by using the second speech recognition word when the text recognition result is taken as the candidate recognition result and the second speech recognition word with the corresponding relation and the character contained in the second text recognition word are different.

10. A data tagging method according to claim 3 wherein said correcting said candidate recognition results comprises:

correcting the candidate recognition result by using a shape near-word data sample under the condition that the voice recognition result is taken as the candidate recognition result, wherein the shape near-word data sample comprises a corresponding relation between characters with the similarity degree of the shapes exceeding a first threshold value;

and correcting the candidate recognition result by utilizing a near-pronunciation data sample under the condition that the text recognition result is taken as the candidate recognition result, wherein the near-pronunciation data sample comprises the corresponding relation between characters with the pronunciation similarity exceeding a second threshold value.

11. The data labeling method of claim 2, wherein the degree of difference comprises ratio information of a portion different from the speech recognition result in the text recognition result to the text recognition result.

12. The data annotation method according to any of claims 1-11, wherein the video comprises a plurality of image frames, each of the plurality of image frames having at least one subtitle region,

the text recognition of the caption area of the video to obtain a text recognition result comprises the following steps:

determining an association relationship between caption areas belonging to different image frames according to position information of at least one caption area of each of the plurality of image frames;

fusing text recognition result fragments in the caption area with the association relation;

and acquiring the text recognition result according to the fusion processing result.

13. The data labeling method of claim 12, wherein the determining an association between at least one caption region possessed by each of the plurality of image frames comprises:

and determining the association relation according to the cross-over ratio between the caption areas belonging to different image frames.

14. The data tagging method according to any one of claims 1 to 11, wherein the voice recognition of the audio stream data of the video comprises:

performing voice recognition on an audio fragment containing voice in the audio stream data;

the text recognition of the subtitle region of the video comprises the following steps:

determining a video clip in the image stream data associated with the audio clip according to an audio sampling rate of the audio stream data and an image sampling rate of the image stream data of the video;

and carrying out text recognition on the subtitle region of the video clip.

15. A method of annotating data according to any of claims 1 to 11, wherein the annotated audio stream data is used to train the speech recognition model.

16. A method of speech recognition, comprising:

performing voice recognition on the audio to be recognized by using a voice recognition model, wherein the voice recognition model is trained by using audio stream data, and the audio stream data is marked by the data marking method according to any one of claims 1 to 15.

17. A data annotation device comprising:

the first recognition unit is used for carrying out voice recognition on the audio stream data of the video by utilizing the voice recognition model so as to obtain a voice recognition result and the confidence coefficient of the voice recognition result;

The second recognition unit is used for carrying out text recognition on the caption area of the video by utilizing the text recognition model so as to obtain a text recognition result;

the determining unit is used for carrying out fusion processing on the voice recognition result and the text recognition result according to the confidence coefficient so as to determine a final recognition result;

and the labeling unit is used for labeling the audio stream data according to the final recognition result.

18. A speech recognition apparatus comprising:

the acquisition unit is used for acquiring the audio to be identified;

the recognition unit is used for carrying out voice recognition on the audio to be recognized by utilizing a voice recognition model, the voice recognition model is trained by utilizing audio stream data, and the audio stream data is marked by the data marking method according to any one of claims 1 to 15.

19. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the data tagging method of any one of claims 1-15, or the speech recognition method of claim 16, based on instructions stored in the memory.

20. A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data labeling method of any of claims 1-15, or the speech recognition method of claim 16.