CN118038852A - Corpus acquisition method and device, electronic equipment, storage medium and program product - Google Patents

Corpus acquisition method and device, electronic equipment, storage medium and program product Download PDF

Info

Publication number
CN118038852A
CN118038852A CN202410132418.1A CN202410132418A CN118038852A CN 118038852 A CN118038852 A CN 118038852A CN 202410132418 A CN202410132418 A CN 202410132418A CN 118038852 A CN118038852 A CN 118038852A
Authority
CN
China
Prior art keywords
text
text content
image data
audio
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410132418.1A
Other languages
Chinese (zh)
Inventor
周逸铭
康健
李�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Artificial Intelligence Technology Beijing Co ltd
Original Assignee
China Telecom Artificial Intelligence Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Artificial Intelligence Technology Beijing Co ltd filed Critical China Telecom Artificial Intelligence Technology Beijing Co ltd
Priority to CN202410132418.1A priority Critical patent/CN118038852A/en
Publication of CN118038852A publication Critical patent/CN118038852A/en
Pending legal-status Critical Current

Links

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a corpus acquisition method, a corpus acquisition device, electronic equipment, a storage medium and a program product. Wherein the method comprises the following steps: acquiring audio and video data, wherein the audio and video data comprises audio data for representing variant languages and image data; determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; determining target text content matched with the audio data from the first text content and the second text content based on the matching information; and combining the audio data and the target text content to obtain a target corpus. The method solves the technical problem of low corpus obtaining efficiency.

Description

Corpus acquisition method and device, electronic equipment, storage medium and program product
Technical Field
The invention relates to the field of large models, in particular to a corpus acquisition method, a corpus acquisition device, electronic equipment, a storage medium and a program product.
Background
Currently, variant languages (e.g., dialects) are variants of natural language, typically used in a particular geographic area or community. Conventional speech recognition techniques often face dialect-diverse challenges because they are primarily trained and optimized for standard languages.
In the related art, the corpus of the dialects is often difficult to obtain, and especially, some dialects of relative masses want to train only through the traditional recording and labeling method, so that the model training is high in cost and long in time consumption, and the technical problem of low corpus obtaining efficiency exists.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a corpus acquisition method, a corpus acquisition device, electronic equipment, a storage medium and a program product, which are used for at least solving the technical problem of low corpus acquisition efficiency.
According to one aspect of the embodiment of the invention, a corpus acquisition method is provided. The method may include: acquiring audio and video data, wherein the audio and video data comprises audio data for representing variant languages and image data; determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; determining target text content matched with the audio data from the first text content and the second text content based on the matching information; and combining the audio data and the target text content to obtain a target corpus.
Optionally, determining the first text content corresponding to the audio data in the audio-video data includes: and recognizing human voice appearing in the audio data by using a voice recognition model to obtain first text content, wherein the voice recognition model is used for recognizing the variant language.
Optionally, determining the second text content corresponding to the image data in the audio-video data includes: determining a text region in the image data and position information corresponding to the text region; identifying the text region to obtain third text content in the text region and initial identification confidence corresponding to the third text content; and obtaining second text content based on the position information, the third text content and the initial recognition confidence.
Optionally, determining the text region in the image data includes: determining at least one sub-text area in the image data, wherein the sub-text area contains texts of which the deployment angles meet a horizontal threshold value; and clustering and merging at least one sub text area to obtain a text area.
Optionally, obtaining the second text content based on the location information, the third text content, and the initial recognition confidence includes: screening caption text in the image data from the third text content based on the position information; combining caption texts in the image data to obtain a combined text; determining target recognition confidence corresponding to the combined text based on the initial recognition confidence; and obtaining second text content based on the target recognition confidence, the time information corresponding to the image data and the combined text.
Optionally, obtaining the second text content based on the target recognition confidence, the time information corresponding to the image data and the combined text includes: determining adjacent frame images of the image data; determining a first similarity between the combined text in the image data and the combined text in the adjacent frame image; responding to the fact that the first similarity is larger than a similarity threshold, and determining the combined text with the target recognition confidence larger than the confidence threshold in the combined text corresponding to the combined text and the adjacent frame image as second text content; and updating the text content of the combined text in the image data and the text content of the combined text in the adjacent frame image into second text content.
Optionally, matching the first text content and the second text content to obtain matching information, including: performing voice positioning on the audio data to obtain first starting time of first text content corresponding to at least one voice in the audio data; determining a second start-stop time of the second text content; and matching the first text content with the second text content based on the first starting time and the second starting time to obtain matching information.
Optionally, determining the start-stop time of the second text content, to obtain the second start-stop time, includes: determining that the text content is the first image data of the second text content and the text content is the last image data of the second text content in the multi-frame image data of the video data; the second start-stop time is determined based on the time at which the first image data appears and the time at which the last image data appears.
Optionally, matching the first text content and the second text content based on the first start time and the second start time to obtain matching information includes: expanding the first starting and stopping time to obtain a third starting and stopping time; determining at least one matched text content in which the second start-stop time is positioned in the third start-stop time in the second text contents; and respectively matching at least one matched text content with the first text content to obtain matching information.
According to another aspect of the embodiment of the invention, a corpus acquisition device is also provided. The apparatus may include: an acquisition unit configured to acquire audio-video data including audio data representing a variant language, and image data; the first determining unit is used for determining first text content corresponding to audio data in the audio-video data and second text content corresponding to image data in the audio-video data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; the processing unit is used for matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; a second determining unit configured to determine, based on the matching information, a target text content that matches the audio data from the first text content and the second text content; and the combining unit is used for combining the audio data and the target text content to obtain a target corpus.
According to another aspect of the embodiments of the present invention, there is further provided a non-volatile storage medium, where a plurality of instructions are stored, where the instructions are adapted to be loaded by a processor and execute any one of the corpus acquisition methods.
According to another aspect of the embodiment of the present invention, there is further provided an electronic device, including one or more processors and a memory, where the memory is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement any one of the corpus acquiring methods.
According to another aspect of the embodiments of the present invention, there is also provided a computer program product, including a computer program, which when executed by a processor, implements the corpus acquisition method of any one of the above.
In the embodiment of the invention, audio-video data is acquired, wherein the audio-video data comprises audio data for representing variant languages and image data; determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; determining target text content matched with the audio data from the first text content and the second text content based on the matching information; and combining the audio data and the target text content to obtain a target corpus. That is, in the embodiment of the invention, audio and video data with variant language and text content are acquired, audio data in the audio and video data are processed to obtain first text information, multi-frame image data in the audio and video data are processed to obtain second text information, the first text information and the second text information are matched to determine target text content matched with the audio data, and the audio data and the target text content can be combined, so that dialect corpus which can be used as training data is obtained, further, the technical effect of improving the efficiency of acquiring corpus is realized, and the technical problem of low efficiency of acquiring corpus is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of a method of corpus acquisition according to an embodiment of the invention;
FIG. 2 is a flow diagram of an OCR and VAD based speech recognition dialect corpus process in accordance with an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a corpus acquiring device according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:
an optical character recognition technique (Optical Character Recognition, abbreviated OCR) for converting an image or scanned text into machine-editable text data from which printed or handwritten characters may be extracted for further processing, editing or storage by image processing and character recognition techniques;
a voice end point detection (Voice Activity Detection, abbreviated as VAD) for finding the start and end points of a voice from a voice signal containing silence, noise, etc.;
Automatic speech recognition (Automatic Speech Recognition, abbreviated as ASR), which may be a technique for converting human speech into text or instructions, may include voice assistants (e.g., siri, alexa, google assant), voice transcription, telephony automation systems, voice command control, and the like.
According to an embodiment of the present invention, there is provided an embodiment of a corpus acquisition method, it should be noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that herein.
Fig. 1 is a flowchart of a corpus acquisition method according to an embodiment of the present invention, as shown in fig. 1, the method may include the following steps:
Step S102, audio and video data is acquired, wherein the audio and video data comprises audio data for representing variant languages and image data.
In this embodiment, when corpus is required to be acquired, audio-video data can be acquired. The audio-video data may be corpus for obtaining a training speech recognition model, and may include audio data and image data for representing a variant language, for example, audio-video data with subtitles and dialect audio. The variant language may be dialect, slang, accent, jargon, etc., and it should be noted that the description is merely illustrative, and the type of the variant language is not particularly limited. The audio data may be a speech signal and may contain utterances of a variant language. The image data may contain subtitles that may be used to further verify the text content corresponding to the pronunciation in the audio data.
Alternatively, the above-mentioned audio-video data may contain multi-frame image data. Audio-video data with subtitles and containing dialect audio may be downloaded, wherein the audio data in the audio-video data may contain dialect audio.
For example, various audio and video data with variant voices and subtitles can be obtained from various places such as a variety of games, talk shows, movie shows, the internet, etc. The audio data and the multi-frame image data can be contained in the audio and video data. The audio data may be audio data containing dialects. The image data may include subtitles corresponding to dialects. It should be noted that, the present invention is merely illustrative, and the source of obtaining the audio and video data is not particularly limited.
Step S104, determining a first text content corresponding to the audio data and a second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data.
In this embodiment, a first text content corresponding to audio data and a second text content corresponding to image data may be determined. The first text content may be a result of automatic speech recognition (Automatic Speech Recognition, abbreviated as ASR), may be used to determine text content corresponding to audio, may be used to represent content of a language uttered by a voice in the audio data, and may be initial text information. The second text content may be text presented from the image data and may be used to describe subtitle content in the image data.
Alternatively, when a voice appears in the audio data in the audio-video data, the content uttered by the voice may be determined to obtain the first text content. The audio-visual data may be composed of a plurality of frames of images. Therefore, the text displayed in each frame of image data can be extracted to obtain the second text content corresponding to the image data.
And step S106, matching the first text content and the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content.
In this embodiment, the first text content and the second text content may be matched to obtain matching information, and the matching degree between the first text content and the second text content may be determined through the matching information. Wherein the matching information may be used to characterize the degree of similarity of the text in the first text content and the second text content.
Optionally, the first text content and the second text content are matched, and a position where the same text exists in the first text content and the second text content can be determined to determine matching information.
Step S108, determining target text content matched with the audio data from the first text content and the second text content based on the matching information.
In this embodiment, the target text content that matches the audio data may be determined from the first text content and the second text content based on the matching information. The target text content may be text corresponding to a variant language.
Step S110, combining the audio data and the target text content to obtain a target corpus.
In this embodiment, the audio data and the target text content may be combined to obtain the target corpus. The target corpus may be a dialect corpus or a slang corpus, etc., and may be used for training an acoustic model to improve accuracy of the acoustic model in identifying the variant language, and it should be noted that the method is only illustrative and does not specifically limit the type of the target corpus. The acoustic model may be used to identify variant languages.
Because variant languages (such as dialects) are often difficult to obtain, especially for some relatively small people, if only traditional recording and labeling methods are adopted, the training cost of an acoustic model is high, the time consumption is long, and the technical problem of low accuracy of model to variant language identification exists. In order to solve the above technical problem, in this embodiment, audio-video data is acquired, and first text content sent out by audio data in the audio-video data and subtitle content (second text content) in a multi-frame image in the audio-video data are determined. And matching the first text content with the second text content to obtain matching information, determining target text content with high matching degree with the audio data based on the matching information, and combining the audio data with the target text content, so that target corpus comprising pronunciations of variant languages and corresponding texts can be obtained. By the method, training data of the acoustic model can be enriched, so that accuracy of the acoustic model on the identification of the title language is improved.
Determining at least one first identification information from the data transmission request in response to the data transmission request through the steps S102 to S106, wherein the first identification information is used for identifying the user equipment to which the data is to be transmitted; determining second identification information associated with the first identification information; and transmitting the first identification information and the data to be transmitted to a session management function identified by the second identification information, so that the session management function transmits the data to be transmitted to a target user equipment group in the user equipment, wherein the first identification information of different target user equipment in the target user equipment group is the same. That is, in the embodiment of the invention, the second identification information associated with the first identification information of the user equipment is determined, and based on the second identification information, the session management function is directly called to transmit data to the user equipment with the same first identification, so that the need of repeatedly calling the same session management function to transmit data to be transmitted to a plurality of user equipment is avoided, the technical effect of avoiding resource waste in the data transmission process is realized, and the technical problem of low efficiency of acquiring corpus is solved.
The above-described method of this embodiment is further described below.
As an optional embodiment, step S104, determining the first text content corresponding to the audio data in the audio-video data includes: and recognizing human voice appearing in the audio data by using a voice recognition model to obtain first text content, wherein the voice recognition model is used for recognizing the variant language.
In this embodiment, a person present in the audio data may be identified using a speech recognition model to obtain the first text content. The speech recognition model may be used to convert a variant language in the audio data into text content, for example, may be a dialect speech recognition model based on a neural network structure (conformer) of an attention mechanism, and it should be noted that the description is merely illustrative, and the type of the dialect speech recognition model is not particularly limited.
Alternatively, the embodiment may first train a conformer-based dialect speech recognition model using a small amount of collected dialect data (i.e., a dialect corpus). The model can be constructed by combining a convolutional neural network (Convolutional Neural Network, CNN for short) structure extracted by acoustic features and a attention mechanism (transducer) structure for context modeling, and by the construction mode, the model can effectively capture the time sequence and the context information of the audio data, so that the recognition accuracy is improved. Wherein, the extracted acoustic feature of the trained dialect voice recognition model can be Mel frequency cepstral coefficient (Mel-Frequency Cepstral Coefficients, which is called MFCC for short), and the encoder-decoder structure can be adopted. The encoder may be configured to encode acoustic features of the audio data into a context representation, and the decoder may be configured to convert the context representation into a text output to obtain the first text content.
Optionally, after the audio-video data is acquired, the audio data in the audio-video data may be extracted, and the voice appearing in the audio may be processed by using the dialect voice recognition model, so as to obtain the first text content. The voice may be a voice speaking in a video, where the voice may include a variant language, such as mandarin, dialect, slang, etc., and it should be noted that the voice is only illustrated herein, and the type of voice is not particularly limited.
Since the dialect speech recognition model is obtained only by training with a small amount of corpus, the first text content obtained by using the model is not very accurate, and the first text content is corrected based on the second text content in a later step.
As an optional embodiment, step S104, determining the second text content corresponding to the image data in the audio/video data includes: determining a text region in the image data and position information corresponding to the text region; identifying the text region to obtain third text content in the text region and initial identification confidence corresponding to the third text content; and obtaining second text content based on the position information, the third text content and the initial recognition confidence.
In this embodiment, text may appear in multiple places in the image data, but there may be text that is not what we need, and multiple text areas where text appears may be processed by the following steps to ensure that the second text content is what we need:
Optionally, determining the text region in the image data and the position information corresponding to the text region, identifying the text region through a text identification model to obtain at least one third text content in the text region and an initial identification confidence corresponding to the third text content, and processing the third text content based on the position information and the initial identification confidence to obtain the second text content. The text region may be a region in which text exists in the image data, for example, a region in which subtitles exist or a region in which text subtitles exist. The location information may be used to determine an area in the image data where the text area is located, may be represented by (X, Y), and may include a deployment location and a deployment angle of the text area. The initial recognition confidence may be used to characterize the likelihood of the real text being included in the third text content in the text region. The text recognition model can be constructed based on a semantic visual text correlation algorithm (Semantic Visual Textual Relevance, abbreviated as SVTR) and can be used for recognizing text contents in a text region.
Alternatively, text may appear in a plurality of areas in the image data, and thus, a text area in the image area and position information corresponding to the text area may be determined. The text regions may be identified by a text recognition model to determine third text content in at least one text region, and an initial recognition confidence level corresponding to the third text content may be determined by the text recognition model. And screening and processing the third text content based on the position information and the initial recognition confidence level to obtain the second text content.
Alternatively, since many of the audio-visual data containing dialect audio has a relatively long time of day and the sharpness is not very high, it is possible to select to extract a plurality of frames of image data per second for optical character recognition. In order to consider the efficiency of batch processing OCR, 4 frames per second, namely, 1 second 4 frames of images can be extracted, and the adjacent two frames are different in time by 0.25 second in frequency, so that audio and video data can be intercepted. For example, a set of 30 minutes of video data, in this way, 7200 pictures can be taken out in total to obtain multi-frame image data. It should be noted that the size of the data is merely illustrative, and the size of the frame number may be selected according to practical situations without being limited thereto.
For example, the truncated image data may be OCR using a chinese recognition model and a chinese detection model. The text region in the image data and the location information corresponding to the text region may be determined using a chinese detection model. And the text region can be identified through the text identification model to determine the third text content in the text region and the initial identification confidence corresponding to the third text region. And adjusting the third text content based on the initial recognition confidence and the position information to obtain the second text content. The Chinese detection model may adopt a database algorithm (Database Algorithm, abbreviated as DB) and may be a DB module.
Alternatively, a Chinese detection model may be used for text detection, through which text regions in the image data, and location information corresponding to the text regions, may be located and extracted. The chinese detection model may be a lightweight deep learning model, for example, a scene text detection model (EFFICIENT AND Accurate Scene Text Detector, abbreviated as EAST). It should be noted that the text detection model is merely illustrative, and the type of the text detection model is not particularly limited.
Alternatively, the text recognition model may be used to perform a text recognition task, i.e., the detected characters in the text region may be converted into readable text strings to obtain a third text content corresponding to the text region.
For example, the character recognition model may be constructed by a variation self-encoder and a transducer model, and a sequence-to-sequence algorithm may be used, by which the character recognition model may first cut each text region into individual characters, and send the cut characters to the variation self-encoder for feature extraction and encoding. Then, decoding and generating a corresponding text sequence by using a transducer model to obtain accurate third text content. Meanwhile, in order to improve robustness, the character recognition model adopts an attention mechanism to model the dependency relationship among different characters so as to improve recognition accuracy.
Alternatively, the text region and the position information of the text in the image data can be determined through the Chinese detection model, and specific content (third text content) in the text region and the initial recognition confidence degree can be recognized through the text recognition model.
As an alternative embodiment, determining a text region in the image data includes: determining at least one sub-text area in the image data, wherein the sub-text area contains texts of which the deployment angles meet a horizontal threshold value; and clustering and merging at least one sub text area to obtain a text area.
In this embodiment, there may be a plurality of texts in the image data, and thus, at least one sub-text region in which the arrangement angle satisfies the horizontal threshold may be determined in the image data, and the at least one sub-text region is clustered to obtain a text region. The level threshold may be a preset value, for example, may be 180 degrees, and it should be noted that the level threshold is only illustrated herein and is not limited in size. The deployment angle may be the angle at which the sub-text region is placed. The sub-text region may include a text line or a text segment.
Since there may be non-caption text such as station caption, advertisement, etc. in part of the image data, more than one text may be recognized, but only caption text is required in the recognized text, so that a text region with a horizontal or near-horizontal deployment angle may be detected to obtain at least one text region. The detected text lines are combined into a larger text region by performing a clustering and merging operation on them to obtain a final text detection result, i.e., a text region.
As an alternative embodiment, step S104 obtains the second text content based on the location information, the third text content, and the initial recognition confidence, including: screening caption text in the image data from the third text content based on the position information; combining caption texts in the image data to obtain a combined text; determining target recognition confidence corresponding to the combined text based on the initial recognition confidence; and obtaining second text content based on the target recognition confidence, the time information corresponding to the image data and the combined text.
In this embodiment, since the caption text in the image data can be used to determine what the caption language in the audio data actually wants to express, the caption text of the image data can be screened out from the third text content. Meanwhile, since space may exist in the middle of the subtitle in the same piece of image data, subtitle text may be combined to obtain a combined text.
Optionally, based on the position information, the caption text in the image data may be screened from the third text content, and the caption text in the image data may be combined to obtain the combined text. The target recognition confidence corresponding to the combined text may be determined based on the initial recognition confidence, and the second text content may be obtained based on the target confidence, the time information corresponding to the image data, and the combined text. The time information corresponding to the image data may be used to determine the time when the image data appears in the audio/video data.
Since there may be non-captioned extraneous text, such as station notes, advertisements, etc., in a portion of the picture, more than one text region may be identified. Therefore, after the third text content in the image data of all the frames is recognized in batch, the position where the third text content appears can be statistically processed. Since it is possible to default that subtitles of the same video appear at approximately the same position, it is possible to statistically obtain the approximate position of the third text content having the highest frequency of appearance among the plurality of frames of image data. Since the subtitles are all horizontal, the X-axis positions of the four endpoints of each subtitle are not fixed, but the Y-axis is relatively fixed, so that we can screen out the non-subtitle text by counting the fixed interval of the Y-axis of the subtitle appearance area. By the method, the caption text in the image data can be screened from the third text content based on the position information.
Alternatively, since the caption text in the same image data may have a space in the middle, the text region recognition result may be plural, and at this time, plural text regions may be combined to obtain a combined text. And determining the initial recognition confidence coefficient of each third text content in the combined text, and determining the average value of the initial recognition confidence coefficients in the plurality of third text contents to obtain the target recognition confidence coefficient corresponding to the combined text. By the step, the triplet data which corresponds to each frame of image data and contains time information, third text content and target recognition confidence coefficient can be obtained.
Optionally, the second text content may be further determined based on the target recognition confidence, the time information corresponding to the image data, and the combined text.
As an alternative embodiment, obtaining the second text content based on the target recognition confidence, the time information corresponding to the image data and the combined text includes: determining adjacent frame images of the image data; determining a first similarity between the combined text in the image data and the combined text in the adjacent frame image; responding to the fact that the first similarity is larger than a similarity threshold, and determining the combined text with the target recognition confidence larger than the confidence threshold in the combined text corresponding to the combined text and the adjacent frame image as second text content; and updating the text content of the combined text in the image data and the text content of the combined text in the adjacent frame image into second text content.
In this embodiment, since several pieces of adjacent image data may correspond to the same subtitle text, but the subtitle text should be the same image data due to low definition of the image data, errors in the results of the OCR models (text detection model+text recognition model), and the like, different subtitle texts may be recognized. Therefore, we need to determine whether the adjacent image data corresponds to the same caption text, if the adjacent image data corresponds to the same caption text, the caption text in the multi-frame image data can be unified.
Optionally, adjacent frame images of the image data are determined. And determining a first similarity between the combined text in the image data and the combined text of the adjacent frame image, and determining the combined text with the target confidence degree larger than the confidence degree threshold value as the second text content in the combined text corresponding to the combined text and the adjacent frame image if the first similarity is larger than the similarity threshold value. The text content of the combined text in the image data and the text content of the combined text in the adjacent frame image may be updated to the second text content.
For example, a second adjacent frame image and a third adjacent frame image of the first image data are determined, wherein the second adjacent frame image is played at a time before the first image data, and the third adjacent frame image is played at a time before the second image data. A first similarity between the merged text in the first image data and the merged text of the second adjacent frame image and a first similarity between the merged text in the first image data and the merged text of the third adjacent frame image are determined. If the first similarity is greater than the similarity threshold, it may be determined that subtitles corresponding to the first image data, the second image data, and the third image data are identical, and therefore, the merged text with the highest target confidence may be determined as the second text content from among the merged text of the first image data, the merged text of the second image data, and the merged text of the third image data. And updating the merged text in the first image data, the merged text in the second image data and the merged text in the third image data into second text contents respectively.
Optionally, it is determined whether the combined text (i.e., the final displayed subtitle) in the adjacent image data is the same, if so, then the combined, the start time of each subtitle is determined, as well as the content text, and the too short or too low OCR confidence subtitles are screened out.
For example, a method of calculating the similarity of adjacent texts may be adopted, and if the first similarity of the combined text between adjacent image data is greater than the similarity threshold, the adjacent image data may be considered to correspond to the same subtitle. The similarity threshold may be 0.5 as determined experimentally. The frame image data from which the subtitle (which may be a combined text) appears first may be compared backward, and if the first similarity obtained by the comparison is greater than the threshold value, the combined text with higher confidence in the recognition result is updated to be the correct text (i.e., the second text content) corresponding to the subtitle of the two image data. And updates the time of the start position (i.e., start time) and the time of the end position (i.e., end time) corresponding to the second text content. If the condition that the combined text is not recognized by OCR is met in the middle, skipping and continuing to compare the similarity backwards until the first similarity with the correct text is smaller than a similarity threshold value, considering the current frame as the starting point of a new subtitle, considering the previous frame as the end of the previous subtitle, and then performing the combining operation on the next subtitle from the current frame. By the above processing, the triples including the start time, the end time, the second text content and the recognition confidence can be obtained.
Optionally, triples in which the recognition confidence of the second text content is lower than 0.8 are screened out, because if the confidence of all frame image data corresponding to a certain caption is not up to 0.8, the recognition result of the image data may be considered to be poor. At the same time, too short recognition results are in high probability erroneous, and therefore triples having too short duration, i.e., only the second text content of 2 frames or less, can be screened out. A large portion of erroneous text can be filtered out by the above steps. It should be noted that the above numbers are merely illustrative, and may be selected according to practical situations, and are not particularly limited herein.
As an optional embodiment, step S106, matching the first text content and the second text content to obtain matching information includes: performing voice positioning on the audio data to obtain first starting time of first text content corresponding to at least one voice in the audio data; determining a second start-stop time of the second text content; and matching the first text content with the second text content based on the first starting time and the second starting time to obtain matching information.
In this embodiment, the audio data may be subjected to voice localization to obtain a first start time of the first text content corresponding to at least one voice in the audio data. Wherein the audio data may include at least a first start time corresponding to a human voice. The first start time may include a start time and an end time of the voice.
Optionally, in the multi-frame image data, there may be a case that the second text content is the same, and determining a time when the second text content first appears and an end time in the video data, so as to obtain a second start-stop time corresponding to the second text content.
For example, audio information in video data is extracted. Meanwhile, a voice endpoint detection model (Voice Activity Detection, abbreviated as VAD) of the neural network structure based on the attention mechanism may be used to process the extracted audio information in batch, that is, determine the starting point and the ending point of the voice in the audio information by using the voice endpoint detection model, so as to obtain the first starting time.
Alternatively, the input feature of the speech end point detection model may be mel frequency cepstrum coefficient MFCC, and the speech end point detection model may be composed of a linear layer and a plurality of core modules. Each core module consists of a feedforward network module, a convolution module, a multi-head attention mechanism module and a layer normalization layer, and each module is connected with a residual error.
Optionally, the voice endpoint detection model may classify noise (including silence) and voice, and locate each sentence of voice in the audio by the VAD and obtain the start-stop time of the voice, so as to obtain the first start-stop time. Wherein the first start time comprises a voice start time and a voice end time.
As an alternative embodiment, determining the start-stop time of the second text content, to obtain the second start-stop time, includes: determining that the text content is the first image data of the second text content and the text content is the last image data of the second text content in the multi-frame image data of the video data; the second start-stop time is determined based on the time at which the first image data appears and the time at which the last image data appears.
In this embodiment, it may be determined that, among the multi-frame image data of the video data, the text content is the first image data of the second text content and the text content is the last image data of the second text content, and the second start-stop time may be determined based on the time at which the first image data appears and the time at which the last image data appears.
It should be noted that no other second text content may exist between the time of the first image data and the last image data. For example, if the second text content in the multi-frame image data is the text content one, the text content two, and the text content one, respectively, it may be determined that the start-stop time of the text content one is the start time of the first second, and the stop time is the second, not the fourth second. If the second text content in the multi-frame image data is the text content one, the text content one and the text content two, respectively, the start-stop time of the text content one can be determined as the start time of the first second and the stop time of the third second.
As an alternative embodiment, based on the first start time and the second start time, matching the first text content and the second text content to obtain matching information includes: expanding the first starting and stopping time to obtain a third starting and stopping time; determining at least one second target text content with a second start-stop time in a third start-stop time in the plurality of second text contents; and respectively matching at least one second target text content with the first text content to obtain matching information.
In this embodiment, since the audio data and the multi-frame image data are extracted from the same audio-video data, the first text content and the second text content should be the same, and therefore, the first text content and the second text content can be matched based on the first start time and the second start time to obtain matching information, and based on the matching information, a target text content matching with the audio data can be determined from the first text content and the second text content.
Alternatively, the ASR result (i.e., the first text content) and the OCR result (i.e., the second text content) are matched in close time, and if matched, the ASR result is replaced with the OCR result, i.e., the VAD result is obtained as a time-stamped OCR text as a labeled speech recognition dialect corpus.
Since the OCR result and the ASR result are roughly aligned only by the time stamp, there is a problem in that the accuracy of recognition of the dialect is low; meanwhile, if the judgment and screening are performed manually, there is a problem that the efficiency of data processing is low. To solve the above problem, in this embodiment, the obtained VAD and ASR results are corrected using the obtained OCR results.
Alternatively, since the first start-stop time of the first text content in the video data occurs only around the time of the voice of the person in the video, and the second start-stop time of the second text content is the start-stop time of each utterance obtained from the acoustic level, the time stamp given by the first start-stop time is more accurate, and the time stamp of the second start-stop time can only be used as a reference to correct the first text content with the second text content. Thus, determining at least one matching text content of the plurality of second text contents having a second start-stop time located in a third start-stop time; and respectively matching at least one matched text content with the first text content to obtain matching information.
Alternatively, based on the matching information, a target matching text content having a high matching degree with the first text content among the at least one matching text content may be selected, and the first text content may be replaced with the target matching text content.
For example, because too short text content may affect the accuracy of matching ASR text and OCR text, too short ASR results may be combined, e.g., into text that corresponds to speech between 7-15 seconds in length. For each ASR recognition text, its corresponding caption presentation time should typically be three seconds before the VAD gets start time and three seconds after the VAD gets end time. Thus, the first start-stop time can be extended, resulting in a third start-stop time [ start, end ]. Since the start-stop time of each piece of second text content has been determined, in the second start-stop time corresponding to the second text content, as long as a part of time appears between [ start, end ], the OCR result caption (second text content) can be regarded as a possible caption that can match the ASR result (first text content). All these possible subtitles are then combined into a long text within which the corresponding OCR text of the current ASR text can be considered to exist. And then accurately positioning the specific OCR text corresponding to the ASR text through a matching algorithm. The first 5 words and the last 5 words of the ASR text are matched with the long text respectively, the position with the matching degree above 0.8 is found, and then the two positions comprise the middle part as the corresponding OCR text. The end of the previous sentence of ASR text and the first 5 characters of the next sentence of ASR text are used simultaneously to match the long text, if there is a position on which to match, then the position is also the position of the beginning or end of the possible sentence. If a plurality of positions meeting the matching requirement appear, the similarity calculation is carried out on the plurality of OCR texts and the original ASR text obtained through the rules, and the text with the highest similarity and the length ratio of the OCR text to the ASR text is considered to be correct text within 1.5. If no location is found that meets the matching requirement, the matching is discarded, i.e., the ASR text is discarded. And then replacing the original ASR text with the OCR result to obtain the corrected target text content.
In the embodiment of the invention, the audio-video data with variant language and text content is obtained, the audio data in the audio-video data is processed to obtain the first text information, the multi-frame image data in the audio-video data is processed to obtain the second text information, the first text information and the second text information are matched to determine the target text content matched with the audio data, and the audio data and the target text content can be combined, so that dialect corpus which can be used as training data is obtained, the technical effect of improving the efficiency of obtaining the corpus is further realized, and the technical problem of low efficiency of obtaining the corpus is solved.
Embodiments of the present invention are further described below in terms of a terminal capability and network service type indication and methods of use thereof.
Speech recognition techniques, which can be used to convert sound signals into readable text, are centered on acoustic models that are capable of recognizing phonemes, syllables, and sound features in speech. In general, the development of acoustic models relies on deep learning techniques such as deep neural networks (Deep Neural Network, abbreviated DNN) and convolutional neural networks to improve accuracy and robustness.
At the same time, speech recognition systems also need to incorporate language models in order to better understand and interpret the meaning of spoken language. The language model takes into account the relationships between words and phrases, helping the system more accurately translate speech input into text output. At the same time, large-scale training data is critical to improving the performance of speech recognition systems. These training data include speech samples and corresponding text transcriptions for machine learning models to learn and constantly optimize.
Dialects are variants of natural language, commonly used in a particular geographic area or community. Conventional speech recognition techniques often face dialect-diverse challenges because they are primarily trained and optimized for standard languages. To address this difficulty, a large amount of dialect material is required, including a broad coverage of dialect pronunciation, vocabulary, and grammar rules, to enable language recognition techniques to accurately recognize and understand the various dialects. The richness of the dialect corpus is critical to the training and performance of the language model, and needs to include dialect samples from various communities or geographical areas, covering various dialect variants and pronunciation differences, to train the language model. Meanwhile, the dialect corpus also needs to comprise representative samples of different ages, sexes, social groups and usage scenes so as to ensure the robustness and universality of the language model. The rich dialect corpus also helps to improve the acoustic and language models of the model, thereby improving the accuracy and usability of dialect speech recognition.
However, in the related art, the dialect corpus is often difficult to obtain, and especially, some dialects of relatively small people want to train the model only through the traditional recording and labeling method, so that the model training is high in cost and long in time consumption. And after the word curtain is extracted by OCR, a certain method is also needed for obtaining the marked text by considering the accuracy and the retention.
In order to solve the above-mentioned problem in the language recognition process, in this embodiment, a speech recognition dialect corpus processing method based on OCR and VAD is provided, and the method can obtain speech recognition corpus which can be used for training a dialect speech recognition model from a dialect video with embedded subtitles, and correct the text of the corpus by utilizing the result of OCR to improve the quality of the corpus, and further improve the quality of the dialect speech recognition model, thereby realizing the technical effect of improving the accuracy of dialect recognition and solving the technical problem of low accuracy of dialect recognition.
FIG. 2 is a flow chart of an OCR and VAD based speech recognition dialect corpus process, as illustrated in FIG. 2, according to an embodiment of the present invention, the method may include the steps of:
step S201, audio/video data with subtitles is acquired.
In this embodiment, audio-video data with subtitles and containing dialect audio may be downloaded, wherein the audio data in the audio-video data may contain dialect audio.
Step S202, prompting audio data of the audio-video data, and determining a first starting time of the voice in the audio information.
In this embodiment, audio information in the audio-video data is extracted. Meanwhile, a voice endpoint detection model based on a neural network structure of an attention mechanism can be used for processing the extracted audio information in batches, namely, the voice endpoint detection model is utilized to determine the starting point and the ending point of human voice in the audio information, so that the smoothness of voice interaction dialogue and user experience are improved.
Alternatively, the input feature of the speech end point detection model may be mel frequency cepstrum coefficient MFCC, and the speech end point detection model may be composed of a linear layer and a plurality of core modules. Each core module consists of a feedforward network module, a convolution module, a multi-head attention mechanism module and a layer normalization layer, and each module is connected with a residual error.
Alternatively, the speech endpoint detection model may classify noise (including silence) and voice in two categories, and each sentence of voice in the audio may be located and their start-stop times obtained by the VAD. Wherein the start-stop time includes a start time and an end time.
Step S203, voice recognition is carried out on the voice, and the first text content is determined.
In this embodiment, a conformer-based dialect speech recognition model may be trained first using a small amount of collected dialect data (i.e., a dialect corpus). The model can be obtained by combining a CNN structure extracted by acoustic features and a transducer structure for context modeling, and by the aid of the construction mode, timeliness and context information of audio data can be effectively captured by the model, so that identification accuracy is improved. Wherein, the extracted acoustic feature of the trained dialect voice recognition model can be Mel frequency cepstral coefficient (Mel-Frequency Cepstral Coefficients, which is called MFCC for short), and the encoder-decoder structure can be adopted. The encoder may be configured to encode acoustic features of the audio data into a context representation, and the decoder may be configured to convert the context representation into a text output to obtain the first text content.
Optionally, the speech information with the start and stop time determined is converted by using a dialect speech recognition model at the training place to obtain initial text information.
Alternatively, since the dialect speech recognition model is obtained by training only a small amount of corpus, the first text content obtained by using the model is not very accurate, and the first text content is corrected based on the second text content in a later step.
Step S204, intercepting the audio and video data to obtain multi-frame image data.
In this embodiment, since many years of audio-visual data containing dialect audio are relatively long, the sharpness is not very high, and thus, it is possible to select to extract a plurality of frames of image data per second for optical character recognition. In order to consider the efficiency of batch processing OCR, 4 frames per second, namely, 1 second 4 frames of images can be extracted, and the adjacent two frames are different in time by 0.25 second in frequency, so that audio and video data can be intercepted. For example, a set of 30 minutes of video data, in this way, 7200 pictures can be taken out in total to obtain multi-frame image data. It should be noted that the size of the data is merely illustrative, and the size of the frame number may be selected according to practical situations without being limited thereto.
Step S205, the text position and the second text content of the image data are identified.
In this embodiment, OCR may be performed on the intercepted image data in the manner of a chinese detection model and a chinese recognition model. The chinese detection model may employ a database algorithm (Database Algorithm, abbreviated as DB). The Chinese recognition model may employ SVTR algorithm.
Alternatively, a Chinese detection model may be used for text detection, through which text regions in the image data, and location information corresponding to the text regions, may be located and extracted. The chinese detection model may be a lightweight deep learning model, for example, a scene text detection model (EFFICIENT AND Accurate Scene Text Detector, abbreviated as EAST). It should be noted that the text detection model is merely illustrative, and the type of the text detection model is not particularly limited.
Alternatively, the text recognition model may be used to perform a text recognition task, i.e., the detected characters in the text region may be converted into readable text strings to obtain a third text content corresponding to the text region.
For example, the character recognition model may be constructed by a variation self-encoder and a transducer model, and a sequence-to-sequence algorithm may be used, by which the character recognition model may first cut each text region into individual characters, and send the cut characters to the variation self-encoder for feature extraction and encoding. Then, decoding and generating a corresponding text sequence by using a transducer model to obtain accurate third text content. Meanwhile, in order to improve robustness, the character recognition model adopts an attention mechanism to model the dependency relationship among different characters so as to improve recognition accuracy.
Alternatively, the text region and the position information of the text in the image data can be determined through the Chinese detection model, and specific content (third text content) in the text region and the initial recognition confidence degree can be recognized through the text recognition model.
Step S206, screening caption text from the second text content.
In this embodiment, since there may be irrelevant text of a station logo, advertisement, or the like, which is not a subtitle, in a part of the picture, more than one text region may be identified. Therefore, after the third text content in the image data of all the frames is recognized in batch, the position where the third text content appears can be statistically processed. Since it is possible to default that subtitles of the same video appear at approximately the same position, it is possible to statistically obtain the approximate position of the third text content having the highest frequency of appearance among the plurality of frames of image data. Since the subtitles are all horizontal, the X-axis positions of the four endpoints of each subtitle are not fixed, but the Y-axis is relatively fixed, so that we can screen out the non-subtitle text by counting the fixed interval of the Y-axis of the subtitle appearance area. By the method, the caption text in the image data can be screened from the third text content based on the position information.
Alternatively, since the caption text in the same image data may have a space in the middle, the text region recognition result may be plural, and at this time, plural text regions may be combined to obtain a combined text. And determining the initial recognition confidence coefficient of each third text content in the combined text, and determining the average value of the initial recognition confidence coefficients in the plurality of third text contents to obtain the target recognition confidence coefficient corresponding to the combined text. By the step, the triplet data which corresponds to each frame of image data and contains time information, third text content and target recognition confidence coefficient can be obtained.
Step S207, the subtitles of the adjacent image data are processed.
In this embodiment, it is determined whether the subtitles of adjacent pictures are the same, if so, the subtitles are combined, and finally the starting time and the content of each subtitle are obtained, and the subtitles that are too short or have too low OCR confidence are screened out.
Optionally, the resulting series of triplet data may be further processed.
Alternatively, since several adjacent pictures actually correspond to the same subtitle, they need to be combined together, but due to low definition, errors in OCR model results, etc., different texts are often recognized by different pictures of the same subtitle, so it is necessary to first determine whether adjacent pictures correspond to the same subtitle.
Alternatively, a method of calculating the similarity of adjacent texts is adopted, and if the similarity is greater than a threshold value, the adjacent texts are considered to correspond to the same subtitle. Here we set the threshold to 0.5 after the experiment. We compare backward from the first frame in which the subtitle appears, if the similarity is greater than the threshold, update the text with higher confidence in the recognition result as the correct text for the current subtitle, and update the time of the start and end positions. If the condition that the text is not recognized by OCR is met in the middle, skipping and continuing to compare the similarity backwards until the similarity with the correct text is smaller than a threshold value, considering the current frame as the starting point of a new subtitle, considering the previous frame as the end of the previous subtitle, and then performing merging operation on the next subtitle from the current frame. By the above processing, the triples including the start time, the end time, the text and the recognition confidence can be obtained. The Arabic numerals contained therein are then removed, as 10987 OCRs 10987, and the different sentence breaks read differently.
Optionally, triads with recognition confidence below 0.8 are screened out, as recognition may be considered worse if all frame confidence levels for a certain caption are up to 0.8; with the screening out triples that are too short in duration, i.e., only OCR text of 2 frames or less, we can consider recognition results that are too short to be erroneous with a high probability. A large portion of erroneous text can be filtered out by the above steps. It should be noted that the above numbers are merely illustrative, and may be selected according to practical situations, and are not particularly limited herein.
Step S208, matching the ASR result and the OCR result in the near time.
In this embodiment, the ASR result (i.e., the first text content) and the OCR result (i.e., the second text content) are matched in close time, and if matched, the ASR result is replaced with the OCR result, i.e., the VAD result is obtained as a time-stamped OCR text as a labeled speech recognition dialect corpus.
Since the OCR and ASR results are roughly aligned only by the time stamp, there is a problem in that the accuracy of dialect recognition is low; meanwhile, if the judgment and screening are performed manually, there is a problem that the efficiency of data processing is low. To solve the above problem, in this embodiment, the obtained VAD and ASR results are corrected using the obtained OCR results.
Alternatively, because the time at which subtitles appear in the video is only around the time at which a person speaks in the video, and the VAD is the start-stop time of each spoken utterance obtained from the acoustic level, the timestamp given by the VAD result is more accurate, and the timestamp of OCR can only be used as a reference to help correct ASR text with the OCR result.
For example, because too short text may affect the accuracy of matching ASR text and OCR text, too short ASR results may be combined, e.g., into text that corresponds to speech between 7-15 seconds in length. For each ASR recognition text, the corresponding caption appearance time should generally be three seconds before the VAD gets the start time, and three seconds after the VAD gets the end time, and the start-stop time may be denoted as "start, end" for convenience of the following description. Since the start-stop time of each OCR result occurrence has been determined in step S205, the OCR result subtitle is regarded as if there is a subtitle that can match the ASR result as long as a part of time occurs between [ start, end ]. All these possible subtitles are then combined into a long text within which the corresponding OCR text of the current ASR text can be considered to exist. And then accurately positioning the specific OCR text corresponding to the ASR text through a matching algorithm. The first 5 words and the last 5 words of the ASR text are matched with the long text respectively, the position with the matching degree above 0.8 is found, and then the two positions comprise the middle part as the corresponding OCR text. The end of the previous sentence of ASR text and the first 5 characters of the next sentence of ASR text are used simultaneously to match the long text, if there is a position on which to match, then the position is also the position of the beginning or end of the possible sentence. If a plurality of positions meeting the matching requirement appear, the similarity calculation is carried out on the plurality of OCR texts and the original ASR text obtained through the rules, and the text with the highest similarity and the length ratio of the OCR text to the ASR text is considered to be correct text within 1.5. If no location is found that meets the matching requirement, the matching is discarded, i.e., the ASR text is discarded. And then, replacing the original ASR text with the OCR result to obtain the corrected speech recognition dialect corpus.
In this embodiment, audio and video data with variant language and text content are acquired, audio data in the audio and video data are processed to obtain first text information, multi-frame image data in the audio and video data are processed to obtain second text information, the first text information and the second text information are matched to determine target text content matched with the audio data, and the audio data and the target text content can be combined, so that dialect corpus which can be used as training data is obtained, further, the technical effect of improving the efficiency of acquiring corpus is achieved, and the technical problem of low efficiency of acquiring corpus is solved.
It should be still noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood and appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present invention.
In this embodiment, a corpus obtaining device is further provided, and this device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the terms "unit," "means" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
According to an embodiment of the present invention, there is further provided an embodiment of an apparatus for obtaining the corpus, and fig. 3 is a schematic structural diagram of an apparatus for obtaining the corpus according to an embodiment of the present invention, as shown in fig. 3, where the apparatus for obtaining the corpus includes: an acquisition unit 302, a first determination unit 304, a processing unit 306, a second determination unit 308, and a combination unit 310.
The acquiring unit 302 is configured to acquire audio and video data, where the audio and video data includes audio data for representing a variant language, and image data.
The first determining unit 304 is configured to determine a first text content corresponding to the audio data, where the first text content is used to describe the audio data, and a second text content corresponding to the image data, where the second text content is used to describe the image data.
The processing unit 306 is configured to match the first text content with the second text content to obtain matching information, where the matching information is used to represent a matching degree between the first text content and the second text content.
The second determining unit 308 is configured to determine, based on the matching information, a target text content that matches the audio data from the first text content and the second text content.
The combining unit 310 is configured to combine the audio data and the target text content to obtain a target corpus.
In the embodiment of the present invention, the above-mentioned obtaining unit 302 obtains audio-video data, where the audio-video data includes audio data for representing a variant language, and image data. Through the first determining unit 304, a first text content corresponding to the audio data and a second text content corresponding to the image data are determined, where the first text content is used to describe the audio data and the second text content is used to describe the image data. The processing unit 306 performs matching on the first text content and the second text content to obtain matching information, where the matching information is used to represent a matching degree between the first text content and the second text content. By the above-described second determination unit 308, the target text content that matches the audio data is determined from the first text content and the second text content based on the matching information. The combination unit 310 is configured to combine the audio data and the target text content to obtain the target corpus, thereby realizing the technical effect of improving the efficiency of obtaining the corpus and solving the technical problem of low efficiency of obtaining the corpus.
It should be noted that each of the above modules may be implemented by software or hardware, for example, in the latter case, it may be implemented by: the above modules may be located in the same processor; or the various modules described above may be located in different processors in any combination.
Here, it should be noted that the acquiring unit 302, the first determining unit 304, the processing unit 306, the second determining unit 308, and the combining unit 310 correspond to steps S102 to S110 in the embodiment, and the above-mentioned units are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above-mentioned embodiments. It should be noted that the above modules may be run in a computer terminal as part of the apparatus.
It should be noted that, the optional or preferred implementation manner of this embodiment may be referred to the related description in the embodiment, and will not be repeated herein.
The corpus acquisition device may further include a processor and a memory, where the acquisition unit 302, the first determination unit 304, the processing unit 306, the second determination unit 308, the combination unit 310, and the like are stored as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions.
The processor comprises a kernel, the kernel accesses the memory to call the corresponding program module, and the kernel can be provided with one or more than one. The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
According to an embodiment of the present invention, there is also provided an embodiment of a nonvolatile storage medium. Optionally, in this embodiment, the nonvolatile storage medium includes a stored program, where when the program runs, the device in which the nonvolatile storage medium is controlled to execute the method for acquiring any corpus.
Alternatively, in this embodiment, the above-mentioned nonvolatile storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network or in any one of the mobile terminals in the mobile terminal group, and the above-mentioned nonvolatile storage medium includes a stored program.
Optionally, the program controls the device in which the nonvolatile storage medium is located to perform the following functions when running: determining at least one first identification information from the data transmission request in response to the data transmission request, wherein the first identification information is used for identifying user equipment for data to be transmitted; determining second identification information associated with the first identification information; and transmitting the first identification information and the data to be transmitted to a session management function identified by the second identification information, so that the session management function transmits the data to be transmitted to a target user equipment group in the user equipment, wherein the first identification information of different target user equipment in the target user equipment group is the same.
According to an embodiment of the present invention, there is also provided an embodiment of a processor. Optionally, in this embodiment, the processor is configured to run a program, where the program runs to execute any one of the corpus acquiring methods.
According to an embodiment of the present invention, there is also provided an embodiment of a computer program product adapted to perform a program initializing the steps of the corpus acquisition method of any one of the above, when the program is executed on a data processing device.
Optionally, the computer program product mentioned above, when executed on a data processing device, is adapted to perform a program initialized with the method steps of: acquiring audio and video data, wherein the audio and video data comprises audio data for representing variant languages and image data; determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; determining target text content matched with the audio data from the first text content and the second text content based on the matching information; and combining the audio data and the target text content to obtain a target corpus.
Fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, an embodiment of the present invention provides an electronic device, where the electronic device 40 includes a processor, a memory, and a program stored on the memory and executable on the processor, and the processor implements the following steps when executing the program: acquiring audio and video data, wherein the audio and video data comprises audio data for representing variant languages and image data; determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data; matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content; determining target text content matched with the audio data from the first text content and the second text content based on the matching information; and combining the audio data and the target text content to obtain a target corpus.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present invention, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the modules may be a logic function division, and there may be another division manner when actually implemented, for example, a plurality of modules or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with respect to each other may be through some interface, module or indirect coupling or communication connection of modules, electrical or otherwise.
The modules described above as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable non-volatile storage medium. Based on such understanding, the technical solution of the present invention may be essentially or a part contributing to the related art or all or part of the technical solution may be embodied in the form of a software product stored in a non-volatile storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned nonvolatile storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (13)

1. The corpus acquisition method is characterized by comprising the following steps of:
Acquiring audio and video data, wherein the audio and video data comprises audio data for representing a variant language and image data;
Determining first text content corresponding to the audio data and second text content corresponding to the image data, wherein the first text content is used for describing the audio data, and the second text content is used for describing the image data;
Matching the first text content with the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content;
Determining target text content matched with the audio data from the first text content and the second text content based on the matching information;
and combining the audio data and the target text content to obtain a target corpus.
2. The method of claim 1, wherein determining the first text content corresponding to the audio data in the audio-video data comprises:
and recognizing human voice appearing in the audio data by using a voice recognition model to obtain the first text content, wherein the voice recognition model is used for recognizing the variant language.
3. The method of claim 1, wherein determining the second text content corresponding to the image data in the audio-video data comprises:
Determining a text region in the image data and position information corresponding to the text region;
identifying the text region to obtain third text content in the text region and initial identification confidence corresponding to the third text content;
and obtaining the second text content based on the position information, the third text content and the initial recognition confidence.
4. A method according to claim 3, wherein determining the text region in the image data comprises:
Determining at least one sub-text region in the image data, wherein the sub-text region comprises text of which the deployment angle meets a horizontal threshold;
And clustering and merging the at least one sub-text region to obtain the text region.
5. The method of claim 3, wherein deriving the second text content based on the location information, the third text content, and the initial recognition confidence comprises:
Screening caption text in the image data from the third text content based on the position information;
Merging the caption text in the image data to obtain a merged text;
Determining target recognition confidence corresponding to the combined text based on the initial recognition confidence;
and obtaining the second text content based on the target recognition confidence, the time information corresponding to the image data and the combined text.
6. The method of claim 5, wherein obtaining the second text content based on the target recognition confidence, the time information corresponding to the image data, and the merged text comprises:
determining adjacent frame images of the image data;
Determining a first similarity of the merged text in the image data and the merged text in the adjacent frame image;
Responding to the first similarity being greater than a similarity threshold, determining the combined text with the target recognition confidence being greater than a confidence threshold in the combined text corresponding to the adjacent frame image as the second text content;
and updating the text content of the combined text in the image data and the text content of the combined text in the adjacent frame image into the second text content.
7. The method of claim 1, wherein matching the first text content and the second text content to obtain matching information comprises:
Performing voice positioning on the audio data to obtain a first starting time of a first text content corresponding to at least one voice in the audio data;
Determining a second start-stop time of the second text content;
and matching the first text content with the second text content based on the first start time and the second start time to obtain the matching information.
8. The method of claim 7, wherein determining a start-stop time for a second text content, the second start-stop time being derived, comprises:
Determining that the text content is the first image data of the second text content and the text content is the last image data of the second text content in multi-frame image data of the video data;
The second start-stop time is determined based on the time at which the first of the image data appears and the time at which the last of the image data appears.
9. The method of claim 7, wherein matching the first text content and the second text content based on the first start-stop time and the second start-stop time to obtain the matching information comprises:
Expanding the first starting and ending time to obtain a third starting and ending time;
determining at least one matching text content in which the second start-stop time is positioned in the third start-stop time in a plurality of second text contents;
And respectively matching at least one of the matched text content and the first text content to obtain the matching information.
10. The corpus acquisition device is characterized by comprising:
An acquisition unit configured to acquire audio-video data, wherein the audio-video data includes audio data for representing a variant language, and image data;
A first determining unit, configured to determine a first text content corresponding to the audio data, and a second text content corresponding to the image data, where the first text content is used to describe the audio data, and the second text content is used to describe the image data;
The processing unit is used for matching the first text content and the second text content to obtain matching information, wherein the matching information is used for representing the matching degree between the first text content and the second text content;
a second determining unit configured to determine, based on the matching information, a target text content that matches the audio data from the first text content and the second text content;
And the combination unit is used for combining the audio data and the target text content to obtain a target corpus.
11. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
12. A non-volatile storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-10.
CN202410132418.1A 2024-01-30 2024-01-30 Corpus acquisition method and device, electronic equipment, storage medium and program product Pending CN118038852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410132418.1A CN118038852A (en) 2024-01-30 2024-01-30 Corpus acquisition method and device, electronic equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410132418.1A CN118038852A (en) 2024-01-30 2024-01-30 Corpus acquisition method and device, electronic equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN118038852A true CN118038852A (en) 2024-05-14

Family

ID=90983410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410132418.1A Pending CN118038852A (en) 2024-01-30 2024-01-30 Corpus acquisition method and device, electronic equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN118038852A (en)

Similar Documents

Publication Publication Date Title
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
US6718303B2 (en) Apparatus and method for automatically generating punctuation marks in continuous speech recognition
US8527272B2 (en) Method and apparatus for aligning texts
US7983910B2 (en) Communicating across voice and text channels with emotion preservation
CN109410664B (en) Pronunciation correction method and electronic equipment
US20060161434A1 (en) Automatic improvement of spoken language
JP4869268B2 (en) Acoustic model learning apparatus and program
US20180047387A1 (en) System and method for generating accurate speech transcription from natural speech audio signals
JP6323947B2 (en) Acoustic event recognition apparatus and program
US11545133B2 (en) On-device personalization of speech synthesis for training of speech model(s)
CN110796140B (en) Subtitle detection method and device
CN114143479A (en) Video abstract generation method, device, equipment and storage medium
CN115424606A (en) Voice interaction method, voice interaction device and computer readable storage medium
CN109376145B (en) Method and device for establishing movie and television dialogue database and storage medium
CN107886940B (en) Voice translation processing method and device
CN114125506B (en) Voice auditing method and device
JP2006084966A (en) Automatic evaluating device of uttered voice and computer program
CN112820281B (en) Voice recognition method, device and equipment
KR101920653B1 (en) Method and program for edcating language by making comparison sound
CN118038852A (en) Corpus acquisition method and device, electronic equipment, storage medium and program product
JP5243886B2 (en) Subtitle output device, subtitle output method and program
CN115424616A (en) Audio data screening method, device, equipment and computer readable medium
CN113450783A (en) System and method for progressive natural language understanding
CN113035247B (en) Audio text alignment method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination