CN111860389A

CN111860389A - Data processing method, electronic device and computer readable medium

Info

Publication number: CN111860389A
Application number: CN202010733797.1A
Authority: CN
Inventors: 秦勇; 李兵
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-10-30

Abstract

The embodiment of the invention discloses a data processing method, electronic equipment and a computer readable medium, wherein the data processing method comprises the following steps: performing text detection on a first text image to obtain information of a text area in the first text image; according to the information of the text area, image interception is carried out on the first text image, and a corresponding first intercepted image which does not contain a text is obtained; acquiring a plurality of text sentences, and fusing the text sentences with the first captured image respectively to obtain a plurality of second text images; and constructing a training sample for training a text recognition model by taking the plurality of second text images as sample images and taking the text content of the text sentence corresponding to each second text image as the text label of the second text image. By the embodiment of the invention, the construction efficiency of the training sample for training the text recognition model is improved.

Description

Data processing method, electronic device and computer readable medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a data processing method, electronic equipment and a computer readable medium.

Background

With the development of machine learning technology, neural network models have made great progress in various applications. For example, neural network models are widely used in speech recognition, text recognition, and the like.

Although in many respects the accuracy of the recognition of neural network models based on machine learning techniques is already quite accurate. However, machine learning has natural limitations, such as requiring a large amount of training data to train neural network models, requiring a large amount of data processing, and so on. At present, the commonly used acquisition method of training data is to manually acquire data and manually label the data to form training data, and the larger the scale of the training data is, the better the training effect is. Taking speech recognition as an example, the speech recognition model inputs speech segments and outputs recognized text sentences, so that the speech recognition model requires a large number of speech segments and their corresponding text sentences as training data. The same is true of neural network models used for text recognition, which also require a large number of text images as training data for model training.

Meanwhile, in the application stage of these models, all data, such as all text images to be processed, needs to be processed, and the amount of data to be processed is huge.

Therefore, the existing neural network model has the problem that the training efficiency of the neural network model is low due to the fact that training data are manually collected and labeled; or the data processing efficiency is low because the data processing quantity is large. However, in any case, the processing efficiency of the neural network model is influenced as a whole.

Disclosure of Invention

The present invention provides a data processing scheme to at least partially address one of the above-mentioned problems.

According to a first aspect of the embodiments of the present invention, there is provided a data processing method, including: performing text detection on a first text image to obtain information of a text area in the first text image; according to the information of the text area, image interception is carried out on the first text image, and a corresponding first intercepted image which does not contain a text is obtained; acquiring a plurality of text sentences, and fusing the text sentences with the first captured image respectively to obtain a plurality of second text images; and constructing a training sample for training a text recognition model by taking the plurality of second text images as sample images and taking the text content of the text sentence corresponding to each second text image as the text label of the second text image.

According to a second aspect of the embodiments of the present invention, there is provided another data processing method, including: acquiring a video frame image sequence from a video; respectively carrying out text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image; performing text recognition on video frame images according to the information of the subtitle region by using a text recognition model to obtain at least one video frame image set and subtitle content corresponding to the video frame image set, wherein the subtitle region corresponding to each video frame image in the video frame image set meets a preset similarity, and the text recognition model is obtained by training based on a training sample constructed by the data processing method in the first aspect; determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set; obtaining audio data corresponding to the video starting time point and the video ending time point from the video; and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

According to a third aspect of the embodiments of the present invention, there is provided another data processing method, including: acquiring a video frame image sequence from a video; respectively carrying out text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image; performing subtitle similarity judgment on video frame images in the video frame image sequence according to the information of the subtitle region, and acquiring at least one video frame image set according to a judgment result; selecting one video frame image from each video frame image set for subtitle identification to obtain subtitle content corresponding to each video frame image set; determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set; acquiring audio data corresponding to the video starting time point and the video ending time point from the video; and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, the apparatus including: one or more processors; a computer readable medium configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a data processing method according to the first aspect, or the second aspect, or the third aspect.

According to a fifth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements a data processing method as described in the first aspect, or the second aspect, or the third aspect.

According to the data processing scheme provided by the embodiment of the invention, the first text image is subjected to image interception based on the first text image, and after the first intercepted image without text is obtained, a new second text image is formed by using the pre-obtained text sentence. Therefore, the text image can be expanded based on a small number of text images to form a large number of text image training samples; in addition, the text sentences acquired in advance are in a text form, and can be directly used as the text labels of the new second text images without manual operation, so that the construction efficiency of training samples for training the text recognition model is greatly improved. Furthermore, the overall processing efficiency of the neural network model is indirectly improved.

Furthermore, in another data processing scheme, a video frame image sequence with information of a caption area obtained is processed through a text recognition model obtained by training a training sample constructed by the data processing scheme to obtain a video frame image set and corresponding caption content; further, audio data in the time period are obtained according to the video starting time point and the video ending time point of the video frame image set; after the audio data are obtained, training data can be constructed by combining the subtitle content obtained by recognition and used for training the voice recognition model, so that the training sample of the voice recognition model can be constructed quickly and at low cost. Furthermore, the overall processing efficiency of the speech recognition model is indirectly improved.

According to another data processing scheme provided by the embodiment of the invention, when text recognition such as subtitle recognition in a video is performed, each video frame image in the video is not recognized any more, and a video frame image set with a subtitle with a certain similarity, such as a video frame image set with the same subtitle, is determined from a plurality of video frame images in a video frame image sequence according to the similarity of the subtitle. Furthermore, one video frame image may be selected from the set to perform caption recognition, and caption content may be obtained. For a video frame image set, a plurality of video frame images are usually included, and the plurality of video frame images have the same subtitle, so that subtitle recognition of one of the plurality of video frame images can realize subtitle recognition of all the video frame images in the set. Therefore, the data processing burden of subtitle recognition is greatly reduced, and the data processing efficiency is improved. Particularly, when the neural network model is adopted for subtitle recognition, the data processing burden of the neural network model is greatly reduced, and the data processing efficiency of the neural network model is improved. Further, the audio data in the time period is obtained by taking the video starting time point and the video ending time point of the video frame image set as the basis; after the audio data are obtained, training data can be constructed by combining the subtitle content obtained by recognition and used for training the voice recognition model, so that the training sample of the voice recognition model can be constructed quickly and at low cost. Furthermore, the overall processing efficiency of the speech recognition model is indirectly improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a flow chart illustrating steps of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a data processing method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of steps of a data processing method according to a third embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps of a data processing method according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example one

Referring to fig. 1, a flowchart illustrating steps of a data processing method according to a first embodiment of the present invention is shown.

In this embodiment, a data processing scheme provided in the embodiment of the present invention is described from the perspective of training sample construction, and the data processing method in this embodiment includes the following steps:

step S102: and performing text detection on the first text image to obtain information of a text area in the first text image.

In this embodiment, the first text image may be any suitable image containing text, including but not limited to: text images of plain text, various scene images containing text, video frame images with subtitles, and the like.

Text detection is a technology for detecting a text area in an image and marking the boundary, namely a text box, the text detection of a first text image can be realized in any appropriate mode, and at present, many neural network models can realize more accurate text detection. In one possible approach, a DB (differential Binarization) model may be used to perform text detection on the first text image, so as to obtain information of the text region in the first text image.

The DB model is a neural network model with Resnet18 as the basic network architecture, and the input image is fed to a pyramid-characterized backbone; the pyramid features are upsampled to the same size and concatenated to generate feature F; then, simultaneously predicting a probability map (P) and a threshold map (T) through the feature F; then, the text area is distinguished from the background area by a differentiable binarization function, so as to realize the detection of the text area. Compared with other text detection modes, the method has the advantages that the DB model is used for text detection, and more accurate detection results can be obtained. It will be apparent to those skilled in the art that other text detection modes or other text detection-enabled neural network models, such as PAN (pixel aggregation network) models, may be equally suitable in practical applications.

Step S104: according to the information of the text area, image interception is carried out on the first text image, and a corresponding first intercepted image which does not contain the text is obtained; and acquiring a plurality of text sentences, and fusing the plurality of text sentences with the first captured image respectively to acquire a plurality of second text images.

In order to obtain text images as much as possible to construct a large-scale training sample, in this embodiment, an existing text image, i.e., a first text image, is used to perform a clipping process to obtain a portion that does not include text, i.e., a first clipped image, and a pre-obtained text sentence is added to the first clipped image to form a new text image. By the method, on one hand, a new image does not need to be acquired, and the existing image is utilized to the maximum extent; on the other hand, because the sentence in the text form is acquired, the sentence can be directly used as the text label of the second text image without manual labeling.

It should be noted that the image capturing of the first text image may be implemented in any appropriate manner, and in a feasible manner, an image capturing boundary may be determined according to the information of the text region; and carrying out image interception on the first text image according to the image interception demarcation line to obtain a first intercepted image without a text. For example, if the text region is located at the bottom of the first text image, the horizontal line where the upper boundary is located may be used to cut the boundary of the image, and the first text image is divided into an upper portion and a lower portion, where the upper portion does not contain text and the lower portion contains text. For another example, if the text region is located at the top of the first text image, the horizontal line where the lower boundary is located may be used as a boundary for image truncation, and the first text image is divided into an upper portion and a lower portion, where the upper portion includes text and the lower portion does not include text. For another example, if the text region is located on the left side of the first text image, the vertical line on the right boundary thereof may be used as a boundary of the image to divide the first text image into a left part and a right part, where the left part includes text and the right part does not include text. Similarly, if the text region is located on the right side of the first text image, after image segmentation is performed on the text region, the first text image is divided into a left part and a right part, wherein the left part does not contain text, and the right part contains text.

Through the method, the text region does not need to be subjected to accurate image matting processing, the realization is simple, and the image capturing efficiency is high. It should be apparent to those skilled in the art that the text region matting approach is equally applicable to embodiments of the present invention.

The text sentence may be obtained in any suitable manner, for example, using a prepared novel or other text, or crawling a novel or other text from a network, etc., to obtain the text sentence quickly and at low cost. Further, the implementation of adding the text sentence into the first truncated image may be implemented by any appropriate manner according to actual needs by those skilled in the art, for example, the implementation may be implemented by a putText () function of OpenCV, and the embodiment of the present invention is not limited thereto.

In specific implementation, a plurality of text sentences and the same first cut image can be fused to form a plurality of second text images, so that the image generation cost is saved; after the plurality of first intercepted images are obtained, different text sentences and different first intercepted images can be fused to form a plurality of second text images so as to improve the richness of the second text images; of course, a part of text sentences may be fused with the same first truncated image, and another part of text sentences may be fused with different first truncated images, which are all applicable to the embodiment of the present invention. In specific use, a person skilled in the art can flexibly select and use the method according to actual needs, and the embodiment of the invention is not limited to this.

And adding the text sentence into the image part which does not contain the text in the first text image, namely the first intercepted image, so as to obtain a second text image.

Step S106: and taking the plurality of second text images as sample images, and taking the text content of the text sentence corresponding to each second text image as the text label of the second text image to construct a training sample for training the text recognition model.

As mentioned above, since the text sentence is in text form, the text content thereof can be directly labeled as the text of the second text image. And forming a training sample which can be used for training the text recognition model through the second text image and the corresponding text label.

Although only the construction of one training sample is taken as an example in the present embodiment, it should be understood by those skilled in the art that a large scale training sample can be formed when the data processing operation of the present embodiment is performed on a large number of original text images. And, by image-capturing an original text image and adding different text sentences, a plurality of new second text images can also be formed. Accordingly, the number and size of training samples can be expanded. Further, the text recognition model may be trained using these training samples. The text recognition model may be any suitable text recognition model, including but not limited to a Convolutional Recurrent Neural Network (CRNN) model, and the like.

In addition, for the image capturing operation of the first text image, in addition to the first captured image, a second captured image containing a text may also be obtained. For this second truncated image, optionally, the following optional steps may also be performed.

Step S108: and constructing a training sample for training a text similarity model for text similarity judgment according to the result of image interception on the first text image.

The method comprises the following steps: acquiring a second intercepted image which comprises a text and corresponds to the first text image according to the result of image interception of the first text image; randomly combining the second intercepted image corresponding to the first text image and the second intercepted images corresponding to other text images in pairs; determining the image pair with the same text as a positive sample and determining the image pair with different texts as a negative sample in the second intercepted image pair obtained after random combination; and constructing a training sample for training a text similarity model for text similarity judgment according to the positive sample and the negative sample.

As described above, regardless of whether the first text image is cut into the upper and lower parts or the left and right parts, both parts of the image can be fully utilized, the part not containing the text can be added into the text sentence to form a new text image, and the part containing the text can form an image pair together with the second cut-out image corresponding to other text images to serve as a training sample for training the text similarity model.

For example, 100 original text images (first text images) are obtained, and if 200 text sentences are obtained, by the image clipping processing of this embodiment, on one hand, up to 100 × 200+100 text images can be obtained, where 100 × 200 may be new 200 second text images obtained by combining the first clipped image corresponding to each original text image with 200 text sentences, respectively, and then 100 of the 100 first clipped images may obtain 100 × 200 second text images, where 100 of the +100 original text images is the original 100 text images. On the other hand, each of the 100 second cut-out images obtained by image-cutting the 100 original text images may be combined with other images to form an image pair, and since the texts in the 100 original text images may be the same or different, the image pair obtained after combination may be an image pair (positive sample) with the same text or an image pair (negative sample) with different texts. It can be seen that in this way, a large number of training samples can be obtained at low cost and with high efficiency.

It should be noted that, after the image interception is performed on the first text image, the first intercepted image is obtained and is subjected to subsequent processing until the operation of obtaining the second text image is performed, and the second intercepted image is obtained and is subjected to subsequent processing until the operation of obtaining the image pair is performed, and the two operations may be performed without a sequential order or in parallel.

When the scheme of the embodiment is applied to the video frame image, the first text image is a video frame image containing subtitles, the text region is a subtitle region, and the pre-acquired text sentence is a text sentence in a novel crawled from a network. A complete video or video clip usually contains a large number of video frame images, when the video or video clip has subtitles, the video or video clip becomes rich resources capable of performing model training, and text sentences in novels crawled from the network are text forms, and are combined with the video frame images subjected to image interception, so that a large number of sample images used for model training do not need to be collected, manual annotation does not need to be performed, the construction speed and efficiency of a training sample are greatly improved, and the construction cost of the training sample is reduced.

According to the embodiment, the image is intercepted based on the first text image, and after the first intercepted image without the text is obtained, a new second text image is formed by using the pre-acquired text sentence. Therefore, the text image can be expanded based on a small number of text images to form a large number of text image training samples; in addition, the text sentences acquired in advance are in a text form, and can be directly used as the text labels of the new second text images without manual operation, so that the construction efficiency of training samples for training the text recognition model is greatly improved. Furthermore, the overall processing efficiency of the neural network model is indirectly improved.

The data processing method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: servers, and PCs, etc.

Example two

Referring to fig. 2, a flowchart of steps of a data processing method according to a second embodiment of the present invention is shown.

In this embodiment, a data processing method according to an embodiment of the present invention is described in terms of obtaining speech data and further constructing a training sample of a speech recognition model based on recognizing subtitles of a video frame image in an application scene of a video.

The data processing method of the embodiment comprises the following steps:

step S201: a sequence of video frame images is acquired from a video.

The video may be a complete video or a video clip, and each of the video clips includes a series of video frame images having a time sequence relationship. In this embodiment, the video frame image sequence means a plurality of video frame images having a time sequence relationship.

Step S203: and respectively carrying out text detection on each video frame image in the video frame image sequence to obtain the information of the subtitle area in each video frame image.

The video frame image containing the caption is similar to the text image, and the caption area in the video frame image can be detected by a text detection method to obtain the information of the corresponding caption area.

In one possible approach, the DB model may be used to perform text detection on each video frame image in the video frame image sequence, and obtain information of the subtitle region in each video frame image. And text detection is performed through the DB model, so that the detection result is more accurate.

But is not limited thereto, other ways of text detection are equally applicable to the embodiments of the present invention.

Step S205: and performing text recognition on the video frame images according to the information of the subtitle area by using a text recognition model to obtain at least one video frame image set and subtitle content corresponding to the video frame image set.

The text recognition model is obtained by training based on the training sample constructed by the data processing method in the first embodiment.

In this embodiment, the text recognition model may be any suitable data model that can implement the function of obtaining at least one video frame image set and the subtitle content corresponding to the video frame image set, including but not limited to a CRNN (convolutional neural network) model. In the text recognition model of this embodiment, on one hand, a plurality of video frame images can be collected into one video frame image set according to the similarity between the video frame images; on the other hand, text recognition can be performed on the subtitle region indicated by the information of the subtitle region of the video frame image, and subtitle content can be obtained.

In a feasible manner, information of a plurality of video frame images and subtitle regions corresponding to the video frame images can be input into a text recognition model; performing similarity identification on the subtitle regions of the plurality of video frame images according to the information of the subtitle regions through a text identification model; obtaining at least one video frame image set according to the result of the similarity identification; and selecting one video frame image from each video frame image set, and performing text recognition on the selected video frame image to obtain the subtitle content corresponding to each video frame image set. In this case, the text recognition model of the embodiment may include two parts, for example, the first part may perform similarity determination on the input video frame images and output at least one video frame image set; furthermore, one video frame image is taken out from each video frame image set and input into a second part; the second part can perform text recognition on the input video frame images according to the information of the subtitle area to obtain the subtitle content corresponding to each video frame image set. By the method, the number of video frame images is reduced, and the text recognition speed and efficiency are improved.

In another possible way, the information of at least one video frame image and the subtitle region corresponding to the video frame image can be input into a text recognition model; performing text recognition on the input video frame images through a text recognition model to obtain each video frame image and corresponding subtitle content; and obtaining at least one video frame image set and the subtitle content corresponding to the video frame image set according to the similarity between the obtained subtitle contents. In this case, the text recognition model of this embodiment may include two parts, for example, the first part may perform text recognition on the input video frame image according to the information of the subtitle region, and output the subtitle content corresponding to each video frame image; then, the video frame images and the corresponding caption contents are input into a second part, the second part judges the similarity of the caption contents, and aggregates a plurality of corresponding video frame images into a video frame image set according to the judgment result, and the caption contents are used as the caption contents corresponding to the video frame image set. By the method, the text recognition of the video frame image is comprehensive and the accuracy is higher.

The specific form and implementation of the two-part network structure may be appropriately selected or set by those skilled in the art according to actual needs, and the embodiment of the present invention is not limited thereto.

Step S207: and determining the video starting time point and the video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set.

In one possible approach, the video start time point and the video end time point of each video frame image set may be determined according to the time stamp of the video frame image in the video frame image set. By the method, the more accurate video starting time point and video ending time point can be obtained, and accurate audio data can be obtained subsequently.

In another possible way, the duration information of each video frame image can be determined according to the total duration and the total frame number of the video; and determining the video starting time point and the video ending time point of the video frame image set according to the duration information of each video frame image and the video sequence number of the video frame image in the video frame image set. In this way, even if the video frame image has no time stamp, an accurate video start time point and an accurate video end time point can be obtained, and thus accurate audio data can be obtained subsequently.

Step S209: and obtaining audio data corresponding to the video starting time point and the video ending time point from the video.

In this step, the obtained audio data corresponds to the video image frame sequence in the video frame image set and also corresponds to the subtitle content corresponding to the video frame image set, and thus, the audio data can be used as a basis for subsequently constructing a training sample.

Step S211: and constructing training data for training a voice recognition model according to the subtitle content and the audio data corresponding to the video frame image set.

As described above, the subtitle content and the audio data corresponding to the video frame image set form a training sample of the speech recognition model, the training sample is obtained without manual participation and can be generated at low cost, the construction efficiency of the training cost of the speech recognition model is greatly improved, and the overall processing efficiency of the speech recognition model is further improved.

According to the embodiment, a video frame image sequence with information of a caption area obtained is processed through a text recognition model obtained by training a training sample constructed by the data processing scheme to obtain a video frame image set and corresponding caption content; further, audio data in the time period are obtained according to the video starting time point and the video ending time point of the video frame image set; after the audio data are obtained, training data can be constructed by combining the subtitle content obtained by recognition and used for training the voice recognition model, so that the training sample of the voice recognition model can be constructed quickly and at low cost. Furthermore, the overall processing efficiency of the speech recognition model is indirectly improved.

EXAMPLE III

Referring to fig. 3, a flowchart of steps of a data processing method according to a third embodiment of the present invention is shown.

In the embodiment, a video is used as an application scene, and the data processing method in the embodiment of the present invention is explained from the perspective of text recognition of a text image, that is, the perspective of subtitle recognition in a video frame image, and the perspective of further construction of a training sample of a speech recognition model.

The data processing method of the embodiment comprises the following steps:

step S202: a sequence of video frame images is acquired from a video.

Step S204: and respectively carrying out text detection on each video frame image in the video frame image sequence to obtain the information of the subtitle area in each video frame image.

Step S206: and according to the information of the subtitle region, performing subtitle similarity judgment on video frame images in the video frame image sequence, and according to a judgment result, obtaining at least one video frame image set.

When the subtitle similarity is judged, the subtitle similarity can be judged on different video frame images mainly according to the image part of the subtitle area. Through the judgment of the similarity of the subtitles, whether different video frame images have the subtitles with the similarity meeting a certain similarity threshold value can be determined. Further, video frame images having the same subtitle may be classified into one set, whereby a sequence of video frame images may be classified into at least one set of video frame images.

In order to improve the efficiency of judging the similarity of the subtitles, in a feasible mode, each video frame image in the video frame image sequence can be subjected to image interception according to the information of the subtitle area of each video frame image, and a plurality of intercepted images which correspond to the video frame image sequence and comprise the subtitle areas are obtained; and carrying out subtitle similarity judgment on a plurality of intercepted images corresponding to the video frame image sequence. Namely, each video frame image is subjected to image interception, and a subtitle area part of each video frame image is intercepted to obtain an intercepted image. Since each video frame image corresponds to one truncated image, the video frame sequence corresponds to a plurality of truncated images accordingly. The specific implementation of the image capturing may refer to the relevant parts in the first embodiment as the relevant description in step S104, and is not described herein again.

The main part of the intercepted image is the subtitle area, and the subtitle similarity judgment is carried out on a plurality of intercepted images, so that excessive information irrelevant to the subtitle area does not need to be processed, and the similarity judgment efficiency is improved; on the other hand, since the subtitle region is mainly determined, the accuracy of determination is also improved.

In a feasible manner, image pairs can be constructed for the plurality of corresponding intercepted images according to the time sequence of the video frame images in the video frame image sequence; and inputting the images into a neural network model for image similarity judgment in sequence for similarity judgment. Through the mode of the neural network model, the similarity judgment can be more accurate, and particularly, the effect is better when whether subtitles in two intercepted images are the same or not is judged.

Alternatively, the neural network model for performing the image similarity determination may be a MatchNet model (unification Feature and Metric Learning for Patch-Based Matching). The MatchNet model mainly comprises a characteristic network and a measurement network, wherein the characteristic network is a convolutional neural network model and comprises 2 branches, each branch comprises 5 convolutional layers and 3 pooling layers, and the 2 branches share weight; the 2 branches respectively extract the features of the input 2 images and output feature pairs; the measurement network portion mainly includes 3 fully-connected layers (the third fully-connected layer is followed by a softmax function) for measuring the similarity of 2 images according to the feature pairs output by the feature network portion. Compared with other models, the MatchNet model is simpler, has less parameter quantity, and has higher similarity measurement speed and higher efficiency.

Step S208: and selecting one video frame image from each video frame image set for subtitle identification to obtain subtitle content corresponding to each video frame image set.

Because the video frame images in each video frame image set have the same subtitles, one of the video frame images can be identified, so that the identification data volume can be greatly reduced, and the identification efficiency is improved. When selecting one video frame image from each video frame image set, a person skilled in the art may select the video frame image in any suitable manner, including but not limited to randomly selecting, selecting the first or last or middle video frame image, and the like. Similarly, one of the sets of images formed after the caption region is clipped may be selected and identified.

Alternatively, one video frame image may be selected from each video frame image set, and the selected video frame image is subjected to subtitle recognition using a CRNN (convolutional neural network) model for text recognition, so as to obtain subtitle content corresponding to each video frame image set.

The CRNN model is composed of a convolutional neural network part, a cyclic neural network part and a translation layer part, the convolutional neural network part is responsible for extracting features (mainly text features, namely caption features, in the embodiment) from video frame images, the cyclic neural network part performs sequence prediction by using the features extracted by the convolutional neural network part, and the translation layer part translates a sequence obtained by the cyclic neural network part into a character sequence. Through the CRNN model, accurate and efficient subtitle recognition can be performed.

Therefore, the subtitle recognition in the video frame image is realized. It should be noted that the above process can be implemented by using a combination of a plurality of neural network models, including: using a DB model to detect texts, and obtaining information of a subtitle area in each video frame image; after the video frame images are subjected to image interception according to the information of the subtitle region and a plurality of intercepted images are obtained, a MatchNet model can be used for carrying out subtitle similarity judgment on the plurality of intercepted images, and at least one video frame image set is obtained according to the judgment result; after one video frame image is selected from the video frame image set, the CRNN model can be used for carrying out subtitle recognition, and corresponding subtitle content is obtained. That is, the scheme of the present embodiment can be implemented by using the DB model + MatchNet model + CRNN model in combination.

In this case, these neural network models need to be trained in advance, for example, a large number of text image samples can be used to train the DB model so that it has accurate text detection performance; the MatchNet model can be trained by using a text image sample, so that the MatchNet model has more accurate similarity judgment performance; a large number of text image samples can be used for training the CRNN model, so that the CRNN model has more accurate text recognition performance.

Optionally, before the image pair is sequentially input into a neural network model for performing image similarity judgment, such as a MatchNet model, for performing similarity judgment, a first training sample set for training the neural network model for performing image similarity judgment, such as the MatchNet model, and a second training sample set for training the CRNN model may also be constructed; the neural network model, such as the MatchNet model, is trained using a first set of training samples, and the CRNN model is trained using a second set of training samples, respectively.

Wherein, further optionally, the first training sample set may be constructed by: respectively carrying out text detection on a plurality of video frame sample images containing subtitles to obtain information of a plurality of corresponding subtitle areas; according to the information of the plurality of subtitle areas, image interception is carried out on the corresponding video frame sample images, and corresponding first intercepted images which do not contain subtitles are obtained; adding a pre-acquired text sentence into the first intercepted image to acquire a corresponding new video frame sample image; and constructing the first training sample set by taking the new video frame sample image as a sample image and taking the added text sentences as text labels of the new video frame sample image.

Optionally, the second training sample set is constructed by: respectively carrying out text detection on a plurality of video frame sample images containing subtitles to obtain information of a plurality of corresponding subtitle areas; according to the information of the caption area, image interception is carried out on the corresponding video frame sample image to obtain a corresponding second intercepted image containing the caption; combining the plurality of second intercepted images corresponding to the plurality of video frame images randomly in pairs; determining an image pair with the same caption as a positive sample and determining an image pair with different captions as a negative sample in a second intercepted image pair obtained after random combination; constructing the second set of training samples using the positive samples and the negative samples.

Therefore, two parts of results obtained after image interception of the video frame sample image are fully utilized, and the number of training samples is expanded at low cost and high efficiency.

Through the above process, when text recognition such as subtitle recognition in a video is performed, each video frame image in the video is not recognized any more, and a video frame image set of subtitles with a certain similarity, such as a video frame image set of the same subtitles, is determined from a plurality of video frame images in a video frame image sequence according to the similarity of the subtitles. Furthermore, one video frame image may be selected from the set to perform caption recognition, and caption content may be obtained. For a video frame image set, a plurality of video frame images are usually included, and the plurality of video frame images have the same subtitle, so that subtitle recognition of one of the plurality of video frame images can realize subtitle recognition of all the video frame images in the set. Therefore, the data processing burden of subtitle recognition is greatly reduced, and the data processing efficiency is improved. Particularly, when the neural network model is adopted for subtitle recognition, the data processing burden of the neural network model is greatly reduced, and the data processing efficiency of the neural network model is improved.

Furthermore, in some scenarios, such as a training scenario for a speech recognition model, a large number of text images and their corresponding audio data are required to construct training samples. For example, in a speech recognition model for performing speech recognition on video, both subtitle content in a video frame image and audio data corresponding to the video frame image are required, and a training sample is constructed in combination to train the speech recognition model. Based on this, the following optional steps may also be performed.

Step S210: determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set; acquiring audio data corresponding to the video starting time point and the video ending time point from the video; and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

When the video start time point and the video end time point of each video frame image set are determined according to the time information of the video frame images in each video frame image set, in a feasible manner, the video start time point and the video end time point of each video frame image set can be determined according to the time stamps of the video frame images in each video frame image set. In this way, the video frame images may have timestamps, and since the video frame images in the same video frame image set have the same subtitles, that is, the same audio corresponding to the subtitles, based on this, the audio data between the video start time point and the video end time point of the video frame image set may be obtained according to the timestamps of the video frame images and based on the video start time point and the video end time point of the video frame image set. After the audio data is obtained, training data can be constructed in combination with the subtitle content obtained by recognition for training the speech recognition model. By the method, the starting time point and the ending time point of the audio data can be obtained accurately, and the accurate audio data can be obtained.

In another possible way, when the video start time point and the video end time point of each video frame image set are determined according to the time information of the video frame images in each video frame image set, the time information of each video frame image can be determined according to the total time length and the total frame number of the video; and determining the video starting time point and the video ending time point of the video frame image set according to the duration information of each video frame image and the video sequence number of the video frame image in the video frame image set. Because the playing time of each frame of video frame image in the video is the same, the time length information of each frame of video image can be determined according to the total time length and the total frame number of the video. In this manner, the video frame images may or may not be time-stamped. Each video frame image has a corresponding sequence number in the video to identify the position thereof, and based on the sequence number, the playing start time and the playing end time of each video frame image can be determined after the video sequence number of the video frame image is determined. In a video frame image set, the playing start time and the playing end time of a corresponding video frame image sequence can be determined according to the video sequence numbers of a plurality of video frame images in the video frame image set. Further, the audio data corresponding to the start time and the end time can be acquired from the video accordingly. After the audio data is obtained, training data can be constructed in combination with the subtitle content obtained by recognition for training the speech recognition model. In this way, accurate audio data can be obtained even if the video frame image has no time stamp.

Through the process, the training data of the voice recognition model can be constructed efficiently and at low cost.

Example four

Referring to fig. 4, a flowchart illustrating steps of a data processing method according to a fourth embodiment of the present invention is shown.

The present embodiment explains the data processing method provided in the embodiment of the present invention in a specific example. The data processing method of the embodiment comprises the following steps:

step S302: multiple videos with captions are collected through the internet and web crawlers.

As previously described, the collected video may be a complete video or a video clip. Through the Internet and a web crawler mode, a large amount of videos with subtitles can be collected at low cost.

Step S304: and respectively extracting the audio data and the video data in each collected video to obtain corresponding audio data and video data. Then, step S306 is executed; alternatively, step S312 is performed; alternatively, steps S306 and S312 are performed in parallel.

For example, using the FFMPEG tool, the audio data and the video data in the video are extracted respectively, and the corresponding audio data and video data are obtained.

Step S306: randomly selecting a certain amount of video data from the collected videos, and cutting each selected video data into images according to video frames, namely video frame images.

The certain number can be set by a person skilled in the art according to actual requirements, and the video frame image can be obtained by cutting the video data into images according to the video frame.

Step S308: and manually marking the position of the subtitle area on the video frame image obtained by segmentation.

Through the video frame images and the labeling of the subtitle positions of the video frame images, training samples for training the DB model can be formed.

Step S310: and training the DB model by using the video frame image marked with the position of the subtitle area to obtain the DB model for text detection. And after the DB model training is finished, ending the training process of the DB model and waiting for subsequent use.

Through the training of the DB model, a text detection model for detecting the subtitle position on the video frame image, namely the DB model for text detection, can be obtained.

Step S312: randomly selecting a certain amount of video data from the collected videos, and cutting each selected video data into video frame images according to video frames. And carrying out image interception on the video frame image, and respectively acquiring a part containing the subtitle and a part not containing the subtitle.

The video data of the certain amount selected in this step may be the same as or different from the video data selected in step S306, or may be partially the same and partially different.

For example, the lower half of the video frame image (i.e., the first text image) containing the subtitles may be cut out using a self-contained function in OpenCV, such as a Resize () function, and only the upper half of each video frame image is retained, without the subtitles. In this embodiment, the part of the image without subtitles is referred to as a co-texture image (i.e., the first clipped image).

Step S314: novels are crawled over the web through the internet and web crawlers.

The novel crawled by the internet and the web crawler comprises a large number of text sentences which are in text form.

It should be noted that, in practical applications, this step may be performed at any time before step S316, and is not limited to the order of steps shown in this embodiment.

Step S316: and combining each text sentence of the novel with the image with the same texture structure to generate a chartlet image.

For example, each text sentence may be pasted on the lower edge of each image of the same texture structure using a function provided in OpenCV itself, such as a putText () function, and the font, color, and size of the text sentence may be randomly changed within a specified range by those skilled in the art, and the resulting image is referred to as a pasted image (i.e., a second text image).

Step S318: and cutting an image part of the position of the text sentence in the character pasting image to be used as an identification image. Then, step S320 is performed; alternatively, step S322 is performed; alternatively, steps S320 and S322 are performed in parallel.

In this step, the image of the region where the text sentence is located is cut from the image of the posted character to form an image whose main content is the text sentence, which is referred to as an image for recognition in this embodiment.

Step S320: and training the CRNN model by using the image for recognition and the corresponding text sentence as training samples to obtain a text recognition model capable of recognizing text contents. And after the training of the CRNN model is finished, ending the training process of the CRNN model and waiting for subsequent use.

In this embodiment, when the CRNN model is trained, the image with the attached characters is further clipped, so as to improve the training pertinence and reduce the training data amount. It should be apparent to those skilled in the art that the CRNN model may also be trained using the image of the typographical representation and its corresponding text sentences to construct training samples.

Step S322: and constructing a training data set for training the MatchNet model according to the image for identification, and training the MatchNet model. And after the MatchNet model is trained, ending the MatchNet model process and waiting for subsequent use.

For example, a training data set is constructed based on whether text sentences used for recognizing the images are the same or not, two images having the same text sentence are used as matching pairs (i.e., positive samples), two images having different text sentences are used as non-matching pairs (i.e., negative samples), and then the MatchNet model is trained to obtain a model capable of judging whether the two text images are similar or not.

Similar to the foregoing, in practical applications, the charpy image and the text sentence corresponding to the charpy image may also be directly used to construct a training data set for training the MatchNet model.

Thus, the training of the DB model, the MatchNet model, and the CRNN model is completed, and the corresponding functions of these models can be directly used subsequently, as described below.

Step S324: dividing all video data extracted from the video into images according to video frames, namely video frame images; and sending each video frame image into the DB model to obtain the position information of the subtitle area on each video frame image.

Step S326: and intercepting the image with the subtitle part on the video frame image according to the information of the position of the subtitle area of each video frame image.

Step S328: and sending every two images with the caption parts into a MatchNet model in sequence, judging whether the two images are similar, classifying the images which are similar in sequence into one class, and recording the sequence order of each image in each class of images in the whole video data. Then, step S330 is performed; alternatively, step S332 is executed; alternatively, the parallel execution of the steps S330 or S332 is started.

Step S330: and (4) selecting one image from each type of image, sending the image into the CRNN model, and performing text recognition to obtain specific text contents of the text sentence. Go to step S336.

Step S332: and according to the sequence number of each image in each type of image in the whole video data, combining the total duration and the total frame number of the corresponding video data to obtain the starting time point and the ending time point of a text sentence appearing in the video data.

Step S334: and intercepting audio data between the starting time point and the ending time point in the video in which the video data is positioned according to the starting time point and the ending time point of the text sentence appearing in the video data.

Step S336: and finishing all the video frame images, and obtaining training data for training the voice recognition model according to the text sentences and the audio data corresponding to the obtained video frame images.

Through the process, a large amount of videos with subtitles, such as videos of TV plays, animations, art programs and the like, are collected; then extracting video data and audio data from the video data and the audio data; then randomly selecting a small amount of video data, cutting the video data into video frame images according to video frames, and manually marking the positions of caption areas of the video frame images; then training a DB model by using the video frame image and the annotation data to obtain a text detection model capable of detecting the text position on the image; randomly selecting a large amount of video data, cutting the video data into video frame images according to video frames, cutting the lower half part of each video frame image, namely the part containing the subtitles, and only keeping the upper half part, namely the part without the subtitles; secondly, crawling a large number of novels from the network, attaching each text sentence (with randomly changed fonts and sizes) of the novels to the lower edge of the upper half part of each reserved image, and then cutting off the image part with the text sentences for training a CRNN model to obtain a text recognition model capable of recognizing text contents on the image; meanwhile, because the content of the text sentence is known, the cut images with the text sentence can be regularly and randomly combined into pairs (two images with the same text sentence are matched pairs, namely positive samples, and two images with the same text sentence are not matched pairs, namely negative samples), and the MatchNet model is trained by using the data, so that a model capable of judging whether the two text images are similar is obtained.

After the training of each model is completed, segmenting each video data collected firstly according to a video frame to obtain a video frame image, and sending the video frame image into a DB model to obtain the position of a caption area on each video frame image; then cutting off the caption part on the image, sending each two continuous cut-off images into a MatchNet model from the first cut-off image, evaluating whether the images are similar, recording the serial number of each image in the corresponding video data, classifying the similar image serial numbers into one class according to the sequence, then sending one image in each class into a CRNN model, and identifying to obtain the caption content; then according to the sequence of each type of images, according to the total frame number and the total duration of each video, obtaining the starting time point and the ending time point of each text sentence, then according to the starting time point and the ending time point, cutting an audio segment from the audio data corresponding to the video data, wherein one audio segment and one text sentence form a training sample of a speech recognition model, all the video data are processed, and a large amount of training data used for the speech recognition model is obtained.

If training data is collected for the Chinese speech recognition model, a large amount of Chinese videos can be collected, and a large amount of Chinese novels can be crawled; if training data is collected for the English speech recognition model, a large number of English videos can be collected, and a large number of English novels can be crawled. By analogy, the speech recognition model to be trained is for which language, and the video and novel of which language is collected is selected. Therefore, a large amount of training data for the voice recognition model can be collected in a substantially full-automatic manner, and the cost of manually collecting data and labeling data is greatly reduced. Furthermore, the 3 models effectively decouple the dependency relationship between recognition and detection, so that the 3 models can be almost parallel, and meanwhile, the workload of the text recognition model is reduced, so that the data collection speed is increased.

By the embodiment, the DB model, the MatchNet model and the CRNN model are combined and act on the video with the subtitles, such as a television show, an animation and a variety program, so as to obtain a large amount of training data for the voice recognition model, and therefore, the cost of manually acquiring and labeling the data to obtain the training data is reduced.

EXAMPLE five

Fig. 5 is a hardware structure of an electronic device according to a fifth embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)401, a communication Interface 402, a memory 403, and a communication bus 404.

Wherein:

the processor 401, communication interface 402, and memory 403 communicate with each other via a communication bus 404.

A communication interface 402 for communicating with other electronic devices or servers.

The processor 401 is configured to execute the program 405, and may specifically execute relevant steps in the foregoing data processing method embodiment.

In particular, the program 405 may include program code comprising computer operating instructions.

The processor 401 may be a central processing unit CPU or an application specific Integrated circuit asic or one or more Integrated circuits configured to implement an embodiment of the invention. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

A memory 403 for storing a program 405. The memory 403 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

In a first embodiment:

the program 405 may specifically be configured to cause the processor 401 to perform the following operations: performing text detection on a first text image to obtain information of a text area in the first text image; according to the information of the text area, image interception is carried out on the first text image, and a corresponding first intercepted image which does not contain a text is obtained; acquiring a plurality of text sentences, and fusing the plurality of text sentences with the first captured image respectively to obtain a plurality of second text images; and constructing a training sample for training a text recognition model by taking the plurality of second text images as sample images and taking the text content of the text sentence corresponding to each second text image as the text label of the second text image.

In an optional implementation, the program 405 is further configured to cause the processor 401, when performing image truncation on the first text image according to the information of the text region to obtain a corresponding first truncated image containing no text: determining an image interception boundary according to the information of the text region; and carrying out image interception on the first text image according to the image interception demarcation line to obtain a first intercepted image without a text.

In an optional implementation manner, the program 405 is further configured to enable the processor 401 to obtain, according to the result of the image capturing, a second captured image containing a text corresponding to the first text image; combining the second intercepted image corresponding to the first text image and the second intercepted images corresponding to other text images randomly in pairs; determining the image pair with the same text as a positive sample and determining the image pair with different texts as a negative sample in the second intercepted image pair obtained after random combination; and constructing a training sample for training a text similarity model for text similarity judgment according to the positive sample and the negative sample.

In an alternative embodiment, the program 405 is further configured to cause the processor 401, when performing text detection on the first text image and obtaining information of a text region in the first text image: and performing text detection on the first text image by using a differentiable binary DB model to obtain the information of the text region in the first text image.

In an optional implementation manner, the first text image is a video frame image containing subtitles, and the text region is a subtitle region; the pre-acquired text sentences are text sentences in novels crawled from the network.

In a second embodiment:

the program 405 may specifically be configured to cause the processor 401 to perform the following operations: acquiring a video frame image sequence from a video; respectively carrying out text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image; performing text recognition on video frame images according to the information of the subtitle region by using a text recognition model to obtain at least one video frame image set and subtitle content corresponding to the video frame image set, wherein the subtitle region corresponding to each video frame image in the video frame image set meets a preset similarity, and the text recognition model is obtained by training based on a training sample constructed by the data processing method in the first embodiment; determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set; obtaining audio data corresponding to the video starting time point and the video ending time point from the video; and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

In an optional implementation manner, the program 405 is further configured to cause the processor 401, when performing text recognition on the video frame image according to the information of the subtitle region by using a text recognition model, to obtain at least one video frame image set and subtitle content corresponding to the video frame image set: inputting information of at least one video frame image and a subtitle region corresponding to the video frame image into the text recognition model; performing text recognition on the input video frame images through the text recognition model to obtain each video frame image and corresponding subtitle content; and obtaining at least one video frame image set and the subtitle content corresponding to the video frame image set according to the similarity between the obtained subtitle contents.

In an optional implementation manner, the program 405 is further configured to cause the processor 401, when performing text recognition on the video frame image according to the information of the subtitle region by using a text recognition model, to obtain at least one video frame image set and subtitle content corresponding to the video frame image set: inputting the information of a plurality of video frame images and subtitle areas corresponding to the video frame images into the text recognition model; performing similarity identification on the caption areas of the plurality of video frame images according to the information of the caption areas through the text identification model; obtaining at least one video frame image set according to the result of the similarity identification; selecting one video frame image from each video frame image set, and performing text recognition on the selected video frame image to obtain subtitle content corresponding to each video frame image set.

In an alternative embodiment, the program 405 is further configured to cause the processor 401, when determining the video start time point and the video end time point of each video frame image set according to the time information of the video frame image in the video frame image set: determining a video starting time point and a video ending time point of each video frame image set according to the time stamp of the video frame image in each video frame image set; or determining the time length information of each video frame image according to the total time length and the total frame number of the video; and determining the video starting time point and the video ending time point of the video frame image set according to the duration information of each video frame image and the video sequence number of the video frame image in the video frame image set.

In a third embodiment:

the program 405 may specifically be configured to cause the processor 401 to perform the following operations: acquiring a video frame image sequence from a video; respectively carrying out text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image; performing subtitle similarity judgment on video frame images in the video frame image sequence according to the information of the subtitle region, and acquiring at least one video frame image set according to a judgment result; selecting one video frame image from each video frame image set for subtitle identification to obtain subtitle content corresponding to each video frame image set; determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set; acquiring audio data corresponding to the video starting time point and the video ending time point from the video; and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

In an alternative embodiment, the program 405 is further configured to enable the processor 401, when performing the subtitle similarity determination on the video frame images in the video frame image sequence according to the information of the subtitle region: according to the information of the caption area, carrying out image interception on each video frame image in the video frame image sequence to obtain a plurality of intercepted images which correspond to the video frame image sequence and comprise the caption area; and carrying out subtitle similarity judgment on a plurality of intercepted images corresponding to the video frame image sequence.

In an alternative embodiment, the program 405 is further configured to cause the processor 401, when performing text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image: and respectively carrying out text detection on each video frame image in the video frame image sequence by using a differentiable binarization DB model to obtain the information of the subtitle area in each video frame image.

In an alternative embodiment, the program 405 is further configured to enable the processor 401, when performing the subtitle similarity determination on the plurality of truncated images corresponding to the video frame image sequence: constructing image pairs for the plurality of corresponding intercepted images according to the time sequence of the video frame images in the video frame image sequence; and inputting the images into a neural network model for image similarity judgment in sequence for similarity judgment.

In an optional embodiment, the neural network model for performing image similarity determination is a MatchNet model.

In an alternative embodiment, the program 405 is further configured to enable the processor 401, when selecting one video frame image from each video frame image set for subtitle recognition, and obtaining subtitle content corresponding to each video frame image set: selecting one video frame image from each video frame image set, and performing subtitle recognition on the selected video frame image by using a Convolutional Recurrent Neural Network (CRNN) model for text recognition to obtain subtitle content corresponding to each video frame image set.

In an optional implementation manner, the program 405 is further configured to enable the processor 401 to construct a first training sample set for training the neural network model for image similarity determination and a second training sample set for training the CRNN model before sequentially inputting the images into the neural network model for image similarity determination for similarity determination; and respectively training the neural network model by using the first training sample set, and training the CRNN model by using the second training sample set.

In an alternative embodiment, program 405 is further configured to cause processor 401 to construct the first training sample set by: respectively carrying out text detection on a plurality of video frame sample images containing subtitles to obtain information of a plurality of corresponding subtitle areas; according to the information of the plurality of subtitle areas, image interception is carried out on the corresponding video frame sample images, and corresponding first intercepted images which do not contain subtitles are obtained; adding a pre-acquired text sentence into the first intercepted image to acquire a corresponding new video frame sample image; and constructing the first training sample set by taking the new video frame sample image as a sample image and taking the added text sentences as text labels of the new video frame sample image.

In an alternative embodiment, program 405 is further configured to cause processor 401 to construct the second training sample set by: respectively carrying out text detection on a plurality of video frame sample images containing subtitles to obtain information of a plurality of corresponding subtitle areas; according to the information of the caption area, image interception is carried out on the corresponding video frame sample image to obtain a corresponding second intercepted image containing the caption; combining the plurality of second intercepted images corresponding to the plurality of video frame images randomly in pairs; determining an image pair with the same caption as a positive sample and determining an image pair with different captions as a negative sample in a second intercepted image pair obtained after random combination; constructing the second set of training samples using the positive samples and the negative samples.

For specific implementation of each step in the program 405, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing data processing method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.

The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the data processing methods described herein. Further, when a general-purpose computer accesses code for implementing the data processing method shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the data processing method shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims

1. A data processing method, comprising:

performing text detection on a first text image to obtain information of a text area in the first text image;

according to the information of the text area, image interception is carried out on the first text image, and a corresponding first intercepted image which does not contain a text is obtained;

acquiring a plurality of text sentences, and fusing the text sentences with the first captured image respectively to obtain a plurality of second text images;

and constructing a training sample for training a text recognition model by taking the plurality of second text images as sample images and taking the text content of the text sentence corresponding to each second text image as the text label of the second text image.

2. The method of claim 1, further comprising:

acquiring a second intercepted image which comprises a text and corresponds to the first text image according to the image interception result;

combining the second intercepted image corresponding to the first text image and the second intercepted images corresponding to other text images randomly in pairs;

determining the image pair with the same text as a positive sample and determining the image pair with different texts as a negative sample in the second intercepted image pair obtained after random combination;

and constructing a training sample for training a text similarity model for text similarity judgment according to the positive sample and the negative sample.

3. The method according to claim 1 or 2, wherein the performing text detection on the first text image to obtain information of a text region in the first text image comprises:

and performing text detection on the first text image by using a differentiable binary DB model to obtain the information of the text region in the first text image.

4. The method according to claim 1 or 2, wherein the first text image is a video frame image containing subtitles, and the information of the text region is information indicating a subtitle region;

the obtaining a plurality of text sentences and fusing the plurality of text sentences with the first truncated image respectively to obtain a plurality of second text images includes:

crawling novel from the network and extracting text sentences in the novel to obtain a plurality of text sentences;

and respectively embedding the plurality of text sentences into the first intercepted image to obtain a plurality of second text images.

5. A data processing method, comprising:

acquiring a video frame image sequence from a video;

respectively carrying out text detection on each video frame image in the video frame image sequence to obtain information of a subtitle area in each video frame image;

performing text recognition on video frame images according to the information of the subtitle region by using a text recognition model to obtain at least one video frame image set and subtitle contents corresponding to the video frame image set, wherein the subtitle region corresponding to each video frame image in the video frame image set meets a preset similarity, and the text recognition model is obtained by training based on a training sample constructed by the data processing method according to any one of claims 1-4;

determining a video starting time point and a video ending time point of each video frame image set according to the time information of the video frame images in each video frame image set;

obtaining audio data corresponding to the video starting time point and the video ending time point from the video;

and constructing training data for training a voice recognition model according to the subtitle content corresponding to the video frame image set and the audio data.

6. The method according to claim 5, wherein the performing text recognition on the video frame image according to the information of the subtitle region by using a text recognition model to obtain at least one video frame image set and subtitle content corresponding to the video frame image set comprises:

inputting information of at least one video frame image and a subtitle region corresponding to the video frame image into the text recognition model;

performing text recognition on the input video frame images through the text recognition model to obtain each video frame image and corresponding subtitle content;

and obtaining at least one video frame image set and the subtitle content corresponding to the video frame image set according to the similarity between the obtained subtitle contents.

7. The method according to claim 5, wherein the performing text recognition on the video frame image according to the information of the subtitle region by using a text recognition model to obtain at least one video frame image set and subtitle content corresponding to the video frame image set comprises:

inputting the information of a plurality of video frame images and subtitle areas corresponding to the video frame images into the text recognition model;

performing similarity identification on the caption areas of the plurality of video frame images according to the information of the caption areas through the text identification model;

obtaining at least one video frame image set according to the result of the similarity identification;

selecting one video frame image from each video frame image set, and performing text recognition on the selected video frame image to obtain subtitle content corresponding to each video frame image set.

8. The method according to any one of claims 5-7, wherein determining the video start time point and the video end time point of each video frame image set according to the time information of the video frame images in the video frame image set comprises:

determining a video starting time point and a video ending time point of each video frame image set according to the time stamp of the video frame image in each video frame image set;

or,

determining the time length information of each video frame image according to the total time length and the total frame number of the video; and determining the video starting time point and the video ending time point of the video frame image set according to the duration information of each video frame image and the video sequence number of the video frame image in the video frame image set.

9. A data processing method, comprising:

acquiring a video frame image sequence from a video;

performing subtitle similarity judgment on video frame images in the video frame image sequence according to the information of the subtitle region, and acquiring at least one video frame image set according to a judgment result;

selecting one video frame image from each video frame image set for subtitle identification to obtain subtitle content corresponding to each video frame image set;

acquiring audio data corresponding to the video starting time point and the video ending time point from the video;

10. The method of claim 9, wherein determining the video start time point and the video end time point of each video frame image set according to the time information of the video frame images in the video frame image set comprises:

or,

11. The method according to claim 9 or 10, wherein the determining the similarity of subtitles for the video frame images in the video frame image sequence according to the information of the subtitle region comprises:

according to the information of the caption area, carrying out image interception on each video frame image in the video frame image sequence to obtain a plurality of intercepted images which correspond to the video frame image sequence and comprise the caption area;

and carrying out subtitle similarity judgment on a plurality of intercepted images corresponding to the video frame image sequence.

12. The method according to claim 11, wherein the performing text detection on each video frame image in the sequence of video frame images separately to obtain information of a subtitle region in each video frame image comprises:

and respectively carrying out text detection on each video frame image in the video frame image sequence by using a differentiable binarization DB model to obtain the information of the subtitle area in each video frame image.

13. The method according to claim 12, wherein the determining the similarity of subtitles for the plurality of truncated images corresponding to the sequence of images of the video frame comprises:

constructing image pairs for the plurality of corresponding intercepted images according to the time sequence of the video frame images in the video frame image sequence;

and inputting the images into a neural network model for image similarity judgment in sequence for similarity judgment.

14. The method of claim 13, wherein the neural network model for image similarity determination is a MatchNet model.

15. The method of claim 13, wherein selecting one video frame image from each video frame image set for caption recognition to obtain caption content corresponding to each video frame image set comprises:

selecting one video frame image from each video frame image set, and performing subtitle recognition on the selected video frame image by using a Convolutional Recurrent Neural Network (CRNN) model for text recognition to obtain subtitle content corresponding to each video frame image set.

16. The method of claim 15, wherein prior to said sequentially inputting the images into a neural network model for image similarity determination, the method further comprises:

constructing a first training sample set for training the neural network model for image similarity judgment and a second training sample set for training the CRNN model;

and respectively training the neural network model by using the first training sample set, and training the CRNN model by using the second training sample set.

17. The method of claim 16, wherein the first training sample set is constructed by:

respectively carrying out text detection on a plurality of video frame sample images containing subtitles to obtain information of a plurality of corresponding subtitle areas;

according to the information of the plurality of subtitle areas, image interception is carried out on the corresponding video frame sample images, and corresponding first intercepted images which do not contain subtitles are obtained; fusing a pre-acquired text sentence with the first captured image to obtain a corresponding new video frame sample image;

and constructing the first training sample set by taking the new video frame sample image as a sample image and taking the text content of the text sentence as the text label of the new video frame sample image.

18. The method of claim 16, wherein the second training sample set is constructed by:

according to the information of the caption area, image interception is carried out on the corresponding video frame sample image to obtain a corresponding second intercepted image containing the caption;

combining the plurality of second intercepted images corresponding to the plurality of video frame images randomly in pairs;

determining an image pair with the same caption as a positive sample and determining an image pair with different captions as a negative sample in a second intercepted image pair obtained after random combination;

constructing the second set of training samples using the positive samples and the negative samples.

19. An electronic device, characterized in that the device comprises:

one or more processors;

a computer readable medium configured to store one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to perform operations of the data processing method of any one of claims 1-4 or any one of claims 5-8 or any one of claims 9-18.

20. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, carries out the data processing method of any one of claims 1 to 4 or of any one of claims 5 to 8 or of any one of claims 9 to 18.