CN112101329A

CN112101329A - Video-based text recognition method, model training method and model training device

Info

Publication number: CN112101329A
Application number: CN202011305590.0A
Authority: CN
Inventors: 宋浩; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2020-12-18
Anticipated expiration: 2040-11-19
Also published as: CN112101329B

Abstract

The application discloses a text recognition method realized by adopting an artificial intelligence technology, which comprises the following steps: acquiring a first video frame and a second video frame; acquiring a first text probability value and a first feature vector based on a first video frame; acquiring a second text probability value and a second feature vector based on a second video frame; obtaining a similarity score based on the first feature vector and the second feature vector; if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold value, and the similarity degree value is less than or equal to a similarity degree threshold value, determining a target video frame according to the first video frame and the second video frame; and performing text recognition on the target video frame. The application also provides a model training method and a model training device. According to the method and the device, the twin network is used for calculating the text similarity between the video frames, so that the video frames with higher similarity can be judged, text recognition is carried out on the video frames with higher similarity, and the text detection efficiency aiming at the video is improved.

Description

Video-based text recognition method, model training method and model training device

Technical Field

The application relates to the technical field of computer vision, in particular to a text recognition method based on video, a model training method and a model training device.

Background

With the continuous development of information technology and communication technology, a large number of videos are continuously emerging, subtitles, guideboards or other texts and the like may appear in the video playing process, and the texts can show the video program contents in an intuitive form, so that people can be effectively assisted in mastering the theme of the programs in the videos to further understand the contents of the videos.

At present, an Optical Character Recognition (OCR) technology may be used to detect and recognize texts in a video frame, that is, first detect a text region, find a region containing a text, then recognize texts in the region, and merge and distinguish text regions according to Recognition results.

However, in the existing solution, OCR technology is required to continuously perform text detection and recognition on each video frame in the video, thereby resulting in inefficient text detection.

Disclosure of Invention

The embodiment of the application provides a text recognition method based on a video, a model training method and a device, a twin network is used for calculating the text similarity between video frames, so that the video frames with higher similarity can be judged, any frame is extracted from the video frames with higher similarity for text recognition, and the text detection efficiency aiming at the video is improved.

In view of the above, an aspect of the present application provides a text recognition method based on a video, including:

acquiring a first video frame and a second video frame from a video to be identified, wherein the video to be identified comprises at least two video frames, and the first video frame and the second video frame are two adjacent video frames;

based on a first video frame, acquiring a first text probability value and a first feature vector through a first identification network included in a text identification network, wherein the first text probability value represents the probability of text appearing in the first video frame;

based on the second video frame, acquiring a second text probability value and a second feature vector through a second identification network included in the text identification network, wherein the second text probability value represents the probability of text appearing in the second video frame, and the second identification network shares weight with the first identification network;

based on the first feature vector and the second feature vector, acquiring a similarity score through a full connection layer included in a text recognition network;

if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold value and the similarity degree value is less than or equal to the similarity degree threshold value, determining a target video frame according to the first video frame and the second video frame;

and performing text recognition on the target video frame.

Another aspect of the present application provides a method for model training, including:

acquiring a sample pair to be trained, wherein the sample pair to be trained comprises a first video frame sample and a second video frame sample, the first video frame sample corresponds to a first text label value, the second video frame sample corresponds to a second text label value, and the sample pair to be trained corresponds to a similarity label value;

based on a first video frame sample, acquiring a first text probability value and a first feature vector through a first recognition network included in a text recognition network to be trained, wherein the first text probability value represents the probability of text appearing in a first video frame;

based on a second video frame sample, acquiring a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network, wherein the second text probability value represents the probability of text appearing in a second video frame, and the second recognition network shares weight with the first recognition network;

based on the first feature vector and the second feature vector, obtaining a similarity score through a full connection layer included in the text recognition network to be trained;

and training the text recognition network to be trained according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value and the similarity score, and outputting the text recognition network when a model training condition is met, wherein the text recognition network is the text recognition network related to the aspects.

Another aspect of the present application provides a text recognition apparatus, including:

the device comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring a first video frame and a second video frame from a video to be identified, the video to be identified comprises at least two video frames, and the first video frame and the second video frame are two adjacent video frames;

the acquisition module is further used for acquiring a first text probability value and a first feature vector through a first identification network included in the text identification network based on the first video frame, wherein the first text probability value represents the probability of text appearing in the first video frame;

the acquisition module is further used for acquiring a second text probability value and a second feature vector through a second identification network included in the text identification network based on the second video frame, wherein the second text probability value represents the probability of text appearing in the second video frame, and the second identification network shares weight with the first identification network;

the acquisition module is further used for acquiring a similarity score through a full connection layer included in the text recognition network based on the first characteristic vector and the second characteristic vector;

the determining module is used for determining a target video frame according to the first video frame and the second video frame if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold value and the similarity degree value is less than or equal to a similarity degree threshold value;

and the recognition module is used for performing text recognition on the target video frame.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for acquiring a first feature map through a convolutional network included in a first identification network based on a first video frame, wherein the first identification network belongs to a text identification network;

acquiring a first text probability value through an attention network included in a first recognition network based on the first feature map;

acquiring a first feature vector through an image feature extraction network included in a first identification network based on the first feature map;

the acquisition module is specifically used for acquiring a second feature map through a convolutional network included in a second identification network based on a second video frame, wherein the second identification network belongs to a text identification network;

acquiring a second text probability value through an attention network included in a second recognition network based on the second feature map;

and acquiring a second feature vector through an image feature extraction network included in the second recognition network based on the second feature map.

the obtaining module is specifically configured to generate a first feature vector set to be processed according to the first feature map, where the first feature vector set to be processed includes M first feature vectors to be processed, each first feature vector to be processed includes N elements, and both N and M are integers greater than 1;

generating a second feature vector set to be processed according to the first feature vector set to be processed, wherein the second feature vector set to be processed comprises N second feature vectors to be processed, and each second feature vector to be processed comprises M elements;

acquiring a first attention feature vector through an attention network included in the first identification network based on the second feature vector set to be processed;

acquiring a first text probability value through a full connection layer included in a first identification network based on the first attention feature vector;

the obtaining module is specifically configured to generate a third feature vector set to be processed according to the second feature map, where the third feature vector set to be processed includes M third feature vectors to be processed, and each third feature vector to be processed includes N elements;

generating a fourth feature vector set to be processed according to the third feature vector set to be processed, wherein the fourth feature vector set to be processed comprises N fourth feature vectors to be processed, and each fourth feature vector to be processed comprises M elements;

acquiring a second attention feature vector through an attention network included in a second identification network based on the fourth feature vector set to be processed;

and acquiring a second text probability value through a full connection layer included by the second identification network based on the second attention feature vector.

the obtaining module is specifically configured to obtain K first to-be-spliced feature vectors through an image feature extraction network included in a first identification network based on a first feature map, where the K first to-be-spliced feature vectors include a first to-be-spliced feature vector obtained by averaging the pooling layers, and K is an integer greater than 1;

acquiring a first feature vector through an image feature extraction network included in a first identification network according to the K first feature vectors to be spliced;

based on the second feature map, acquiring a second feature vector through an image feature extraction network included in the second recognition network, wherein the second feature vector comprises:

acquiring K second feature vectors to be spliced through an image feature extraction network included by a second identification network based on a second feature map, wherein the K second feature vectors to be spliced include second feature vectors to be spliced obtained through an average pooling layer;

and acquiring a second feature vector through an image feature extraction network included by the second identification network according to the K second feature vectors to be spliced.

the obtaining module is specifically used for subtracting elements at the same position in the first feature vector and the second feature vector to obtain an intermediate feature vector;

carrying out absolute value taking processing on the intermediate characteristic vector to obtain a target characteristic vector;

and acquiring a similarity score through the full connection layer based on the target feature vector.

the obtaining module is further configured to obtain a first frame identifier corresponding to the first video frame and a second frame identifier corresponding to the second video frame if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity score is less than or equal to a similarity threshold;

the determining module is further configured to determine, according to the first frame identifier, the second frame identifier, and the frame rate of the video to be recognized, an occurrence time of the first video frame in the video to be recognized, and an occurrence time of the second video frame in the video to be recognized.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the text recognition apparatus further includes a processing module;

the processing module is used for eliminating the first video frame if the first text probability value is smaller than the text probability threshold value and the second text probability value is larger than or equal to the text probability threshold value;

the processing module is further used for rejecting the second video frame if the first text probability value is larger than or equal to the text probability threshold and the second text probability value is smaller than the text probability threshold;

the determining module is further configured to determine that the first video frame and the second video frame belong to the same text video frame interval if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity value is less than or equal to a similarity threshold.

the determining module is specifically configured to determine that the first video frame and the second video frame belong to the same text video frame interval, where the text video frame interval includes at least two video frames;

and selecting any one video frame from the text video frame interval as a target video frame.

In one possible design, in another implementation manner of another aspect of the embodiment of the present application, the text recognition apparatus further includes a display module;

the acquisition module is also used for acquiring a time interval corresponding to the text video frame interval and a text recognition result corresponding to the target video frame after the recognition module performs text recognition on the target video frame;

the display module is used for displaying a time interval corresponding to the text video frame interval and a text recognition result, wherein the time interval represents the time from the first video frame to the last video frame in the text video frame interval;

alternatively, the first and second electrodes may be,

and the display module is further used for sending the text recognition result and the time interval corresponding to the text video frame interval to the terminal equipment so that the terminal equipment can display the time interval corresponding to the text video frame interval and the text recognition result.

Another aspect of the present application provides a model training apparatus, including:

the system comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring a sample pair to be trained, the sample pair to be trained comprises a first video frame sample and a second video frame sample, the first video frame sample corresponds to a first text label value, the second video frame sample corresponds to a second text label value, and the sample pair to be trained corresponds to a similarity label value;

the acquisition module is further used for acquiring a first text probability value and a first feature vector through a first recognition network included in a text recognition network to be trained based on a first video frame sample, wherein the first text probability value represents the probability of text appearing in a first video frame;

the acquisition module is further used for acquiring a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network based on a second video frame sample, wherein the second text probability value represents the probability of text occurrence in a second video frame, and the second recognition network shares weight with the first recognition network;

the acquisition module is further used for acquiring a similarity score through a full connection layer included in the text recognition network to be trained based on the first characteristic vector and the second characteristic vector;

the training module is used for training the text recognition network to be trained according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value and the similarity score, and outputting the text recognition network when the model training condition is met, wherein the text recognition network is the text recognition network provided by the aspect.

the training module is specifically used for determining a first loss value by adopting a first loss function according to the first text label value and the first text probability value;

determining a second loss value by adopting a second loss function according to the second text label value and the second text probability value;

determining a third loss value by adopting a third loss function according to the similarity marking value and the similarity value;

and updating the model parameters of the text recognition network to be trained according to the first loss value, the second loss value and the third loss value.

Another aspect of the present application provides a computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein, the memory is used for storing programs;

the processor is used for executing the program in the memory, and the method comprises the steps of executing the method provided by the aspects;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a method for text recognition is provided, which includes first obtaining a first video frame and a second video frame from a video to be recognized, the first video frame and the second video frame being two adjacent video frames, then obtaining a first text probability value and a first feature vector through a first recognition network included in a text recognition network based on the first video frame, obtaining a second text probability value and a second feature vector through a second recognition network included in the text recognition network based on the second video frame, obtaining a similarity score through a full connection layer included in the text recognition network based on the first feature vector and the second feature vector, and if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity score is less than or equal to the similarity threshold, according to the first video frame and the second video frame, and determining a target video frame, and finally performing text recognition on the target video frame. By the method, the twin network is used for judging whether the video frames contain texts, so that only the video frames with the texts can be subjected to text detection, meanwhile, the twin network is used for calculating the text similarity among the video frames, so that the video frames with higher similarity can be judged, and therefore any one frame is extracted from the video frames with higher similarity for text recognition, and the text detection efficiency aiming at the videos is improved.

Drawings

Fig. 1 is a schematic view of an application scenario of a text recognition method in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating text recognition based on conventional learning in an embodiment of the present application;

FIG. 3 is a schematic flowchart of text recognition based on deep learning according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a video-based text recognition method in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a text recognition network in an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of extracting a target video frame in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an identification network in an embodiment of the present application;

FIG. 8 is a schematic diagram of a convolutional network in the embodiment of the present application;

FIG. 9 is a schematic diagram of processing a feature map in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an image feature extraction network in an embodiment of the present application;

FIG. 11 is a schematic diagram of an interface for displaying text recognition results in an embodiment of the present application;

FIG. 12 is a schematic diagram of an embodiment of a method for model training in an embodiment of the present application;

FIG. 13 is a schematic diagram of an embodiment of a text recognition apparatus in an embodiment of the present application;

FIG. 14 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a terminal device in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a server in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Video subtitle extraction has become an important research hotspot problem in the field of real-time video research. As the number of videos in the internet and the degree of attention are continuously increasing, video subtitle extraction technology has also received wide attention. Its main task is to obtain the start and end time of the caption appearing in the video and provide the specific text content of the caption. These recognized texts can be used for model training, editing texts or formats, and the specific use is not limited in the present application.

A video often includes a large number of video frames, and if each video frame is identified, a large amount of processing resources and processing time are consumed. Based on this, the present application provides a text recognition method implemented based on an Artificial Intelligence (AI) technology, and in particular relates to a Computer Vision (CV) technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The computer vision technology is a science for researching how to make a machine "see", and in particular, it refers to that a camera and a computer are used to replace human eyes to make machine vision of identifying, tracking and measuring target, and further make image processing, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

For convenience of understanding, referring to fig. 1, fig. 1 is a schematic view of an application scenario of a text recognition method in an embodiment of the present application, as shown in the figure, a video to be recognized is first uploaded to a computer device, then the computer device acquires two adjacent video frames from the video to be recognized for detection, and if it is detected that both the two video frames include text and the similarity of the text between the two video frames is large, any one of the two video frames can be selected as a target video frame. Therefore, the subtitle of the target video frame needs to be positioned, namely the position of the subtitle in the video frame is found out, the subtitle is usually arranged horizontally or vertically, and the positioning result can be represented by a minimum external frame as shown in fig. 1. Finally, the computer device needs to perform text recognition on the subtitle, i.e. by extracting the image features of the subtitle region, recognizing the characters in the subtitle region, and finally outputting the text string, for example, as shown in fig. 1, "this is the first time they are confronted at sea".

It should be noted that the computer device described in this application may be a terminal device, a server, or a system formed by a server and a terminal device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, and the like.

Based on this, the manner of text recognition using OCR technology will be described below with reference to fig. 2 and 3. Exemplarily, referring to fig. 2, fig. 2 is a schematic flowchart of text recognition based on conventional learning in the embodiment of the present application, and as shown in the figure, a target video frame is used as an input, and in step a1, text region positioning is performed first, which is mainly implemented based on a method of connected domain analysis, and specifically, a maximum extremum Stable region (MSER) algorithm and a Stroke Width Transform (SWT) algorithm may be used. In step a2, the correction of the character region image is mainly based on the rotation transformation and the affine transformation. In step a3, a single character is extracted by row-column division, and a row-column division point is found by binarization and projection using the characteristic that there is a gap between the rows and columns of characters. In step A4, the vocabulary is identified using Histogram of Oriented Gradient (HOG) or Convolutional Neural Network (CNN) features in conjunction with a classification model. In step a5, semantic error correction is performed based on the statistical language model or rule, and finally the corresponding text recognition result is output.

Referring to fig. 3, fig. 3 is a schematic flowchart of text recognition based on deep learning in the embodiment of the present application, and as shown in the drawing, a target video frame is used as an input, and in step a1, a CNN or a Recurrent Neural Network (RNN) may be used to locate a text region. In step a2, text line recognition is implemented by using CNN, RNN, or Connection Temporal Classification (CTC), and finally a corresponding text recognition result is output.

With reference to the above description, a scheme provided in an embodiment of the present application relates to technologies such as artificial intelligence computer vision and machine learning, and a text recognition method based on a video in the present application is described below, with reference to fig. 4, an embodiment of the text recognition method based on a video in the embodiment of the present application includes:

101. acquiring a first video frame and a second video frame from a video to be identified, wherein the video to be identified comprises at least two video frames, and the first video frame and the second video frame are two adjacent video frames;

in this embodiment, the text recognition device obtains the Video to be recognized, where the type of the Video to be recognized includes, but is not limited to, a Moving Picture Experts Group (MPEG) Format, an Advanced Streaming Format (ASF), an Audio Video Interleaved (AVI) Format, a real media Variable bit Rate (RMVB) Format, a Video Streaming media (FlashVideo, FLV) Format, and the like, and is not limited herein.

Specifically, after the video to be recognized is acquired, the video to be recognized may be decoded into consecutive video frames using Fast Forward MPEG (FFmpeg), which is a set of open source computer programs that may be used to record and convert digital video audio and video, and can be converted into streams. Then, two continuous video frames are obtained from the video to be identified as a video frame pair, and one video frame pair is composed of a first video frame and a second video frame. In practical application, assuming that the video to be identified includes 5 video frames, i.e., video frame 1, video frame 2, video frame 3, video frame 4, and video frame 5, 4 video frame pairs can be obtained, for example, video frame 1 and video frame 2 constitute one video frame pair, video frame 2 and video frame 3 constitute another video frame pair, video frame 3 and video frame 4 constitute another video frame pair, and video frame 4 and video frame 5 constitute another video frame pair. Each video frame pair serves as input to a text recognition network.

It should be noted that the text recognition apparatus is disposed in a computer device, and may be disposed in a terminal device or a server, or may be disposed in a system formed by the server and the terminal device, which is not limited herein.

102. Based on a first video frame, acquiring a first text probability value and a first feature vector through a first identification network included in a text identification network, wherein the first text probability value represents the probability of text appearing in the first video frame;

in this embodiment, taking a video frame pair formed by a first video frame and a second video frame as an example, the text recognition device inputs the first video frame to a first recognition network included in the text recognition network, and outputs a first text probability value and a first feature vector from the first recognition network. The text recognition network adopts a network structure of a twin network, the twin network comprises two sub-networks, namely a first recognition network and a second recognition network, each sub-network respectively receives different inputs, maps the different inputs to a high-dimensional feature space and outputs corresponding representations. By calculating the distance of two tokens (e.g., euclidean distance), the degree of similarity of the two inputs can be compared, and the weights of the two subnetworks can be optimized by energy functions or classification loss.

Specifically, an energy function is used at the top layer of the text recognition network, the energy function is a subtraction function of absolute values, and a full connection layer is connected behind the energy function and used for performing text similarity calculation.

103. Based on the second video frame, acquiring a second text probability value and a second feature vector through a second identification network included in the text identification network, wherein the second text probability value represents the probability of text appearing in the second video frame, and the second identification network shares weight with the first identification network;

in this embodiment, as can be seen from the content described in step 102, the text recognition apparatus further needs to input the second video frame into a second recognition network included in the text recognition network, and output the second text probability value and the second feature vector from the second recognition network. Thus, by using the first video frame and the second video frame as the input of the text recognition network, the degree of text similarity between the video frames and the possibility of containing text can be calculated. Since the text recognition network is a twin network structure, the network parameters between the two "twinned" networks (i.e. the first recognition network and the second recognition network) are shared, which ensures that two extremely similar images may not be mapped to very different locations in the feature space through the respective networks, since each network calculates the same function, i.e. two extremely similar images may not be mapped to different locations in the feature space. The twin network (i.e. the text recognition network) is symmetrical, which ensures that the energy functions of the top layer can obtain the same similarity when two different video frames are input into different "twin" networks (i.e. the first recognition network and the second recognition network).

104. Based on the first feature vector and the second feature vector, acquiring a similarity score through a full connection layer included in a text recognition network;

in this embodiment, the text recognition apparatus inputs the first feature vector and the second feature vector to a Full Connection (FC) layer included in the text recognition network, and outputs a similarity score of a text between the first video frame and the second video frame through the FC layer.

Specifically, for convenience of understanding, please refer to fig. 5, where fig. 5 is a schematic structural diagram of a text recognition network in an embodiment of the present application, and as shown in the figure, a video to be recognized is first obtained, a first video frame and a second video frame are then extracted from the video, and the first video frame and the second video frame are then input to the text recognition network. The text recognition network includes a first recognition network and a second recognition network, and therefore, the first video frame needs to be input to the first recognition network and the second video frame needs to be input to the second recognition network. The first recognition network has a similar network structure to the second recognition network, the first recognition network includes a convolutional network, an attention network, and an image feature extraction network, and the second recognition network also includes a convolutional network, an attention network, and an image feature extraction network. The first text probability value is output by an attention network and an FC layer included in the first recognition network, and the second text probability value is output by an attention network and an FC layer included in the second recognition network. The first feature vector is output by an image feature extraction network comprised by the first recognition network, and the second feature vector is output by an image feature extraction network comprised by the second recognition network. The first feature vector and the second feature vector are collectively input to the FC layer, thereby obtaining a similarity score.

105. If the first text probability value and the second text probability value are both greater than or equal to the text probability threshold value and the similarity degree value is less than or equal to the similarity degree threshold value, determining a target video frame according to the first video frame and the second video frame;

in this embodiment, the text recognition device needs to determine whether the first text probability value is greater than or equal to a text probability threshold, and determine whether the second text probability value is greater than or equal to the text probability threshold, if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold, it indicates that the first video frame and the second video frame both contain text content, and otherwise, it indicates that the video frame does not contain text content. Meanwhile, the text recognition device also needs to judge whether the similarity score is smaller than or equal to a similarity threshold, if so, the text of the first video frame is very similar to the text of the second video frame, otherwise, the difference between the two is large.

Based on this, if it is determined that the first video frame and the second video frame both include text and that the text of the first video frame and the text of the second video frame are very similar, then the first video frame and the second video frame are both candidate video frames, and finally, any one video frame is taken out from the candidate video frames as a target video frame.

It is understood that the text probability threshold may be set to 0.5 or 0.8, or may be set to other values, and the similarity threshold may be set to 0.5 or 0.3, or may be set to other values, which are not limited herein.

106. And performing text recognition on the target video frame.

In this embodiment, the text recognition device may use an OCR technology to recognize text content in the target video frame, so as to obtain a corresponding text recognition result.

For convenience of understanding, the text recognition method provided by the present application will be described below with reference to fig. 6, where fig. 6 is a schematic view of an embodiment of extracting a target video frame in the embodiment of the present application, and as shown in the figure, taking six consecutive video frames in a video to be recognized as an example, a text probability value of each of the video frames 1 and 2 is greater than a text probability threshold, and a similarity score is greater than a similarity threshold, then both the video frame 1 and the video frame 2 are candidate video frames. And the text probability values of the video frame 2 and the video frame 3 are both greater than the text probability threshold, and the similarity degree value is greater than the similarity degree threshold, so that the video frame 3 is also a candidate video frame. The text probability value of video frame 4 is less than the text probability threshold, i.e. video frame 4 is not a candidate video frame. The text probability value of video frame 5 is less than the text probability threshold, i.e. video frame 5 is also not a candidate video frame. The text probability value of the video frame 6 is greater than the text probability threshold, but the similarity score between the video frame 6 and the video frame 5 is less than the similarity threshold, so it is also necessary to determine the similarity score between the video frame 6 and the next video frame, and to determine whether the text probability value of the next video frame is greater than or equal to the text probability threshold.

Therefore, the video frame 1, the video frame 2 and the video frame 3 are candidate video frames, then any one of the three video frames is selected as a target video frame, and finally, the OCR recognition is performed on the target video frame to obtain a text recognition result, for example, the text recognition result includes "new sedan".

In the embodiment of the application, a method for text recognition is provided, and through the above manner, a twin network is used to judge whether a video frame contains a text, so that only the video frame with the text can be subjected to text detection, and meanwhile, the twin network is used to calculate the text similarity between the video frames, so that the video frame with higher similarity can be judged, and therefore, any one frame is extracted from the video frames with higher similarity to perform text recognition, so that the text detection efficiency for the video is improved.

Optionally, on the basis of the embodiments corresponding to fig. 4, in another optional embodiment, based on the first video frame, the method for acquiring the first text probability value and the first feature vector through the first recognition network included in the text recognition network specifically includes the following steps:

based on the first video frame, acquiring a first feature map through a convolutional network included in a first identification network, wherein the first identification network belongs to a text identification network;

based on the second video frame, obtaining a second text probability value and a second feature vector through a second identification network included in the text identification network, and specifically including the following steps:

based on the second video frame, acquiring a second feature map through a convolutional network included in a second identification network, wherein the second identification network belongs to a text identification network;

In this embodiment, a method for extracting a text probability value and a feature vector based on a twin network structure is described. As described in the foregoing embodiments, since the text recognition network belongs to the structure of the twin network, the first recognition network and the second recognition network included in the text recognition network have similar structures, each recognition network includes two branches, one branch is a text branch including a convolutional network, an attention network, and an FC layer, and the other branch is an image similarity branch including a convolutional network, an image feature extraction network, and an FC layer.

Specifically, referring to fig. 7, fig. 7 is a schematic structural diagram of an identification network according to an embodiment of the present disclosure, in which a video frame (e.g., a first video frame) is input to a convolution network included in the identification network (e.g., the first identification network), and a feature map (e.g., a first feature map) is output through the convolution network. Next, a feature map (e.g., a first feature map) is used as an input of the attention network and the image feature extraction network, respectively, an output result of the attention network is input to an FC layer in the recognition network (e.g., the first recognition network), thereby obtaining an output text probability value (e.g., a first text probability value), and a feature vector (e.g., a first feature vector) is output by the image feature extraction network.

Similarly, as can be seen from fig. 7, a video frame (e.g., a second video frame) is input to a convolution network included in an identification network (e.g., a second identification network), and a feature map (e.g., a second feature map) is output through the convolution network. Next, a feature map (e.g., a second feature map) is input as an input to the attention network and the image feature extraction network, respectively, an output result of the attention network is input to an FC layer in the recognition network (e.g., a second recognition network), thereby obtaining an output text probability value (e.g., a second text probability value), and a feature vector (e.g., a second feature vector) is output by the image feature extraction network.

More specifically, the text branch and the similarity branch both include the same convolutional Network, and for convenience of illustration, please refer to fig. 8, fig. 8 is a schematic structural diagram of the convolutional Network in the embodiment of the present application, and as shown in the figure, a video frame is scaled to an image with a size of 224 × 224 and then input to the convolutional Network, which has all Network layers before the 5_2 th convolutional layer in the Residual Network 18 (Residual Network 18, ResNet 18). Referring to table 1, table 1 is a schematic of the network layer before the 5_2 th convolutional layer in the ResNet 18.

TABLE 1

As can be seen from table 1, since the network layers before the 5_2 th convolutional layer have 17 layers in total, the feature map (i.e., the first feature map and the second feature map) size of the 224 × 224 image after passing through the network is 512 × 7 × 7. It should be noted that, in practical applications, the convolutional network may also adopt network layers of numbers and types, and the sizes of the first characteristic diagram and the second characteristic diagram may also be other values, which is only an illustration here and should not be understood as a limitation of the present application.

Secondly, in the embodiment of the application, a method for extracting text probability values and feature vectors based on a twin network structure is provided, and through the method, the twin network structure can be used for respectively carrying out parallel processing on two video frames, so that higher processing efficiency is achieved, and meanwhile, text information in the video frames can be extracted by using a convolutional network, and subsequent processing is facilitated. Furthermore, each recognition network comprises an image feature extraction network and an attention network, whereby a corresponding text probability value and a second feature vector can be output.

Optionally, on the basis of the embodiments corresponding to fig. 4, in another optional embodiment, the obtaining of the first text probability value through the attention network included in the first recognition network based on the first feature map specifically includes the following steps:

generating a first feature vector set to be processed according to the first feature map, wherein the first feature vector set to be processed comprises M first feature vectors to be processed, each first feature vector to be processed comprises N elements, and both N and M are integers greater than 1;

based on the second feature map, acquiring a second text probability value through an attention network included in a second recognition network, specifically including the following steps:

generating a third feature vector set to be processed according to the second feature map, wherein the third feature vector set to be processed comprises M third feature vectors to be processed, and each third feature vector to be processed comprises N elements;

In this embodiment, a method for obtaining a text probability value based on an attention network is introduced. As described in the foregoing embodiments, since the text recognition network belongs to the structure of the twin network, the first recognition network and the second recognition network included in the text recognition network have similar structures. The way in which feature processing is based on text branching in each recognition network will be described below.

Specifically, for convenience of introduction, a feature map (for example, the first feature map and the second feature map) with a size of 512 × 7 × 7 is taken as an example for description, please refer to fig. 9, where fig. 9 is a schematic diagram of processing the feature map in the embodiment of the present application, as shown in the figure, the 7 × 7 feature map is drawn into feature vectors with 49 dimensions, that is, a first set of to-be-processed feature vectors is obtained, that is, the first set of to-be-processed feature vectors is represented as 512 × 49 feature vectors, where the first set of to-be-processed feature vectors includes M first to-be-processed feature vectors, each first to-be-processed feature vector includes N elements, where M is 512 and N is 49. Next, for each element in a first set of feature vectors to be processed (e.g., 512-dimensional first feature vectors to be processed of 49 dimensions), the element corresponding to each dimension is taken out, and a second set of feature vectors to be processed is formed, that is, the second set of feature vectors to be processed is represented as 49 × 512 feature vectors, where the second set of feature vectors to be processed includes N second feature vectors to be processed, and each second feature vector to be processed includes M elements, where N is 49 and M is 512.

Based on this, the second set of feature vectors to be processed is input to the attention network included in the first recognition network to obtain a second attention feature vector, and finally the second attention feature vector is input to the FC layer included in the first recognition network to obtain a first text probability value, where the first text probability value is an integer greater than or equal to 0 and less than or equal to 1.

Similarly, as can be seen from fig. 9, a feature map of 7 × 7 is drawn into feature vectors of 49 dimensions, that is, a third set of feature vectors to be processed is obtained, that is, the third set of feature vectors to be processed is represented as 512 × 49 feature vectors, where the third set of feature vectors to be processed includes M third feature vectors to be processed, each third feature vector to be processed includes N elements, where M is equal to 512 and N is equal to 49. Next, for each element in a third feature vector set to be processed (e.g., 512 third feature vectors to be processed with 49 dimensions), the elements corresponding to each dimension are respectively extracted, and a fourth feature vector set to be processed is formed, that is, the fourth feature vector set to be processed is represented by 49 × 512 feature vectors, where the fourth feature vector set to be processed includes N fourth feature vectors to be processed, and each fourth feature vector to be processed includes M elements, where N is equal to 49 and M is equal to 512.

Based on this, the fourth set of feature vectors to be processed is input to the attention network included in the second recognition network to obtain a second attention feature vector, and finally the second attention feature vector is input to the FC layer included in the second recognition network to obtain a second text probability value, where the second text probability value is an integer greater than or equal to 0 and less than or equal to 1.

It will be appreciated that the attention network based processing is as follows:

；

；

；

wherein the content of the first and second substances,

representing the attention feature vector (e.g., the first attention feature vector or the second attention feature vector), i.e., using an attention mechanism, calculating weights of 49 feature vectors to be processed (e.g., the second feature vector to be processed or the fourth feature vector to be processed), and performing weighted summation to generate a final 512-dimensional attention feature vector (e.g., the first attention feature vector or the second attention feature vector).

Represents the total number of feature vectors to be processed (e.g., the first attention feature vector or the second attention feature vector),

is shown as

The attention-coding result of each feature vector to be processed (e.g., the first attention-feature vector or the second attention-feature vector),

is shown as

A feature vector to be processed (e.g., a first attention feature vector or a second attention feature vector)An intentional force feature vector),

is shown as

An intermediate coded vector of feature vectors to be processed (e.g., a first attention feature vector or a second attention feature vector),

and

network parameters representing an attention network.

In order to more accurately recognize texts by using text information in images, attention networks are introduced to calculate the first feature map and the second feature map, that is, the first feature map and the second feature map output by the convolutional networks are processed, then the processed result is input to the attention networks, and the attention networks output corresponding attention feature vectors.

Optionally, on the basis of each embodiment corresponding to fig. 4, in another optional embodiment, the embodiment of the present application provides a method for obtaining a first feature vector through an image feature extraction network included in a first recognition network based on a first feature map, which specifically includes the following steps:

acquiring K first to-be-spliced feature vectors through an image feature extraction network included in a first identification network based on a first feature map, wherein the K first to-be-spliced feature vectors include first to-be-spliced feature vectors obtained through an average pooling layer, and K is an integer greater than 1;

In this embodiment, a method for extracting feature vectors based on an image feature extraction network is described. As described in the foregoing embodiments, since the text recognition network belongs to the structure of the twin network, the first recognition network and the second recognition network included in the text recognition network have similar structures. The following describes a feature processing method based on image similarity branching in each recognition network. In the image similarity branch, in order to generate high-dimensional image features with more information content, an initial-a (inclusion-a) module and an average pooling operation are adopted in the image feature extraction network.

Specifically, the size of the feature map (i.e., the first feature map and the second feature map) is 512 × 7 × 7 as an example, and in order to be more effectively combined with the network structure of the convolutional network, the number of input channels of the increment-a module is modified from 384 to 512, so that the feature vector dimension output by the first recognition network and the second recognition network is 512 dimensions. For convenience of introduction, please refer to fig. 10, where fig. 10 is a schematic structural diagram of an image feature extraction network in an embodiment of the present application, and as shown in the figure, a 512 × 7 × 7 feature map (e.g., a first feature map) is input to the image feature extraction network, and a 128-dimensional first to-be-stitched feature vector is obtained through an average pooling layer in the image feature extraction network and a convolution layer with a convolution kernel of 1 × 1. And obtaining a 128-dimensional first feature vector to be spliced by a convolution layer with another convolution kernel of 1 multiplied by 1 in the image feature extraction network. A128-dimensional first feature vector to be spliced is obtained through a convolution layer with a convolution kernel of 1 x 1 and a convolution layer with a convolution kernel of 3 x 3 in the image feature extraction network. A128-dimensional first feature vector to be spliced is obtained through a convolution layer with a convolution kernel of 1 x 1 and convolution layers with two convolution kernels of 3 x 3 in an image feature extraction network. Based on the above, 4 first to-be-spliced feature vectors are obtained, and at the moment, K is equal to 4. The 4 first feature vectors to be stitched are input to a filter cascade (filter registration) included in the image feature extraction network, so as to obtain feature vectors (i.e. 512-dimensional first feature vectors).

Similarly, as can be seen from fig. 10, a 512 × 7 × 7 feature map (e.g., a second feature map) is input to the image feature extraction network, and a 128-dimensional second feature vector to be stitched is obtained through an average pooling layer and a convolution layer with a convolution kernel of 1 × 1 in the image feature extraction network. And obtaining a 128-dimensional second feature vector to be spliced by using a convolution layer with another convolution kernel of 1 x 1 in the image feature extraction network. And obtaining a 128-dimensional second feature vector to be spliced by a convolution layer with a convolution kernel of 1 × 1 and a convolution layer with a convolution kernel of 3 × 3 in the image feature extraction network. And obtaining a 128-dimensional second feature vector to be spliced by a convolution layer with a convolution kernel of 1 × 1 and a convolution layer with two convolution kernels of 3 × 3 in the image feature extraction network. Based on the above, 4 second feature vectors to be spliced are obtained, and at the moment, K is equal to 4. And inputting the 4 second feature vectors to be spliced into the filter registration included in the image feature extraction network to obtain feature vectors (namely 512-dimensional second feature vectors).

In the embodiment of the application, a method for extracting a feature vector based on an image feature extraction network is provided, and through the method, in an image similarity branch, in order to generate a high-dimensional image feature with more information content, an inclusion module and an average pooling operation are added, because the inclusion module is reasonably designed, and the extracted image feature has more significance.

Optionally, on the basis of each embodiment corresponding to fig. 4, in another optional embodiment, the method for obtaining the similarity score through the full-link layer included in the text recognition network based on the first feature vector and the second feature vector specifically includes the following steps:

subtracting elements at the same position in the first feature vector and the second feature vector to obtain an intermediate feature vector;

In this embodiment, a method for generating a similarity score based on FC hierarchy is described. After a first feature vector acquired based on the first recognition network and a second feature vector acquired based on the second recognition network, elements at the same position in the first feature vector and the second feature vector may be subtracted to obtain an intermediate feature vector. For convenience of understanding, the first feature vector is described as 4 dimensions, and the second feature vector is also described as 4 dimensions as an example, it should be noted that in practical applications, the first feature vector and the second feature vector may be 512 dimensions, or may be other designed dimensions, and this is only an illustration and should not be construed as a limitation to the present application.

Specifically, assuming that the first eigenvector is (3, 1,8, 5) and the second eigenvector is (5, 1,10, 6), the elements at the same position in the first eigenvector and the second eigenvector are subtracted, i.e., 3-5= -2, 1-1=0, 8-10= -2, 5-6= -1, thereby obtaining an intermediate eigenvector (-2, 0, -2, -1). And then, taking the absolute value of the intermediate feature vector to obtain a target feature vector, wherein the target feature vector corresponding to the intermediate feature vector (-2, 0, -2, -1) is (2, 0,2, 1), for example. Inputting the target feature vector into an FC layer, and mapping feature vectors (namely a first feature vector and a second feature vector) representing the similarity degree between two video frames to [0,1] by the FC layer by adopting a sigmoid activation function, wherein the value of [0,1] is the similarity score.

It will be appreciated that if the similarity score is closer to 0, it indicates that the text within the first video frame is more similar to the text within the second video frame, whereas if the similarity score is closer to 1, it indicates that the text within the first video frame is more different from the text within the second video frame.

Secondly, in the embodiment of the present application, a method for generating a similarity score based on an FC layer is provided, and by the above method, after a first feature vector and a second feature vector are obtained, an absolute value subtraction operation is performed on the two feature vectors, so that a difference between two video frames is obtained, which is beneficial to outputting a more accurate determination result.

Optionally, on the basis of the embodiments corresponding to fig. 4, in another optional embodiment provided by the embodiments of the present application, if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold, and the similarity score is less than or equal to the similarity threshold, the method further includes the following steps:

acquiring a first frame identifier corresponding to a first video frame and a second frame identifier corresponding to a second video frame;

and determining the appearance time of the first video frame in the video to be identified and the appearance time of the second video frame in the video to be identified according to the first frame identifier, the second frame identifier and the frame rate of the video to be identified.

In this embodiment, a method for determining the occurrence time of a video frame in a video to be identified is described. For video frames, each video frame corresponds to a frame identifier, for example, the first frame identifier corresponding to the first video frame is "010", the second frame identifier corresponding to the second video frame is "011", and so on. Under the condition that the first text probability value and the second text probability value are both greater than or equal to the text probability threshold value and the similarity degree value is less than or equal to the similarity degree threshold value, that is, the first video frame and the second video frame are both determined to have text contents and the similarity degree of the text contents is higher, based on the determination, the frame rate of the video to be recognized is also required to be obtained, and it is assumed that the frame rate of the video to be recognized is 10 frames per second and the first frame identifier of the video to be recognized is '001', so that the first frame identifier is the 10 th frame identifier of the video to be recognized, the occurrence time of the first video frame in the video to be recognized is 1 st second, the second frame identifier is the 11 th frame identifier of the video to be recognized, and the occurrence time of the second video frame in the video to be recognized is 1.1 st second.

Further, according to the frame identifier and the frame rate of the video to be recognized, the starting time and the ending time of the text are obtained from the video to be recognized, that is, a text video frame interval is obtained, any one video frame is extracted from the text video frame interval to serve as a target video frame, and the content, the position and the like of the text in the target video frame are detected by combining an OCR technology.

Secondly, in the embodiment of the application, a mode for determining the occurrence time of the video frame in the video to be recognized is provided, and through the mode, the starting position and the ending position of the same subtitle can be predicted according to the frame rate and the frame identification of the video to be recognized, and each video frame is accurate, so that the accuracy of video recognition is improved.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 4, an embodiment of the present application provides another optional embodiment, which further includes the following steps:

if the first text probability value is smaller than the text probability threshold value and the second text probability value is larger than or equal to the text probability threshold value, rejecting the first video frame;

if the first text probability value is larger than or equal to the text probability threshold value and the second text probability value is smaller than the text probability threshold value, rejecting the second video frame;

and if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold value, and the similarity degree value is less than or equal to the similarity degree threshold value, determining that the first video frame and the second video frame belong to the same text video frame interval.

In this embodiment, a method for screening video frames based on text probability values is introduced. After the first text probability corresponding to the first video frame and the second text probability corresponding to the second video frame are obtained, whether the video frames contain text content or not can be judged preferentially, and if the video frames do not contain the text content, the influence of the similarity score does not need to be considered.

Specifically, if the first text probability is smaller than the text probability threshold, it indicates that no text content is contained in the first video frame, and then the first video frame is directly rejected, and similarly, if the second text probability is smaller than the text probability threshold, it indicates that no text content is contained in the second video frame, and then the second video frame is directly rejected. If the first text probability value and the second text probability value are both greater than or equal to the text probability threshold, whether the similarity score is less than or equal to the text probability threshold needs to be further judged, and if the similarity score is less than or equal to the similarity threshold, the similarity degree of the first video frame and the second video frame is high, so that the first video frame and the second video frame can be determined to belong to the same text video frame interval. On the contrary, if the similarity score is larger than the similarity threshold, the similarity degree of the first video frame and the second video frame is low, and therefore the first video frame and the second video frame belong to different text video frame intervals.

It should be noted that the text video frame interval includes video frames with text content, and the video frames included in different text video frame intervals often have different text content, for example, the text content included in the text video frame interval a is "new sedan", and the text content included in the text video frame interval B is "supermarket operation".

Secondly, in the embodiment of the application, a method for screening video frames based on text probability values is provided, through the method, whether the video frames contain text content or not can be preferentially judged, if the video frames do not contain the text content, the video frames are directly removed, and similarity scores among the video frames are not considered any more, so that the processing efficiency is improved.

Optionally, on the basis of each embodiment corresponding to fig. 4, in another optional embodiment provided by the embodiment of the present application, the determining the target video frame according to the first video frame and the second video frame specifically includes the following steps:

determining that a first video frame and a second video frame belong to the same text video frame interval, wherein the text video frame interval comprises at least two video frames;

In this embodiment, a method for performing random sampling detection from a text video frame interval is introduced. As can be seen from the foregoing embodiments, after two adjacent video frames are detected, a text video frame interval may be determined, where the text video frame interval includes candidate video frames.

Specifically, taking the frame rate of the video to be recognized as 10 frames per second as an example, assuming that the text video frame interval is 3 seconds to 10 seconds, there are 7 seconds in total, that is, there are 70 candidate video frames. Therefore, each of the 70 candidate video frames contains text content, and the text similarity between every two adjacent video frames is high, so that any one video frame can be taken out from the 70 candidate video frames as a "representative" of the 70 candidate video frames, that is, a target video frame is selected.

It should be noted that one video frame may be randomly selected from the text video frame interval as the target video frame, the first video frame in the text video frame interval may also be used as the target video frame, or the last video frame in the text video frame interval may also be used as the target video frame, and other manners may also be used to select the target video frame, which is not limited herein.

Secondly, in the embodiment of the present application, a method for performing random sampling detection from a text video frame interval is provided, and through the above method, any one frame can be selected from the text video frame interval as a target video frame, so that only OCR recognition is performed on the target video frame, and a recognition result of each frame in the text video frame interval can be obtained, thereby improving efficiency of text recognition and reducing resources consumed by recognition.

Optionally, on the basis of the foregoing embodiments corresponding to fig. 4, in another optional embodiment provided by the embodiments of the present application, after performing text recognition on a target video frame, the method further includes the following steps:

acquiring a time interval corresponding to a text video frame interval and a text recognition result corresponding to a target video frame;

displaying a time interval corresponding to the text video frame interval and a text recognition result, wherein the time interval represents the time from the first video frame to the last video frame in the text video frame interval;

alternatively, the first and second electrodes may be,

after text recognition is carried out on the target video frame, the method further comprises the following steps:

and sending the text recognition result and the time interval corresponding to the text video frame interval to the terminal equipment so that the terminal equipment displays the time interval corresponding to the text video frame interval and the text recognition result.

In this embodiment, a manner of displaying a time interval and a text recognition result is described. The text recognition result includes text content and may also include a text position, where the text content may be a dialog subtitle or a text on a guideboard, and is not limited herein.

Specifically, after the target video frame is obtained, the OCR technology is adopted to identify the target video frame, so as to obtain a text identification result. For convenience of understanding, please refer to fig. 11, where fig. 11 is a schematic interface diagram illustrating a text recognition result in an embodiment of the present application, where as shown, the text recognition result includes text content, for example, "new car", and the text recognition result further includes a text position, for example, "width: 1500, height: 400, left spacing: 500, upper pitch: 10". In addition, a time interval in which the target video frame corresponds to the text video frame interval, for example, "2 minutes 52 seconds to 2 minutes 55 seconds" may also be displayed.

It should be noted that, if the text recognition apparatus is deployed in the server, the server may feed back the time interval corresponding to the text video frame interval and the text recognition result corresponding to the target video frame to the terminal device, and the text recognition result is displayed by the terminal device. If the text recognition device is deployed in the terminal equipment, the terminal equipment can directly display the time interval corresponding to the text video frame interval and the text recognition result corresponding to the target video frame.

In the embodiment of the application, a way of displaying a time interval and a text Recognition result is provided, and through the way, the scheme can be applied to a video subtitle extraction system and a text video understanding system, and can also be applied to an Automatic voice Recognition technology (Automatic Speech Recognition) intelligent labeling task, so that text information in a video can be extracted quickly and accurately, and through text extraction of the video and according to time information extracted by the text, voice in the video is obtained, and thus a process of corresponding the voice and the text is completed. In addition, a convenient subtitle text extraction method is provided for subtitle translation and subtitle processing work, and the burden of subtitle work is reduced.

With reference to fig. 12, a method for training a model in the present application will be described below, and an embodiment of the method for training a model in the present application includes:

201. acquiring a sample pair to be trained, wherein the sample pair to be trained comprises a first video frame sample and a second video frame sample, the first video frame sample corresponds to a first text label value, the second video frame sample corresponds to a second text label value, and the sample pair to be trained corresponds to a similarity label value;

in this embodiment, the model training apparatus obtains a pair of samples to be trained, the pair of samples to be trained is derived from one or more videos, and the types of the videos include, but are not limited to, MPEG format, ASF, AVI format, RMVB format, FLV format, and the like, which is not limited herein.

Specifically, after the video is acquired, the video may be decoded into consecutive video frames using FFmpeg, and then pairs of samples to be trained for training are selected from the video, where each pair of samples to be trained includes two adjacent video frames, i.e., a first video frame sample and a second video frame sample. Based on this, the first video frame sample and the second video frame sample are respectively labeled, and the labeling mode may be manual labeling or machine automatic labeling, which is not limited here. The annotated content includes a text annotation value for each sample of the video frame, e.g., a text annotation value of "0" indicating that there is no text content in the sample of the video frame, and a text annotation value of "1" indicating that there is text content in the sample of the video frame. Moreover, the annotated content also includes a similarity annotation value for two adjacent samples of video frames, e.g., a similarity annotation value of "0" indicating that the two adjacent samples of video frames have similar text and a similarity annotation value of "1" indicating that the two adjacent samples of video frames have different text.

It should be noted that the model training apparatus is deployed in a computer device, and may be specifically deployed in a terminal device or a server, and may also be deployed in a system formed by the server and the terminal device, which is not limited herein.

202. Based on a first video frame sample, acquiring a first text probability value and a first feature vector through a first recognition network included in a text recognition network to be trained, wherein the first text probability value represents the probability of text appearing in a first video frame;

in this embodiment, taking a video frame pair formed by a first video frame sample and a second video frame sample as an example, the model training device inputs the first video frame sample to a first recognition network included in a text recognition network to be trained, and outputs a first text probability value and a first feature vector from the first recognition network. The text recognition network to be trained adopts a network structure of a twin network, the twin network comprises two sub-networks, namely a first recognition network and a second recognition network, each sub-network respectively receives different inputs, maps the different inputs to a high-dimensional feature space and outputs corresponding representations. By calculating the distance of two tokens (e.g., euclidean distance), the degree of similarity of the two inputs can be compared, and the weights of the two subnetworks can be optimized by energy functions or classification loss.

Specifically, an energy function is used at the top layer of the text recognition network to be trained, the energy function is a function of subtracting absolute values, and a full connection layer is connected behind the energy function and used for carrying out similarity calculation.

203. Based on a second video frame sample, acquiring a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network, wherein the second text probability value represents the probability of text appearing in a second video frame, and the second recognition network shares weight with the first recognition network;

in this embodiment, as can be seen from the content described in step 202, the model training apparatus further needs to input the second video frame sample to the second recognition network included in the text recognition network to be trained, and output the second text probability value and the second feature vector by the second recognition network. Therefore, the first video frame sample and the second video frame sample are used as the input of the text recognition network to be trained, and the similarity degree of texts between video frames and the possibility of containing texts can be calculated. Since the text recognition network to be trained is a twin network structure, the network parameters between the two "twins" networks (i.e. the first recognition network and the second recognition network) are shared, which can ensure that two extremely similar images may not be mapped to very different locations in the feature space through the respective networks, since each network calculates the same function, i.e. two extremely similar images may not be mapped to different locations in the feature space. The twin network (i.e. the text recognition network to be trained) is symmetrical, which can ensure that the energy functions of the top layer can obtain the same similarity when two different video frames are input into different "twin" networks (i.e. the first recognition network and the second recognition network).

204. Based on the first feature vector and the second feature vector, obtaining a similarity score through a full connection layer included in the text recognition network to be trained;

in this embodiment, the model training apparatus inputs the first feature vector and the second feature vector to an FC layer included in the text recognition network to be trained, and outputs a similarity score between the first video frame sample and the second video frame sample through the FC layer. It can be understood that the overall structure of the to-be-trained text recognition network is similar to the text recognition network shown in fig. 5, and therefore, the details are not repeated here.

205. And training the text recognition network to be trained according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value and the similarity score, and outputting the text recognition network when a model training condition is met, wherein the text recognition network is the text recognition network related to the embodiment.

In this embodiment, the first text label value is used as a text probability true value of the first video frame sample, the first text probability value is used as a text probability predicted value of the first video frame sample, the second text label value is used as a text probability true value of the second video frame sample, the second text probability value is used as a text probability predicted value of the second video frame sample, the similarity label value is used as a true value, and the similarity score is used as a predicted value.

Specifically, the text recognition network to be trained is trained based on the loss value between the true value and the predicted value. If the preset iteration number (for example, ten thousand times) is reached or the loss value has reached convergence, the model training condition is satisfied, so that the model parameters obtained by updating the last time are used as the model parameters of the text recognition network, that is, the training of the text recognition network to be trained is completed, and the text recognition network is obtained.

Optionally, on the basis of each embodiment corresponding to fig. 4, in another optional embodiment provided by the embodiment of the present application, the training the to-be-trained text recognition network according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value, and the similarity score includes:

determining a first loss value by adopting a first loss function according to the first text label value and the first text probability value;

In this embodiment, a method for training a text recognition network using a loss function is introduced. As described in the foregoing embodiments, since the text recognition network belongs to the structure of the twin network, the first recognition network and the second recognition network included in the text recognition network have similar structures, and each recognition network includes two branches, one branch is a text branch, and the other score is an image similarity branch, and therefore, both the text branch and the image similarity branch need to be trained.

Specifically, the output of the text recognition network to be trained includes two tasks, namely a task based on text branching (i.e. determining whether the video frames contain text) and a task based on image similarity branching (i.e. determining whether the text in the first video frame is sufficiently similar to the text in the second video frame). In order to accurately identify whether the video frame contains text or not, a cross entropy function can be introduced to train the tasks of text branches, and joint training of three different tasks is realized through a multi-task loss function. The multitask penalty function is calculated as follows:

；

wherein the content of the first and second substances,

represents the firstA sample of a video frame is taken,

representing a second sample of the video frame,

which represents a first weight value of the first weight,

a second weight value is represented which is a second weight value,

a first text label value is represented that,

a first text probability value is represented that indicates a first text probability value,

which represents the value of the first loss to be,

a second text label value is represented that,

a second text probability value is represented that indicates a second text probability value,

the value of the second loss is represented,

the value of the label of the degree of similarity is represented,

a score of the degree of similarity is represented,

representing a third loss value.

And finally, determining a total loss value according to the first loss value, the second loss value and the third loss value, and updating the model parameters of the text recognition network to be trained in a reverse gradient updating mode based on the total loss value.

Secondly, in the embodiment of the application, a method for training a text recognition network by using a loss function is provided, and through the method, a plurality of tasks in the text recognition network can be trained by using a multi-task loss function, so that a feature sharing part and a task specific part are considered simultaneously, the generalized representation among the tasks needs to be learned, overfitting is avoided, unique features of each task also need to be learned, and under-fitting is avoided. In the process of assigning weights to the loss of each task, it is necessary and important to automatically learn the weights or design a network robust to all the weights.

Referring to fig. 13, fig. 13 is a schematic view of an embodiment of a text recognition apparatus in an embodiment of the present application, and the text recognition apparatus 30 includes:

an obtaining module 301, configured to obtain a first video frame and a second video frame from a video to be identified, where the video to be identified includes at least two video frames, and the first video frame and the second video frame are two adjacent video frames;

the obtaining module 301 is further configured to obtain, based on the first video frame, a first text probability value and a first feature vector through a first identification network included in the text identification network, where the first text probability value represents a probability that a text appears in the first video frame;

the obtaining module 301 is further configured to obtain a second text probability value and a second feature vector through a second recognition network included in the text recognition network based on the second video frame, where the second text probability value represents a probability that a text appears in the second video frame, and the second recognition network shares a weight with the first recognition network;

the obtaining module 301 is further configured to obtain a similarity score through a full connection layer included in the text recognition network based on the first feature vector and the second feature vector;

a determining module 302, configured to determine a target video frame according to the first video frame and the second video frame if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity score is less than or equal to a similarity threshold;

and the identification module 303 is configured to perform text identification on the target video frame.

Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the text recognition device 30 provided in the embodiment of the present application,

an obtaining module 301, configured to obtain a first feature map through a convolutional network included in a first identification network based on a first video frame, where the first identification network belongs to a text identification network;

an obtaining module 301, configured to obtain a second feature map through a convolutional network included in a second identification network based on a second video frame, where the second identification network belongs to a text identification network;

an obtaining module 301, configured to generate a first set of to-be-processed feature vectors according to a first feature map, where the first set of to-be-processed feature vectors includes M first to-be-processed feature vectors, each first to-be-processed feature vector includes N elements, and both N and M are integers greater than 1;

an obtaining module 301, configured to generate a third set of feature vectors to be processed according to the second feature map, where the third set of feature vectors to be processed includes M third feature vectors to be processed, and each third feature vector to be processed includes N elements;

the obtaining module 301 is specifically configured to obtain K first to-be-spliced feature vectors through an image feature extraction network included in a first identification network based on a first feature map, where the K first to-be-spliced feature vectors include a first to-be-spliced feature vector obtained by averaging the pooling layers, and K is an integer greater than 1;

an obtaining module 301, configured to subtract elements at the same position in the first feature vector and the second feature vector to obtain an intermediate feature vector;

the obtaining module 301 is further configured to obtain a first frame identifier corresponding to the first video frame and a second frame identifier corresponding to the second video frame if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold and the similarity score is less than or equal to the similarity threshold;

the determining module 302 is further configured to determine, according to the first frame identifier, the second frame identifier, and the frame rate of the video to be recognized, an occurrence time of the first video frame in the video to be recognized, and an occurrence time of the second video frame in the video to be recognized.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the text recognition device 30 provided in the embodiment of the present application, the text recognition device 30 further includes a processing module 304;

the processing module 304 is configured to, if the first text probability value is smaller than the text probability threshold and the second text probability value is greater than or equal to the text probability threshold, reject the first video frame;

the processing module 304 is further configured to reject the second video frame if the first text probability value is greater than or equal to the text probability threshold and the second text probability value is smaller than the text probability threshold;

the determining module 302 is further configured to determine that the first video frame and the second video frame belong to the same text video frame interval if the first text probability value and the second text probability value are both greater than or equal to the text probability threshold and the similarity score is less than or equal to the similarity threshold.

a determining module 302, configured to specifically determine that a first video frame and a second video frame belong to the same text video frame interval, where the text video frame interval includes at least two video frames;

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the text recognition device 30 provided in the embodiment of the present application, the text recognition device 30 further includes a display module 305;

the obtaining module 301 is further configured to obtain a time interval corresponding to the text video frame interval and a text recognition result corresponding to the target video frame after the recognition module 303 performs text recognition on the target video frame;

a display module 305, configured to display a time interval corresponding to a text video frame interval and a text recognition result, where the time interval represents a time from a first video frame to a last video frame in the text video frame interval;

alternatively, the first and second electrodes may be,

the display module 305 is further configured to send the text recognition result and the time interval corresponding to the text video frame interval to the terminal device, so that the terminal device displays the time interval corresponding to the text video frame interval and the text recognition result.

Referring to fig. 13, fig. 13 is a schematic view of an embodiment of the model training apparatus in the embodiment of the present application, and the model training apparatus 40 includes:

an obtaining module 401, configured to obtain a to-be-trained sample pair, where the to-be-trained sample pair includes a first video frame sample and a second video frame sample, the first video frame sample corresponds to a first text label value, the second video frame sample corresponds to a second text label value, and the to-be-trained sample pair corresponds to a similarity label value;

the obtaining module 401 is further configured to obtain, based on a first video frame sample, a first text probability value and a first feature vector through a first recognition network included in a to-be-trained text recognition network, where the first text probability value represents a probability that a text appears in a first video frame;

the obtaining module 401 is further configured to obtain a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network based on a second video frame sample, where the second text probability value represents a probability that a text appears in a second video frame, and a weight is shared between the second recognition network and the first recognition network;

the obtaining module 401 is further configured to obtain a similarity score through a full connection layer included in the text recognition network to be trained based on the first feature vector and the second feature vector;

the training module 402 is configured to train the text recognition network to be trained according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value, and the similarity score, and output the text recognition network when a model training condition is satisfied, where the text recognition network is a text recognition network provided in the foregoing aspect.

Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the model training apparatus 40 provided in the embodiment of the present application,

a training module 402, configured to determine a first loss value by using a first loss function according to the first text label value and the first text probability value;

The embodiment of the application also provides another text recognition device and a model training device, and any one of the text recognition device and the model training device can be deployed in terminal equipment. As shown in fig. 15, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method portion of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:

fig. 15 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 15, the cellular phone includes: radio Frequency (RF) circuit 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuit 560, wireless fidelity (WiFi) module 570, processor 580, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 15 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 15:

RF circuit 510 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 510 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The memory 520 may be used to store software programs and modules, and the processor 580 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 520. The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also called a touch screen, can collect touch operations of a user on or near the touch panel 531 (for example, operations of the user on or near the touch panel 531 by using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 580, and can receive and execute commands sent by the processor 580. In addition, the touch panel 531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 540 may include a Display panel 541, and optionally, the Display panel 541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation on or near the touch panel 531, the touch panel is transmitted to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although the touch panel 531 and the display panel 541 are shown as two separate components in fig. 15 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 550, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 541 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping) and the like, and can also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor and the like, which are not described herein again.

Audio circuitry 560, speaker 561, and microphone 562 may provide an audio interface between a user and a cell phone. The audio circuit 560 may transmit the electrical signal converted from the received audio data to the speaker 561, and convert the electrical signal into a sound signal by the speaker 561 for output; on the other hand, the microphone 562 converts the collected sound signals into electrical signals, which are received by the audio circuit 560 and converted into audio data, which are then processed by the audio data output processor 580, and then passed through the RF circuit 510 to be sent to, for example, another cellular phone, or output to the memory 520 for further processing.

WiFi belongs to short distance wireless transmission technology, and the mobile phone can help the user to send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 570, which provides wireless broadband internet access for the user. Although fig. 15 shows the WiFi module 570, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 520 and calling data stored in the memory 520, thereby performing overall monitoring of the mobile phone. Alternatively, processor 580 may include one or more processing units; optionally, processor 580 may integrate an application processor, which handles primarily the operating system, user interface, applications, etc., and a modem processor, which handles primarily the wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 580.

The handset also includes a power supply 590 (e.g., a battery) for powering the various components, which may optionally be logically connected to the processor 580 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device configuration shown in fig. 15.

The embodiment of the application also provides another text recognition device and a model training device, and any one of the text recognition device and the model training device can be deployed in terminal equipment. Fig. 16 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) for storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.

The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfacesA port 658, and/or one or more operating systems 641, such as Windows Server^TM，Mac OS X^TM，Unix^TM, Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 16.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for recognizing text based on video, comprising:

based on the first video frame, acquiring a first text probability value and a first feature vector through a first identification network included in a text identification network, wherein the first text probability value represents the probability of text appearing in the first video frame;

based on the second video frame, obtaining a second text probability value and a second feature vector through a second identification network included in the text identification network, wherein the second text probability value represents the probability of text appearing in the second video frame, and the second identification network shares weight with the first identification network;

based on the first feature vector and the second feature vector, obtaining a similarity score through a full connection layer included in the text recognition network;

if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold value, and the similarity degree value is less than or equal to a similarity degree threshold value, determining a target video frame according to the first video frame and the second video frame;

and performing text recognition on the target video frame.

2. The method of claim 1, wherein the obtaining a first text probability value and a first feature vector through a first recognition network included in a text recognition network based on the first video frame comprises:

based on the first video frame, acquiring a first feature map through a convolutional network included in the first identification network, wherein the first identification network belongs to the text identification network;

acquiring the first text probability value through an attention network included in the first identification network based on the first feature map;

acquiring the first feature vector through an image feature extraction network included in the first identification network based on the first feature map;

the obtaining, based on the second video frame, a second text probability value and a second feature vector through a second recognition network included in the text recognition network includes:

based on the second video frame, acquiring a second feature map through a convolutional network included in the second identification network, wherein the second identification network belongs to the text identification network;

acquiring a second text probability value through an attention network included in the second recognition network based on the second feature map;

and acquiring the second feature vector through an image feature extraction network included in the second identification network based on the second feature map.

3. The text recognition method of claim 2, wherein the obtaining the first text probability value through an attention network included in the first recognition network based on the first feature map comprises:

generating a first set of to-be-processed feature vectors according to the first feature map, wherein the first set of to-be-processed feature vectors includes M first to-be-processed feature vectors, each first to-be-processed feature vector includes N elements, and both N and M are integers greater than 1;

acquiring a first attention feature vector through an attention network included by the first identification network based on the second set of feature vectors to be processed;

acquiring the first text probability value through a full connection layer included in the first identification network based on the first attention feature vector;

the obtaining, based on the second feature map, the second text probability value through an attention network included in the second recognition network includes:

acquiring a second attention feature vector through an attention network included by the second identification network based on the fourth set of feature vectors to be processed;

and acquiring the second text probability value through a full connection layer included by the second identification network based on the second attention feature vector.

4. The text recognition method according to claim 2, wherein the obtaining the first feature vector through an image feature extraction network included in the first recognition network based on the first feature map includes:

acquiring K first to-be-spliced feature vectors through an image feature extraction network included by the first identification network based on the first feature map, wherein the K first to-be-spliced feature vectors include first to-be-spliced feature vectors obtained by averaging the pooling layers, and K is an integer greater than 1;

according to the K first to-be-spliced feature vectors, acquiring the first feature vectors through an image feature extraction network included by the first identification network;

the obtaining of the second feature vector through the image feature extraction network included in the second recognition network based on the second feature map includes:

acquiring K second feature vectors to be spliced through an image feature extraction network included by the second identification network based on the second feature map, wherein the K second feature vectors to be spliced include second feature vectors to be spliced obtained through an average pooling layer;

and acquiring the second feature vector through an image feature extraction network included by the second identification network according to the K second feature vectors to be spliced.

5. The method of claim 1, wherein obtaining a similarity score through a full connectivity layer included in the text recognition network based on the first feature vector and the second feature vector comprises:

carrying out absolute value taking processing on the intermediate feature vector to obtain a target feature vector;

and acquiring the similarity score through the full-connection layer based on the target feature vector.

6. The method of claim 1, wherein if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity score is less than or equal to a similarity threshold, the method further comprises:

acquiring a first frame identifier corresponding to the first video frame and a second frame identifier corresponding to the second video frame;

7. The text recognition method of claim 1, further comprising:

and if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold value, and the similarity score is less than or equal to the similarity threshold value, determining that the first video frame and the second video frame belong to the same text video frame interval.

8. The text recognition method of claim 1, wherein determining a target video frame from the first video frame and the second video frame comprises:

determining that the first video frame and the second video frame belong to the same text video frame interval, wherein the text video frame interval comprises at least two video frames;

and selecting any one video frame from the text video frame interval as the target video frame.

9. The text recognition method of claim 8, wherein after the text recognition of the target video frame, the method further comprises:

acquiring a time interval corresponding to the text video frame interval and a text recognition result corresponding to the target video frame;

displaying a time interval corresponding to the text video frame interval and the text recognition result, wherein the time interval represents the time from the first video frame to the last video frame in the text video frame interval;

alternatively, the first and second electrodes may be,

after the text recognition is performed on the target video frame, the method further comprises:

and sending the text recognition result and the time interval corresponding to the text video frame interval to terminal equipment so that the terminal equipment displays the time interval corresponding to the text video frame interval and the text recognition result.

10. A method of model training, comprising:

obtaining a sample pair to be trained, wherein the sample pair to be trained comprises a first video frame sample and a second video frame sample, the first video frame sample corresponds to a first text label value, the second video frame sample corresponds to a second text label value, and the sample pair to be trained corresponds to a similarity label value;

based on the first video frame sample, acquiring a first text probability value and a first feature vector through a first recognition network included in a text recognition network to be trained, wherein the first text probability value represents the probability of text appearing in the first video frame;

based on the second video frame sample, acquiring a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network, wherein the second text probability value represents the probability of text appearing in the second video frame, and the second recognition network shares weight with the first recognition network;

training the text recognition network to be trained according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value and the similarity score, and outputting the text recognition network when a model training condition is met, wherein the text recognition network is the text recognition network in any one of claims 1 to 9.

11. The method of claim 10, wherein the training the to-be-trained text recognition network according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value, and the similarity score comprises:

determining a third loss value by adopting a third loss function according to the similarity marking value and the similarity score;

12. A text recognition apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first video frame and a second video frame from a video to be identified, the video to be identified comprises at least two video frames, and the first video frame and the second video frame are two adjacent video frames;

the obtaining module is further configured to obtain, based on the first video frame, a first text probability value and a first feature vector through a first recognition network included in a text recognition network, where the first text probability value represents a probability that a text appears in the first video frame;

the obtaining module is further configured to obtain, based on the second video frame, a second text probability value and a second feature vector through a second recognition network included in the text recognition network, where the second text probability value represents a probability that text appears in the second video frame, and the second recognition network shares a weight with the first recognition network;

the obtaining module is further configured to obtain a similarity score through a full connection layer included in the text recognition network based on the first feature vector and the second feature vector;

a determining module, configured to determine a target video frame according to the first video frame and the second video frame if the first text probability value and the second text probability value are both greater than or equal to a text probability threshold and the similarity score is less than or equal to a similarity threshold;

and the identification module is used for performing text identification on the target video frame.

13. A model training apparatus, comprising:

the obtaining module is further configured to obtain, based on the first video frame sample, a first text probability value and a first feature vector through a first recognition network included in a to-be-trained text recognition network, where the first text probability value represents a probability that a text appears in the first video frame;

the obtaining module is further configured to obtain a second text probability value and a second feature vector through a second recognition network included in the to-be-trained text recognition network based on the second video frame sample, where the second text probability value represents a probability that a text appears in the second video frame, and the second recognition network shares a weight with the first recognition network;

the obtaining module is further configured to obtain a similarity score through a full connection layer included in the to-be-trained text recognition network based on the first feature vector and the second feature vector;

a training module, configured to train the to-be-trained text recognition network according to the first text label value, the first text probability value, the second text label value, the second text probability value, the similarity label value, and the similarity score, and output a text recognition network when a model training condition is met, where the text recognition network is the text recognition network according to any one of claims 1 to 9.

14. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, including performing a text recognition method as claimed in any one of claims 1 to 9, or performing a method as claimed in any one of claims 10 to 11;

15. A computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the text recognition method of any one of claims 1 to 9 or perform the method of any one of claims 10 to 11.