CN110728167A

CN110728167A - Text detection method and device and computer readable storage medium

Info

Publication number: CN110728167A
Application number: CN201810775852.6A
Authority: CN
Inventors: 徐博
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2020-01-24

Abstract

The application discloses a text detection method, a text detection device and a storage medium, and belongs to the technical field of machine vision. The method comprises the following steps: extracting feature information of a current video frame, and acquiring a text region from the current video frame based on the extracted feature information, wherein the text region is an image region containing a text in the current video frame; the text area is used as the input of a neural network, characters contained in the text area are detected by the neural network, and a text detection result is output, wherein the neural network comprises a plurality of neural network models; determining text contained in the current video frame based on the text detection result of the current video frame and the text detection results of the plurality of video frames other than the current video frame. According to the method and the device, the text region is not segmented, but is directly input into the neural network for detection, and the detection result is output. Because the text region does not need to be segmented, the problem of low identification accuracy caused by inaccurate segmentation is solved.

Description

Text detection method and device and computer readable storage medium

Technical Field

The present invention relates to the field of machine vision technologies, and in particular, to a text detection method, a text detection device, and a computer-readable storage medium.

Background

Currently, the played video often contains text. For better analysis and review of the content of the video, text in the video may be detected.

In the related art, a text region may be detected from a video frame. Then, the position information of each character in the text region can be determined, and the text region is segmented according to the position information of each character, so that a plurality of character regions are obtained. The characters in each of the plurality of character areas are recognized, thereby outputting a plurality of characters.

However, when detecting characters in a video frame by the above method, since the text region needs to be divided according to the position information of each character, when the determined position information of each character is not accurate due to poor imaging quality of the video frame, the divided character region is not accurate, and in this case, the recognition accuracy when recognizing the characters in each character region is also low.

Disclosure of Invention

The embodiment of the invention provides a text detection method, a text detection device and a computer readable storage medium, which can be used for solving the problem of low identification accuracy caused by inaccurate character region segmentation in the related technology. The technical scheme is as follows:

in a first aspect, a text detection method is provided, and the method includes:

extracting feature information of a current video frame, and acquiring a text region from the current video frame based on the extracted feature information, wherein the text region is an image region containing a text in the current video frame;

taking the text area as the input of a neural network, detecting characters contained in the text area by using the neural network, and outputting a text detection result, wherein the neural network comprises a plurality of neural network models;

determining text contained in the current video frame based on the text detection result of the current video frame and the text detection results of a plurality of video frames related to the current video frame.

Optionally, the detecting the characters included in the text region by using the neural network and outputting a text detection result includes:

extracting a plurality of feature maps from the text region by using a Convolutional Neural Network (CNN) model, and generating a plurality of feature vectors based on the plurality of feature maps, wherein each feature map in the plurality of feature maps is used for characterizing one pixel feature of the text region, and each feature vector in the plurality of feature vectors is used for characterizing the pixel feature in one sub-region of the text region;

determining a plurality of character detection results based on the plurality of feature vectors, each character detection result of the plurality of character detection results comprising at least one character and a probability value corresponding to each character of the at least one character;

outputting a text detection result based on the plurality of character detection results.

Optionally, the determining a plurality of character detection results based on the plurality of feature vectors includes:

sequencing the plurality of feature vectors according to the position information of the sub-region corresponding to each feature vector in the text region and the writing sequence of the text;

taking the plurality of feature vectors as input of a Recurrent Neural Network (RNN) model, and sequentially generating a plurality of intermediate vectors by using the RNN model based on the sequence of the plurality of feature vectors, wherein each intermediate vector in the plurality of intermediate vectors is used for representing character features of at least one character;

based on each of the plurality of intermediate vectors, at least one character corresponding to the each intermediate vector and a probability value corresponding to each of the at least one character are determined.

Optionally, the determining text included in the current video frame based on the text detection result of the current video frame and the text detection results of a plurality of video frames related to the current video frame includes:

acquiring a text detection result of each video frame in a plurality of continuous video frames before the current video frame, wherein the last video frame in the plurality of continuous video frames is adjacent to the current video frame;

determining an editing distance between two text detection results corresponding to each two adjacent video frames in the current video frame and the plurality of continuous video frames, wherein the editing distance refers to the minimum editing time of different editing times required by different editing modes when one text detection result in the two text detection results is converted into the other text detection result according to the different editing modes;

determining the similarity probability between two text detection results corresponding to each two adjacent video frames based on the editing distance between the two text detection results corresponding to each two adjacent video frames and the number of characters contained in each text detection result in the two text detection results;

determining text contained in the current video frame based on the determined plurality of similarity probabilities.

Optionally, the determining text contained in the current video frame based on the determined plurality of similarity probabilities includes:

sequencing the plurality of similarity probabilities according to the sequence of the current video frame and the plurality of continuous video frames;

if a plurality of continuous similar probabilities from the similar probability corresponding to the current video frame are all larger than a probability threshold, determining a plurality of similar text detection results based on a plurality of text detection results corresponding to a plurality of video frames corresponding to the plurality of continuous similar probabilities;

determining text contained in the current video frame based on the plurality of similar text detection results.

Optionally, the method further comprises:

and if the similarity probability corresponding to the current video frame is smaller than the probability threshold, taking the text detection result of the current video frame as the text contained in the current video frame.

Optionally, after obtaining the text region from the current video frame based on the extracted feature information, the method further includes:

acquiring the position of the text area in the current video frame;

accordingly, before determining the edit distance between the current video frame and two text detection results corresponding to each adjacent two video frames in the plurality of consecutive video frames, the method further includes:

determining a deviation between the current video frame and the locations of two text regions in each adjacent two of the plurality of consecutive video frames;

if the determined deviations are all smaller than or equal to a deviation threshold value, executing the step of determining the edit distance between two text detection results corresponding to every two adjacent video frames in the current video frame and the plurality of continuous video frames;

if any deviation of the deviations is larger than the deviation threshold, determining a similarity probability between two text detection results corresponding to two video frames corresponding to any deviation larger than the deviation threshold as a first probability, wherein the first probability is smaller than the probability threshold.

In a second aspect, there is provided a text detection apparatus, the apparatus comprising:

the acquisition module is used for extracting the characteristic information of the current video frame and acquiring a text region from the current video frame based on the extracted characteristic information, wherein the text region is an image region containing a text in the current video frame;

the detection module is used for taking the text area as the input of a neural network, detecting characters contained in the text area by using the neural network and outputting a text detection result, wherein the neural network comprises a plurality of neural network models;

a determining module, configured to determine text included in the current video frame based on a text detection result of the current video frame and text detection results of a plurality of video frames related to the current video frame.

Optionally, the detection module includes:

a generation sub-module, configured to extract a plurality of feature maps from the text region by using a convolutional neural network CNN model, and generate a plurality of feature vectors based on the plurality of feature maps, where each feature map of the plurality of feature maps is used to characterize one pixel feature of the text region, and each feature vector of the plurality of feature vectors is used to characterize a pixel feature in one sub-region of the text region;

a first determining sub-module, configured to determine a plurality of character detection results based on the plurality of feature vectors, where each character detection result in the plurality of character detection results includes at least one character and a probability value corresponding to each character in the at least one character;

and the output sub-module is used for outputting a text detection result based on the plurality of character detection results.

Optionally, the first determining submodule is specifically configured to:

Optionally, the determining module includes:

the obtaining submodule is used for obtaining a text detection result of each video frame in a plurality of continuous video frames before the current video frame, and the last video frame in the plurality of continuous video frames is adjacent to the current video frame;

a second determining sub-module, configured to determine an edit distance between two text detection results corresponding to each adjacent two video frames in the current video frame and the multiple consecutive video frames, where the edit distance is a minimum edit time of different edit times required by different edit modes when one text detection result of the two text detection results is converted into another text detection result according to the different edit modes;

the third determining submodule is used for determining the similarity probability between the two text detection results corresponding to each two adjacent video frames based on the editing distance between the two text detection results corresponding to each two adjacent video frames and the number of characters contained in each text detection result in the two text detection results;

a fourth determining sub-module, configured to determine, based on the determined plurality of similarity probabilities, a text included in the current video frame.

Optionally, the fourth determining submodule is specifically configured to:

Optionally, the fourth determining sub-module is further specifically configured to:

Optionally, the apparatus is further configured to:

acquiring the position of the text area in the current video frame;

accordingly, the determining module further comprises:

a fifth determining sub-module for determining a deviation between the positions of the two text regions in each of the current video frame and each of the adjacent two of the plurality of consecutive video frames;

the triggering sub-module is used for triggering the fourth determining sub-module to determine the editing distance between two text detection results corresponding to each two adjacent video frames in the current video frame and the multiple continuous video frames if the determined multiple deviations are all smaller than or equal to a deviation threshold value;

a sixth determining submodule, configured to determine, if any deviation of the plurality of deviations is greater than the deviation threshold, a similarity probability between two text detection results corresponding to two video frames corresponding to any deviation greater than the deviation threshold as a first probability, where the first probability is smaller than the probability threshold.

In a third aspect, there is provided a text detection apparatus, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform any of the methods of the first aspect above.

In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out any of the methods of the first aspect.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: after the text region is acquired from the current video frame, the text region may not be segmented, but the text region is directly input into the neural network for detection, and a detection result is output. Because the text region does not need to be segmented, the problem of low identification accuracy caused by inaccurate segmentation is solved. In addition, in the embodiment of the invention, after the neural network outputs the text detection result of the current video frame, the text detection result of the current video frame and the text detection results of a plurality of video frames related to the current video frame can be combined to determine the characters contained in the current video frame, so that the accuracy of character detection is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a text detection method provided in an embodiment of the present application;

fig. 2 is a flowchart of a text detection method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating association between text regions and feature vectors according to an embodiment of the present disclosure;

FIG. 4 is an expanded view of an LSTM network according to an embodiment of the present application;

fig. 5 is a flowchart for determining texts included in a current video frame based on a text detection result of the current video frame according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a text detection apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal for text detection according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario related to the embodiments of the present application will be described.

Currently, whether real-time video or stored video, the video often contains text content. For example, to more clearly illustrate an object in a video image, explanatory text may be tagged in the vicinity of the object. For another example, in order to make the user more clearly acquire the voice content in the video, subtitles may be added to the voice content in the video. With the development of the internet technology, various videos are more and more, and the quality of video contents is uneven, so that texts contained in the videos can be detected for better supervision of massive videos, and then the detected texts are analyzed and audited. The text detection method provided by the application can be used in the scene to detect the text in the video.

The following explains the text detection method provided in the embodiments of the present application in detail. In the embodiment of the present application, the terminal is mainly used as the execution subject to perform corresponding explanation, but this does not constitute a limitation to the execution subject of the text detection method provided in the embodiment of the present application.

Fig. 1 is a flowchart of a text detection method according to an embodiment of the present application. Referring to fig. 1, the method comprises the steps of:

step 101: and extracting the characteristic information of the current video frame, and acquiring a text region from the current video frame based on the extracted characteristic information.

The text area is an image area containing a text in the current video frame.

Step 102: and taking the text area as the input of the neural network, detecting characters contained in the text area by using the neural network, and outputting a text detection result.

The neural network may include a plurality of neural network models, among others. For example, a CNN (convolutional Neural Network) model, an RNN (Recurrent Neural Network) model may be included.

Step 103: determining text contained in the current video frame based on the text detection result of the current video frame and the text detection results of the plurality of video frames related to the current video frame.

The plurality of video frames related to the current video frame may refer to a plurality of consecutive video frames before the current video frame, and a last video frame of the plurality of consecutive video frames and the current video frame are adjacent, that is, the current video frame and the plurality of video frames are consecutive. Alternatively, the plurality of video frames related to the current video frame may refer to a plurality of consecutive video frames following the current video frame, and a first video frame of the plurality of consecutive video frames is adjacent to the current video frame. Alternatively, the plurality of video frames related to the current video frame may also refer to a plurality of video frames before the current video frame and a plurality of video frames after the current video frame, and a last video frame of the plurality of video frames before the current video frame is adjacent to the current video frame, and a first video frame of the plurality of video frames after the current video frame is adjacent to the current video frame. Alternatively, the plurality of video frames related to the current video frame may refer to a portion of the video frames selected from a plurality of consecutive video frames before or after the current video frame, that is, the plurality of video frames may not be consecutive to the current video.

In the embodiment of the application, after the text region is acquired from the current video frame, the text region may not be segmented, but the text region is directly input into the neural network for detection, and a detection result is output. Because the text region does not need to be segmented, the problem of low identification accuracy caused by inaccurate segmentation is solved. In addition, in the embodiment of the invention, after the neural network outputs the text detection result of the current video frame, the text detection result of the current video frame and the text detection results of a plurality of video frames except the current video frame can be combined to determine the characters contained in the current video frame, so that the accuracy of character detection is effectively improved.

Fig. 2 is a flowchart of a text detection method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

step 201: and extracting the characteristic information of the current video frame, and acquiring a text region from the current video frame based on the extracted characteristic information.

In the embodiment of the application, the terminal can sequentially acquire each video frame according to the sequence of the video frames in the video, and detect the text contained in the acquired video frames. The text contained in the video frame may be chinese characters, english characters, or other language characters. In the embodiment of the present application, a description will be given by taking an example in which a terminal detects a text included in a certain video frame in a video.

When a terminal acquires a certain video frame in video data, a text region can be acquired from a current video frame, wherein the text region refers to an image region which may contain a text in the current video frame. The terminal may extract feature information from the current video frame, and may obtain the text region based on the extracted feature information in the following manners.

In the first mode, the terminal may obtain a plurality of candidate image regions in the current video frame, use the plurality of candidate image regions as input of the CNN model or the RNN model, extract feature information of the plurality of candidate image regions through the CNN model or the RNN model, and classify the plurality of candidate image regions according to the extracted feature information, thereby obtaining a first probability and a second probability corresponding to each of the plurality of candidate image regions, where the first probability is a probability that a candidate image region is a text region, and the second probability is a probability that a candidate image region is not a text region. Thereafter, the terminal may take a corresponding candidate image region having a first probability greater than a first threshold value among the plurality of candidate image regions as a text region in the current video frame.

The candidate image area refers to an area determined according to a position where text is usually located in the video frame. For example, in general, subtitles in a video frame are each set at a bottom position of a video image, and in this case, the terminal may determine an area within a preset distance from a bottom edge of the video image as a candidate image area. In addition, in general, the video frame includes text indicating program information, and such text is usually disposed at the right edge or the left edge of the video image, in which case, the terminal may determine an area within a preset distance from the right edge of the video image as a candidate image area, and determine an area within a preset distance from the left edge of the video image as a candidate image area.

Optionally, in a possible implementation manner, the candidate image region may also be a candidate image region obtained by performing texture feature extraction on the video frame and determining according to the extracted texture feature. The specific implementation manner of determining the candidate image region by the terminal according to the extracted texture features may refer to related technologies, which are not described in detail in the embodiments of the present application.

It should be further noted that, in the embodiment of the present application, the multiple candidate image regions may be classified by a CNN model or an RNN model, or may be classified by another classifier. This is not particularly limited in the embodiments of the present application.

In addition, when the text regions are selected from the candidate image regions according to the first threshold, since the probability that two or more candidate image regions may be text regions in the candidate image regions is greater than the first threshold, in this case, the terminal will acquire the text regions from the current video frame. Subsequently, the terminal may detect each text region in the plurality of text regions through step 202-204 to obtain the text included in the current video frame.

Optionally, the CNN model or the RNN model may determine the position information of each candidate image region in the video frame while determining the first probability and the second probability corresponding to each candidate image region. On this basis, if the terminal selects at least two candidate image regions from the plurality of candidate image regions according to the first threshold, the terminal may further select one candidate image region that meets the preset position information from the at least two candidate image regions as a text region according to the position information of each of the at least two candidate image regions, that is, the terminal may selectively detect a text in a candidate image region located at a certain position from the at least two candidate image regions.

Optionally, in another possible implementation, after determining the first probability and the second probability corresponding to each candidate image region, the CNN model or the RNN model may further select, from the plurality of candidate image regions, a candidate image region with a highest corresponding first probability as the text region.

In the second mode, the terminal may use the video frame as an input of the CNN model or the RNN model, extract pixel characteristics in the video frame using the CNN model or the RNN model, and then classify a plurality of pixel points included in the video frame according to the extracted pixel characteristics, thereby obtaining a first probability and a second probability corresponding to each pixel point in the plurality of pixel points. The first probability is used for indicating the probability that the corresponding pixel point is the text pixel point, and the second probability is used for indicating the probability that the corresponding pixel point is the non-text pixel point. And determining the pixel points of which the corresponding first probability is greater than or equal to the second threshold value as text pixel points, and determining the pixel points of which the corresponding first probability is less than the second threshold value as non-text pixel points. And then, determining the region with dense text pixel points in the video frame as a text region by using a K average value method or other methods.

After the text regions are obtained from the current video frame, if the number of the text regions obtained by the terminal is multiple, the terminal may detect each of the multiple text regions through step 202 and step 204, so as to determine the characters contained in the video frame according to the multiple detected text detection results of the multiple text regions. If the number of the text areas acquired by the terminal is one, the terminal may detect the text areas through the step 202 and the step 204, and further determine the characters contained in the video frame according to the text detection result of the text areas.

Step 202: a plurality of feature maps are extracted from the text region using the CNN model, and a plurality of feature vectors are generated based on the plurality of feature maps.

The terminal can convert the size of the current text area to be detected into a fixed size through a Resize function, and then can use the text area with the fixed size as the input of a CNN model, and extract a plurality of feature maps from the text area by using the CNN model, wherein each feature map is used for representing one pixel feature in the text area. Wherein the fixed size may be a size of an input image supported by the CNN model. For example, the fixed size may be 256 × 32, or the fixed size may be 128 × 28, etc., and the fixed size is not specifically limited in the embodiments of the present application.

It should be noted that the CNN model may include a plurality of convolutional layers and a plurality of pooling layers, where convolutional layers and pooling layers are alternately arranged. By convolution layer, convolution operation can be performed on each pixel channel of the text region, and dimension reduction can be performed on the features obtained by convolution operation by the pooling layer. Finally, what is output through the last layer of the CNN model will be the extracted plurality of feature maps. Wherein the number of the characteristic graphs is related to the number of the pixel channels for convolution operation. For example, assuming that three channels of convolution operations are performed on the text region, where the three channels are r (red), g (green), and b (blue), respectively, the number of feature maps that are finally output will be 3. Also, the 3 feature maps will be used to indicate the R pixel feature, the G pixel feature, and the B pixel feature within the text region, respectively.

After extracting a plurality of feature maps from within the text region, a plurality of feature vectors may be generated from the plurality of feature maps. When the number of the feature maps is C, the width of each feature map is W, and the height of each feature map is H, (W x H) feature vectors can be generated according to the C feature maps, and the dimension of each feature vector is C.

Illustratively, assuming that 3 feature maps are obtained after the last layer of the CNN model, each feature map has a size of 3 x 3, in this case, the pixel matrix of the feature map 1 is as shown in the following equation (1), the pixel matrix of the feature map 2 is as shown in the following equation (2), and the pixel matrix of the feature map 3 is as shown in the following equation (3). Based on the pixel matrix of the 3 feature maps, 9 feature vectors can be generated, which are respectively (a)₁₁,b₁₁,c₁₁)、(a₁₂,b₁₂,c₁₂)、(a₁₃,b₁₃,c₁₃)、(a₂₁,b₂₁,c₂₁)、(a₂₂,b₂₂,c₂₂)、(a₂₃,b₂₃,c₂₃)、(a₃₁,b₃₁,c₃₁)、(a₃₂,b₃₂,c₃₂)、(a₃₃,b₃₃,c₃₃)。

Since the feature map corresponds to the text region, 9 feature vectors may be used to characterize the pixel features of 9 sub-regions in the text region, as shown in fig. 3, the text region may be divided into 9 sub-regions according to nine feature vectors and the width and height of the text region, where each feature vector may be used to characterize the pixel features of a sub-region at a position corresponding to the position of the element in the corresponding feature vector in the feature matrix. For example, feature vector (a)₁₁,b₁₁,c₁₁) The pixel feature, feature vector (a), of sub-region 1, which is located at the top left corner of the text region, is actually characterized₁₂,b₁₂,c₁₂) The pixel characteristics of the sub-region 2, the characteristic vector (a), are characterized₁₃,b₁₃,c₁₃) The pixel characteristics, feature vector (a), of the sub-region 3 are characterized₂₁,b₂₁,c₂₁) The pixel characteristics, the feature vector (a), of the sub-region 4 are characterized₂₁,b₂₁,c₂₁) The pixel characteristics, the feature vector (a), of the sub-region 4 are characterized₂₂,b₂₂,c₂₂) The pixel characteristics of the sub-area 5, the characteristic vector (a), are characterized₂₃,b₂₃,c₂₃) The pixel characteristics of the sub-region 6, the characteristic vector (a), are characterized₃₁,b₃₁,c₃₁) The pixel characteristics, the feature vector (a), of the sub-region 7 are characterized₃₂,b₃₂,c₃₂) The pixel characteristics, the feature vector (a), of the sub-region 8 are characterized₃₃,b₃₃,c₃₃) The pixel characteristics of the sub-area 9 are characterized.

Step 203: a plurality of character detection results are determined based on the plurality of feature vectors, and a text detection result is output based on the plurality of character detection results.

After obtaining a plurality of feature vectors through the plurality of feature maps, the terminal may obtain a plurality of character detection results through the RNN model by using the plurality of feature vectors as an input of the RNN model.

For example, the terminal may sort the plurality of feature vectors according to the position information of the sub-region corresponding to each feature vector in the text region, according to the writing order of the text, use the plurality of feature vectors as the input of the RNN model, sequentially generate a plurality of intermediate vectors based on the order of the plurality of feature vectors by using the RNN model, and determine, based on each intermediate vector of the plurality of intermediate vectors, at least one character corresponding to each intermediate vector and a probability value corresponding to each character of the at least one character.

As can be seen from the description in step 202, each feature vector may be used to characterize the pixel features of a sub-region in the text region. Therefore, the terminal can rank the plurality of feature vectors according to the position of the sub-region represented by each feature vector in the text region and the order of writing the text. The sorting is performed according to the writing order of the text, so that each sub-area is detected in sequence according to the writing order of the text when detection is performed, and thus, the sequence of the output characters is consistent with the writing order of the text. For example, the current general text writing order is left to right and top to bottom, in which case the plurality of feature vectors are ordered in the sub-regions left to right and top to bottom order.

Still taking the text region shown in fig. 3 as an example, in order from left to right and from top to bottom, the sub-region located at the leftmost end and the uppermost is sub-region 1, and therefore, the feature vector (a)₁₁,b₁₁,c₁₁) I.e. in the first. Next, in order from left to right, the next subregion of subregion 1 is subregion 2, and thus, (a)₁₂,b₁₂,c₁₂) Is the second eigenvector, and so on, (a)₁₃,b₁₃,c₁₃) Is the third feature vector. Since there are no other sub-areas to the right of sub-area 3, it goes to the sub-area of the second row in the order from top to bottom, while in the order from left to right, thenSubregion 4 is the next subregion of subregion 3, i.e., (a)₂₁,b₂₁,c₂₁) Is the fourth feature vector. Referring to the above method, a plurality of feature vectors are sorted.

After the plurality of feature vectors are sorted, the plurality of feature vectors may be input to the RNN model, and the RNN model may be used to process the plurality of feature vectors to sequentially generate a plurality of intermediate vectors. Wherein each intermediate vector may be used to characterize character features of at least one character.

Illustratively, the RNN Model may be an Attention Model (Attention Model) -based neural network Model, and the neural network structure of the RNN Model may be an lstm (long Short Term memory) network structure. Based on this, after inputting the plurality of feature vectors into the RNN model, at t₁At a time, the RNN model may be based on t₁Determining t by the weight vector of the moment and the plurality of feature vectors₁Intermediate semantic information of time, t₁The intermediate semantic information of the time is input into the LSTM, and the LSTM is based on t₁Generating t from the intermediate semantic information and the initial cell state information of the time₁Intermediate vector of time and according to t₁Updating initial cell state information by the intermediate vector of the moment to obtain t₁Cell state information at time, output t₁The median vector sum of the moments t₁Cell state information at a time. Wherein, t₁The time instant is the first time instant at which the LSTM processes a plurality of feature vectors and outputs the first intermediate vector. Then, at t₁The next moment in time, i.e. t₂Time of day based on t₂Determining t by the weight vector of the moment and the plurality of feature vectors₂Intermediate semantic information of time, t₂Temporal intermediate semantic information, t₁The intermediate vector of time instants is used as input for the LSTM, which may be based on t₂Temporal intermediate semantic information, t₁Intermediate vector of time and t₁Cell state information of time, generating t₂Intermediate vector of time and according to t₂Intermediate vector pair t of moments₁Updating the cell state information of the moment to obtaint₂Cell state information at time, output t₂The median vector sum of the moments t₂Cell state information at a time. By analogy, at t_nThe time of day can be based on t_nDetermining t by weight vector of moment_nIntermediate semantic information of time, t_nTemporal intermediate semantic information, t_n-1The intermediate vector of time instants is input to the LSTM, which is based on t_nTemporal intermediate semantic information, t_n-1Intermediate vector of time and t_n-1Determining t from cell state information at time of day_nThe intermediate vector of time instants.

The number of the weights included in the weight vector is the same as the number of the feature vectors, and the plurality of weights correspond to the plurality of feature vectors one to one. And, t_nEach weight value included in the weight value vector at the moment is used for indicating the feature vector corresponding to each weight value to determine t_nThe higher the weight value is, the higher the importance degree of the middle semantic information at the moment is, the higher the weight value is, the characteristic vector corresponding to the weight value is shown to determine t_nThe more important the intermediate semantic information at a moment is, the greater the influence. Based on this, t_nDetermining t according to the characteristic vector corresponding to the maximum weight value in a plurality of weight values included in the weight value vector at the moment_nThe contribution from the temporal intermediate semantic information will be the largest. At this time, t may be set_nAnd the middle semantic information of the moment is taken as the semantic information corresponding to the feature vector with the maximum weight in the plurality of feature vectors.

Considering that the sequence of the plurality of feature vectors actually represents the sequence of the plurality of sub-regions in the text region, the size of each weight in the weight vector can be adjusted according to the sequence of the plurality of feature vectors, so that the intermediate semantic information corresponding to each feature vector can be sequentially determined according to the sequence of the plurality of feature vectors, in other words, the intermediate semantic information of characters possibly contained in the plurality of sub-regions can be sequentially determined according to the sequence of the plurality of sub-regions, and thus the intermediate vectors sequentially determined according to the intermediate semantic information and used for representing the character features also conform to the writing sequence of the text.

Exemplarily, at t₁The time is the first timeTherefore, the intermediate semantic information corresponding to the first feature vector of the plurality of feature vectors, that is, the intermediate semantic information corresponding to the character that may be included in the first sub-region of the text region, may be determined first. In this case, since the first feature vector is used to characterize the pixel features in the first sub-region, t is determined₁When the intermediate semantic information of a moment is obtained, the influence of the first feature vector on the intermediate semantic information of the moment is the largest, and at this time, the corresponding weight of the first feature vector in the weight vector can be set to be greater than all other weights. And, in the written order from left to right and from top to bottom, the characters that may be contained in a second sub-area following the first sub-area have the greatest association with the characters that may be contained in the first sub-area, and therefore the second sub-area is the most relevant for determining t₁The influence of the middle semantic information at the moment is next to the influence of the first sub-region, that is, the weight corresponding to the second eigenvector may be smaller than the weight corresponding to the first eigenvector and larger than the other weights. And analogizing in sequence, according to the sequence of the plurality of feature vectors, the weight corresponding to the feature vector which is farther away from the first feature vector is smaller. At t₂At this time, because the intermediate semantic information corresponding to the second feature vector is to be determined, the weight corresponding to the second feature vector may be the largest among the multiple weights included in the weight vector, and according to the order of the multiple feature vectors, the weight corresponding to the feature vector that is farther away from the second feature vector is smaller. Based on the above method, at t_nAt this time, because the intermediate semantic information corresponding to the nth feature vector is to be determined, the weight corresponding to the nth feature vector may be the largest among the multiple weights included in the weight vector, and according to the order of the multiple feature vectors, the weight corresponding to the feature vector having a longer distance from the nth feature vector is smaller. Wherein, the sum of all weights included in the weight vector may be 1.

It should be noted that, at t_nThe time can be determined by the following equation (4) according to t_nDetermining t by weight vector and multiple characteristic vectors of time_nTemporal intermediate semantic information.

g_n＝α₁*C₁+α₂*C₂+…+α_n*C_n(4)

Wherein, g_nIs t_nIntermediate semantic information of time of day, α₁、α₂···α_nA plurality of weights included for the weight vector, i.e., the weight vector is (α)₁、α₂···α_n)，C₁、C₂···C_nA plurality of feature vectors.

Fig. 4 is a schematic diagram of an expanded LSTM network provided by an embodiment of the present application. As shown in fig. 4, at t₁Time of day, based on a plurality of eigenvectors and t₁Weight vector A of time₁Determining to obtain t₁Temporal intermediate semantic information g₁G is mixing₁Inputting into LSTM according to initial cell state information c₀And g₁Generating an intermediate vector s₁The intermediate vector s₁May be used to characterize character features of characters that may be contained in the first sub-region in left-to-right and top-to-bottom order within the text region. In generating an intermediate vector s₁Then, the pair c can be determined according to the intermediate vector₀Updating to obtain t₁Information on the cellular state of time c₁Output s₁And c₁. At t₂Time of day, based on a plurality of eigenvectors and t₂Weight vector A of time₂Determining to obtain t₂Temporal intermediate semantic information g₂G is mixing₂、s₁And c₁Inputting into LSTM according to initial cell state information g₂、s₁And c₁Generating an intermediate vector s₂The intermediate vector s₂Can be used to characterize the character features of the characters possibly contained in the sub-area arranged in the second order from left to right and from top to bottom in the text area, based on s₂To c₁Updating to obtain c₂Output s₂To c₂. By analogy, at t_nTime of day, based on a plurality of eigenvectors and t_nWeight vector A of time_nDetermining to obtain t_nIntermediate semantics of a time of dayInformation g_nG is mixing_n、s_n-1And c_n-1Inputting into LSTM according to initial cell state information g_n、s_n-1And c_n-1Generating an intermediate vector s_n。

Optionally, in this embodiment of the application, before determining the intermediate vector according to the plurality of feature vectors, the terminal may further encode the plurality of feature vectors, and then determine intermediate semantic information according to the encoded plurality of feature vectors and the weight vector different at each time, and further determine the intermediate vector according to the intermediate semantic information.

In this embodiment, the terminal may determine a plurality of intermediate vectors by the above method, and each time the LSTM outputs one intermediate vector, the RNN model may perform a full join operation and a softmax operation on the intermediate vector, so as to obtain a character detection result corresponding to the intermediate vector, where the character detection result includes corresponding at least one character and a probability value of each character of the at least one character. When at least one corresponding character in the character detection results includes an end character and the probability value corresponding to the end character is greater than a third threshold, the determination of the intermediate vector, that is, the determination of the character detection results, may be stopped, and the character detection results determined before the current time are used as the finally detected multiple character detection results. In other words, in the embodiment of the present application, not as many intermediate vectors can be generated as many feature vectors, and thus as many character detection results are generated, that is, the sum of the number of feature vectors may be greater than or equal to the number of determined character detection results.

After determining the plurality of character detection results, for each character detection result, the terminal may determine a character corresponding to a maximum probability value in probability values of at least one character included in each character detection result as a character corresponding to the corresponding character detection result, and then the terminal may sequentially arrange the plurality of characters according to the determination order of the character detection results to obtain a character string, where the character string is a text detection result of the current video frame.

Of course, as can be seen from the description in step 201, the current video frame may include a plurality of text regions, in which case, for each text region, the terminal may determine to obtain a corresponding text detection result, that is, the terminal may detect to obtain a plurality of text detection results of the current video frame.

Step 204: determining text contained in the current video frame based on the text detection result of the current video frame and the text detection results of the plurality of video frames other than the current video frame.

After determining the text detection result of the current video frame, in one possible implementation, the terminal may directly determine the text detection result as the text contained in the detected current video frame. In step 201, the terminal may obtain the text region and also obtain the position information of the text region, so that if the current video frame includes a plurality of text regions, in this step, the terminal may output a plurality of text detection results and the position information of the text region corresponding to each text detection result in the current video frame.

Of course, if a certain text is contained in only one video frame in the video data, the playing time of the video frame is very short, so that it is difficult for the user to clearly see the text contained in the video frame. Therefore, generally, in order to enable a user to see the text contained in the video frame, a plurality of consecutive video frames may contain the same text, and based on this, in this embodiment of the application, after determining the text detection result of the current video frame, the terminal may further fuse the text detection results of the plurality of video frames consecutive to the current video frame to determine the text contained in the current video frame.

For example, referring to fig. 5, the terminal may determine the text contained in the current video frame by fusing the text detection result of the current video frame with the text detection results of a plurality of consecutive video frames through the following steps.

2041: the method comprises the steps of obtaining a text detection result of each video frame in a plurality of continuous video frames before a current video frame, wherein the last video frame in the plurality of continuous video frames is adjacent to the current video frame.

The terminal may obtain text detection results of a plurality of consecutive video frames before the current video frame, and a last video frame of the plurality of consecutive video frames is adjacent to the current video frame, that is, the terminal may obtain a plurality of video frames before the current video frame and consecutive to the current video frame.

It should be noted that, for each video frame in a plurality of consecutive video frames before the current video frame, the terminal may refer to the method provided in the embodiment of the present application to detect the text content in the corresponding video, so as to obtain the text detection result of the corresponding video. And storing the detected text detection result every time the text detection result of one video frame is detected. The terminal can correspondingly store the identifier of each video frame and the text detection result of the corresponding video frame, so that the corresponding relation between the identifier of the video frame and the text detection result is obtained. Alternatively, the identifier of each video frame may be a timestamp of the video frame, or may be a video frame sequence number indicating that the video frame is the second video frame in the entire video stream. Based on this, in this step, after the terminal detects and obtains the text detection result of the current video frame, the terminal may obtain the text detection results of a plurality of consecutive video frames before the current video frame from the stored text detection results according to the identifier of the current video frame.

For example, when the timestamp of the video frame is used as the identifier of the video frame, the terminal may first obtain, according to the timestamp of the current video frame, a first timestamp that is located before the timestamp of the current video frame and is adjacent to the timestamp of the current video frame from the stored correspondence, and obtain a first text detection result corresponding to the first timestamp, and then, according to the first timestamp, the terminal may obtain, adjacent to the first timestamp and is located before the first timestamp, and obtain a second text detection result corresponding to the second timestamp. By analogy, the terminal can obtain the text detection results of a plurality of video frames.

Illustratively, when the video frame sequence number is used as the identifier of the video frame, the terminal may obtain, from the stored correspondence, m text detection results corresponding to the video frame sequence numbers N-1, N-2, and N-3 · N-m according to the video frame sequence number N of the current video frame, where the obtained m text detection results are text detection results of a plurality of consecutive video frames before the current video frame.

2042: and determining the editing distance between the two text detection results corresponding to each adjacent two video frames in the current video frame and the plurality of continuous video frames.

The editing distance refers to the minimum editing times in different editing times required by different editing modes when one text detection result in the two text detection results is converted into the other text detection result according to the different editing modes.

It should be noted that, for any two adjacent video frames, the terminal may convert the text detection result of one video frame into the text detection result of another video frame in different manners, and in the process of converting in different manners, the terminal may count the corresponding editing times, and determine the smallest editing time among the multiple editing times obtained by the counting as the editing distance between the two adjacent video frames.

For example, assume that the text detection result of one of two adjacent video frames is "weather-intolerance continuous high-temperature cage southwest", and the text detection result of the other video frame is "weather-intolerance continuous high-temperature cage southwest". When the first text detection result is converted into the second text detection result, in one conversion mode, the terminal can compare the two text detection results one by one from the first character, when the nth character in the first text detection result is different from the nth character in the second text detection result, the nth character in the first text detection result is modified into the nth character in the second text detection result, and if the nth character in the first text detection result is the same as the nth character in the second text detection result, the n characters are kept unchanged. If the number of characters included in the first text detection result is less than that of characters included in the second text detection result, when the last character of the first text detection result is processed, the remaining characters in the second text detection result are sequentially added after the last character of the first text detection result. Thus, in the process of converting the weather-resistant continuous high-temperature cage in the southwest into the weather-resistant continuous high-temperature cage in the southwest, the editing operations such as modification and addition need to be executed 10 times, that is, if the conversion is performed in this way, the corresponding editing times are 10 times.

However, the terminal may also switch in another way. For example, the terminal may convert "nai" in the first text detection result into "and add" memory "between" and "hold", and when the conversion is performed in this way, only two editing operations need to be performed when the first text detection result is converted into the second text detection result, that is, the number of editing times corresponding to this conversion way is 2.

Optionally, in a possible implementation manner, for two adjacent video frames, because the sizes of the video frames are consistent, if the two video frames contain the same text, the positions of the texts in the video frames should be the same or similar, and if the positions of the texts in the respective video frames are far apart, the possibility that the texts are similar is very small. Based on this, before the edit distance between the two text detection results corresponding to each adjacent two of the current video frame and the plurality of consecutive video frames of the terminal, the deviation between the positions of the two text regions in each adjacent two of the current video frame and the plurality of consecutive video frames may be determined first according to the position information of the two text regions in each adjacent two of the current video frame and the plurality of consecutive video frames.

If the determined deviations are less than or equal to the deviation threshold, it is indicated that the positions of the text regions in the current video frame and the plurality of continuous video frames are similar, that is, the current video frame and the plurality of continuous video frames are likely to contain the same text, and at this time, the terminal may perform a step of determining an edit distance between two text detection results corresponding to every two adjacent video frames in the current video frame and the plurality of continuous video frames.

If any deviation of the determined deviations is greater than the deviation threshold, it indicates that the positions of the two text regions in the two video frames corresponding to the deviation greater than the deviation threshold are far apart, that is, the probability that the two video frames contain the same text is very low. The first probability may be 0, or may be set to other values smaller than a certain value, which are used to indicate that two text detection results are not similar.

2043: and determining the similarity probability between the two text detection results corresponding to each two adjacent video frames based on the editing distance between the two text detection results corresponding to each two adjacent video frames and the number of characters contained in each text detection result in the two text detection results.

After determining the edit distance between two text detection results in every two adjacent video frames, the terminal may compare the number of characters included in each of the two text detection results, calculate a ratio between the edit distance and a maximum number of characters in the two number of characters, and determine the ratio as a similarity probability between two text detection results corresponding to the two adjacent video frames.

For a current video frame and any two adjacent video frames in a plurality of continuous video frames, the terminal can determine the similarity probability between two text detection results corresponding to the two video frames by the method, so as to obtain a plurality of similarity probabilities.

2044: based on the determined plurality of similarity probabilities, text contained in the current video frame is determined.

The terminal can sequence the determined multiple similarity probabilities according to the sequence of the current video frame and the multiple continuous video frames; if a plurality of continuous similar probabilities from the similar probability corresponding to the current video frame are all larger than the probability threshold, determining a plurality of similar text detection results based on a plurality of text detection results corresponding to a plurality of video frames corresponding to the continuous similar probabilities; and determining the text contained in the current video frame based on a plurality of similar text detection results. And if the similarity probability corresponding to the current video frame is smaller than the probability threshold, taking the text detection result of the current video frame as the text contained in the current video frame. Wherein the probability threshold is greater than the first probability.

For example, assume that the current video frame and the plurality of consecutive video frames are respectively X in the sequence order₁、X₂、X₃···X_n-1、X_nWherein X is_nFor the current video frame, X₁、X₂、X₃···X_n-1A number of video frames that are consecutive before the current video frame. In this order, a plurality of similarity probabilities are ranked, wherein X₁And X₂The similarity probability between two text detection results is ranked first, X₂And X₃The probability of similarity between two text detection results is ranked second, X₃And X₄The similarity probability between the two text detection results is ranked in the third, and so on, X_n-1And X_nThe similarity probability between the two text detection results is ranked at the n-1 st.

After the plurality of similarity probabilities are ranked, the terminal may sequentially determine whether each similarity probability is greater than a probability threshold value from the last similarity probability in the plurality of similarity probabilities in a reverse order, and if each similarity probability in the plurality of consecutive similarity probabilities is greater than the probability threshold value from the last similarity probability, may determine a plurality of similar text detection results based on a plurality of text detection results corresponding to a plurality of video frames corresponding to the plurality of consecutive similarity probabilities greater than the probability threshold value.

For example, the foregoing examples are still provided. The terminal starts from the n-1 th similarity probability, firstly judges whether the n-1 th similarity probability is larger than a probability threshold value, if so, indicates X_n-1And X_nThe similarity between the two text detection results is small, and at this time, the terminal can directly determine the text detection result of the current video frame as the text contained in the current video frame. If yes, judging whether the (n-2) th similarity probability is larger than the probability threshold, and if not, judging whether the (n-2) th similarity probability is larger than the probability thresholdIf the rate is less than the probability threshold, X is indicated_n-1And X_nThe similarity between the two text detection results is large, and X_n-2And X_n-1The similarity between the two text detection results is small, and in this case, X can be used_n-1And X_nAnd determining the two text detection results as similar text detection results. Of course, if the (n-2) th similarity probability is still greater than the probability threshold, continuously determining whether the (n-3) th similarity probability is greater than the probability threshold, and determining the text detection results of the following video frame in the two video frames corresponding to the similarity probability less than the probability threshold, the text detection results of all the video frames between the following video frame and the current video frame, and the text detection result of the current video frame as a plurality of similar text detection results until a certain similarity probability is less than the probability threshold.

After determining the plurality of similar text detection results, in a possible implementation manner, the terminal may divide the same text detection results in the plurality of similar text detection results into one group, where text detection results for which the same text detection results do not exist are independently grouped into one group, so as to obtain a plurality of groups, determine the number of text detection results included in each group, and use a group including a text detection result with a largest number of text detection results included in the plurality of groups as a text included in the current video frame.

In another possible implementation manner, if the number of the text detection results included in each group is the same, it cannot be determined which group of text detection results is used as the text included in the current video frame by the above method. In this case, the terminal may analyze the plurality of similar text detection results using other language models, so as to output a text detection result that best meets the context, and use the text detection result as the text included in the current video frame.

In addition, it should be noted that, in the above embodiment, explanation is mainly given for an example in which a current video frame and a plurality of consecutive video frames each include one text detection result. If the current video frame includes a plurality of text detection results and/or each of a plurality of consecutive video frames includes a plurality of text detection results, the terminal may first obtain a text detection result corresponding to each text detection result from the plurality of consecutive video frames according to the position information of the text region corresponding to each text detection result in the current video frame, take each text detection result in the current video frame and the obtained plurality of consecutive video frames as a packet, and process the text detection results in the corresponding packet based on the video frame corresponding to the text detection result in each packet according to the aforementioned method 2041-2044, thereby obtaining the characters included in the current video frame.

Optionally, in a possible implementation manner, the terminal may also fuse the text detection result of the current video frame with the text detection results of a plurality of consecutive video frames after the current video frame to determine the text included in the current video frame, where a first video frame in the plurality of consecutive video frames is adjacent to the current video frame.

Or, in a possible implementation manner, the terminal may also fuse the text detection result of the current video frame, the text detection results of a plurality of consecutive video frames before the current video frame, and a plurality of consecutive text detection results after the current video frame, so as to determine the text included in the current video frame. The last video frame in a plurality of continuous video frames before the current video frame is adjacent to the current video frame, and the first video frame in a plurality of continuous video frames after the current video frame is also adjacent to the current video frame.

In the embodiment of the application, the terminal can acquire the text region from the current video frame by extracting the feature information of the current video frame, and after the text region is acquired, the terminal can directly input the text region into the neural network for detection without segmenting the text region and output a detection result. Because the text region does not need to be segmented, the problem of low identification accuracy caused by inaccurate segmentation when the imaging quality is poor is solved. In addition, in the embodiment of the present application, it is considered that a plurality of consecutive video frames often contain the same text, and therefore, after the neural network outputs the text detection result of the current video frame, the text detection results of the plurality of consecutive video frames including the current video frame can be combined to determine the text contained in the current video frame, thereby effectively improving the accuracy of text detection.

Next, a description will be given of a text detection device provided in an embodiment of the present application.

Fig. 6 is a block diagram of a text detection apparatus 600 according to an embodiment of the present application. Referring to fig. 6, the apparatus includes:

an obtaining module 601, configured to extract feature information of a current video frame, and obtain a text region from the current video frame based on the extracted feature information, where the text region is an image region containing a text in the current video frame;

the detecting module 602 is configured to use the text region as an input of a neural network, detect characters included in the text region by using the neural network, and output a text detection result, where the neural network includes a plurality of neural network models;

a determining module 603, configured to determine text included in the current video frame based on the text detection result of the current video frame and the text detection results of the plurality of video frames related to the current video frame.

Optionally, the detection module 602 includes:

the generation sub-module is used for extracting a plurality of feature maps from the text region by using the convolutional neural network CNN model and generating a plurality of feature vectors based on the plurality of feature maps, wherein each feature map in the plurality of feature maps is used for representing one pixel feature of the text region, and each feature vector in the plurality of feature vectors is used for representing the pixel feature in one sub-region of the text region;

a first determining submodule, configured to determine a plurality of character detection results based on the plurality of feature vectors, where each character detection result in the plurality of character detection results includes at least one character and a probability value corresponding to each character in the at least one character;

Optionally, the first determining submodule is specifically configured to:

sequencing the plurality of eigenvectors according to the position information of the subarea corresponding to each eigenvector in the text area and the writing sequence of the text;

the method comprises the steps that a plurality of feature vectors are used as input of a Recurrent Neural Network (RNN) model, a plurality of intermediate vectors are sequentially generated by the RNN model based on the sequence of the feature vectors, and each intermediate vector in the intermediate vectors is used for representing character features of at least one character;

based on each of the plurality of intermediate vectors, at least one character corresponding to each intermediate vector and a probability value corresponding to each of the at least one character are determined.

Optionally, the determining module 603 includes:

the acquisition submodule is used for acquiring a text detection result of each video frame in a plurality of continuous video frames before the current video frame, and the last video frame in the plurality of continuous video frames is adjacent to the current video frame;

the second determining submodule is used for determining an editing distance between two text detection results corresponding to each two adjacent video frames in the current video frame and the plurality of continuous video frames, wherein the editing distance refers to the minimum editing time in different editing times required by different editing modes when one text detection result in the two text detection results is converted into the other text detection result according to the different editing modes;

and the fourth determining submodule is used for determining texts contained in the current video frame based on the plurality of determined similarity probabilities.

Optionally, the fourth determining submodule is specifically configured to:

sequencing the plurality of similar probabilities according to the sequence of the current video frame and a plurality of continuous video frames;

if a plurality of continuous similar probabilities from the similar probability corresponding to the current video frame are all larger than the probability threshold, determining a plurality of text detection results corresponding to a plurality of video frames corresponding to the continuous similar probabilities as a plurality of similar text detection results;

and determining the text contained in the current video frame based on a plurality of similar text detection results.

Optionally, the apparatus 600 is further configured to:

acquiring the position of the text area in the current video frame;

accordingly, the determining module 603 further comprises:

a fifth determining sub-module for determining a deviation between positions of two text regions in each adjacent two of the current video frame and the plurality of consecutive video frames;

the triggering sub-module is used for triggering the fourth determining sub-module to determine the editing distance between two text detection results corresponding to each two adjacent video frames in the current video frame and the multiple continuous video frames if the determined multiple deviations are all smaller than or equal to the deviation threshold;

and the sixth determining submodule is used for determining the similarity probability between two text detection results corresponding to two video frames corresponding to any deviation larger than the deviation threshold as a first probability if any deviation of the deviations is larger than the deviation threshold, and the first probability is smaller than the probability threshold.

It should be noted that: in the text detection apparatus provided in the above embodiment, when detecting a text, only the division of the above functional modules is used for illustration, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the text detection apparatus and the text detection method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 7 is a block diagram illustrating a terminal 700 according to an exemplary embodiment of the present invention. The terminal may be the terminal in the system architecture described in fig. 1. Among them, the terminal 700 may be: industrial computers, industrial personal computers, notebook computers, desktop computers, smart phones or tablet computers, and the like. Terminal 700 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.

In general, terminal 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the method of planning a flight path of a flight device provided by method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch screen display 705, camera 706, audio circuitry 707, positioning components 708, and power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, providing the front panel of the terminal 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The positioning component 708 is used to locate the current geographic position of the terminal 700 to implement navigation or LBS (location based Service). The positioning component 708 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the galileo System in the european union.

Power supply 709 is provided to supply power to various components of terminal 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 709 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal 700 by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side bezel of terminal 700 and/or an underlying layer of touch display 705. When the pressure sensor 713 is disposed on a side frame of the terminal 700, a user's grip signal on the terminal 700 may be detected, and the processor 701 performs right-left hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the touch display 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal 700. When a physical button or a vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical button or the vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 705 is increased; when the ambient light intensity is low, the display brightness of the touch display 705 is turned down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically disposed on a front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front surface of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually decreases, the processor 701 controls the touch display 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually becomes larger, the processor 701 controls the touch display 705 to switch from the breath screen state to the bright screen state.

That is, not only is an apparatus for planning a flight path of a flight device provided in an embodiment of the present invention, which may be applied to the terminal 700 described above and includes a processor and a memory for storing executable instructions of the processor, where the processor is configured to execute the text detection method provided in the foregoing embodiment, but also a computer-readable storage medium is provided, and a computer program is stored in the storage medium, and when being executed by the processor, the computer program can implement the text detection method provided in the foregoing embodiment.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text detection method, the method comprising:

2. The method of claim 1, wherein the detecting the characters contained in the text region by using the neural network and outputting a text detection result comprises:

3. The method of claim 2, wherein determining a plurality of character detection results based on the plurality of feature vectors comprises:

4. The method of claim 1, wherein the determining text included in the current video frame based on the text detection result of the current video frame and the text detection results of a plurality of video frames related to the current video frame comprises:

5. The method of claim 4, wherein determining text contained in the current video frame based on the determined plurality of similarity probabilities comprises:

6. The method of claim 5, further comprising:

7. The method according to claim 5 or 6, wherein after obtaining the text region from the current video frame based on the extracted feature information, the method further comprises:

acquiring the position of the text area in the current video frame;

8. A text detection apparatus, characterized in that the apparatus comprises:

9. The apparatus of claim 8, wherein the detection module comprises:

10. The apparatus of claim 9, wherein the first determination submodule is specifically configured to:

11. The apparatus of claim 8, wherein the determining module comprises:

12. The apparatus of claim 11, wherein the fourth determination submodule is specifically configured to:

13. The apparatus of claim 12, wherein the fourth determination submodule is further configured to:

14. The apparatus of claim 12 or 13, wherein the apparatus is further configured to:

acquiring the position of the text area in the current video frame;

accordingly, the determining module further comprises:

15. A text detection apparatus, characterized in that the apparatus comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform any of the methods of claims 1-7.

16. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when being executed by a processor, carries out the method of any one of claims 1-7.