CN114782941A

CN114782941A - Video OSD character recognition method, device and medium

Info

Publication number: CN114782941A
Application number: CN202210446762.9A
Authority: CN
Inventors: 凌康杰; 陈利军; 林焕凯; 洪曙光; 王祥雪; 刘双广
Original assignee: Gosuncn Technology Group Co Ltd
Current assignee: Gosuncn Technology Group Co Ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-07-22

Abstract

The invention provides a video OSD character recognition method, which comprises the following steps: s1, obtaining an OSD video image; s2, inputting the OSD video image to the character position detection module to obtain character position information; s3, inputting the character position information and the OSD video image to a character position cutting module at the same time to obtain a cut image; the character position cutting module randomly shifts the center coordinates of the character image blocks in the OSD video image by M character lengths; and S4, inputting the cut image to a character content recognition module, and outputting the recognized character string information. In the process of recognizing the characters, the invention uses the position information of the OSD characters at the same time, thereby quickening the convergence of the model and improving the recognition precision.

Description

Video OSD character recognition method, device and medium

Technical Field

The invention relates to the technical field of character recognition, in particular to a method, a device and a medium for recognizing video OSD characters.

Background

With the wide application of the intelligent security system based on video monitoring, a large amount of osd (on screen display) videos are generated every day. In practice, the OSD character information in the OSD video images needs to be identified for determining whether the characters displayed in real time by the OSD video are consistent with the characters to be displayed actually set, or for video archiving, or for database indexing, etc.

There are currently the following solutions in 2: the first is OCR (optical character recognition) character recognition based on a conventional image processing method. The method comprises the steps of firstly using computer vision technologies such as gradient feature extraction, HOG feature extraction, expansion, corrosion, image binarization and the like to detect OSD characters in a video and divide the OSD characters into single characters, then carrying out template matching on each OSD character, and finally obtaining identified characters. The method has the main defects that the method is influenced by illumination change of a video acquisition environment and a complex background in an image, the character segmentation effect is poor, the problems of incomplete character segmentation, missing segmentation and error segmentation are easily caused, the identification precision is low, the online deployment effect is unstable, and particularly, the segmentation threshold is difficult to determine due to scene transition in daytime and at night.

The second method is based on a deep learning method, the position of an OSD character is detected through a character detection network, and then a character segment image is input to the deep learning network for character recognition. The main drawback of this method is that it requires a large amount of training samples and computational resources and is affected by image background and illumination.

The existing deep learning method is directly applied to OSD character recognition and has the following 3 main defects.

Firstly, the accuracy of the current OSD character recognition model is low.

Secondly, the current OCR optical character recognition method does not fully utilize the unique information of OSD character recognition, which is mainly because the data of the traditional OCR character recognition mainly comes from manual marking, which is time-consuming and labor-consuming, and the position information of each character can not be marked finely generally, and the data synthesized by the OSD technology has the information of two aspects of position and content.

And thirdly, the deep learning model based on the OCR method is not optimized aiming at the identification of the OSD characters, the OSD characters are formed by square and regular continuous block-shaped characters, the problems of character deformation, character scaling on the same picture and the like do not exist in the video, and the color of the characters is mostly black or white and other pure colors. The characters recognized by OCR are characters in nature, such as characters of an outdoor billboard, characters of paper shot by a mobile phone, and the characters have different colors and textures. If a method is not designed to strengthen the learning of the characteristic features of the OSD characters and weaken the characteristic influence of the background characters, the OCR character recognition method is directly applied to the OSD character recognition, the background characters in a video picture are often mistakenly detected, and when the irrelevant background characters are transmitted to an OSD recognition module, the performance degradation of the whole OSD character recognition system is easily caused. Secondly, the pixels of the OSD characters are monochrome, but the background is directly from the video background, which causes the complexity and variety of the background of the OSD characters and increases the difficulty of recognition. The character background in the OCR research object is often single, which easily causes the method of OCR character recognition to be insensitive to the OSD characters and easily causes omission.

The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the material described in this section is not prior art to the claims in this application and is not admitted to be prior art by inclusion in this section.

Disclosure of Invention

In view of the above technical problems in the related art, the present invention provides a video OSD character recognition method, which includes the following steps:

s1, obtaining an OSD video image;

s2, inputting the OSD video image to the character position detection module to obtain character position information;

s3, inputting the character position information and the OSD video image to a character position cutting module simultaneously to obtain a cut image; the character position cutting module randomly shifts the center coordinates of the character image blocks in the OSD video image by M character lengths;

and S4, inputting the cut image to a character content recognition module, and outputting the recognized character string information.

Specifically, the specific structure of the character content recognition module is as follows: the system comprises a first CNN feature map (101), a down-sampling module (102), a spatial attention module (103), an up-sampling module (104), a multiplier (105) and a second CNN feature map (106), wherein the multiplier (105) multiplies the output from the up-sampling module (104) and the CNN feature map of the first CNN feature map (101).

Specifically, the second CNN feature map (106) is input to a CRNN network for character recognition.

Specifically, the method for generating training data of the character position detection module is as follows:

and simultaneously dividing the background area and the character area into blocks according to pixels with the height of k1 and the width of k2, recording the blocks as blocks, and superposing the characters by applying a single character background averaging method aiming at each block to obtain an OSD character image of the synthesized characters.

Specifically, the OSD character image of the synthesized character is edge filtered.

In a second aspect, another embodiment of the present invention discloses an OSD character recognition apparatus, which includes the following units:

the OSD video acquisition unit is used for acquiring an OSD video image;

the character position acquisition unit is used for inputting the OSD video image into the character position detection module to obtain character position information;

the character cutting unit is used for simultaneously inputting the character position information and the OSD video image to a character position cutting module to obtain a cut image; the character position cutting module randomly shifts the center coordinates of the character image blocks in the OSD video image by M character lengths;

and the character recognition unit is used for inputting the cut image to a character content recognition module and outputting recognized character string information.

Specifically, the character content recognition module has a specific structure that: the system comprises a first CNN feature map (101), a down-sampling module (102), a spatial attention module (103), an up-sampling module (104), a multiplier (105) and a second CNN feature map (106), wherein the multiplier (105) multiplies the output from the up-sampling module (104) and the CNN feature map of the first CNN feature map (101).

Specifically, the character recognition unit further inputs the second CNN feature map (106) into the CRNN network for character recognition.

In a third aspect, another embodiment of the present invention discloses a non-volatile memory having instructions stored thereon, which when executed by a processor, are configured to implement the above-mentioned OSD character recognition method.

In the process of recognizing the characters, the invention uses the position information of the OSD characters at the same time, thereby quickening the convergence of the model and improving the recognition precision. In the prior art, a segmentation auxiliary network is directly adopted to train the external joint position information of the identification network, the segmentation network is removed after the training is finished, and then the identification network is directly used. The invention directly integrates the position information into the recognition network through a two-stage position attention mechanism, guides the recognition network to properly pay attention to the position information of the characters in the recognition process, ensures that the network architecture in the training stage and the deployment stage is completely consistent, furthest leads the network to learn and memorize the position information of the characters, and has better effect and higher precision of the obtained recognition network. Meanwhile, compared with the external method of additionally using a segmentation auxiliary network to guide the network to pay attention to the position information, the method has the advantages of smaller model size and less training resource consumption. The method for synthesizing the OSD can be used for truly synthesizing the OSD pattern effect in the real scene in a large scale, and can simulate the effect that characters appear fuzzy edges in an OSD image due to information degradation in the signal acquisition process to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for identifying OSD characters in a video according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an OSD character according to an embodiment of the invention;

FIG. 3 is a schematic diagram illustrating OSD character synthesis according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a character recognition module provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a CBAM module provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of an apparatus for recognizing OSD characters according to an embodiment of the invention;

fig. 7 is a schematic diagram of an OSD character recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

Example one

Referring to fig. 1, the present embodiment discloses a method for identifying OSD characters of a video, which includes the following steps:

s1, obtaining an OSD video image;

specifically, the embodiment obtains the video stream with the OSD from the monitoring camera or the cloud server storing the video stream.

In this embodiment, the method for acquiring the OSD video image from the video stream is to convert the corresponding frame object in the video stream into an image by adopting a video parsing manner.

And S2, inputting the OSD video image into the character position detection module, obtaining the position information of the character image block, and using (x _ center, y _ center, h, w) to represent. Wherein x _ center represents an abscissa of a central position of the character image block, y _ center represents an ordinate of the central position of the character image block, h represents a height of the character image block, and w represents a width of the character image block;

the character position detection module of the present embodiment is constructed based on a convolutional neural network, which needs to be trained.

The training process of the character position modeling module of the embodiment is as follows:

in the training stage, firstly, a training data set is synthesized in a large scale, and the method comprises the following steps:

in a video image, one or more character image blocks are randomly selected, and the size of each character image block is random but needs to be ensured to be capable of completely accommodating a plurality of characters.

The character image block of the present embodiment refers to an image block composed of one or more characters, wherein the image block composed of a plurality of characters has no obvious space between the plurality of characters, that is, the plurality of characters are continuous.

The character image blocks are typically located below the video image, for example in the form of commonly used subtitles. Of course the character image blocks may be located elsewhere in the video image.

And (3) randomly selecting a plurality of characters to fill the character image blocks according to each selected character image block by adopting an OSD technology, so that the character image blocks are just filled, filling all the character image blocks according to the method until all the character image blocks are filled, and obtaining the position information of the filled OSD video image, character codes and character image blocks, wherein the position information of the character image blocks is expressed by (x _ center, y _ center, h, w).

The OSD technique fills characters into a video image by filling one character into a character image block, and if a selected character image block is filled with a plurality of characters, position information of the character image block is referred to as image block position information of the plurality of characters. In a character image block including a plurality of characters, there is image block position information of a single character corresponding thereto for each character.

In the process of filling characters into images by using the OSD technology, the colors, the font sizes and the fonts of the characters are all randomly selected.

The data synthesis tool in this embodiment is Python, and data synthesis is performed in the process of overlaying a font image on a video image.

The common superposition method in the prior art is a single character background averaging method, and the specific method is as follows:

let the character height be h, the character width be w, the pixel value of the character region before superimposition be denoted as P, and the pixel value of the character region after superimposition be denoted as X. The background height is H, the background width is W, the pixel value of the background region before superimposition is denoted as Q, and the pixel value of the background region after superimposition is denoted as Y.

Before the character is superimposed on the background, the background is an image area, and the average value of pixels of the background image is calculated as:

the formula q is the value of each pixel point, and n is the total number of pixels, where n is H × W.

If Q is_mean<T0, then X is 255. If Q is_mean>T1, then X is equal to 0, otherwise, the value of X is determined by bernoulli distribution with parameter probability of 0.5, specifically, in 1 bernoulli test, if success times is r and failure times is 1-r, p (r) value is calculated according to the following formula, where p is equal to 0.5, if p (r) is greater than 0.5, then X value is 255, otherwise, X value is 0.

P(r)＝p^r*(1-p)^(1-r)

T0, T1 is a threshold value obtained from a gaussian distribution with a mean value of 0 and a variance of 1, specifically, a gaussian distribution probability p (x) is calculated by using the following formula, and if x obtained by random sampling is between [ -1,1], that is, a confidence distribution interval of a normal distribution is 67.27%, T0 ═ T p (x), and T1 ═ T-p (x) ═ 255-T0, where T is a hard threshold value, and T in the example is 100.

In some cases, the background of the character is changed too much, and the problem that some local pixels of the character are fused with the background is easily caused by the superposed character obtained by the single character background averaging method, so that the character in the OSD video image is not clearly seen. In this embodiment, it is assumed that the width of the background region is W and the height is H, and the adaptive character pixel stacking technique divides the background region and the character region into blocks according to the pixels with the height of k1 and the width of k2, and takes the blocks as blocks, and applies a single character background averaging method to stack characters for each block. After one block area is calculated, the area calculation of the next block is carried out until all the partitioned block areas are calculated. Wherein k1 ranges from 3 to H, and k2 ranges from 3 to W. For example, in fig. 3, an image a with a size of H × W is selected as the background of the synthesized OSD video image, and divided into 5 × 5 blocks, and a single character background averaging method is applied to each block, and for example, in Q11, since the position of the corresponding character image is a non-character region, the Q11 image block of the background may be directly copied to the synthesized OSD character image. In the background Q24 block, the character region exists in the character image corresponding to the position, and the pixels of the background Q24 block are 255 (black) and the pixels of the character region are 255 (black), and according to the single character background averaging method, the character region of the position block is changed to a pixel value of 0 (white) with a certain probability in the synthesized OSD character image.

And filtering the edges of the OSD characters after the OSD characters are obtained by the self-adaptive character pixel superposition technology, so that the background and the font edges are fused with each other. The specific method is that a median filtering kernel with the size of 3 x 3 is adopted, the median filtering kernel is moved along the junction of a character area and a background area in an OSD video image, the average value of pixels in the area where the median filtering kernel is located is calculated, the area where the whole median filtering kernel is located is filled with the average value, after the calculation of one position is completed, the pixels are moved to the next position according to the step length of 1 pixel unit, and the calculation is repeated until the junction of all the character areas and the background area is calculated.

The training process of the character position detection module is that after the synthesized data is subjected to data enhancement, the synthesized data is sent to the character position detection module to obtain the position information of the predicted character image block, the position information of the actually synthesized character image block is compared with the position information of the actually synthesized character image block, the loss function of the character image block is calculated to obtain an error gradient, then the parameters of the whole character position detection module are updated in a back propagation mode until the errors between the position information of the character image block predicted by the character position detection module and the position information of the actually synthesized character image block reach a set threshold value, and the training is stopped.

S3, inputting the character position information and the OSD video image to a character position cutting module simultaneously to obtain a cut image; and the character position clipping module randomly offsets the center coordinates of the character image blocks in the OSD video image by M character lengths. The specific method is that M characters are randomly selected according to all OSD characters in the character image block, the widths of the M characters are added to obtain the offset of the abscissa, and the height of the character with the highest height in the M characters is selected as the offset of the ordinate;

and a character position clipping module for clipping the OSD video image according to the position information of the character image block obtained in the step S2, and finally outputting the clipped image. The cutting method is an amplification random cutting method and aims to guide a character content recognition module to pay attention to position change of characters properly.

Referring to fig. 2, in the present embodiment, the center coordinates of the position information of the image blocks of the multiple characters are randomly shifted by M character lengths, where M is 3, 5, or 8 in the implementation case, and meanwhile, the width and height of the position information of the image blocks of the multiple characters are correspondingly enlarged or reduced, so as to satisfy that the cropped image blocks of the characters contain complete character information and have a blank area with random size redundancy. The specific method for expanding or reducing the width and height of the position information of the image blocks of the multiple characters is that, if the original center coordinates in the position information of the image blocks of the multiple characters are (x _ center, y _ center), after randomly shifting by M character lengths, the center coordinates are (x _ center ± α, y _ center ± β), where α refers to the horizontal coordinate offset and β refers to the vertical coordinate offset, then correspondingly, the width in the position information of the image blocks of the multiple characters is adjusted to w +2 | α | + Δ, and the height is adjusted to h +2 | β | + Δ, where Δ is a redundancy of random size, in order to satisfy that the cut-out character image blocks contain complete character information and have blank areas. That is, the position information of the image blocks of the plurality of characters is originally (x _ center, y _ center, h, w), and the position clipping model adjusts it to (x _ center ± α, y _ center ± β, h +2 | β | + Δ, w +2 | α | + Δ), and then clips it.

S4, inputting the cut image to a character content recognition module, and outputting recognized character string information;

the character content recognition module of this embodiment refers in particular to a model that adopts deep learning and Neural Network design, and this Network generally consists of a CNN (Convolutional Neural Network) layer, an RNN (Convolutional Neural Network) and its variant LSTM (Long short Term Memory), and a self-attention mechanism layer, and the training Loss function usually adopts CTC Loss or a classification Loss function. Such networks are represented by crnn (conditional recovery Neural network).

The invention optimizes a character content recognition module by a position supervision mechanism method, and particularly designs a two-stage position attention module as shown in figure 4. The module can be directly embedded into a character recognition network, such as a CRNN network. The calculation flow of the module specifically comprises the following steps:

(1) CNN features 101 are obtained from the deep neural network and delivered to a downsampling module 102. In the implementation case, the CNN feature refers to the CRNN network middle layer feature from VGG 13-based CRNN. It should be understood that the CNN features of the present invention are not limited to a specific network type and a specific layer, and the related network transformation and replacement modules having matrix output format such as C × H × W (C is the number of channels, H is the height of the feature map, and W is the width of the feature map) should be considered as CNN features, for example, in a transformer network, data output after self-attention layer (self-attention) may be rearranged into a matrix format of C × H W, and such features should also be considered as CNN features.

(2) The downsampling module 102 performs convolution, batch regularization, activation, pooling, downsampling and other actions on the CNN feature map output by the node 101 to obtain a feature map with the number of channels being 1. In the embodiment, the final down-sampling ratio is 0.25, which specifically means that the length and the width of the 101 output CNN feature map are reduced by 4 times respectively, and then the feature map is subjected to supervised learning of image block position information of a plurality of characters. The supervised learning method is that a two-dimensional Gaussian probability distribution thermodynamic diagram corresponding to multi-character image block position information (x _ center, y _ center, h, w) in an image input into a network is calculated, supervised learning is carried out on the two-dimensional Gaussian probability distribution thermodynamic diagram and a 2-dimensional feature diagram output by 102, and a Loss function is Focal local. It should be understood that the downsampling length and width is reduced by 4 times in this embodiment, and is not limited to 4 times, and in this embodiment, the downsampling module refers to a module that may include the actions of convolution, batch regularization, activation, pooling, downsampling, and so on.

(3) The 102 result is input to a spatial attention module 103, and the 103 module is composed of a convolution layer, a pooling layer, a regularization layer and the like, and is used for focusing attention on the feature map output by the down-sampling module 102 and improving the position accuracy of a single character.

In one embodiment, the spatial Attention module is a cbam (conditional Block Attention module) module without channel Attention (e.g., fig. 5).

(4) The upsampling module 104 obtains a feature map with a channel number of 1 through convolution, batch regularization, activation, pooling, upsampling and other actions. In the embodiment, the final up-sampling ratio is 4, and the up-sampling ratio is consistent with the down-sampling, so that the output size of the feature map 101 and the input size of the multiplier 105 can be ensured to be consistent. The specific meaning is that the length and the width of the 103 output CNN feature map are respectively enlarged by 4 times, and then the feature map is subjected to image block position information supervised learning of a single character. The supervised learning method is to calculate the corresponding two-dimensional Gaussian probability distribution thermodynamic diagram of the image block position information (x _ center, y _ center, h, w) of a single character in the image input to the network, and to perform supervised learning with the 2-dimensional feature diagram output by 103, wherein the Loss function is Focal local. It should be understood that the upsampling in the embodiment is not limited to 4 times, and in the implementation case of the module, the upsampling module refers to a module that may include the actions of convolution, batch regularization, activation, pooling, upsampling, and the like.

(5) The multiplier 105 multiplies the output from 104 with the CNN feature map of 101 so that the network can focus on the position features of the character, ultimately outputting the CNN feature map 106. The CNN profile 106 matrix and the CNN profile 101 matrix are consistent in number of dimensions and shape.

The OSD identification method of the embodiment can effectively improve the OSD identification accuracy rate in different scenes, and the result is shown in the table 1.

In the process of recognizing the characters, the embodiment uses the position information of the OSD characters at the same time, so that the model convergence can be accelerated, and the recognition accuracy is improved. In the prior art, a segmentation auxiliary network is directly adopted to train the external joint position information of the identification network, the segmentation network is removed after the training is finished, and then the identification network is directly used. According to the embodiment, the position information is directly integrated into the recognition network through a two-stage position attention mechanism, the recognition network is guided to properly focus on the position information of the characters in the recognition process, network architectures in a training stage and a deployment stage are completely consistent, the network is enabled to learn and memorize the position information of the characters to the maximum extent, the obtained recognition network effect is better, and the accuracy is higher. Meanwhile, compared with the external method of additionally using a segmentation auxiliary network to guide the network to pay attention to the position information, the method of the embodiment has the advantages of smaller model size and less training resource consumption. The method for synthesizing the OSD can be used for really synthesizing the OSD pattern effect in the real scene in a large scale and simulating the effect that the characters present fuzzy edges in the OSD image due to the information degradation problem in the signal acquisition process to a certain extent.

Example two

Referring to fig. 6, the present embodiment discloses a video OSD character recognition apparatus, which includes the following units:

the OSD video acquisition unit is used for acquiring an OSD video image;

in the training stage, a training data set is synthesized in a large scale, and the method comprises the following steps:

The character image block of the present embodiment refers to an image block composed of one or more characters, wherein the image block composed of a plurality of characters has no obvious space between the characters, that is, the characters are consecutive.

And (2) according to each selected character image block, adopting an OSD technology, randomly selecting a plurality of characters to fill the character image block, filling the character image block to be full, filling all the character image blocks according to the method until all the character image blocks are filled, and obtaining the position information of the filled OSD video image, character codes and character image blocks, wherein the position information of the character image blocks is used (x _ center, y _ center, h, w).

The data synthesis tool in this embodiment is Python, and the data synthesis is performed in the process of superimposing a font image on a video image.

let h be the character height, w be the character width, P be the pixel value of the character region before the superimposition, and X be the pixel value of the character region after the superimposition. The background height is H, the background width is W, the pixel value of the background region before superimposition is denoted as Q, and the pixel value of the background region after superimposition is denoted as Y.

If Q_mean<T0, then X is 255. If Q is_mean>T1, then X is equal to 0, otherwise, the value of X is determined by bernoulli distribution with parameter probability of 0.5, specifically, in 1 bernoulli test, if success times is r and failure times is 1-r, p (r) value is calculated according to the following formula, where p is equal to 0.5, if p (r) is greater than 0.5, then X value is 255, otherwise, X value is 0.

P(r)＝p^r*(1-p)^(1-r)

In some cases, the change of the character background is too large, and the problem of fusion of local pixels of some characters and the background is easily caused by the superposed characters obtained by adopting a single character background averaging method, so that characters in an OSD video image are not clear. In this embodiment, it is assumed that the width of the background region is W and the height is H, and the adaptive character pixel stacking technique divides the background region and the character region into blocks according to the pixels with the height of k1 and the width of k2, and records the blocks as blocks, and applies a single character background averaging method to stack characters for each block of blocks. And after the area of one block is calculated, calculating the area of the next block until all the partitioned block areas are calculated. Wherein k1 ranges from 3 to H, and k2 ranges from 3 to W. For example, in fig. 3, an image a with a size of H × W is selected as the background of the synthesized OSD video image, and divided into 5 × 5 blocks, and a single character background averaging method is applied to each block, and for example, in Q11, since the position of the corresponding character image is a non-character region, the Q11 image block of the background may be directly copied to the synthesized OSD character image. In the Q24 block of the background, a character region exists in the character image corresponding to the position, and the pixels of the background Q24 block are 255 (black), while the pixels of the character region are 255 (black), according to the single character background averaging method, in the synthesized OSD character image, the character region of the position block is changed to a pixel value of 0 (white) with a certain probability.

and a character position clipping module for clipping the OSD video image according to the position information of the character image block obtained in the step S2, and finally outputting the clipped image. The cutting method is an amplification random cutting method and aims to guide a character content recognition module to properly focus on the position change of characters.

Referring to fig. 2, in the present embodiment, the center coordinates of the position information of the image blocks of the multiple characters are randomly shifted by M character lengths, and M in the embodiment is 3, 5, and 8, and meanwhile, the width and the height of the position information of the image blocks of the multiple characters are correspondingly enlarged or reduced, wherein the width is adjusted to w +2 | + Δ, and the height is adjusted to h +2 | + Δ, so as to satisfy that the cropped character image blocks contain complete character information and have blank areas with random size redundancy.

The specific structure of the character content identification module is as follows: the system comprises a first CNN feature map (101), a down-sampling module (102), a spatial attention module (103), an up-sampling module (104), a multiplier (105) and a second CNN feature map (106), wherein the multiplier (105) multiplies the output from the up-sampling module (104) and the CNN feature map of the first CNN feature map (101).

The character recognition unit also inputs the second CNN characteristic diagram (106) into a CRNN network for character recognition.

In the process of recognizing the characters, the embodiment uses the position information of the OSD characters at the same time, so that the model convergence can be accelerated, and the recognition accuracy is improved. In the prior art, a segmentation auxiliary network is directly adopted to train outside joint position information of a recognition network, the segmentation network is removed after the training is finished, and then the recognition network is directly used. According to the embodiment, the position information is directly integrated into the recognition network through a two-stage position attention mechanism, the recognition network is guided to properly focus on the position information of the characters in the recognition process, network architectures in a training stage and a deployment stage are completely consistent, the network is enabled to learn and memorize the position information of the characters to the maximum extent, the obtained recognition network effect is better, and the accuracy is higher. Meanwhile, compared with the external method of additionally using a segmentation auxiliary network to guide the network to pay attention to the position information, the method of the embodiment has the advantages of smaller model size and less training resource consumption. The method for synthesizing the OSD can be used for really synthesizing the OSD pattern effect in the real scene in a large scale and simulating the effect that the characters present fuzzy edges in the OSD image due to the information degradation problem in the signal acquisition process to a certain extent.

EXAMPLE III

Referring to fig. 7, fig. 7 is a schematic structural diagram of a video OSD character recognition apparatus of the present embodiment. The OSD character recognition device 20 of this embodiment includes a processor 21, a memory 22, and a computer program stored in the memory 22 and executable on the processor 21. The processor 21 realizes the steps in the above-described method embodiments when executing the computer program. Alternatively, the processor 21 implements the functions of the modules/units in the above-described device embodiments when executing the computer program.

Illustratively, the computer program may be partitioned into one or more modules/units, which are stored in the memory 22 and executed by the processor 21 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the OSD character recognition device 20. For example, the computer program may be divided into the modules in the second embodiment, and for the specific functions of the modules, reference is made to the working process of the apparatus in the foregoing embodiment, which is not described herein again.

The video OSD character recognition apparatus 20 may include, but is not limited to, a processor 21, a memory 22. Those skilled in the art will appreciate that the schematic diagram is merely an example of the video OSD character recognition device 20, and does not constitute a limitation of the video OSD character recognition device 20, and may include more or less components than those shown, or combine some components, or different components, for example, the video OSD character recognition device 20 may further include an input-output device, a network access device, a bus, etc.

The Processor 21 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 21 is the control center of the video OSD character recognition device 20 and connects the various parts of the entire video OSD character recognition device 20 using various interfaces and lines.

The memory 22 may be used to store the computer programs and/or modules, and the processor 21 implements various functions of the OSD character recognition device 20 by running or executing the computer programs and/or modules stored in the memory 22 and calling data stored in the memory 22. The memory 22 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. In addition, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the integrated modules/units of the OSD character recognition device 20 may be stored in a computer readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by the processor 21 to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement without inventive effort.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A video OSD character recognition method comprises the following steps:

s1, obtaining an OSD video image;

2. The method of claim 1, wherein the character content recognition module has a specific structure: the system comprises a first CNN feature map (101), a down-sampling module (102), a spatial attention module (103), an up-sampling module (104), a multiplier (105) and a second CNN feature map (106), wherein the multiplier (105) multiplies the output from the up-sampling module (104) and the CNN feature map of the first CNN feature map (101).

3. The method of claim 2, wherein the second CNN profile (106) is input to a CRNN network for character recognition.

4. The method of claim 1, wherein the training data of the character position detection module is generated as follows:

and simultaneously dividing the background area and the character area into blocks according to pixels with the height of k1 and the width of k2, recording the blocks as blocks, and superposing characters by applying a single character background averaging method aiming at each block to obtain an OSD character image of synthesized characters.

5. The method of claim 4, edge filtering an OSD character image of the synthesized character.

6. A video OSD character recognition device comprising the following elements:

the OSD video acquisition unit is used for acquiring an OSD video image;

7. The apparatus of claim 6, wherein the character content recognition module has a specific structure: the system comprises a first CNN feature map (101), a down-sampling module (102), a spatial attention module (103), an up-sampling module (104), a multiplier (105) and a second CNN feature map (106), wherein the multiplier (105) multiplies the output from the up-sampling module (104) and the CNN feature map of the first CNN feature map (101).

8. The apparatus of claim 7, the character recognition unit further comprising inputting the second CNN profile (106) into a CRNN network for character recognition.

9. The apparatus of claim 6, wherein the training data of the character position detection module is generated as follows:

10. A non-volatile memory having stored thereon instructions that, when executed by a processor, are operable to implement the OSD character recognition method of any one of claims 1-5.