CN111899202A

CN111899202A - Method for enhancing superimposed time characters in video image

Info

Publication number: CN111899202A
Application number: CN202010422327.3A
Authority: CN
Inventors: 聂晖; 杨小波; 李军
Original assignee: Wuhan Eastwit Technology Co ltd
Current assignee: Wuhan Eastwit Technology Co ltd
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2020-11-06
Anticipated expiration: 2040-05-19
Also published as: CN111899202B

Abstract

The invention belongs to the field of computer vision, and particularly relates to a method for identifying and enhancing time annotation information of a video image. The invention comprises the following steps: training UNet (an image segmentation neural network) to realize a pixel-level time character extraction model in an image; and performing graying suppression on the original image background on the image to be detected by means of a time character extraction model so as to enhance the identifiability of the identified time characters. The invention aims at the character characteristics in the natural scene monitoring image, realizes a time marking information enhancement method, and overcomes the difficult problem to be solved when identifying the 'substrate-free' superposition time characters of the video image. The invention focuses on the separation-inhibition processing of the superposed characters and the image background, and is an image enhancement technology with great application value in the field of scene character recognition.

Description

Method for enhancing superimposed time characters in video image

Technical Field

The invention belongs to the field of computer vision, and is suitable for detecting the overlapping time characters in pictures of video monitoring systems in public security and related industries. In particular to a method for identifying and enhancing time marking information of video images.

Background

With the development of social security management, the identification of time marking information in massive video monitoring images has obvious and special application value for the technical investigation work of the public security industry, and is also one of the assessment contents of the public security department on the operation and maintenance work of national video image networking application platforms.

According to the implementation requirements of GA/T751-2008 video image text annotation specification, the time characters superposed in the natural scene image cannot use the 'substrate' picture block to cover the background. The characters are overlaid on the random scene image monitored outdoors, the background between the stroke gaps of the single characters and the adjacent character intervals is kept visible, the interference of random illumination distribution, background trivial objects and the like is easily caused, and the time character recognition is difficult.

Disclosure of Invention

The invention aims to solve the technical problem of providing a character enhancement technical scheme aiming at 'substrate-free' superposed time characters in a natural scene image, and overcoming the difficulty of identifying video image time annotation information in the prior art.

In order to solve the technical problem, the basic technical idea of the invention is to train UNet (an image segmentation neural network) to realize a pixel-level time character extraction model in an image; and performing graying suppression on the original image background on the image to be detected by means of a time character extraction model so as to enhance the identifiability of the identified time characters.

Therefore, the invention provides an enhancement method for superimposing time characters in a video image, which comprises the following steps:

step i, generating UNet training samples in a customized batch mode;

step ii, using an extraction model of time character pixels in the UNet training image;

and iii, suppressing the background of the image to be detected based on the mask obtained by the time character extraction model.

Preferably, the step i, the step of customizing the UNet batch training samples, includes:

1-1) taking batch random video images as a background, drawing time characters with black and white colors and various fonts, and superposing 'no substrate' on the time characters to be used as training input samples;

1-2) taking a black image with the same size as a background, and superposing time characters with white color and the same other contents and characteristics on the same coordinate position as an input sample to serve as one-to-one corresponding extraction target sample;

preferably, the step ii of training the extraction model of the temporal character pixel in the image by using UNet specifically includes:

2-1) setting feature extraction convolution network structure

M sets of 'convolution + pooled downsampling', where each set is convolved with N layers and contains BatchNormal and ReLU operations;

adjusting the number of channels by using a layer of single-layer convolution after the M groups of pooling so as to match with subsequent up-sampling;

k groups of 'upsampling + convolution', where each group of convolutions has L layers;

the output matrix of each layer of up-sampling is connected to the output matrix of the corresponding down-sampling convolution layer in sequence;

after K sets of upsampling, the number of channels is reduced to 1 by using a layer of single-layer convolution for outputting final characteristics.

2-2) defining training parameters and outputting a segmentation model

Convolutional layer configuration, outputting channel number maps, convolutional kernel size k, stride is s, padding is p;

pooling and upsampling configuration, the size of a sliding Window is Window, stride is s, padding is p;

the length and width of the original input image must be integral multiples of a;

using a sigmoid activation function for the finally output prediction matrix, adjusting the value of the prediction matrix to be in the range of 0 to 1, and when a certain characteristic value is greater than a threshold value S, indicating that the position is a character pixel;

preferably, in step iii, the specific step of suppressing the background of the image to be detected based on the mask obtained by the time character extraction model includes:

3-1) segmenting time characters and image backgrounds in the image to be detected by using a trained UNet time character extraction model to obtain a time character instance segmentation mask;

3-2) performing graying treatment on a background area of non-time character strokes (pixels) in the original image to be detected by a linear attenuation algorithm according to the mask.

Therefore, training of the time character extraction model is completed, the time characters and the background area in the image to be detected can be subjected to example segmentation, the time character identification degree is enhanced, and the technical purpose of the invention is achieved.

The beneficial effects of the invention include:

1) the method aims at the character characteristics in the natural scene monitoring image, realizes the time marking information enhancement method, and overcomes the difficult problem to be solved when identifying the 'substrate-free' superposition time characters of the video image.

2) The invention focuses on the separation-inhibition processing of the superposed characters and the image background, and is an image enhancement technology with great application value in the field of scene character recognition.

Drawings

The technical solution of the present invention will be further specifically described with reference to the accompanying drawings and the detailed description.

FIG. 1 is a basic flow diagram of the process of the present invention;

FIG. 2 is an example diagram of training input samples and extracted target samples of the UNet network extraction model;

FIG. 3 is an example of an original image of a time character in an original image to be detected by a UNet network extraction model;

FIG. 4 is an example of a mask diagram of a UNet network extraction model for time characters in an original image to be detected;

fig. 5 is an example of a mask-based background suppression map of the UNet network extraction model for the time characters in the original to-be-detected image.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the present invention provides an overall flowchart of an enhancement method for superimposing time characters in a video image, which mainly comprises the following steps:

step i, customizing to generate UNet batch training samples;

Step i comprises the following subdivision steps:

1-2) taking batch random video images as a background, drawing time characters with black and white colors and various fonts, and superposing 'no substrate' on the time characters to be used as training input samples;

Ω_i＝{0,1，...，K}

drawing K characters, where Ω_iRepresents the pixel range to be drawn by the ith character (only including the stroke pixels of the character), omega₀Representing background pixels other than character pixels.

I[x,y]Representing a pixel [ x, y ] in an image]RGB average luminance value of [ omega ]_iRepresents the number of background pixels covered by the ith character, and d (i) represents the average brightness of the background pixels covered by the ith character.

The rendering function for character generation is as follows:

using RGBA encoding, [0,0,0,0] represents a colorless transparent pixel, [0,0,0,1] represents a black pixel, and [1,1,1,1] represents a white pixel.

f'＝(1.0-α)ob+f

And f and a background image b are mixed, wherein alpha represents an image formed by transparency channels in f, and o in the formula represents a matrix operator, and the matrix operator represents that two matrixes with the same size are subjected to element-by-element multiplication.

t＝α

and alpha is the white character image with the median value of 1 in the transparency channel in the previous operation, and is directly used as the extraction target sample t.

As shown in fig. 2, the upper and lower portions represent the input sample and the target sample, respectively.

Step ii comprises the following subdivision steps:

2-1) setting feature extraction convolution network structure

in this example, M is 4 and N is 2.

in this example, K is 4 and L is 2.

2-2) defining training parameters and outputting a segmentation model

in this embodiment, the maps start with 64 during downsampling, and end with 64 during upsampling; k. s and p take the values of 3 x 3, 1 and 1 respectively.

in this embodiment, Window, s, and p take values of 2 × 2, and 0, respectively.

in this example, a is 16.

in this embodiment, the threshold S is 0.5, and x is a prediction matrix output by the UNet network, and its value is within a real number range.

Step iii comprises the following subdivision steps:

let n be the original image and m be the resulting example split mask.

If m [ i, j ] ═ 1 (white) indicates that n [ i, j ] is a character pixel, and conversely, it is a background pixel.

3-2) performing graying treatment on a background area of non-time character strokes (pixels) in the original image to be detected by using a linear attenuation algorithm according to the mask shown in the figure 4;

wherein n' represents the image after the background attenuation, and o represents a matrix operator in the formula, which represents that two matrixes with the same size are subjected to element-by-element multiplication.

The attenuation control coefficient k > is 0, and the larger k is, the more obvious the attenuation effect is;

in this example, k is 8.

Finally, the original image is suppressed based on the time character mask, resulting in the output shown in FIG. 5.

Therefore, training of the time character extraction model is completed, the time characters and the background area in the image to be detected can be subjected to example segmentation, the time character identification degree is enhanced, and the technical scheme of the invention is realized.

It will be clear to those skilled in the art that the specific values of the above parameters or thresholds can be adjusted according to the strictness of the sample training method and the specification implementation, and do not limit the present invention.

Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and not intended to limit the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and equivalents can be made in the technical solutions described in the foregoing embodiments, or some technical features of the present invention may be substituted. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An enhancement method for superimposing a temporal character in a video image, comprising the steps of:

step i, generating UNet training samples in a customized batch mode;

2. The method according to claim 1, wherein the step i, the step of customizing the UNet batch training samples, comprises:

1-2) taking a black image with the same size as a background, and superposing time characters with white color and the same other contents and characteristics on the same coordinate position as the input sample to be used as one-to-one corresponding extraction target sample.

3. An enhancement method for superimposing temporal characters in video images according to claim 1, wherein the step ii of using UNet to train the extraction model of temporal character pixels in images comprises:

2-1) setting feature extraction convolution network structure

after the K groups of up-sampling, reducing the number of channels to 1 by using a layer of single-layer convolution for outputting final characteristics;

2-2) defining training parameters and outputting a segmentation model

and (3) using a sigmoid activation function for the finally output prediction matrix, adjusting the value of the prediction matrix to be in the range of 0 to 1, and when a certain characteristic value is greater than a threshold value S, indicating that the position is a character pixel.

4. The method for enhancing temporal character superposition in video images according to claim 1, wherein in the step iii, the specific step of suppressing the background of the image to be detected based on the mask obtained by the temporal character extraction model comprises: