CN115601688A

CN115601688A - Video main content detection method and system based on deep learning

Info

Publication number: CN115601688A
Application number: CN202211609427.2A
Authority: CN
Inventors: 罗鑫凯; 王新勇; 杨笑; 丁振; 杨柳; 高明亮; 高天鸣
Original assignee: Chinese Translation Entertainment Technology Qingdao Co ltd
Current assignee: Chinese Translation Entertainment Technology Qingdao Co ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-01-13
Anticipated expiration: 2042-12-15
Also published as: CN115601688B

Abstract

The invention relates to the technical field of image processing, and provides a video main content detection method and system based on deep learning, which comprises the following steps: acquiring video data; carrying out size adjustment on different targets in the video frame image to obtain a plurality of first initial images to form an expanded data set, and obtaining a first label image; reasoning the enlarged data set to obtain a first target image, and obtaining the identification accuracy of each category under different scale information through the edge difference between the first target image and the first label image to obtain the optimal reasoning scale of each category; obtaining an initial reasoning result through tile segmentation and identification, obtaining an optimal reasoning scale corresponding to each connected domain, adjusting the scale and obtaining an optimal reasoning result; and calculating the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of the target area to obtain the subject degree, and further obtain the main content of the video. The invention aims to solve the problem of poor neural network identification accuracy caused by different target scales.

Description

Video main content detection method and system based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a video main content detection method and system based on deep learning.

Background

Today with intelligent development, post-processing of some adjustments to video data is often required, and the post-processing also requires that the influence on the video subject is as small as possible; for example, in order to prevent the video from being blocked by excessive barrage on the video, the main body of the video needs to be identified, a corresponding mask is further generated, the rendering of the barrage is adjusted, the barrage is not displayed at the main body, the blocking of the barrage on the main content of the video is avoided, and the viewing feeling is improved; in the field of video compression, the pixel value of non-subject content is often adjusted, so that the spatial redundancy degree of the video is improved while the details of the non-subject pixel are sacrificed, the compression rate is improved, and the transmission speed of the video is increased.

In the prior art, for the identification of a main body in a video, a semantic segmentation neural network is often used for single frame identification, the input data scale of the neural network is fixed, and a target in the video is presented in different resolution scales.

Disclosure of Invention

The invention provides a video main content detection method and system based on deep learning, which aim to solve the problem of poor neural network identification accuracy caused by different target scales in the prior art, and adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for detecting video main content based on deep learning, including the following steps:

acquiring video data and constructing a semantic segmentation recognition network as a training data set;

acquiring a category label value and a fixed input scale in an identification network, taking the ratio of the scale of a minimum circumscribed rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing different-degree size adjustment on the pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and acquiring a plurality of first label images by adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images;

reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target images of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of each category on the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of each category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of each category;

performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the video frame image into an identification network to obtain an initial reasoning result, obtaining a category label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain according to the category label value, adjusting the pixel size of each initial area in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding category, and inputting the adjusted pixel size into the identification network to obtain an optimal reasoning result, wherein each initial area corresponds to one target area in the optimal reasoning result;

acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.

Optionally, the obtaining and identifying the category label value and the fixed input scale in the network includes a specific method that:

the identification network manually marks different labels for categories to be identified in the construction process, and each category label is represented in a digital form, namely a category label value; the fixed input scale is a fixed horizontal length, vertical length, width and width scale of the input image data in the identification network.

Optionally, the obtaining of the first initial image of each target under the information of the plurality of scales includes a specific method that:

the scale information comprises horizontal length initial scale information and vertical width scale information, the scale information is changed by a first preset step length from the initial scale information to adjust the size of the corresponding area of each target in the video frame image until a first initial image with the scale information of 1 is obtained, and each adjustment obtains the first initial image under the corresponding scale information, namely the first initial image of each target under the plurality of scale information.

Optionally, the obtaining of the segmentation difference degree of each line in the first initial image includes a specific method that:

wherein, the first and the second end of the pipe are connected with each other,

representing the first in the first initial image

The degree of completeness of the division of the line,

represents the first in FB

The maximum label value of the row, FB representing the edge image of the first label image, with the edge pixels labeled 2, the non-edge pixels labeled 0,

represents the number one in TB

The maximum value of the flag for a row,

an edge image representing the first target image, wherein edge pixels are labeled as 1 and non-edge pixels are labeled as 0;

will be provided with

、

All are 0 th

Degree of line segmentation difference

Is marked as 0; will be provided with

And is

、

Not all of them are 0

Line segmentationDegree of difference

Marking as 1; obtaining

To (1)

Degree of deviation of target edge points in a line

Comprises the following steps:

indicates the first in TB

The one-dimensional distance between the edge point P of any target in the line and the nearest label edge point M in the same line in the FB,

represents the first in FB

One-dimensional distance between the label edge point M in the row and the nearest label edge point L in the same row; will be in TB

Taking the average value of the deviation degrees of all the target edge points as the second

Degree of line segmentation difference

。

Optionally, the obtaining of the identification accuracy of each category in the current scale information includes the specific method:

wherein the content of the first and second substances,

representing any one category in scale information

The degree of the difference in the recognition of (c),

representing the category in scale information

Share the same thing

A first initial image of the input is displayed,

denotes the first

A segmentation error rate of the first initial image;

represents the category on-scale information

The accuracy of the recognition is lower than that of the recognition,

represents the maximum value of the recognition difference degree of all the categories under all the scale information,

and the minimum value of the identification difference degree of all the categories under all the scale information is represented.

Optionally, the obtaining of the initial inference result includes a specific method that:

and performing overlapping tile segmentation on any video frame image to obtain a plurality of tile areas, inputting each tile area into the identification network for reasoning, and overlapping the obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result, wherein the scale of each tile area is a fixed input scale.

Optionally, the obtaining of the multi-frame occurrence degree of each target region includes a specific method that:

acquiring the same target area appearing in different video frame images according to the optimal reasoning result,

indicating the extent of the occurrence of multiple frames of any one target area,

the number of occurrence frames of the target area is indicated,

represents the total number of frames of the video data,

is a second preset threshold value, and is,

for the RELU function, values defined as greater than 0 do not change and values less than or equal to 0 all change to 0; and taking the product of the second preset threshold value and the total frame number as the minimum continuous frame number.

Optionally, the specific obtaining method of the comprehensive area ratio is as follows:

wherein the content of the first and second substances,

representing the combined area fraction of any one target region,

which represents the minimum number of consecutive frames,

indicates the number of the minimum continuous frames

The area proportion of the target area in the image on the frame image,

representing the largest area fraction of the target region in the image for the smallest number of consecutive frames,

indicating the minimum area ratio of the target region in the minimum continuous frame number image,

is a third preset threshold value and is set to the third preset threshold value,

is the RELU function.

Optionally, the specific obtaining method of the comprehensive centering degree is as follows:

indicating the overall degree of centering of any one target region,

which represents the minimum number of consecutive frames,

representing the lateral length dimension in a fixed input dimension,

indicates the number of the minimum continuous frames

The horizontal coordinate value of the central point of the target area in the frame image,

and represents the abscissa value of the center point of the image.

In a second aspect, another embodiment of the present invention provides a system for detecting video subject content based on deep learning, including:

the network construction module is used for acquiring video data and constructing a semantic segmentation recognition network as a training data set;

an input scale module: acquiring class label values and fixed input scales in an identification network, taking the ratio of the scale of a minimum external rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing size adjustment of different degrees on pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images to acquire a plurality of first label images;

reasoning the enlarged data set through an identification network to obtain a plurality of first target images, respectively carrying out edge detection on a first label image and the first target images of the same first initial image, obtaining segmentation difference degrees of each row in the first initial image according to an edge detection result, taking the sum of the segmentation difference degrees of all the rows as a segmentation error rate of the first initial image, obtaining the identification accuracy of the category in the current scale information according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and obtaining the optimal reasoning scale of the category according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of the category;

an inference recognition module: performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the overlapping tile segmentation into a recognition network to obtain an initial reasoning result, obtaining a class label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain corresponding to the class according to the class label value, adjusting the pixel size of each initial region in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding class, inputting the optimal reasoning result into the recognition network to perform secondary reasoning, and obtaining an optimal reasoning result, wherein each initial region corresponds to one target region in the optimal reasoning result;

a main body judging module: acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.

The beneficial effects of the invention are: the identification accuracy of the target under the premise of different scales is obtained by means of scale adjustment, so that reference is provided for input data of the identification network, the input data can be adjusted in a self-adaptive manner, and the purpose of improving the identification accuracy is achieved; in the real-time reasoning process, first coarse reasoning is carried out in a tile overlapping mode, after an initial area is determined, the size of the identified initial area is adjusted, and more optimal input data are obtained, so that the output result of the network is more optimal, and a more accurate target area identification result is obtained; meanwhile, the main body degree is judged by combining the change expression of time sequence multiframes for the target area, and the video main body content is better detected.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for detecting video main content based on deep learning according to an embodiment of the present invention;

fig. 2 is a block diagram illustrating a video subject content detection system based on deep learning according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a first overlapping tile partition;

FIG. 4 is a schematic diagram of a second overlapping tile partition.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of a method for detecting video main content based on deep learning according to an embodiment of the present invention is shown, where the method includes the following steps:

and S001, acquiring video data and constructing a semantic segmentation recognition network as a training data set.

Constructing a semantic segmentation recognition network, wherein the specific contents are as follows:

(1) Collecting video data in the Internet, taking each video frame image in the video data as a training data set of the network, and manually marking a target to be identified with a marking tool according to a category to which the target belongs to obtain an original label image; it should be noted that the category labels are represented by different numbers, each object corresponds to one original label image according to the category to which the object belongs, each original label image is a binary image, the non-object part is marked as 0, and the object part is marked as a corresponding category label value;

(2) The network structure adopts a U-shaped structure;

(3) The loss function of the network adopts a cross entropy loss function;

(4) The input data scale of the network is

In which

For the size of the lateral length dimension of the input data,

the size of the longitudinal width dimension is the same,

for channel information of input data, of colour RGB images

Of value 3, achromatic image

The value is 1.

Therefore, a semantic segmentation recognition network for recognizing and classifying the targets in the video frame image is constructed, and an original label image with label information of each target is obtained at the same time.

It should be noted that, in the prior art, a semantic segmentation neural network is used to perform pixel-level target identification, and the input of the neural network requires a fixed data scale, for example, the input data scale of the network is 60 × 100 pixels, so that only an image with such a large scale can be processed, and for an image with too large or too small size in an actual scene, a pooling or interpolation mode needs to be used to change the scale of the image, so as to adapt to the input requirement of the network; for example, the original image is 1000 × 600, but the input scale of the network is fixed to 60 × 100, so the image needs to be segmented; in an actual scene, the target scales are variable, namely not all the targets are like live videos, the main characters account for more than 50% of the area of a video frame, for zooming-in and zooming-out of the lens in some videos, the area of the main characters in the video frame is too small or too large, and a good recognition result is difficult to obtain at this moment; meanwhile, when reasoning is carried out on the whole image, the target segmentation effect with a large front area ratio is always better than the target segmentation effect with a small rear area ratio; the scale of the input image can be changed by the pooling and interpolation operations described above, changing the proportion of the target in the input image, but the change is of such a proportion that it is ensured that the recognition accuracy is optimal and difficult to determine.

Step S002, carrying out size adjustment of different degrees on different targets in all video frame images in the training data set to obtain a plurality of first initial images to form an enlarged data set, and obtaining a first label image corresponding to each first initial image.

It should be noted that the scales of different targets in each video frame image in the training data set are random, and the labels are obtained by manual labeling, and the common features of different targets in each category in the network inference process cannot be obtained by identifying different targets in different scales in the same category, so that it is necessary to make different targets in each category have a plurality of input images with different scales, so as to expand the input data set, and at the same time, it is necessary to obtain corresponding label images of input images with different scales of the same target.

Specifically, firstly, different targets of each video frame image in a training data set are extracted according to labels obtained by artificial labeling, and the targets are selected by using a minimum circumscribed rectangle, wherein the obtained target dimension is

Wherein

The size of the transverse length dimension is indicated,

representing the size of the longitudinal width dimension while identifying a fixed input dimension of the network as

The initial scale information of the target is

Adjusting the initial scale information to obtain new scale information of

Wherein

To adjust the parameters; for example, if the original pixel scale of a target in the video frame image is 40 × 80, and the fixed input scale is 60 × 100, the initial scale information of the target is

(ii) a In order to obtain input images of the same target with different scales, the pixels can be remapped by adopting modes such as pooling, interpolation and the like so as to realize the size adjustment of the target.

It should be further noted that, while the input image is changed, the label image needs to be synchronously changed by using the same mapping relationship, so that the input image and the label image of the target under each scale information can be in one-to-one correspondence.

Specifically, in this embodiment, a first preset step length a =0.05 is adopted, and from the initial scale information, a new scale information is obtained by changing an adjustment parameter with the first preset step length, so as to adjust the size of a corresponding region of each target in a video frame image, that is, t sequentially takes values of 1,1+ a,1+2 a,1+3 a, 8230, until a first initial image with scale information of 1 is obtained, a first initial image under corresponding scale information is obtained by each adjustment, that is, a first initial image of each target under a plurality of scale information is obtained, and a corresponding first label image is also obtained by each first initial image through a corresponding relationship; several first initial images of all objects in all video frame images are combined into an enlarged data set.

At this point, an expanded data set is obtained, the expanded data set includes a plurality of first initial images of each target under different scale information, and each first initial image has a corresponding first label image.

And S003, reasoning the enlarged data set to obtain a plurality of first target images, obtaining the segmentation error rate of the first initial image through the difference of the edge detection result between the corresponding first target image and the first label image, further obtaining the identification accuracy of the same category under the corresponding scale information, and obtaining the optimal reasoning scale of each category according to the identification accuracy.

It should be noted that, the edge image comparison is performed on the first target image and the first label image of the same first initial image, the segmentation difference degree of each line is obtained first, and all the lines are accumulated as the segmentation error rate of the first initial image, at this time, the greater the difference of the edge image, the greater the segmentation error rate, the greater the difference of the first target image and the first label image, that is, the worse the recognition effect is.

Specifically, the method for obtaining a plurality of first target images by reasoning the enlarged data set specifically includes: and inputting each first initial image in the expanded data set into a trained semantic segmentation recognition network, and outputting a first target image of the first initial image.

At this time, the first target image and the first label image are binary images, the target part is marked as a corresponding class label value, and the non-target part is marked as 0; respectively carrying out edge detection on the first target image and the first label image, wherein the edge image of the first label image is marked as FB, the edge pixel is marked as 2, and the non-edge pixel is marked as 0; marking an edge image of the first target image as TB, wherein edge pixels are marked as 1, and non-edge pixels are marked as 0; in any one of the first initial images

Behavior example, first obtain the first

Segmentation integrity of rows

The calculation method comprises the following steps:

represents the first in FB

The maximum value of the flag for a row,

indicates the first in TB

The maximum tag value of the row; at this time, if it is the first in FB

Pixels with a row flag value of 2, TB

The row has a pixel with a mark value of 1, which indicates that the target is identified in the row in the inference result, and the segmentation integrity is 1; and if in FB

The row has pixels with a flag value of 2, but the TB is the second

The line has no pixel with the mark value of 1, which indicates that no target is identified in the line in the inference result, and the segmentation integrity is 0; in the same way, if in FB

Lines have no pixels with a flag value of 2, but the second in TB

The row has a pixel with a flag value of 1, indicating that the portion outside the target is recognized as the target portion in the inference result.

Further, if

In the middle of a line

、

Are all 0, indicating that there is no edge for the row, this time the second

Degree of line segmentation difference

Is marked as 0; when in

And is

、

Not all are 0, indicating the largest difference in object edge recognition, at which time

Degree of line segmentation difference

Marking as 1; for segmentation integrity

In case of (1), first obtain

Degree of deviation of target edge points in a line

The specific calculation method comprises the following steps:

wherein the content of the first and second substances,

represents the number one in TB

represents the first in FB

One-dimensional distance between the label edge point M in the line and the nearest label edge point L in the same line; in TBFirst, the

Degree of line segmentation difference

(ii) a The greater the degree of difference in the division at this time, the more the number of points

The poorer the object recognition of the line, and the greater the segmentation differentiation when the maximum segmentation differentiation is achieved

When the first target image and the first initial image are in the second position

The edges of the row appear to be one without the other, this time the first

Worst results of object recognition on the row.

Further, the sum of the segmentation difference degrees of all the lines in the first initial image is used as the segmentation error rate of the first initial image, and the greater the segmentation error rate is, the poorer the identification effect of the first initial image corresponding to the target under the current scale information is; the smaller the segmentation error rate is, the better the recognition effect of the corresponding target under the current scale information is.

It should be further noted that, the mean value of the segmentation error rates of the first initial images under the same scale information of all targets in the same category is used as the identification difference degree of the category under the current scale information, the greater the identification difference degree, the worse the identification effect, the greater the identification difference degree, the identification accuracy is obtained by using the inverse proportion normalization value of the identification difference degree, the result obtained by multiplying the maximum identification accuracy of the same category by the fixed input scale is the optimal inference scale, and at this time, the scale is adjusted to the optimal inference scale for all target images under the category, so as to obtain the optimal target identification result.

Specifically, information on scale in a certain category

For example, the degree of the identification difference of the category under the current scale information is obtained

The calculation method comprises the following steps:

wherein the content of the first and second substances,

representing the category in scale information

Share the same thing

A first initial image of the input is displayed,

is shown as

A segmentation error rate of the first initial image; the larger the obtained identification difference degree is, the poorer the identification effect of the category under the current scale information is, and meanwhile, the identification accuracy is kept at the same time

Within the interval range of (2), carrying out inverse proportion linear normalization on the identification difference degree to obtain the identification accuracy rate, and still using the category to scaleDegree information

For example, the recognition accuracy of the category under the current scale information

The calculation method comprises the following steps:

wherein the content of the first and second substances,

represents the maximum recognition difference degree of all categories under all scale information,

representing the minimum value of the recognition difference degree of all categories under all scale information; at this time, the greater the recognition accuracy under different scale information in the same category, the better the target recognition effect of the first initial image of the different targets of the category under the corresponding scale information is.

Selecting scale information corresponding to the maximum identification accuracy rate in all identification accuracy rates of the same category as the optimal reasoning scale information of the corresponding category, and taking the product of the optimal reasoning scale information and the fixed input scale as the optimal reasoning scale of the corresponding category; thus, the optimal inference scale of different targets in each category is obtained, and the first initial image of the target under the optimal inference scale is used as the input of the recognition network, so that the optimal target recognition result can be obtained.

And S004, obtaining an initial reasoning result through overlapping tile segmentation and recognition reasoning, obtaining a corresponding optimal reasoning scale for each connected domain in the initial reasoning result according to the corresponding class label value, and carrying out scale adjustment on the input image through the optimal reasoning scale to obtain an optimal reasoning result.

It should be noted that, in order to avoid omission of the recognition and inference results caused by a large scale of the video frame image, overlapping tiles are used for performing overlapping segmentation on the video frame image, the segmented tile regions are recognized through a recognition network and are overlapped according to the tile overlapping position relationship to obtain an initial inference result of each video frame image, and a connected domain corresponding to each category of targets in the video frame image is roughly determined; and then, adjusting according to the corresponding optimal inference scale of the class label value corresponding to the connected domain, so that different targets with different scales in each video frame image can achieve the optimal inference result.

Referring to FIG. 3, a first time fixed input scale is shown

A mode of performing overlapping segmentation on the video frame image; wherein the left image is the dimension and the moving direction of the tile, the arrow represents the moving direction, and the dashed box represents the moving tile; the right image is the first tile overlapping segmentation result, and a broken line frame represents a segmented tile area; to avoid large errors in the segmentation identification, a new tile overlay is needed at the segmentation.

Refer to FIG. 4, which illustrates the second use of a fixed input scale

A mode of performing overlapping segmentation on the video frame image; taking the boundary line of the divided tiles as a reference line, and then determining the division of a new tile; the newly added broken line boxes in fig. 4 represent newly added tile division areas, and cover the division of the first overlapping division to avoid large errors.

Inputting a plurality of tile areas obtained by two-time tile overlapping and dividing into an identification network for reasoning, and superposing obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result; at the moment, the initial reasoning result of any video frame image comprises a plurality of connected domains, each connected domain is an initial recognition result of a target, and simultaneously due to the influence of a category label value, namely an original label image, the pixel value of the same target in the initial recognition result is represented as a corresponding category label value, so that a corresponding optimal reasoning scale can be obtained according to the category label value, the size of each initial region in the video frame image corresponding to each connected domain is adjusted according to the optimal reasoning scale, the initial regions are input into a recognition network to obtain an output result, and the output result is combined according to the corresponding position distribution of each initial region in the video frame image to obtain an optimal reasoning result; at the moment, each initial region respectively corresponds to one target region in the optimal reasoning result, and the optimal reasoning result realizes the identification reasoning result which is optimal for the targets with different scales in the video frame image.

And S005, acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the three characteristics, and acquiring the main body content of the video according to the comparison result of the main body degree and the first preset threshold.

It should be noted that, in step S004, the target area of the video frame image is only different targets that appear in the video and need to be noticed, and is not all the main content in the video, and it is necessary to determine whether the target area is the main content of the video according to the appearance of the targets in the time sequence and the area distribution relationship; the main content in the video is always appeared for a long time, distributed in the center of the video and has a large area in the video frame image; the larger the occurrence degree of multiple frames in the target area is, the larger the comprehensive area proportion is, the larger the comprehensive centering degree is, and the higher the possibility that the target area is the main content of the video is, namely the larger the main degree is; and then, forming a main body degree matrix according to the main body degrees of different target areas in the video frame image, and comparing the main body degree matrix with a first preset threshold value, wherein the main body degree matrix is the main body content of the video if the main body degree matrix is larger than the first preset threshold value.

Specifically, the same target area appearing in the optimal inference result of different video frame images is obtained through an IOU method, and the multi-frame appearance degree of any one target area is obtained

The calculation method comprises the following steps:

wherein the content of the first and second substances,

indicates the number of occurrence frames of the target region,

represents the total number of frames of the video data,

for the second preset threshold, the method of the present embodiment uses

The calculation is carried out in such a way that,

for the RELU function, values defined as greater than 0 do not change and values less than or equal to 0 all change to 0; taking the product of the second preset threshold value and the total frame number as the minimum continuous frame number; the more the number of frames of the target region appears in the video, the greater the degree of multi-frame appearance, and the longer the time of appearance in the video, the greater the possibility of being the subject content.

Obtaining the comprehensive area ratio of any one target area

The calculation method comprises the following steps:

wherein the content of the first and second substances,

which represents the minimum number of consecutive frames,

indicates the number of the minimum continuous frames

The area ratio of the target area in the image on the frame image,

for the third preset threshold, the method of the present embodiment adopts

The calculation is carried out in such a way that,

is a RELU function; the larger the area ratio mean value of the target area in the minimum continuous frames is, the larger the comprehensive area ratio is, the larger the area is occupied in the video frame image, and the higher the possibility of being the main content is.

Obtaining the comprehensive centering degree of any one target area

The calculation method comprises the following steps:

which represents the minimum number of consecutive frames,

show fixed transmissionThe value of the length of the horizontal axis in the dimension,

indicates the number of the minimum continuous frames

The horizontal coordinate value of the center point of the target area of the frame image,

an abscissa value representing a center point of the image; the larger the difference mean value of the one-dimensional abscissa distance between the area center of the target area in the minimum continuous frame and the video frame image center is, the maximum distance of the one-dimensional abscissa is half of the transverse length scale in the fixed input scale, the closer to the maximum distance, the smaller the comprehensive centering degree is, and the lower the possibility that the target area is the subject content is.

Further, the multi-frame occurrence degree of the target area is determined

Ratio of the total area

And integrated degree of centering

Is taken as the subject degree of the target region, and is subjected to linear normalization, the subject degree is recorded as

(ii) a The greater the subject degree, the greater the likelihood that the target area is the video subject content.

Replacing the pixel value of each target area in the optimal inference result of the video frame image with a corresponding main body degree value, wherein the pixel value of the non-target part in the optimal inference result is already 0, the pixel values of each point in the obtained replaced video frame image form a main body degree matrix, and a first preset threshold value is provided in the embodiment

Marking the part of the subject degree greater than a first preset threshold value as 1, and marking the rest part as 0; the image after the judgment mark is obtained at this time is binary mask data for the main content, the main content in the video frame image can be extracted by operating the video frame image according to the binary mask data, and corresponding operations such as compression or barrage rendering can be performed.

Referring to fig. 2, a block diagram of a system for detecting video subject content based on deep learning according to an embodiment of the present invention is shown, where the system includes:

and the network construction module S101 is used for acquiring video data and constructing a semantic segmentation recognition network as a training data set.

Input scale module S102:

(1) Carrying out size adjustment of different degrees on different targets in all video frame images in a training data set to obtain a plurality of first initial images to form an enlarged data set, and obtaining a first label image corresponding to each first initial image;

(2) Reasoning is carried out on the enlarged data set to obtain a plurality of first target images, the segmentation error rate of the first initial image is obtained through the difference of the edge detection result between the corresponding first target image and the corresponding first label image, the identification accuracy of the same category under the corresponding scale information is further obtained, and the optimal reasoning scale of each category is obtained according to the identification accuracy.

The inference recognition module S103: and obtaining an initial reasoning result through overlapping tile segmentation and recognition reasoning, obtaining a corresponding optimal reasoning scale for each connected domain in the initial reasoning result according to the corresponding class label value, and carrying out scale adjustment on the input image through the optimal reasoning scale to obtain an optimal reasoning result.

The main body judgment module S104: acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the three characteristics, and acquiring the main body content of the video according to the comparison result of the main body degree and the first preset threshold.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The video main content detection method based on deep learning is characterized by comprising the following steps:

reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target image of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of each category in the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of each category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of each category;

2. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the class label value and the fixed input scale in the recognition network comprises the following specific methods:

the identification network marks different labels for the categories to be identified manually in the construction process, and each category label is represented in a digital form, namely a category label value; the fixed input scale is a fixed horizontal, longitudinal, length, width and width scale of the input image data in the recognition network.

3. The method for detecting video main content based on deep learning according to claim 1, wherein the obtaining a first initial image of each target under information of several scales includes a specific method that:

the scale information comprises transverse length initial scale information and longitudinal width scale information, the size of a corresponding area of each target in a video frame image is adjusted by changing the scale information by a first preset step length from the initial scale information until a first initial image with the scale information of 1 is obtained, and each adjustment obtains a first initial image under the corresponding scale information, namely the first initial image of each target under a plurality of scale information.

4. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the segmentation difference degree of each line in the first initial image comprises the following specific steps:

wherein the content of the first and second substances,

representing the first in the first initial image

The degree of completeness of the division of the line,

represents the first in FB

The maximum labeled value of the row, FB representing the edge image of the first label image, with the edge pixels labeled 2, the non-edge pixels labeled 0,

indicates the first in TB

A maximum flag value of a row, TB, represents an edge image of the first target image, wherein edge pixels are marked as 1 and non-edge pixels are marked as 0;

will be provided with

、

All are 0 th

Degree of line segmentation difference

Is marked as 0; will be provided with

And is

、

Not all of them are 0

Degree of line segmentation difference

Marking as 1; obtaining

To (1) a

Degree of deviation of target edge points in a line

Comprises the following steps:

represents the number one in TB

represents the first in FB

Degree of line segmentation difference

。

5. The method for detecting the video main content based on the deep learning of claim 1, wherein the obtaining of the recognition accuracy of each category in the current scale information comprises the following specific steps:

wherein the content of the first and second substances,

representing any one category in scale information

The degree of difference in the recognition of (b),

representing the category in scale information

Share the same thing

A first initial image of the input is displayed,

is shown as

A segmentation error rate of the first initial image;

wherein the content of the first and second substances,

representing the category in scale information

The accuracy of the recognition is lower than the accuracy of the recognition,

6. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the initial inference result includes the following specific methods:

and performing overlapping tile segmentation on any video frame image to obtain a plurality of tile areas, wherein the scale of each tile area is a fixed input scale, inputting each tile area into a recognition network for reasoning, and overlapping the obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result.

7. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the multi-frame occurrence degree of each target region includes a specific method that:

acquiring the same target area appearing in different video frame images according to the optimal inference result,

indicates the number of occurrence frames of the target region,

represents the total number of frames of the video data,

is a second preset threshold value, and is,

8. The method for detecting the main content of the video based on the deep learning of claim 7, wherein the specific obtaining method of the comprehensive area ratio is as follows:

wherein the content of the first and second substances,

representing the combined area fraction of any one target region,

which represents the minimum number of consecutive frames,

indicates the number of the minimum continuous frames

The area ratio of the target area in the image on the frame image,

is a third preset threshold value, and is,

is the RELU function.

9. The method for detecting video main content based on deep learning of claim 7, wherein the comprehensive centering degree is obtained by:

wherein the content of the first and second substances,

indicating the overall degree of centering of any one target region,

which represents the minimum number of consecutive frames,

representing the lateral length dimension in a fixed input dimension,

indicates the number of the minimum continuous frames

and an abscissa value representing the center point of the image.

10. Video subject content detection system based on deep learning, characterized in that the system comprises:

reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target image of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of the category in the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of the category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of the category;