CN115601688A - Video main content detection method and system based on deep learning - Google Patents

Video main content detection method and system based on deep learning Download PDF

Info

Publication number
CN115601688A
CN115601688A CN202211609427.2A CN202211609427A CN115601688A CN 115601688 A CN115601688 A CN 115601688A CN 202211609427 A CN202211609427 A CN 202211609427A CN 115601688 A CN115601688 A CN 115601688A
Authority
CN
China
Prior art keywords
image
initial
scale
target
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211609427.2A
Other languages
Chinese (zh)
Other versions
CN115601688B (en
Inventor
罗鑫凯
王新勇
杨笑
丁振
杨柳
高明亮
高天鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Entertainment Technology Qingdao Co ltd
Original Assignee
Chinese Translation Entertainment Technology Qingdao Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Entertainment Technology Qingdao Co ltd filed Critical Chinese Translation Entertainment Technology Qingdao Co ltd
Priority to CN202211609427.2A priority Critical patent/CN115601688B/en
Publication of CN115601688A publication Critical patent/CN115601688A/en
Application granted granted Critical
Publication of CN115601688B publication Critical patent/CN115601688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention relates to the technical field of image processing, and provides a video main content detection method and system based on deep learning, which comprises the following steps: acquiring video data; carrying out size adjustment on different targets in the video frame image to obtain a plurality of first initial images to form an expanded data set, and obtaining a first label image; reasoning the enlarged data set to obtain a first target image, and obtaining the identification accuracy of each category under different scale information through the edge difference between the first target image and the first label image to obtain the optimal reasoning scale of each category; obtaining an initial reasoning result through tile segmentation and identification, obtaining an optimal reasoning scale corresponding to each connected domain, adjusting the scale and obtaining an optimal reasoning result; and calculating the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of the target area to obtain the subject degree, and further obtain the main content of the video. The invention aims to solve the problem of poor neural network identification accuracy caused by different target scales.

Description

Video main content detection method and system based on deep learning
Technical Field
The invention relates to the technical field of image processing, in particular to a video main content detection method and system based on deep learning.
Background
Today with intelligent development, post-processing of some adjustments to video data is often required, and the post-processing also requires that the influence on the video subject is as small as possible; for example, in order to prevent the video from being blocked by excessive barrage on the video, the main body of the video needs to be identified, a corresponding mask is further generated, the rendering of the barrage is adjusted, the barrage is not displayed at the main body, the blocking of the barrage on the main content of the video is avoided, and the viewing feeling is improved; in the field of video compression, the pixel value of non-subject content is often adjusted, so that the spatial redundancy degree of the video is improved while the details of the non-subject pixel are sacrificed, the compression rate is improved, and the transmission speed of the video is increased.
In the prior art, for the identification of a main body in a video, a semantic segmentation neural network is often used for single frame identification, the input data scale of the neural network is fixed, and a target in the video is presented in different resolution scales.
Disclosure of Invention
The invention provides a video main content detection method and system based on deep learning, which aim to solve the problem of poor neural network identification accuracy caused by different target scales in the prior art, and adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a method for detecting video main content based on deep learning, including the following steps:
acquiring video data and constructing a semantic segmentation recognition network as a training data set;
acquiring a category label value and a fixed input scale in an identification network, taking the ratio of the scale of a minimum circumscribed rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing different-degree size adjustment on the pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and acquiring a plurality of first label images by adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images;
reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target images of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of each category on the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of each category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of each category;
performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the video frame image into an identification network to obtain an initial reasoning result, obtaining a category label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain according to the category label value, adjusting the pixel size of each initial area in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding category, and inputting the adjusted pixel size into the identification network to obtain an optimal reasoning result, wherein each initial area corresponds to one target area in the optimal reasoning result;
acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.
Optionally, the obtaining and identifying the category label value and the fixed input scale in the network includes a specific method that:
the identification network manually marks different labels for categories to be identified in the construction process, and each category label is represented in a digital form, namely a category label value; the fixed input scale is a fixed horizontal length, vertical length, width and width scale of the input image data in the identification network.
Optionally, the obtaining of the first initial image of each target under the information of the plurality of scales includes a specific method that:
the scale information comprises horizontal length initial scale information and vertical width scale information, the scale information is changed by a first preset step length from the initial scale information to adjust the size of the corresponding area of each target in the video frame image until a first initial image with the scale information of 1 is obtained, and each adjustment obtains the first initial image under the corresponding scale information, namely the first initial image of each target under the plurality of scale information.
Optionally, the obtaining of the segmentation difference degree of each line in the first initial image includes a specific method that:
Figure 431261DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 312629DEST_PATH_IMAGE002
representing the first in the first initial image
Figure 871786DEST_PATH_IMAGE003
The degree of completeness of the division of the line,
Figure 584527DEST_PATH_IMAGE004
represents the first in FB
Figure 587250DEST_PATH_IMAGE003
The maximum label value of the row, FB representing the edge image of the first label image, with the edge pixels labeled 2, the non-edge pixels labeled 0,
Figure 752652DEST_PATH_IMAGE005
represents the number one in TB
Figure 115500DEST_PATH_IMAGE006
The maximum value of the flag for a row,
Figure 682748DEST_PATH_IMAGE007
an edge image representing the first target image, wherein edge pixels are labeled as 1 and non-edge pixels are labeled as 0;
will be provided with
Figure 43322DEST_PATH_IMAGE008
Figure 945288DEST_PATH_IMAGE009
All are 0 th
Figure 111827DEST_PATH_IMAGE010
Degree of line segmentation difference
Figure 799160DEST_PATH_IMAGE011
Is marked as 0; will be provided with
Figure 127373DEST_PATH_IMAGE012
And is
Figure 18100DEST_PATH_IMAGE013
Figure 660434DEST_PATH_IMAGE014
Not all of them are 0
Figure 202274DEST_PATH_IMAGE010
Line segmentationDegree of difference
Figure 966967DEST_PATH_IMAGE015
Marking as 1; obtaining
Figure 594258DEST_PATH_IMAGE016
To (1)
Figure 89217DEST_PATH_IMAGE010
Degree of deviation of target edge points in a line
Figure 751143DEST_PATH_IMAGE017
Comprises the following steps:
Figure 624421DEST_PATH_IMAGE018
wherein, the first and the second end of the pipe are connected with each other,
Figure 473428DEST_PATH_IMAGE019
indicates the first in TB
Figure 785461DEST_PATH_IMAGE020
The one-dimensional distance between the edge point P of any target in the line and the nearest label edge point M in the same line in the FB,
Figure 52625DEST_PATH_IMAGE021
represents the first in FB
Figure 893542DEST_PATH_IMAGE022
One-dimensional distance between the label edge point M in the row and the nearest label edge point L in the same row; will be in TB
Figure 433108DEST_PATH_IMAGE010
Taking the average value of the deviation degrees of all the target edge points as the second
Figure 17673DEST_PATH_IMAGE023
Degree of line segmentation difference
Figure 388612DEST_PATH_IMAGE011
Optionally, the obtaining of the identification accuracy of each category in the current scale information includes the specific method:
Figure 649698DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure 4456DEST_PATH_IMAGE025
representing any one category in scale information
Figure 127132DEST_PATH_IMAGE026
The degree of the difference in the recognition of (c),
Figure 821419DEST_PATH_IMAGE027
representing the category in scale information
Figure 4139DEST_PATH_IMAGE028
Share the same thing
Figure 331346DEST_PATH_IMAGE029
A first initial image of the input is displayed,
Figure 257714DEST_PATH_IMAGE030
denotes the first
Figure 603244DEST_PATH_IMAGE031
A segmentation error rate of the first initial image;
Figure 956865DEST_PATH_IMAGE032
wherein, the first and the second end of the pipe are connected with each other,
Figure 958319DEST_PATH_IMAGE033
represents the category on-scale information
Figure 928856DEST_PATH_IMAGE034
The accuracy of the recognition is lower than that of the recognition,
Figure 394473DEST_PATH_IMAGE035
represents the maximum value of the recognition difference degree of all the categories under all the scale information,
Figure 918995DEST_PATH_IMAGE036
and the minimum value of the identification difference degree of all the categories under all the scale information is represented.
Optionally, the obtaining of the initial inference result includes a specific method that:
and performing overlapping tile segmentation on any video frame image to obtain a plurality of tile areas, inputting each tile area into the identification network for reasoning, and overlapping the obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result, wherein the scale of each tile area is a fixed input scale.
Optionally, the obtaining of the multi-frame occurrence degree of each target region includes a specific method that:
acquiring the same target area appearing in different video frame images according to the optimal reasoning result,
Figure 204483DEST_PATH_IMAGE037
wherein, the first and the second end of the pipe are connected with each other,
Figure 941495DEST_PATH_IMAGE038
indicating the extent of the occurrence of multiple frames of any one target area,
Figure 746771DEST_PATH_IMAGE039
the number of occurrence frames of the target area is indicated,
Figure 707774DEST_PATH_IMAGE040
represents the total number of frames of the video data,
Figure 480558DEST_PATH_IMAGE041
is a second preset threshold value, and is,
Figure 552419DEST_PATH_IMAGE042
for the RELU function, values defined as greater than 0 do not change and values less than or equal to 0 all change to 0; and taking the product of the second preset threshold value and the total frame number as the minimum continuous frame number.
Optionally, the specific obtaining method of the comprehensive area ratio is as follows:
Figure 710736DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 45903DEST_PATH_IMAGE044
representing the combined area fraction of any one target region,
Figure 40404DEST_PATH_IMAGE045
which represents the minimum number of consecutive frames,
Figure 915956DEST_PATH_IMAGE046
indicates the number of the minimum continuous frames
Figure 945092DEST_PATH_IMAGE047
The area proportion of the target area in the image on the frame image,
Figure 998629DEST_PATH_IMAGE048
representing the largest area fraction of the target region in the image for the smallest number of consecutive frames,
Figure 480426DEST_PATH_IMAGE049
indicating the minimum area ratio of the target region in the minimum continuous frame number image,
Figure 97353DEST_PATH_IMAGE050
is a third preset threshold value and is set to the third preset threshold value,
Figure 980995DEST_PATH_IMAGE051
is the RELU function.
Optionally, the specific obtaining method of the comprehensive centering degree is as follows:
Figure 454702DEST_PATH_IMAGE052
wherein, the first and the second end of the pipe are connected with each other,
Figure 675992DEST_PATH_IMAGE053
indicating the overall degree of centering of any one target region,
Figure 158926DEST_PATH_IMAGE054
which represents the minimum number of consecutive frames,
Figure 897075DEST_PATH_IMAGE055
representing the lateral length dimension in a fixed input dimension,
Figure 541683DEST_PATH_IMAGE056
indicates the number of the minimum continuous frames
Figure 201334DEST_PATH_IMAGE057
The horizontal coordinate value of the central point of the target area in the frame image,
Figure 707533DEST_PATH_IMAGE058
and represents the abscissa value of the center point of the image.
In a second aspect, another embodiment of the present invention provides a system for detecting video subject content based on deep learning, including:
the network construction module is used for acquiring video data and constructing a semantic segmentation recognition network as a training data set;
an input scale module: acquiring class label values and fixed input scales in an identification network, taking the ratio of the scale of a minimum external rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing size adjustment of different degrees on pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images to acquire a plurality of first label images;
reasoning the enlarged data set through an identification network to obtain a plurality of first target images, respectively carrying out edge detection on a first label image and the first target images of the same first initial image, obtaining segmentation difference degrees of each row in the first initial image according to an edge detection result, taking the sum of the segmentation difference degrees of all the rows as a segmentation error rate of the first initial image, obtaining the identification accuracy of the category in the current scale information according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and obtaining the optimal reasoning scale of the category according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of the category;
an inference recognition module: performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the overlapping tile segmentation into a recognition network to obtain an initial reasoning result, obtaining a class label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain corresponding to the class according to the class label value, adjusting the pixel size of each initial region in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding class, inputting the optimal reasoning result into the recognition network to perform secondary reasoning, and obtaining an optimal reasoning result, wherein each initial region corresponds to one target region in the optimal reasoning result;
a main body judging module: acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.
The beneficial effects of the invention are: the identification accuracy of the target under the premise of different scales is obtained by means of scale adjustment, so that reference is provided for input data of the identification network, the input data can be adjusted in a self-adaptive manner, and the purpose of improving the identification accuracy is achieved; in the real-time reasoning process, first coarse reasoning is carried out in a tile overlapping mode, after an initial area is determined, the size of the identified initial area is adjusted, and more optimal input data are obtained, so that the output result of the network is more optimal, and a more accurate target area identification result is obtained; meanwhile, the main body degree is judged by combining the change expression of time sequence multiframes for the target area, and the video main body content is better detected.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for detecting video main content based on deep learning according to an embodiment of the present invention;
fig. 2 is a block diagram illustrating a video subject content detection system based on deep learning according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of a first overlapping tile partition;
FIG. 4 is a schematic diagram of a second overlapping tile partition.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flowchart of a method for detecting video main content based on deep learning according to an embodiment of the present invention is shown, where the method includes the following steps:
and S001, acquiring video data and constructing a semantic segmentation recognition network as a training data set.
Constructing a semantic segmentation recognition network, wherein the specific contents are as follows:
(1) Collecting video data in the Internet, taking each video frame image in the video data as a training data set of the network, and manually marking a target to be identified with a marking tool according to a category to which the target belongs to obtain an original label image; it should be noted that the category labels are represented by different numbers, each object corresponds to one original label image according to the category to which the object belongs, each original label image is a binary image, the non-object part is marked as 0, and the object part is marked as a corresponding category label value;
(2) The network structure adopts a U-shaped structure;
(3) The loss function of the network adopts a cross entropy loss function;
(4) The input data scale of the network is
Figure 565767DEST_PATH_IMAGE059
In which
Figure 115698DEST_PATH_IMAGE060
For the size of the lateral length dimension of the input data,
Figure 59383DEST_PATH_IMAGE061
the size of the longitudinal width dimension is the same,
Figure 133387DEST_PATH_IMAGE062
for channel information of input data, of colour RGB images
Figure 846128DEST_PATH_IMAGE063
Of value 3, achromatic image
Figure 35801DEST_PATH_IMAGE064
The value is 1.
Therefore, a semantic segmentation recognition network for recognizing and classifying the targets in the video frame image is constructed, and an original label image with label information of each target is obtained at the same time.
It should be noted that, in the prior art, a semantic segmentation neural network is used to perform pixel-level target identification, and the input of the neural network requires a fixed data scale, for example, the input data scale of the network is 60 × 100 pixels, so that only an image with such a large scale can be processed, and for an image with too large or too small size in an actual scene, a pooling or interpolation mode needs to be used to change the scale of the image, so as to adapt to the input requirement of the network; for example, the original image is 1000 × 600, but the input scale of the network is fixed to 60 × 100, so the image needs to be segmented; in an actual scene, the target scales are variable, namely not all the targets are like live videos, the main characters account for more than 50% of the area of a video frame, for zooming-in and zooming-out of the lens in some videos, the area of the main characters in the video frame is too small or too large, and a good recognition result is difficult to obtain at this moment; meanwhile, when reasoning is carried out on the whole image, the target segmentation effect with a large front area ratio is always better than the target segmentation effect with a small rear area ratio; the scale of the input image can be changed by the pooling and interpolation operations described above, changing the proportion of the target in the input image, but the change is of such a proportion that it is ensured that the recognition accuracy is optimal and difficult to determine.
Step S002, carrying out size adjustment of different degrees on different targets in all video frame images in the training data set to obtain a plurality of first initial images to form an enlarged data set, and obtaining a first label image corresponding to each first initial image.
It should be noted that the scales of different targets in each video frame image in the training data set are random, and the labels are obtained by manual labeling, and the common features of different targets in each category in the network inference process cannot be obtained by identifying different targets in different scales in the same category, so that it is necessary to make different targets in each category have a plurality of input images with different scales, so as to expand the input data set, and at the same time, it is necessary to obtain corresponding label images of input images with different scales of the same target.
Specifically, firstly, different targets of each video frame image in a training data set are extracted according to labels obtained by artificial labeling, and the targets are selected by using a minimum circumscribed rectangle, wherein the obtained target dimension is
Figure 201203DEST_PATH_IMAGE065
Wherein
Figure 298472DEST_PATH_IMAGE066
The size of the transverse length dimension is indicated,
Figure 147610DEST_PATH_IMAGE067
representing the size of the longitudinal width dimension while identifying a fixed input dimension of the network as
Figure 304922DEST_PATH_IMAGE068
The initial scale information of the target is
Figure 160883DEST_PATH_IMAGE069
Adjusting the initial scale information to obtain new scale information of
Figure 61843DEST_PATH_IMAGE070
Wherein
Figure 14755DEST_PATH_IMAGE071
To adjust the parameters; for example, if the original pixel scale of a target in the video frame image is 40 × 80, and the fixed input scale is 60 × 100, the initial scale information of the target is
Figure 329586DEST_PATH_IMAGE072
(ii) a In order to obtain input images of the same target with different scales, the pixels can be remapped by adopting modes such as pooling, interpolation and the like so as to realize the size adjustment of the target.
It should be further noted that, while the input image is changed, the label image needs to be synchronously changed by using the same mapping relationship, so that the input image and the label image of the target under each scale information can be in one-to-one correspondence.
Specifically, in this embodiment, a first preset step length a =0.05 is adopted, and from the initial scale information, a new scale information is obtained by changing an adjustment parameter with the first preset step length, so as to adjust the size of a corresponding region of each target in a video frame image, that is, t sequentially takes values of 1,1+ a,1+2 a,1+3 a, 8230, until a first initial image with scale information of 1 is obtained, a first initial image under corresponding scale information is obtained by each adjustment, that is, a first initial image of each target under a plurality of scale information is obtained, and a corresponding first label image is also obtained by each first initial image through a corresponding relationship; several first initial images of all objects in all video frame images are combined into an enlarged data set.
At this point, an expanded data set is obtained, the expanded data set includes a plurality of first initial images of each target under different scale information, and each first initial image has a corresponding first label image.
And S003, reasoning the enlarged data set to obtain a plurality of first target images, obtaining the segmentation error rate of the first initial image through the difference of the edge detection result between the corresponding first target image and the first label image, further obtaining the identification accuracy of the same category under the corresponding scale information, and obtaining the optimal reasoning scale of each category according to the identification accuracy.
It should be noted that, the edge image comparison is performed on the first target image and the first label image of the same first initial image, the segmentation difference degree of each line is obtained first, and all the lines are accumulated as the segmentation error rate of the first initial image, at this time, the greater the difference of the edge image, the greater the segmentation error rate, the greater the difference of the first target image and the first label image, that is, the worse the recognition effect is.
Specifically, the method for obtaining a plurality of first target images by reasoning the enlarged data set specifically includes: and inputting each first initial image in the expanded data set into a trained semantic segmentation recognition network, and outputting a first target image of the first initial image.
At this time, the first target image and the first label image are binary images, the target part is marked as a corresponding class label value, and the non-target part is marked as 0; respectively carrying out edge detection on the first target image and the first label image, wherein the edge image of the first label image is marked as FB, the edge pixel is marked as 2, and the non-edge pixel is marked as 0; marking an edge image of the first target image as TB, wherein edge pixels are marked as 1, and non-edge pixels are marked as 0; in any one of the first initial images
Figure 735160DEST_PATH_IMAGE073
Behavior example, first obtain the first
Figure 377494DEST_PATH_IMAGE074
Segmentation integrity of rows
Figure 919334DEST_PATH_IMAGE075
The calculation method comprises the following steps:
Figure 684027DEST_PATH_IMAGE076
wherein, the first and the second end of the pipe are connected with each other,
Figure 62050DEST_PATH_IMAGE077
represents the first in FB
Figure 304813DEST_PATH_IMAGE078
The maximum value of the flag for a row,
Figure 701159DEST_PATH_IMAGE079
indicates the first in TB
Figure 371175DEST_PATH_IMAGE073
The maximum tag value of the row; at this time, if it is the first in FB
Figure 689024DEST_PATH_IMAGE074
Pixels with a row flag value of 2, TB
Figure 984745DEST_PATH_IMAGE073
The row has a pixel with a mark value of 1, which indicates that the target is identified in the row in the inference result, and the segmentation integrity is 1; and if in FB
Figure 501177DEST_PATH_IMAGE073
The row has pixels with a flag value of 2, but the TB is the second
Figure 342094DEST_PATH_IMAGE074
The line has no pixel with the mark value of 1, which indicates that no target is identified in the line in the inference result, and the segmentation integrity is 0; in the same way, if in FB
Figure 147239DEST_PATH_IMAGE078
Lines have no pixels with a flag value of 2, but the second in TB
Figure 731804DEST_PATH_IMAGE073
The row has a pixel with a flag value of 1, indicating that the portion outside the target is recognized as the target portion in the inference result.
Further, if
Figure 853475DEST_PATH_IMAGE080
In the middle of a line
Figure 130872DEST_PATH_IMAGE081
Figure 220051DEST_PATH_IMAGE082
Are all 0, indicating that there is no edge for the row, this time the second
Figure 545990DEST_PATH_IMAGE078
Degree of line segmentation difference
Figure 771435DEST_PATH_IMAGE083
Is marked as 0; when in
Figure 471931DEST_PATH_IMAGE084
And is
Figure 48406DEST_PATH_IMAGE085
Figure 974774DEST_PATH_IMAGE086
Not all are 0, indicating the largest difference in object edge recognition, at which time
Figure 320304DEST_PATH_IMAGE087
Degree of line segmentation difference
Figure 424658DEST_PATH_IMAGE088
Marking as 1; for segmentation integrity
Figure 426112DEST_PATH_IMAGE089
In case of (1), first obtain
Figure 156170DEST_PATH_IMAGE090
Degree of deviation of target edge points in a line
Figure 621787DEST_PATH_IMAGE091
The specific calculation method comprises the following steps:
Figure 146309DEST_PATH_IMAGE092
wherein the content of the first and second substances,
Figure 681064DEST_PATH_IMAGE093
represents the number one in TB
Figure 152497DEST_PATH_IMAGE090
The one-dimensional distance between the edge point P of any target in the line and the nearest label edge point M in the same line in the FB,
Figure 207041DEST_PATH_IMAGE094
represents the first in FB
Figure 168043DEST_PATH_IMAGE090
One-dimensional distance between the label edge point M in the line and the nearest label edge point L in the same line; in TBFirst, the
Figure 940827DEST_PATH_IMAGE090
Taking the average value of the deviation degrees of all the target edge points as the second
Figure 763421DEST_PATH_IMAGE090
Degree of line segmentation difference
Figure 938050DEST_PATH_IMAGE095
(ii) a The greater the degree of difference in the division at this time, the more the number of points
Figure 7638DEST_PATH_IMAGE090
The poorer the object recognition of the line, and the greater the segmentation differentiation when the maximum segmentation differentiation is achieved
Figure 267718DEST_PATH_IMAGE096
When the first target image and the first initial image are in the second position
Figure 877691DEST_PATH_IMAGE074
The edges of the row appear to be one without the other, this time the first
Figure 436322DEST_PATH_IMAGE090
Worst results of object recognition on the row.
Further, the sum of the segmentation difference degrees of all the lines in the first initial image is used as the segmentation error rate of the first initial image, and the greater the segmentation error rate is, the poorer the identification effect of the first initial image corresponding to the target under the current scale information is; the smaller the segmentation error rate is, the better the recognition effect of the corresponding target under the current scale information is.
It should be further noted that, the mean value of the segmentation error rates of the first initial images under the same scale information of all targets in the same category is used as the identification difference degree of the category under the current scale information, the greater the identification difference degree, the worse the identification effect, the greater the identification difference degree, the identification accuracy is obtained by using the inverse proportion normalization value of the identification difference degree, the result obtained by multiplying the maximum identification accuracy of the same category by the fixed input scale is the optimal inference scale, and at this time, the scale is adjusted to the optimal inference scale for all target images under the category, so as to obtain the optimal target identification result.
Specifically, information on scale in a certain category
Figure 739127DEST_PATH_IMAGE097
For example, the degree of the identification difference of the category under the current scale information is obtained
Figure 220924DEST_PATH_IMAGE098
The calculation method comprises the following steps:
Figure 837850DEST_PATH_IMAGE099
wherein the content of the first and second substances,
Figure 721492DEST_PATH_IMAGE100
representing the category in scale information
Figure 211511DEST_PATH_IMAGE101
Share the same thing
Figure 180604DEST_PATH_IMAGE102
A first initial image of the input is displayed,
Figure 132379DEST_PATH_IMAGE103
is shown as
Figure 339370DEST_PATH_IMAGE104
A segmentation error rate of the first initial image; the larger the obtained identification difference degree is, the poorer the identification effect of the category under the current scale information is, and meanwhile, the identification accuracy is kept at the same time
Figure 983978DEST_PATH_IMAGE105
Within the interval range of (2), carrying out inverse proportion linear normalization on the identification difference degree to obtain the identification accuracy rate, and still using the category to scaleDegree information
Figure 955214DEST_PATH_IMAGE106
For example, the recognition accuracy of the category under the current scale information
Figure 710680DEST_PATH_IMAGE107
The calculation method comprises the following steps:
Figure 506598DEST_PATH_IMAGE108
wherein the content of the first and second substances,
Figure 322107DEST_PATH_IMAGE109
represents the maximum recognition difference degree of all categories under all scale information,
Figure 265792DEST_PATH_IMAGE110
representing the minimum value of the recognition difference degree of all categories under all scale information; at this time, the greater the recognition accuracy under different scale information in the same category, the better the target recognition effect of the first initial image of the different targets of the category under the corresponding scale information is.
Selecting scale information corresponding to the maximum identification accuracy rate in all identification accuracy rates of the same category as the optimal reasoning scale information of the corresponding category, and taking the product of the optimal reasoning scale information and the fixed input scale as the optimal reasoning scale of the corresponding category; thus, the optimal inference scale of different targets in each category is obtained, and the first initial image of the target under the optimal inference scale is used as the input of the recognition network, so that the optimal target recognition result can be obtained.
And S004, obtaining an initial reasoning result through overlapping tile segmentation and recognition reasoning, obtaining a corresponding optimal reasoning scale for each connected domain in the initial reasoning result according to the corresponding class label value, and carrying out scale adjustment on the input image through the optimal reasoning scale to obtain an optimal reasoning result.
It should be noted that, in order to avoid omission of the recognition and inference results caused by a large scale of the video frame image, overlapping tiles are used for performing overlapping segmentation on the video frame image, the segmented tile regions are recognized through a recognition network and are overlapped according to the tile overlapping position relationship to obtain an initial inference result of each video frame image, and a connected domain corresponding to each category of targets in the video frame image is roughly determined; and then, adjusting according to the corresponding optimal inference scale of the class label value corresponding to the connected domain, so that different targets with different scales in each video frame image can achieve the optimal inference result.
Referring to FIG. 3, a first time fixed input scale is shown
Figure 575682DEST_PATH_IMAGE111
A mode of performing overlapping segmentation on the video frame image; wherein the left image is the dimension and the moving direction of the tile, the arrow represents the moving direction, and the dashed box represents the moving tile; the right image is the first tile overlapping segmentation result, and a broken line frame represents a segmented tile area; to avoid large errors in the segmentation identification, a new tile overlay is needed at the segmentation.
Refer to FIG. 4, which illustrates the second use of a fixed input scale
Figure 554002DEST_PATH_IMAGE111
A mode of performing overlapping segmentation on the video frame image; taking the boundary line of the divided tiles as a reference line, and then determining the division of a new tile; the newly added broken line boxes in fig. 4 represent newly added tile division areas, and cover the division of the first overlapping division to avoid large errors.
Inputting a plurality of tile areas obtained by two-time tile overlapping and dividing into an identification network for reasoning, and superposing obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result; at the moment, the initial reasoning result of any video frame image comprises a plurality of connected domains, each connected domain is an initial recognition result of a target, and simultaneously due to the influence of a category label value, namely an original label image, the pixel value of the same target in the initial recognition result is represented as a corresponding category label value, so that a corresponding optimal reasoning scale can be obtained according to the category label value, the size of each initial region in the video frame image corresponding to each connected domain is adjusted according to the optimal reasoning scale, the initial regions are input into a recognition network to obtain an output result, and the output result is combined according to the corresponding position distribution of each initial region in the video frame image to obtain an optimal reasoning result; at the moment, each initial region respectively corresponds to one target region in the optimal reasoning result, and the optimal reasoning result realizes the identification reasoning result which is optimal for the targets with different scales in the video frame image.
And S005, acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the three characteristics, and acquiring the main body content of the video according to the comparison result of the main body degree and the first preset threshold.
It should be noted that, in step S004, the target area of the video frame image is only different targets that appear in the video and need to be noticed, and is not all the main content in the video, and it is necessary to determine whether the target area is the main content of the video according to the appearance of the targets in the time sequence and the area distribution relationship; the main content in the video is always appeared for a long time, distributed in the center of the video and has a large area in the video frame image; the larger the occurrence degree of multiple frames in the target area is, the larger the comprehensive area proportion is, the larger the comprehensive centering degree is, and the higher the possibility that the target area is the main content of the video is, namely the larger the main degree is; and then, forming a main body degree matrix according to the main body degrees of different target areas in the video frame image, and comparing the main body degree matrix with a first preset threshold value, wherein the main body degree matrix is the main body content of the video if the main body degree matrix is larger than the first preset threshold value.
Specifically, the same target area appearing in the optimal inference result of different video frame images is obtained through an IOU method, and the multi-frame appearance degree of any one target area is obtained
Figure 274834DEST_PATH_IMAGE112
The calculation method comprises the following steps:
Figure 705815DEST_PATH_IMAGE113
wherein the content of the first and second substances,
Figure 6346DEST_PATH_IMAGE114
indicates the number of occurrence frames of the target region,
Figure 91370DEST_PATH_IMAGE115
represents the total number of frames of the video data,
Figure 983103DEST_PATH_IMAGE116
for the second preset threshold, the method of the present embodiment uses
Figure 901380DEST_PATH_IMAGE117
The calculation is carried out in such a way that,
Figure 67919DEST_PATH_IMAGE118
for the RELU function, values defined as greater than 0 do not change and values less than or equal to 0 all change to 0; taking the product of the second preset threshold value and the total frame number as the minimum continuous frame number; the more the number of frames of the target region appears in the video, the greater the degree of multi-frame appearance, and the longer the time of appearance in the video, the greater the possibility of being the subject content.
Obtaining the comprehensive area ratio of any one target area
Figure 240406DEST_PATH_IMAGE119
The calculation method comprises the following steps:
Figure 568619DEST_PATH_IMAGE120
wherein the content of the first and second substances,
Figure 911876DEST_PATH_IMAGE121
which represents the minimum number of consecutive frames,
Figure 616527DEST_PATH_IMAGE122
indicates the number of the minimum continuous frames
Figure 158366DEST_PATH_IMAGE123
The area ratio of the target area in the image on the frame image,
Figure 906748DEST_PATH_IMAGE124
representing the largest area fraction of the target region in the image for the smallest number of consecutive frames,
Figure 799618DEST_PATH_IMAGE125
indicating the minimum area ratio of the target region in the minimum continuous frame number image,
Figure 42381DEST_PATH_IMAGE126
for the third preset threshold, the method of the present embodiment adopts
Figure 438727DEST_PATH_IMAGE127
The calculation is carried out in such a way that,
Figure 312005DEST_PATH_IMAGE128
is a RELU function; the larger the area ratio mean value of the target area in the minimum continuous frames is, the larger the comprehensive area ratio is, the larger the area is occupied in the video frame image, and the higher the possibility of being the main content is.
Obtaining the comprehensive centering degree of any one target area
Figure 177324DEST_PATH_IMAGE129
The calculation method comprises the following steps:
Figure 958198DEST_PATH_IMAGE130
wherein, the first and the second end of the pipe are connected with each other,
Figure 474630DEST_PATH_IMAGE131
which represents the minimum number of consecutive frames,
Figure 581126DEST_PATH_IMAGE132
show fixed transmissionThe value of the length of the horizontal axis in the dimension,
Figure 435206DEST_PATH_IMAGE133
indicates the number of the minimum continuous frames
Figure 19771DEST_PATH_IMAGE134
The horizontal coordinate value of the center point of the target area of the frame image,
Figure 593972DEST_PATH_IMAGE135
an abscissa value representing a center point of the image; the larger the difference mean value of the one-dimensional abscissa distance between the area center of the target area in the minimum continuous frame and the video frame image center is, the maximum distance of the one-dimensional abscissa is half of the transverse length scale in the fixed input scale, the closer to the maximum distance, the smaller the comprehensive centering degree is, and the lower the possibility that the target area is the subject content is.
Further, the multi-frame occurrence degree of the target area is determined
Figure 605790DEST_PATH_IMAGE136
Ratio of the total area
Figure 429390DEST_PATH_IMAGE137
And integrated degree of centering
Figure 568378DEST_PATH_IMAGE138
Is taken as the subject degree of the target region, and is subjected to linear normalization, the subject degree is recorded as
Figure 59403DEST_PATH_IMAGE139
(ii) a The greater the subject degree, the greater the likelihood that the target area is the video subject content.
Replacing the pixel value of each target area in the optimal inference result of the video frame image with a corresponding main body degree value, wherein the pixel value of the non-target part in the optimal inference result is already 0, the pixel values of each point in the obtained replaced video frame image form a main body degree matrix, and a first preset threshold value is provided in the embodiment
Figure 242122DEST_PATH_IMAGE140
Marking the part of the subject degree greater than a first preset threshold value as 1, and marking the rest part as 0; the image after the judgment mark is obtained at this time is binary mask data for the main content, the main content in the video frame image can be extracted by operating the video frame image according to the binary mask data, and corresponding operations such as compression or barrage rendering can be performed.
Referring to fig. 2, a block diagram of a system for detecting video subject content based on deep learning according to an embodiment of the present invention is shown, where the system includes:
and the network construction module S101 is used for acquiring video data and constructing a semantic segmentation recognition network as a training data set.
Input scale module S102:
(1) Carrying out size adjustment of different degrees on different targets in all video frame images in a training data set to obtain a plurality of first initial images to form an enlarged data set, and obtaining a first label image corresponding to each first initial image;
(2) Reasoning is carried out on the enlarged data set to obtain a plurality of first target images, the segmentation error rate of the first initial image is obtained through the difference of the edge detection result between the corresponding first target image and the corresponding first label image, the identification accuracy of the same category under the corresponding scale information is further obtained, and the optimal reasoning scale of each category is obtained according to the identification accuracy.
The inference recognition module S103: and obtaining an initial reasoning result through overlapping tile segmentation and recognition reasoning, obtaining a corresponding optimal reasoning scale for each connected domain in the initial reasoning result according to the corresponding class label value, and carrying out scale adjustment on the input image through the optimal reasoning scale to obtain an optimal reasoning result.
The main body judgment module S104: acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the three characteristics, and acquiring the main body content of the video according to the comparison result of the main body degree and the first preset threshold.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. The video main content detection method based on deep learning is characterized by comprising the following steps:
acquiring video data and constructing a semantic segmentation recognition network as a training data set;
acquiring a category label value and a fixed input scale in an identification network, taking the ratio of the scale of a minimum circumscribed rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing different-degree size adjustment on the pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and acquiring a plurality of first label images by adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images;
reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target image of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of each category in the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of each category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of each category;
performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the video frame image into an identification network to obtain an initial reasoning result, obtaining a category label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain according to the category label value, adjusting the pixel size of each initial area in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding category, and inputting the adjusted pixel size into the identification network to obtain an optimal reasoning result, wherein each initial area corresponds to one target area in the optimal reasoning result;
acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.
2. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the class label value and the fixed input scale in the recognition network comprises the following specific methods:
the identification network marks different labels for the categories to be identified manually in the construction process, and each category label is represented in a digital form, namely a category label value; the fixed input scale is a fixed horizontal, longitudinal, length, width and width scale of the input image data in the recognition network.
3. The method for detecting video main content based on deep learning according to claim 1, wherein the obtaining a first initial image of each target under information of several scales includes a specific method that:
the scale information comprises transverse length initial scale information and longitudinal width scale information, the size of a corresponding area of each target in a video frame image is adjusted by changing the scale information by a first preset step length from the initial scale information until a first initial image with the scale information of 1 is obtained, and each adjustment obtains a first initial image under the corresponding scale information, namely the first initial image of each target under a plurality of scale information.
4. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the segmentation difference degree of each line in the first initial image comprises the following specific steps:
Figure 526094DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 167291DEST_PATH_IMAGE002
representing the first in the first initial image
Figure 509279DEST_PATH_IMAGE003
The degree of completeness of the division of the line,
Figure 404685DEST_PATH_IMAGE004
represents the first in FB
Figure 455818DEST_PATH_IMAGE005
The maximum labeled value of the row, FB representing the edge image of the first label image, with the edge pixels labeled 2, the non-edge pixels labeled 0,
Figure 584311DEST_PATH_IMAGE006
indicates the first in TB
Figure 526728DEST_PATH_IMAGE007
A maximum flag value of a row, TB, represents an edge image of the first target image, wherein edge pixels are marked as 1 and non-edge pixels are marked as 0;
will be provided with
Figure 588225DEST_PATH_IMAGE008
Figure 341417DEST_PATH_IMAGE009
All are 0 th
Figure 222785DEST_PATH_IMAGE010
Degree of line segmentation difference
Figure 719626DEST_PATH_IMAGE011
Is marked as 0; will be provided with
Figure 111993DEST_PATH_IMAGE012
And is
Figure 36087DEST_PATH_IMAGE013
Figure 670331DEST_PATH_IMAGE014
Not all of them are 0
Figure 705283DEST_PATH_IMAGE015
Degree of line segmentation difference
Figure 475793DEST_PATH_IMAGE016
Marking as 1; obtaining
Figure 820055DEST_PATH_IMAGE017
To (1) a
Figure 410436DEST_PATH_IMAGE005
Degree of deviation of target edge points in a line
Figure 249079DEST_PATH_IMAGE018
Comprises the following steps:
Figure 139675DEST_PATH_IMAGE019
wherein, the first and the second end of the pipe are connected with each other,
Figure 156304DEST_PATH_IMAGE020
represents the number one in TB
Figure 499560DEST_PATH_IMAGE021
The one-dimensional distance between the edge point P of any target in the line and the nearest label edge point M in the same line in the FB,
Figure 876315DEST_PATH_IMAGE022
represents the first in FB
Figure 355838DEST_PATH_IMAGE015
One-dimensional distance between the label edge point M in the row and the nearest label edge point L in the same row; will be in TB
Figure 792636DEST_PATH_IMAGE023
Taking the average value of the deviation degrees of all the target edge points as the second
Figure 606877DEST_PATH_IMAGE003
Degree of line segmentation difference
Figure 52902DEST_PATH_IMAGE024
5. The method for detecting the video main content based on the deep learning of claim 1, wherein the obtaining of the recognition accuracy of each category in the current scale information comprises the following specific steps:
Figure 652510DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 525788DEST_PATH_IMAGE026
representing any one category in scale information
Figure 312479DEST_PATH_IMAGE027
The degree of difference in the recognition of (b),
Figure 44418DEST_PATH_IMAGE028
representing the category in scale information
Figure 498533DEST_PATH_IMAGE029
Share the same thing
Figure 542712DEST_PATH_IMAGE030
A first initial image of the input is displayed,
Figure 816699DEST_PATH_IMAGE031
is shown as
Figure 604526DEST_PATH_IMAGE032
A segmentation error rate of the first initial image;
Figure 162415DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 111917DEST_PATH_IMAGE034
representing the category in scale information
Figure 404358DEST_PATH_IMAGE035
The accuracy of the recognition is lower than the accuracy of the recognition,
Figure 464718DEST_PATH_IMAGE036
represents the maximum value of the recognition difference degree of all the categories under all the scale information,
Figure 644158DEST_PATH_IMAGE037
and the minimum value of the identification difference degree of all the categories under all the scale information is represented.
6. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the initial inference result includes the following specific methods:
and performing overlapping tile segmentation on any video frame image to obtain a plurality of tile areas, wherein the scale of each tile area is a fixed input scale, inputting each tile area into a recognition network for reasoning, and overlapping the obtained output results according to the overlapping position relation of the tiles to obtain an initial reasoning result.
7. The method for detecting video main content based on deep learning of claim 1, wherein the obtaining of the multi-frame occurrence degree of each target region includes a specific method that:
acquiring the same target area appearing in different video frame images according to the optimal inference result,
Figure 30140DEST_PATH_IMAGE038
wherein, the first and the second end of the pipe are connected with each other,
Figure 278718DEST_PATH_IMAGE039
indicating the extent of the occurrence of multiple frames of any one target area,
Figure 142769DEST_PATH_IMAGE040
indicates the number of occurrence frames of the target region,
Figure 425983DEST_PATH_IMAGE041
represents the total number of frames of the video data,
Figure 232134DEST_PATH_IMAGE042
is a second preset threshold value, and is,
Figure 968009DEST_PATH_IMAGE043
for the RELU function, values defined as greater than 0 do not change and values less than or equal to 0 all change to 0; and taking the product of the second preset threshold value and the total frame number as the minimum continuous frame number.
8. The method for detecting the main content of the video based on the deep learning of claim 7, wherein the specific obtaining method of the comprehensive area ratio is as follows:
Figure 635750DEST_PATH_IMAGE044
wherein the content of the first and second substances,
Figure 39050DEST_PATH_IMAGE045
representing the combined area fraction of any one target region,
Figure 501255DEST_PATH_IMAGE046
which represents the minimum number of consecutive frames,
Figure 743667DEST_PATH_IMAGE047
indicates the number of the minimum continuous frames
Figure 215100DEST_PATH_IMAGE048
The area ratio of the target area in the image on the frame image,
Figure 207327DEST_PATH_IMAGE049
representing the largest area fraction of the target region in the image for the smallest number of consecutive frames,
Figure 106013DEST_PATH_IMAGE050
indicating the minimum area ratio of the target region in the minimum continuous frame number image,
Figure 816480DEST_PATH_IMAGE051
is a third preset threshold value, and is,
Figure 75292DEST_PATH_IMAGE052
is the RELU function.
9. The method for detecting video main content based on deep learning of claim 7, wherein the comprehensive centering degree is obtained by:
Figure 922025DEST_PATH_IMAGE053
wherein the content of the first and second substances,
Figure 257191DEST_PATH_IMAGE054
indicating the overall degree of centering of any one target region,
Figure 189375DEST_PATH_IMAGE055
which represents the minimum number of consecutive frames,
Figure 268190DEST_PATH_IMAGE056
representing the lateral length dimension in a fixed input dimension,
Figure 720162DEST_PATH_IMAGE057
indicates the number of the minimum continuous frames
Figure 960650DEST_PATH_IMAGE058
The horizontal coordinate value of the center point of the target area of the frame image,
Figure 645710DEST_PATH_IMAGE059
and an abscissa value representing the center point of the image.
10. Video subject content detection system based on deep learning, characterized in that the system comprises:
the network construction module is used for acquiring video data and constructing a semantic segmentation recognition network as a training data set;
an input scale module: acquiring class label values and fixed input scales in an identification network, taking the ratio of the scale of a minimum external rectangle of different targets in each video frame image in a training data set to the fixed input scale as initial scale information of each target, performing size adjustment of different degrees on pixels of each target according to the initial scale information to acquire first initial images of the targets under the information of a plurality of scales, forming an expanded data set by a plurality of first initial images of the different targets in all the video frame images, and adjusting the sizes of the original label images corresponding to the targets according to the scale information of the first initial images to acquire a plurality of first label images;
reasoning is carried out on the expanded data set through an identification network to obtain a plurality of first target images, edge detection is carried out on a first label image and the first target image of the same first initial image respectively, segmentation difference degrees of all rows in the first initial image are obtained according to edge detection results, the sum of the segmentation difference degrees of all the rows is used as a segmentation error rate of the first initial image, the identification accuracy of the category in the current scale information is obtained according to the segmentation error rate of the first initial image under the same scale information of all targets in the same category, and the optimal reasoning scale of the category is obtained according to the scale information corresponding to the maximum identification accuracy in all the identification accuracy of the category;
an inference recognition module: performing overlapping tile segmentation on any video frame image according to a fixed input scale, inputting the overlapping tile segmentation into a recognition network to obtain an initial reasoning result, obtaining a class label value corresponding to each connected domain in the initial reasoning result, obtaining an optimal reasoning scale of each connected domain corresponding to the class according to the class label value, adjusting the pixel size of each initial region in the video frame image corresponding to each connected domain according to the optimal reasoning scale of the corresponding class, inputting the optimal reasoning result into the recognition network to perform secondary reasoning, and obtaining an optimal reasoning result, wherein each initial region corresponds to one target region in the optimal reasoning result;
a main body judging module: acquiring the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring the main body degree of each target area according to the multi-frame occurrence degree, the comprehensive area ratio and the comprehensive centering degree of each target area, acquiring a main body degree matrix corresponding to each video frame, and acquiring the main body content of the video according to the comparison result of the main body degree matrix and the first preset threshold.
CN202211609427.2A 2022-12-15 2022-12-15 Video main content detection method and system based on deep learning Active CN115601688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211609427.2A CN115601688B (en) 2022-12-15 2022-12-15 Video main content detection method and system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211609427.2A CN115601688B (en) 2022-12-15 2022-12-15 Video main content detection method and system based on deep learning

Publications (2)

Publication Number Publication Date
CN115601688A true CN115601688A (en) 2023-01-13
CN115601688B CN115601688B (en) 2023-02-21

Family

ID=84854198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211609427.2A Active CN115601688B (en) 2022-12-15 2022-12-15 Video main content detection method and system based on deep learning

Country Status (1)

Country Link
CN (1) CN115601688B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116055722A (en) * 2023-03-06 2023-05-02 山东梁山酿酒总厂有限公司 Data storage method for automatic white spirit production system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539930A (en) * 2020-04-21 2020-08-14 浙江德尚韵兴医疗科技有限公司 Dynamic ultrasonic breast nodule real-time segmentation and identification method based on deep learning
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
US20210256252A1 (en) * 2020-02-19 2021-08-19 Kyocera Document Solutions Inc. Learning dataset generation system, learning dataset generation server, and computer readable non-temporary recording medium storing learning dataset generation program
CN113538227A (en) * 2020-04-20 2021-10-22 华为技术有限公司 Image processing method based on semantic segmentation and related equipment
CN114387506A (en) * 2021-12-11 2022-04-22 深圳供电局有限公司 Transmission tower monitoring method and device, computer equipment and storage medium
US20220215662A1 (en) * 2021-01-06 2022-07-07 Dalian University Of Technology Video semantic segmentation method based on active learning
CN114842430A (en) * 2022-07-04 2022-08-02 江苏紫琅汽车集团股份有限公司 Vehicle information identification method and system for road monitoring
CN114881665A (en) * 2021-09-30 2022-08-09 中国电力科学研究院有限公司 Method and system for identifying electricity stealing suspected user based on target identification algorithm

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120162449A1 (en) * 2010-12-23 2012-06-28 Matthias Braun Digital image stabilization device and method
CN105184271A (en) * 2015-09-18 2015-12-23 苏州派瑞雷尔智能科技有限公司 Automatic vehicle detection method based on deep learning
CN114898097A (en) * 2022-06-01 2022-08-12 首都师范大学 Image recognition method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
US20210256252A1 (en) * 2020-02-19 2021-08-19 Kyocera Document Solutions Inc. Learning dataset generation system, learning dataset generation server, and computer readable non-temporary recording medium storing learning dataset generation program
CN113538227A (en) * 2020-04-20 2021-10-22 华为技术有限公司 Image processing method based on semantic segmentation and related equipment
CN111539930A (en) * 2020-04-21 2020-08-14 浙江德尚韵兴医疗科技有限公司 Dynamic ultrasonic breast nodule real-time segmentation and identification method based on deep learning
US20220215662A1 (en) * 2021-01-06 2022-07-07 Dalian University Of Technology Video semantic segmentation method based on active learning
CN114881665A (en) * 2021-09-30 2022-08-09 中国电力科学研究院有限公司 Method and system for identifying electricity stealing suspected user based on target identification algorithm
CN114387506A (en) * 2021-12-11 2022-04-22 深圳供电局有限公司 Transmission tower monitoring method and device, computer equipment and storage medium
CN114842430A (en) * 2022-07-04 2022-08-02 江苏紫琅汽车集团股份有限公司 Vehicle information identification method and system for road monitoring

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
WU, ZF 等: "A Comprehensive Study on Cross-View Gait Based Human Identification with Deep CNNs" *
付路瑶: "场景约束下的视频数据人体异常行为识别研究" *
单玉刚 等: "尺度方向自适应视觉目标跟踪方法综述" *
孙霄宇 等: "基于卷积神经网络的铁轨路牌识别方法" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116055722A (en) * 2023-03-06 2023-05-02 山东梁山酿酒总厂有限公司 Data storage method for automatic white spirit production system
CN116055722B (en) * 2023-03-06 2023-06-16 山东梁山酿酒总厂有限公司 Data storage method for automatic white spirit production system

Also Published As

Publication number Publication date
CN115601688B (en) 2023-02-21

Similar Documents

Publication Publication Date Title
CN108460764B (en) Ultrasonic image intelligent segmentation method based on automatic context and data enhancement
CN107016647B (en) Panoramic picture color tone consistency correcting method and system
CN107679502A (en) A kind of Population size estimation method based on the segmentation of deep learning image, semantic
CN102883175B (en) Methods for extracting depth map, judging video scene change and optimizing edge of depth map
CN110598698B (en) Natural scene text detection method and system based on adaptive regional suggestion network
CN115601688B (en) Video main content detection method and system based on deep learning
CN102306307B (en) Positioning method of fixed point noise in color microscopic image sequence
CN106327488B (en) Self-adaptive foreground detection method and detection device thereof
CN109740563B (en) Moving object detection method for video monitoring
CN111768407B (en) Defect detection algorithm based on quick positioning
CN108615043B (en) Video classification method and system
CN111062381B (en) License plate position detection method based on deep learning
KR20150092546A (en) Harmless frame filter and apparatus for harmful image block having the same, method for filtering harmless frame
CN107977645A (en) A kind of news-video poster map generalization method and device
CN112215859A (en) Texture boundary detection method based on deep learning and adjacency constraint
CN110349070B (en) Short video watermark detection method
CN111401368A (en) News video title extraction method based on deep learning
CN113408550B (en) Intelligent weighing management system based on image processing
CN109871790B (en) Video decoloring method based on hybrid neural network model
CN111192213A (en) Image defogging adaptive parameter calculation method, image defogging method and system
CN111080723A (en) Image element segmentation method based on Unet network
CN111010605B (en) Method for displaying video picture-in-picture window
CN107766838B (en) Video scene switching detection method
CN109978858B (en) Double-frame thumbnail image quality evaluation method based on foreground detection
CN107704864A (en) Well-marked target detection method based on image object Semantic detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant