CN112906649A

CN112906649A - Video segmentation method, device, computer device and medium

Info

Publication number: CN112906649A
Application number: CN202110314575.0A
Authority: CN
Inventors: 宋波
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2021-06-04
Anticipated expiration: 2038-05-10
Also published as: CN108647641B; CN112966646A; CN108647641A; CN112836687A; CN112966646B

Abstract

The application discloses a video segmentation method, a device, a computer device and a medium. The method comprises the following steps: segmenting a video into segments based on correlation coefficients between adjacent video frames in the video; for the video frames in the segments, identifying the scenes of the video frames to obtain scene characteristic vectors; for the video frames in the segments, identifying the local behavior characteristics of the video frames to obtain local behavior characteristic vectors; identifying a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector; determining the behavior class of the segment based on the behavior class and the confidence of the video frame of the segment; and combining the adjacent segments with the same behavior category to obtain the segmentation result of the video. The method can be used for fusing the two-way models at the same time, comprehensively utilizing two dimensions of scenes and local behaviors and extracting the whole behavior information, thereby rapidly segmenting the video.

Description

Video segmentation method, device, computer device and medium

Technical Field

The present application relates to the field of image automation processing, and in particular, to a video segmentation method, device, computer device, and medium.

Background

The rapid development of video compression algorithms and applications brings massive video data. Abundant information is contained in video, however, because video data is huge and abstract concepts are not directly expressed by characters, extraction and structuring of video information are relatively complex. At present, a video information extraction method mainly includes segmenting a video, and labeling each segmented segment classification, which is an idea of video information extraction and structuring. The video is segmented based on the traditional computer vision, image characteristics generally need to be designed manually, and the designed characteristics cannot flexibly adapt to the changes of various scenes. At present, most of actually available videos are segmented only according to color information of each frame, change of two adjacent frames is detected through various traditional computer vision transformation, so that video segmentation points are determined, then clustering algorithm in machine learning is continuously utilized to aggregate segmented adjacent video segments, and similar types can be classified into one type. However, these methods can only accomplish coarse and shallow segmentation and cannot recognize the semantics of each segment in the video.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a video segmentation method including:

a fragment segmentation step: segmenting a video into segments based on correlation coefficients between adjacent video frames in the video;

scene recognition: for the video frames in the segments, identifying the scenes of the video frames to obtain scene characteristic vectors;

local behavior feature identification: for the video frames in the segments, identifying the local behavior characteristics of the video frames to obtain local behavior characteristic vectors;

judging the video frame behavior category: identifying a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;

determining the fragment behavior category: determining the behavior class of the segment based on the behavior class and the confidence of the video frame of the segment;

and (3) fragment merging step: and combining the adjacent segments with the same behavior category to obtain the segmentation result of the video.

The method can be used for fusing the two-way models at the same time, comprehensively utilizing two dimensions of scenes and local behaviors and extracting the whole behavior information, thereby rapidly segmenting the video.

Optionally, the segment dividing step includes:

a histogram calculation step: calculating a YCbCr histogram of each video frame of the video;

and a correlation coefficient calculation step: calculating the correlation coefficient of the YCbCr histogram of the video frame and the YCbCr histogram of the previous video frame;

a threshold comparison step: and when the correlation coefficient is smaller than a first preset threshold value, the video frame is taken as the starting frame of the new segment.

Optionally, the scene recognition step includes:

resolution conversion step: converting RGB channels of the video frame into resolution ratios with fixed sizes respectively; and

a scene feature vector generation step: inputting the video frame after resolution conversion into a first network model to obtain a scene feature vector of the video frame, wherein the first network model is as follows: the VGG16 network model of the last layer of fully connected layers and the Softmax classifier is removed.

Optionally, the local behavior feature identification step includes:

and (3) fixing the shortest side length: respectively converting RGB channels of the video frame into a resolution with fixed shortest side length; and

local behavior feature vector generation: inputting a video frame with a fixed shortest side length into a first network model, inputting an output result of the first network model into a region-based convolutional neural network (FasterRCNN) model, calculating an optimal detection type result by using the output result of the region-based convolutional neural network, and obtaining a local behavior feature vector by the optimal detection type result through a region-of-interest pooling layer.

Optionally, the video frame behavior category determining step includes:

merging the feature vectors of the video frames: merging the scene feature vector and the local behavior feature vector into a video frame feature vector; and

behavior category and confidence calculation step: and inputting the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers and a Softmax classifier.

Optionally, the segment behavior category determining step includes: and in the case that the ratio of the number of the video frames with the same behavior category to the total number of the video frames of the segment is larger than a second preset threshold value, taking the behavior category as the behavior category of the segment.

According to another aspect of the present application, there is also provided a video segmentation apparatus including:

a segment segmentation module configured to segment a video into segments based on correlation coefficients between adjacent video frames in the video;

a scene identification module configured to identify, for a video frame in the segment, a scene of the video frame, resulting in a scene feature vector;

a local behavior feature identification module configured to identify, for a video frame in the segment, a local behavior feature of the video frame, resulting in a local behavior feature vector;

a video frame behavior category judgment module configured to identify a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;

a segment behavior category determination module configured to determine a behavior category for the segment based on the behavior category and the confidence of the video frames of the segment; and

and the segment merging module is configured to merge adjacent segments with the same behavior category to obtain a segmentation result of the video.

The device can fuse the two-way model simultaneously, and the two dimensions of scene and local action are used comprehensively to extract the whole action information, thereby rapidly segmenting the video.

According to another aspect of the present application, there is also provided a computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to another aspect of the application, there is also provided a computer-readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method as described above.

According to another aspect of the present application, there is also provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method as described above.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a video segmentation method in accordance with the present application;

FIG. 2 is a schematic block diagram of a behavior prediction network of the present application;

FIG. 3 is a schematic block diagram of training a behavior prediction network of the present application;

FIG. 4 is a schematic block diagram of one embodiment of a video segmentation apparatus in accordance with the present application;

FIG. 5 is a block diagram of one embodiment of a computing device of the present application;

FIG. 6 is a block diagram of one embodiment of a computer-readable storage medium of the present application.

Detailed Description

Embodiments of the present application provide a video segmentation method, and fig. 1 is a schematic flow chart of one example of a video segmentation method according to the present application. The method can comprise the following steps:

s100, segment segmentation step: segmenting a video into segments based on correlation coefficients between adjacent video frames in the video;

s200 scene identification: for the video frames in the segments, identifying the scenes of the video frames to obtain scene characteristic vectors;

s300, local behavior feature identification: for the video frames in the segments, identifying the local behavior characteristics of the video frames to obtain local behavior characteristic vectors;

s400, judging the video frame behavior type: identifying a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;

s500, determining the behavior category of the segment: determining the behavior class of the segment based on the behavior class and the confidence of the video frame of the segment;

s600 fragment merging step: and combining the adjacent segments with the same behavior category to obtain the segmentation result of the video.

The method provided by the application can be used for fusing the two-way models at the same time, comprehensively utilizing two dimensions of scenes and local behaviors and extracting the whole behavior information, so that the video is rapidly segmented. The invention utilizes a deep learning technology to segment the video from the dimension of the behavior category of the human. On one hand, more abstract general features can be extracted by using a deep learning technology, and on the other hand, dynamic information and causal events in the video are mainly defined by human behaviors, so that the video is most reasonably segmented according to the human behavior types.

Optionally, the S100 segment dividing step may include:

s101, histogram calculation: calculating a YCbCr histogram of each video frame of the video;

s102, correlation coefficient calculation step: calculating the correlation coefficient of the YCbCr histogram of the video frame and the YCbCr histogram of the previous video frame; and

s103, threshold comparison step: and when the correlation coefficient is smaller than a first preset threshold value, the video frame is taken as the starting frame of the new segment.

The color space may include: RGB, CMY (tricolor), HSV (Hue, Saturation, brightness), HIS (Hue, Saturation, Intensity), YCbCr. Where Y of YCbCr refers to the luminance component, Cb refers to the blue chrominance component, and Cr refers to the red chrominance component. Taking YCbCr as an example, in an alternative embodiment, the video is segmented:

and on the basis of the YCbCr color space, carrying out normalization processing on the YCbCr data of the frame to construct a normalized YCbCr histogram, wherein the horizontal axis of the histogram represents the number of stages of normalization, and the vertical axis represents the number of pixel points corresponding to the stages. In the normalization process, Y, Cb, and Cr may be divided into 16 parts, 9 parts, and 9 parts, respectively, that is, in a 16-9-9 mode, where the number of normalization steps is 16+9+9 — 34. The reason why the number of steps is determined and normalized is that normalization, i.e., quantization, is performed at unequal intervals in accordance with different ranges of colors and subjective color perception, considering the human visual resolution and the processing speed of a computer.

Calculating a correlation coefficient d (H) between the frame and a frame previous to the frame using the following equation_fi,H_fj)：

Where l represents the normalized series, bins1 represents the total normalized series, H_fi(l) And H_fj(l) The number of pixel points corresponding to the l-th level of the frame and the frame before the frame;

and

and respectively averaging the number of the pixels of the frame and the previous frame of the frame. Note that bins1 are the number of bins (boxes) in the histogram, and in the YCbCr histogram, the total number of normalized levels is represented. For each pixel, the Y channel value is divided into 16 equal parts, and the Cb channel and the Cr channel are divided into 9 equal parts. In this case, bins1 takes the value 16+9+ 9-34. Preferably, bins1 are taken as 34. Compared with color difference information, human eyes are more sensitive to brightness information, so that the brightness information and the color difference information can be better processed respectively by adopting a YCbCr color space model.

And comparing the first similarity with a first threshold, and if the first similarity is less than the first threshold, indicating that the frame is likely to be the start frame of a new clip (clip), using the frame as the start frame of the new clip. The first threshold value may be determined by experiments and practical applications. Optionally, the first threshold takes 0.85.

For each video clip (i) roughly cut in step S103, wherein i represents the sequence number of each video, one frame of image is cut out every second, sent to the behavior prediction network, the network outputs the identifier (id) of the behavior, which is represented by clip (i) _ frame (j) _ id, and outputs the corresponding confidence level clip (i) _ frame (j) _ confidence. The behavior prediction network is a network specially used for behavior prediction, and each behavior corresponds to one id. The behavior prediction network may include a first network model, a second network model, and a third network model. The following describes a process of obtaining a behavior category finally by a single-frame image through the behavior prediction network.

Optionally, the S200 scene recognition step may include:

s201, resolution conversion: converting RGB channels of the video frame into resolution ratios with fixed sizes respectively; and

s202, scene feature vector generation: inputting the video frame after resolution conversion into a first network model to obtain a scene feature vector of the video frame, wherein the first network model is as follows: the VGG16 network model of the last layer of fully connected layers and the Softmax classifier is removed.

Fig. 2 is a schematic block diagram of a behavior prediction network of the present application. The image RGB channels are each converted to a fixed-size resolution, for example, to a resolution of 224 × 224, and the converted video frames are input to a first network model, also referred to as a scene recognition subnetwork. The first network model is a modified VGG16 network trained for scene recognition for several predefined scenes, which removes the last full connectivity layer and the Softmax classifier. The output of the scene recognition subnetwork is a vector with dimensions 1x1x25088, which is denoted as a scene feature vector place _ feature _ vector.

It should be noted that, the Visual Geometry Group (VGG) is an organization of engineering science of oxford university, and a model established by deep learning on an expression database is a VGG model, and the VGG model is characterized by VGG features, and the VGG features may include: FC6 layer characteristics. VGG16 Net deep neural network architecture.

The VGG16 Net network structure comprises 5 stacked convolutional neural networks (convnets), each ConvNet in turn consisting of a plurality of convolutional layers (Conv) followed by a non-linear mapping layer (ReLU), each ConvNet followed by a Pooling layer (firing), finally 3 fully-connected layers each with 4096 channels and 1 soft-max (max) layer with 1000 channels, different numbers of outputs can be chosen depending on the specific task. The network introduces a smaller convolution kernel (3 multiplied by 3), a ReLU layer is added, the input of the convolution layer and the input of the full connection layer are directly connected with the ReLU layer, and a regularization method (Dropout) is used at the full connection layers fc6 and fc7, so that the network structure greatly shortens the training time, increases the flexibility of the network, and simultaneously prevents the over-fitting phenomenon. The invention comprehensively considers the learning and characterization capability of the network model, the flexibility of the structure, the training time and other factors, and selects VGG16 Net as the feature extractor of the invention. The matrix adjustment function (Reshape function) in this model is a function that can readjust the number of rows, columns, and dimensions of the matrix.

Optionally, the S300 local behavior feature identification step may include:

s301, fixing the shortest side length: respectively converting RGB channels of the video frame into a resolution with fixed shortest side length; and

s302, local behavior feature vector generation: inputting a video frame with a fixed shortest side length into a first network model, inputting an output result of the first network model into a region-based convolutional neural network (FasterRCNN) model, calculating an optimal detection type result by using the output result of the region-based convolutional neural network, and obtaining a local behavior feature vector by the optimal detection type result through a region-of-interest pooling layer.

Referring to fig. 2, the RGB channels of the video frame are converted into the shortest side lengths, for example, the resolution of 600 pixels, respectively, and the video frame is input into a second network model, also called a local behavior detection subnetwork. The second network model is a local behavior detection network trained for several predefined local behaviors. The second network model may include: a first network model, FasterRCNN, an optimal detection module, and a pooling layer. The data processing flow of the second network model is that the output result of the first network model is input into a FasterRCNN model, the optimal detection module calculates the optimal detection type result by using the output result of the region-based convolutional neural network, and the optimal detection type result is subjected to a region of interest (ROI) Pooling Layer (Pooling Layer) to obtain a local behavior feature vector. This second network model is based on FasterRCNN, but uses only the optimal detection class.

The optimal detection class is determined based on the following quantified formula: for the detection target and the rectangular box output by each fasterncn, for example, the detection target takes the maximum probability value Softmax _ max output by softmx, the area of the rectangular box is denoted as S, and the optimal detection category result opt _ detection is calculated:

opt_detection＝SCALE*softmax_max+WEIGHT*S

wherein, SCALE is a coefficient, in order to prevent the softmax _ max from being inundated by the value range of S; WEIGHT is a WEIGHT value for an area. Alternatively, SCALE 1000 and WEIGHT 0.7, which means that the WEIGHT of the local behavior is slightly higher than the WEIGHT of the area.

And converting the output result of 7x7x512 dimensions into a 1x1x25088 vector through the region of interest pooling layer by the optimal detection category result, and recording the vector as a local behavior feature vector local _ action _ feature _ vector. In fig. 2, after obtaining the local behavior feature vector, the result obtained by FC1, FC2, FC M, Softmax M, and the result obtained by FC2 are input to FC M × 4, and the result obtained by using the window regression function Bbox _ Pred, where M is the local behavior class, can be used to evaluate the recognition effect of the local behavior feature vector.

Optionally, the S400 video frame behavior category determining step may include:

s401, merging the feature vectors of the video frames: merging the scene feature vector and the local behavior feature vector into a video frame feature vector; and

s402, behavior category and confidence degree calculation step: and inputting the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers and a Softmax classifier.

In S401, the scene feature vector place _ feature _ vector and the local behavior feature vector local _ action _ feature _ vector are merged into a video frame feature vector, and the size of the vector is 1x1x (25088+25088) ═ 50176 dimension vector, which is denoted as feature _ vector, see fig. 2.

Optionally, the S500 segment behavior category determining step may include: and in the case that the ratio of the number of the video frames with the same behavior category to the total number of the video frames of the segment is larger than a second preset threshold value, taking the behavior category as the behavior category of the segment.

In S402, the video frame feature vector feature _ vector passes through the 4-layer full-connection layer FC, FC1 through FC 4. Among them, FC1 outputs 4096 channels, FC2 outputs 4096 channels, FC3 outputs 1000 channels, and FC4 outputs scores of C classes, see fig. 2. C can be selected according to the number of the actually required behavior categories, and is generally selected from 15 to 30. The output of the FC4 accesses the Softmax classifier and finally outputs the prediction confidence for each behavior class. Selecting the behavior category with the highest confidence as the frame behavior category output, and recording as clip (i) _ frame (j) _ id and clip (i) _ frame (j) _ confidence.

In the step of determining the behavior class of the clip S500, the processing in steps S200 to S400 is performed for each frame captured every second in the clip (i), and the behavior class of each frame is predicted. The percentage of frames with the same id in clip (i) to the total predicted frame number is denoted as same _ id _ percent. As long as there is such an id, so that the same _ id _ percent > same _ id _ percent _ thres, where same _ id _ percent _ thres represents a set threshold, and the confidence of a frame of the same id exceeds 65% by 80%. The id is output as the behavior class of the clip (i).

In the step S600 of segment merging, the above-described processing is performed for each segment roughly obtained in step S100, and a behavior class of each segment is obtained. If the behavior categories of the adjacent segments are the same, the two segments are merged into one segment. And finally, obtaining the short video segmented by the video according to the behavior category.

It should be understood that the step of identifying the local behavior characteristics of S300 and the step of determining the behavior class of the S400 video frame do not have to be executed sequentially, and may be executed simultaneously or sequentially.

FIG. 3 is a schematic block diagram of training a behavior prediction network of the present application. Optionally, the method may further comprise a training step of the behavior prediction network.

For the first network model, i.e. the scene prediction network, the network model uses VGG16 to classify N predefined scenes. The output scene type N is selected according to actual requirements, and is generally selected from 30 to 40. For example, the scene category may be a restaurant, basketball court, concert hall, and so on. The training strategy is as follows: weight w initialization is performed using the following equation:

w＝np.random.randn(n)*sqrt(2.0/n)

random, randn (n) is a function for generating random numbers, i.e., n weights of the filter for each channel of each convolutional layer are initialized to gaussian distribution and can be generated by a numpy method. The resulting sqrt (2.0/n) is calculated using a square root function to ensure that the distribution variance of the input of each neuron at each layer is consistent. And regularizing by adopting a dropout technology to prevent overfitting, wherein dropout refers to temporarily discarding the neural network units from the network according to a certain probability in the training process of the deep learning network. The probability of each neuron activation is the hyperparameter p. The pooled result is input to the cost function after passing through two FC4096, FC N and Softmax N. The cost function is calculated by using a cross-entropy loss function cross-entropy loss (Softmax). The weight updating strategy is realized by adopting an SGD + Momentum (random gradient descent + Momentum) method. The learning rate (learning) decreases with training time according to step decay.

For the second network model, i.e. the local behavior prediction network, which uses FasterRCNN, the training method uses the standard training method of FasterRCNN. The output local behavior category M is selected according to actual requirements, and is generally selected from 15 to 30. For example, the local behavior may be eating, basketball appointments, and the like. After obtaining the local behavior feature vector, the prediction results obtained through two FC4096, FC M, Softmax M, and the result of the second FC4096 are input to FC M × 4, and the result obtained by using the window regression function Bbox _ Pred can be used to evaluate the recognition effect of the local behavior feature vector, where M is the local behavior category. The results of Softmax M and FC M4 are input to the cross entropy loss defined by fasterncn.

And training a third network after the training of the first network model and the second network model is finished. The scene network removes a Softmax classifier and the last layers of full connection layers, the parameters of the rest layers are kept unchanged, and the last layer of pooling layer is converted into 1x1x25088 dimensions and recorded as a video frame feature vector. A network is identified for local behavior. When a third network model is trained, each image predicts a plurality of local behaviors and a rectangular frame of the local behaviors through the local behavior recognition network, selects an optimal detection category according to the optimal detection category, obtains vector output of 7x7x512 dimensions of a corresponding region of interest pooling layer, and further converts the vector output into a local behavior feature vector of 1x1x25088 dimensions. The scene feature vector and the local behavior feature vector are combined to be 1x1x (25088+25088) ═ 50176 dimensions, and are recorded as video frame feature vectors. The video frame feature vector passes through 4 layers of full-connection layers FC1 through FC 4. The output of FC4 is sequentially switched in Softmax C and cross-entropy loss. For the third network model, the other parameters remain unchanged, and only the parameters of the layer 4 FC are trained. The parameter training strategy adopts a training strategy of the first network model.

For the C behavior classes predicted by the third network model, the M local behavior classes predicted by the second network model, and the N scene classes predicted by the first network model, the selection may be made as follows. First, an overall C behavior categories such as eating, basketball and dating are defined according to business requirements. Then, according to the C overall behaviors, the possible local behavior categories are defined, and the overall behaviors, such as eating, basketball and dating, can be generally kept consistent. Finally, N possible scenes are defined according to the overall behavior classification, for example, scenes such as restaurants, coffee shops and the like can be defined for eating.

There is also provided, in accordance with another embodiment of the present application, a video segmentation apparatus, and fig. 4 is a schematic block diagram of an example of a video segmentation apparatus in accordance with the present application. The apparatus may include:

a segment segmentation module 100 configured to segment a video into segments based on correlation coefficients between adjacent video frames in the video;

a scene recognition module 200 configured to, for a video frame in the segment, recognize a scene of the video frame, resulting in a scene feature vector;

a local behavior feature identification module 300 configured to identify, for a video frame in the segment, a local behavior feature of the video frame, resulting in a local behavior feature vector;

a video frame behavior category determination module 400 configured to identify a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;

a segment behavior category determination module 500 configured to determine a behavior category for a segment based on a behavior category and a confidence of a video frame of the segment; and

and a segment merging module 600 configured to merge adjacent segments with the same behavior category to obtain a segmentation result of the video.

The device that this application provided can fuse the double-circuit model simultaneously, and two dimensions of comprehensive utilization scene and local action extract whole action information to cut apart the video fast.

Optionally, the segment division module 100 may include:

a histogram calculation module configured to calculate a YCbCr histogram for each video frame of the video;

a correlation coefficient calculation module configured to calculate a correlation coefficient of the YCbCr histogram of the video frame and the YCbCr histogram of a previous video frame; and

a threshold comparison module configured to take the video frame as a starting frame of a new segment when the correlation coefficient is smaller than a predetermined first threshold.

Optionally, the scene recognition module 200 may include:

a resolution conversion module configured to convert RGB channels of the video frame into fixed-size resolutions, respectively; and

a scene feature vector generating module, configured to input the video frame after resolution conversion into a first network model, to obtain a scene feature vector of the video frame, where the first network model is: the VGG16 network model of the last layer of fully connected layers and the Softmax classifier is removed.

Optionally, the local behavior feature recognition module 300 may include:

a shortest side length fixing module configured to convert the RGB channels of the video frame into a shortest side length fixed resolution, respectively; and

the local behavior feature vector generation module is configured to input a video frame with a fixed shortest side length into a first network model, input an output result of the first network model into a region-based convolutional neural network (FasterRCNN) model, calculate an optimal detection category result by using the output result of the region-based convolutional neural network, and obtain a local behavior feature vector by passing the optimal detection category result through a region-of-interest pooling layer.

Optionally, the video frame behavior class determination module 400 may include:

a video frame feature vector merging module configured to merge the scene feature vector and the local behavior feature vector into a video frame feature vector; and

and the behavior category and confidence coefficient calculation module is configured to input the feature vectors of the video frames into a third network, and obtain the behavior category and the confidence coefficient corresponding to the behavior category of the video frames, wherein the third network is formed by sequentially connecting 4 full connection layers and a Softmax classifier.

FIG. 5 is a block diagram of one embodiment of a computing device of the present application. Another embodiment of the present application also provides a computing device comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program realizing for performing any of the method steps 1131 according to the present invention when executed by the processor 1110.

Another embodiment of the present application also provides a computer-readable storage medium. FIG. 6 is a block diagram of one embodiment of a computer readable storage medium of the present application including a memory unit for program code provided with a program 1131' for performing method steps according to the present invention, the program being executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. As well as a computer program product.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video segmentation method, comprising:

scene recognition: for the video frames in the segments, converting RGB channels of the video frames into resolutions with fixed sizes respectively, inputting the video frames subjected to resolution conversion into a first network model to obtain scene feature vectors of the video frames, wherein the first network model is as follows: removing the VGG16 network model of the last layer of full connection layer and the Softmax classifier;

local behavior feature identification: respectively converting RGB channels of the video frames into resolutions with fixed shortest side length, inputting the video frames with fixed shortest side length into a first network model, inputting the output result of the first network model into a convolutional neural network model based on a region, calculating an optimal detection category result by using the output result of the convolutional neural network based on the region, and obtaining a local behavior feature vector by the optimal detection category result through a region-of-interest pooling layer;

judging the video frame behavior category: identifying the behavior category of the video frame and the confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector, wherein the judging step of the behavior category of the video frame comprises the following steps:

merging the feature vectors of the video frames: merging the scene feature vector and the local behavior feature vector into a video frame feature vector, and

behavior category and confidence calculation step: inputting the video frame feature vector into a third network to obtain a behavior category of the video frame and a confidence degree corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers and a Softmax classifier;

determining the fragment behavior category: determining a behavior class of the segment based on the behavior class and the confidence of the video frame of the segment, wherein the determining step of the behavior class of the segment comprises the following steps: taking the behavior category as the behavior category of the segment under the condition that the ratio of the number of the video frames with the same behavior category to the total number of the video frames of the segment is greater than a second preset threshold value;

2. A video segmentation apparatus comprising:

a scene recognition module, configured to, for a video frame in the segment, convert RGB channels of the video frame into resolutions of fixed sizes, respectively, and input the video frame after resolution conversion into a first network model to obtain a scene feature vector of the video frame, where the first network model is: removing the VGG16 network model of the last layer of full connection layer and the Softmax classifier;

the local behavior feature recognition module is configured to convert RGB channels of the video frames into resolutions with fixed shortest side lengths respectively, input the video frames with fixed shortest side lengths into a first network model, input an output result of the first network model into a convolutional neural network model based on a region, calculate an optimal detection category result by using the output result of the convolutional neural network based on the region, and obtain a local behavior feature vector by passing the optimal detection category result through a region-of-interest pooling layer;

a video frame behavior category determination module configured to identify a behavior category of the video frame and a confidence corresponding to the behavior category based on the scene feature vector and the local behavior feature vector, the video frame behavior category determination module comprising:

a video frame feature vector merging module configured to merge the scene feature vector and the local behavior feature vector into a video frame feature vector, an

A behavior category and confidence coefficient calculation module configured to input the feature vector of the video frame into a third network, and obtain a behavior category of the video frame and a confidence coefficient corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 fully-connected layers and a Softmax classifier;

a segment behavior category determination module configured to determine a behavior category of the segment based on the behavior category and the confidence level of the video frames of the segment, wherein the segment behavior category determination module takes the behavior category as the behavior category of the segment if the ratio of the number of video frames with the same behavior category to the total number of video frames of the segment is greater than a predetermined second threshold; and

3. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of claim 1 when executing the computer program.

4. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of claim 1.