CN110276332A

CN110276332A - A kind of video features processing method, device and Three dimensional convolution neural network model

Info

Publication number: CN110276332A
Application number: CN201910577983.8A
Authority: CN
Inventors: 张云桃; 晋瑞锦
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-09-24
Anticipated expiration: 2039-06-28
Also published as: CN110276332B

Abstract

This application discloses video features processing method, device and Three dimensional convolution neural network models, comprising: obtains the video feature vector obtained after Three dimensional convolution processing；Video feature vector is done into process of convolution on airspace, obtains spatial processing result；Spatial processing result is divided into the sub- result of multiple groups spatial processing, and the sub- result of at least two groups spatial processing is subjected to process of convolution in the time domain respectively, obtain the sub- result of at least two groups Time Domain Processing, wherein, the convolution kernel coefficient of expansion of the process of convolution of each group of sub- result of spatial processing in the time domain is different；The sub- result of at least two groups Time Domain Processing is spliced, the video feature vector that obtains that treated.In disclosed method, every group of sub- result of spatial processing all carries out convolution with the different convolution kernel coefficients of expansion, namely multiple scales carry out convolution processing to video features, the information of image when more can more fully capture at the time of changing in time, is able to ascend the accuracy of temporal signatures.

Description

A kind of video features processing method, device and Three dimensional convolution neural network model

Technical field

This application involves electronic information field more particularly to a kind of video features processing methods, device and Three dimensional convolution mind Through network model.

Background technique

Video feature extraction be video processing a basic link, nearly all video analysis and processing during, It requires first to extract video features.Three dimensional convolution neural network model because its can preferably while capture time domain, The characteristic information in airspace, therefore is widely adopted.The process of video feature extraction is carried out using Three dimensional convolution neural network model, Convolution operation is actually carried out on time and Spatial Dimension simultaneously, is obtaining each frame image in video to realize Visual signature while, can also obtain the relevance of adjacent image frame over time.

But current Three dimensional convolution neural network model arrives certain video extractions when extracting video features Temporal signatures information in video features is insufficient, poor so as to cause effect in practical application, for example, will have a direct impact on three-dimensional volume Product neural network model is in the accuracy being applied under video identification scene.

Summary of the invention

This application provides a kind of video features processing method and processing devices, it is therefore intended that solves how not consume money excessively In the case where source, the information in video is sufficiently extracted, the problem of accuracy to improve visual classification.

To achieve the goals above, this application provides following technical schemes:

In a first aspect, this application provides a kind of video features processing methods, comprising:

Obtain the video feature vector obtained after Three dimensional convolution processing；

The video feature vector is done into process of convolution on airspace, obtains spatial processing result；

The spatial processing result is divided into multiple groups spatial processing as a result, and dividing the sub- result of at least two groups spatial processing It does not carry out process of convolution in the time domain, obtains the sub- result of at least two groups Time Domain Processing, wherein each group of sub- result of spatial processing exists The convolution kernel coefficient of expansion of process of convolution in time domain is different；

The sub- result of at least two groups Time Domain Processing is spliced, the video feature vector that obtains that treated.

Preferably, described the video feature vector is done into process of convolution on airspace to include:

The video feature vector is divided at least two groups sub-video feature vector；

Lifting dimension process of convolution, the lifting dimension volume are carried out for optional two groups from at least two groups sub-video feature vector Product processing includes: that wherein one group of sub-video feature vector is carried out the liter on airspace to tie up process of convolution, obtains a liter dimension process of convolution As a result, another group of sub-video feature vector is carried out the dimensionality reduction process of convolution on airspace, dimensionality reduction convolution processing result is obtained.

Preferably, the process for obtaining spatial processing result includes:

The liter is tieed up into convolution processing result and the dimensionality reduction convolution processing result splices.

Preferably, the method also includes:

Never in the sub-video feature vector for carrying out lifting dimension process of convolution, optional one group is done pondization operation.

Preferably, the process for obtaining spatial processing result includes:

The result progress after dimension convolution processing result, the dimensionality reduction convolution processing result and the pondization operate is risen by described Splicing.

Preferably, the method also includes:

Never carry out lifting dimension process of convolution and do not do pondization operate sub-video feature vector, optional one group, and by its Spliced with the result risen after tieing up convolution processing result, the dimensionality reduction convolution processing result and pondization operation.

Preferably, it is described by the sub- result of at least two groups Time Domain Processing carry out splicing include:

Never carry out in the sub- result of spatial processing of process of convolution optional one group, and by its with by the multiple groups Time Domain Processing Sub- result is spliced.

By the video feature vector obtained after Three dimensional convolution processing and the spatial processing result for not carrying out process of convolution In any one group and the sub- result of at least two groups Time Domain Processing spliced.

The video feature vector is done into multiple dimensioned process of convolution on airspace.

On the other hand the application discloses a kind of video features processing unit, comprising:

Module is obtained, for obtaining the video feature vector obtained after Three dimensional convolution processing；

Spatial processing module obtains spatial processing knot for the video feature vector to be done process of convolution on airspace Fruit；

Convolution module, for the spatial processing result to be divided into multiple groups spatial processings as a result, and will at least two The empty sub- result of pretreatment of group carries out process of convolution in the time domain respectively, obtains the sub- result of at least two groups Time Domain Processing, wherein each The convolution kernel coefficient of expansion of the process of convolution of the group sub- result of spatial processing in the time domain is different；

Splicing module, for splicing the sub- result of at least two groups Time Domain Processing, obtaining that treated, video is special Levy vector.

On the other hand the application discloses a kind of Three dimensional convolution neural network model, comprising:

Convolutional layer, the convolutional layer are used to carry out process of convolution to video sample, obtain after obtaining Three dimensional convolution processing Video feature vector；

Video features processing unit receives the three-dimensional volume and treated video feature vector and handles.

In video feature vector processing method disclosed in the present application, every group of sub- result of spatial processing all uses different convolution Swelling of nucleus coefficient carries out convolution, that is to say, that convolution processing can be carried out to video features from multiple scales, it in this way can be more The information of image when more fully capturing at the time of changing in time, is able to ascend the accuracy of temporal signatures.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow chart of video features processing method disclosed in the embodiment of the present application；

Fig. 2 is a kind of schematic diagram of characteristic image frame disclosed in the embodiment of the present application；

Fig. 3 is a kind of principle for the processing module block for realizing video features processing method disclosed in the embodiment of the present application Schematic diagram；

Fig. 4 is the flow chart of another video features processing method disclosed in the embodiment of the present application；

Fig. 5 is the original of the processing module block of another disclosed realization video features processing method of the embodiment of the present application Manage schematic diagram；

Fig. 6 is the original of the processing module block of another disclosed realization video features processing method of the embodiment of the present application Manage schematic diagram；

Fig. 7 is a kind of structural schematic diagram of video features processing unit disclosed in the embodiment of the present application.

Specific embodiment

Inventors have found that Three dimensional convolution neural network model when extracting video features, for different videos, extracts The effect of the temporal signatures arrived is irregular.Find after further research, the effect and video when with direct relation.It is right In the moderate video of certain durations, effect is preferable, and or duration longer video shorter for certain durations, time domain Characteristic effect is relatively poor.The reason is that for the convolutional layer with the same convolution kernel coefficient of expansion, when video length compared with In short-term, it is difficult to capture changed a certain frame image, to miss crucial information, causes temporal signatures effect poor, And when video length is longer at that time, if a certain movement duration there are several seconds, it is difficult timely and effectively capture movement hair Information when changing.

To solve the above-mentioned problems, the embodiment of the present application discloses a kind of video features processing method, and this method can be by One processing module block realizes that the processing module can be applied in 3D convolutional neural networks, receives the output of convolutional layer, It is intended that it is further processed, after obtaining Three dimensional convolution treated video feature vector so that processing Video feature vector afterwards can preferably embody the feature of video in the time domain, further the video features that extract to The accuracy of amount, and then promote 3D convolutional neural networks and apply the accuracy in video identification scene.

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

A kind of process of video features processing method disclosed in the embodiment of the present application is as shown in Figure 1, comprising:

Step S101: the video feature vector obtained after Three dimensional convolution processing is obtained；

Video feature vector in the present embodiment refers to using at the convolutional layer in Three dimensional convolution neural network model The characteristic pattern obtained after reason, and during processing, it is practically used for the feature vector of characteristic feature picture frame.The application is real Apply the multi-dimensional matrix that the video feature vector in example is H*W*C.Wherein W is the width of any one of characteristic image frame, and H is The height of any one characteristic image frame, C represent the quantity in channel.The numerical value of C depends in Three dimensional convolution neural network model Convolutional layer convolution kernel numerical value.

Step S102: the video feature vector is done into process of convolution on airspace, obtains spatial processing result；

In the present embodiment, airspace process of convolution is carried out to video feature vector first.For Three dimensional convolution neural network Convolution kernel is typically expressed as h*w*t in model, wherein and t indicates that time parameter, h and w indicate the height and width of airspace convolution kernel, In this step, only to airspace carry out process of convolution then mean convolution kernel be h*w*1, convolution kernel it is last it is one-dimensional be 1, also It is not carry out convolution operation in the time domain.

As shown in Fig. 2, multi-dimensional matrix is made of the layer of t h*w from the angle of any one characteristic image frame, each layer A corresponding characteristic image.In Fig. 2, t₀、t₁、t₂、t₃The corresponding timestamp of each characteristic image respectively shown in Fig. 2.It is only right Airspace carries out process of convolution, it is meant that only selected characteristic point carries out process of convolution on each layer, is not related between different layers Connection.

Airspace process of convolution in the present embodiment can be the process of convolution of single scale, that is to say, that convolution kernel it is big It is small to be consistent, more parameters will not be introduced in this way, and operand is lower, and scene general for processor performance also can It is applicable in.If the processor performance in actual treatment is preferable, multiple dimensioned process of convolution also can be used, using different size of volume Product core carries out convolution algorithm, although will increase number of parameters, promotes operand, multiple dimensioned convolution is able to ascend spatial feature Accuracy, further promoted final process result accuracy.

Step S103: the spatial processing result is divided into multiple groups spatial processing as a result, and will be at least two groups airspace It manages sub- result and carries out process of convolution in the time domain respectively, obtain the sub- result of at least two groups Time Domain Processing, wherein at each group of airspace The convolution kernel coefficient of expansion for managing the process of convolution of sub- result in the time domain is different；

In this step, spatial processing result is divided into multiple groups, that is, at least two groups, obtains multiple groups spatial processing knot Then fruit therefrom selects the group of predetermined amount to carry out process of convolution in the time domain respectively.In the present embodiment, spatial processing As a result it is grouped according to channel, grouping can choose respectively, can not also divide equally, it is assumed that spatial processing result has 64 to lead to Road, if being divided into two groups, every group of 32 channels are divided with specific processing requirement according to demand if do not divided equally Match, for example, if needing to be divided into three groups, then can be divided into 20 channels, 20 channels and 24 channels, because being using logical Road is grouped, if using dividing equally, needing to predefine the group number to be divided at the beginning of modelling, so that volume be arranged The quantity of the output channel of lamination can be divided exactly by this group of number.By taking spatial processing result is divided into two groups as an example, respectively will The different channel of this two groups of feedings carries out the process of convolution in time domain, and still, the convolution kernel coefficient of expansion of process of convolution is different.? In the present embodiment, at least to introduce two different convolution kernel coefficients of expansion, thus at least two groups of spatial processing results when Process of convolution is carried out on domain.

Herein, the convolution kernel coefficient of expansion be d, example as shown in connection with fig. 2, when it is characterized in convolution, every d layer Carry out convolution operation.In fig. 2 it can be seen that be between layers it is adjacent, in the value for being not provided with d, be defaulted as 1, Exactly each layer carries out convolution.If d is the positive integer greater than 1, it is meant that convolution operation can just be done by being often separated by d-1 layers. As shown in Fig. 2, mean that every alternating floor does convolution if d=2, if representing the parameter in time domain in convolution kernel is 2, That is for t₀Certain point on corresponding image, by itself and t₂Point at same position on corresponding image carries out convolution Operation.

In the present embodiment, every group of sub- result of spatial processing all carries out convolution with the different convolution kernel coefficients of expansion, also It is to say that convolution processing can be carried out to image from multiple scales, and the value of d is bigger, and time span when convolution operation is bigger.

Step S104: the sub- result of at least two groups Time Domain Processing is spliced, the video features that obtain that treated to Amount.

The sub- result of Time Domain Processing after carrying out convolution according to the different coefficients of expansion is spliced, and finally obtains that treated Video feature vector.Herein, it because different groups are convolved the convolution kernel coefficient of expansion difference of processing, and finally needs every group Processing result spliced.In front because grouping when be carried out according to port number divide equally, spelling herein It connects and refers to and splice the video features on every group of channel.

In disclosed embodiments, every group of sub- result of spatial processing is all carried out with the different convolution kernel coefficients of expansion Convolution, that is to say, that convolution processing can be carried out to video features from multiple scales, more can more fully caught in time in this way The information of image when grasping at the time of changing is able to ascend the accuracy of temporal signatures.For example, some in video acts Duration is very short, only several seconds, if the convolution kernel coefficient of expansion is larger, may lose and mention to the feature of this movement It takes.And if movement duration is very long, the convolution kernel coefficient of expansion is smaller, although the spy of this movement will not be lost Sign, but the sampling of the global characteristics to this movement can be lacked, to influence the accuracy of temporal signatures.And to video features Convolution is carried out according to the different convolution kernel coefficients of expansion, then can capture duration very short feature in time, also can capture To the global characteristics of the longer feature of time of origin, to realize the accuracy of feature extraction.

Also, Three dimensional convolution first in the present embodiment, is split as to the sky of 2 dimensions before carrying out multiple dimensioned processing to time domain The convolution that domain convolution sum 1 is tieed up, greatly reduces the parameter amount and operand of model, so that this method not will increase Biggish calculation amount is adapted in the processor of different operational capabilities.

Corresponding to video features processing method shown in Fig. 1, the processing module block's disclosed in the present application for realizing this method Schematic illustration is as shown in Figure 3.In the present embodiment, it is divided into two parts, spatial processing part and Time Domain Processing part.Airspace The convolution kernel of process of convolution is 3*3*1, and in convolution part, spatial processing result is divided into 3 groups, and convolution core is equal For 1*1*3, but the value of d is different, and respectively 1,2,4, that is, represent layer-by-layer convolution, it interlayer convolution and is rolled up every 3 layers Product.

Spatial processing result is not limited in the present embodiment and is divided into 3 groups, also, the setting of convolution kernel and d value is also only Be signal, airspace convolution kernel can also be 5*5*1, convolution core may be 1*1*5 etc., can threedimensional model training it Just set according to specific circumstances.Although the number of convolution kernel can also, in figure and only one non-limiting convolution kernel To be set according to practical situation.

In another embodiment disclosed in the present application, step S102 can also use for reference the think of of the packet transaction in step S103 Airspace convolution is carried out presumably.Its detailed process is as shown in Figure 4, comprising:

Step S401: the video feature vector is divided at least two groups sub-video feature vector；

Step S402: lifting dimension process of convolution is carried out for optional two groups from at least two groups sub-video feature vector.

The lifting dimension process of convolution includes: the liter Wei Juanjichu carried out wherein one group of sub-video feature vector on airspace Reason obtains a liter dimension convolution processing result, another group of sub-video feature vector is carried out the dimensionality reduction process of convolution on airspace, is dropped Tie up convolution processing result.

After grouping carries out process of convolution, it is also necessary to which the liter is tieed up convolution processing result and the dimensionality reduction process of convolution knot Fruit is spliced, to obtain spatial processing result.

In the present embodiment, by carrying out the processing of lifting dimension to video feature vector, the non-linear table of network is increased Danone power.

In addition, in the above-described embodiments, video feature vector, which is divided equally, can be divided into 3 groups, obtain 3 groups of sub-video features to Amount, wherein two groups are done, processing shown in Fig. 4 is outer, and remaining one group is pondization operation Max pooling, max in the application Pooling uses the scale of 2*2, and stride=1 indicates that on characteristic pattern, 2*2 is maximized in four numbers totally, passes through pond Operation is realized to the dimension-reduction treatment on airspace, simplifies the complexity in subsequent calculating process.

Result after pondization operation can with state liter together with dimension convolution processing result and the dimensionality reduction convolution processing result into Row splicing, and then obtain spatial processing result.

Due to first to be separated to time domain and airspace in disclosed embodiments, so just inevitably It will cause the loss of some characteristic informations, in order to solve this problem, above-mentioned video feature vector is divided into 4 groups, wherein two groups After flow processing shown in Fig. 4, one group of carry out pondization in remaining two groups is operated, and another set, is not processed, directly 3 groups of processing result of itself and other is spliced, that is, the original video feature vector before not processing, by itself and pond Result, liter dimension convolution processing result and the dimensionality reduction convolution processing result after operation are spliced together, because this is regarded all the way Frequency feature vector does not pass through any processing, therefore wherein contains whole raw information, thus in splicing, it can To replenish in the convolution process of airspace, because the characteristic information lost with the separation of time domain, ensure that the integrality of feature.

Similarly, in step s 103, spatial processing result is also classified into 3 groups, obtains spatial processing as a result, wherein two groups The sub- result of spatial processing does convolution processing, and another set is not processed, so in step S104, at multiple groups time domain When managing sub- result and spliced, by the sub- result of multiple groups Time Domain Processing and one group of spatial processing for not handled by convolution As a result spliced.It is, never carry out in the sub- result of spatial processing of process of convolution optional one group, and by its with will be described The sub- result of multiple groups Time Domain Processing is spliced.

Further, the video features obtained after Three dimensional convolution processing directly can also be introduced from the input terminal of block Vector also splices it with the sub- result of Time Domain Processing, that is to say, that the video obtained after Three dimensional convolution processing is special Sign and do not carry out in the sub- result of spatial processing of process of convolution any one group and the sub- result of at least two groups Time Domain Processing into Row splicing further improves the accuracy of processing result so that the characteristic information of last processing result can be complete.

Realize that the schematic illustration of the block of above-mentioned video features processing is as shown in Figure 5.

The process object of video features processing method disclosed in the present application be in Three dimensional convolution neural network convolutional layer to original Beginning video data carries out the video features obtained after process of convolution, that is to say, that is further processed to video features.It is held Row main body can regard that a processing module block, the block are defeated with convolutional layer in Three dimensional convolution neural network model as It is connected out.Because the convolutional layer in Three dimensional convolution neural network model may include multiple, video features processing module Block is also accordingly arranged multiple.

In order to better illustrate video features processing method disclosed in the present application, below with reference to video features shown in fig. 6 The principle of mould is handled, it is further elucidated above to make by a specific example.

The experimental situation for being trained use to Three dimensional convolution neural network model in the present embodiment is Linux/CentOS 7.2, software platform uses Tensorflow 1.9.

Video sample is pre-processed first.Due to the width of general each video and high different, the ratio of width to height is not yet ?.For the ease of subsequent processing, each video is zoomed to 256 according to short side equal proportion, so Long side is zoomed in and out according to original video ratio afterwards.

It is also different comprising picture frame number in 1 second according to the difference of video frame rate.Typical case is one second 24 frame, 25 frames or 30 Frame, in the present embodiment, first uniform extracted at equal intervals 8 frame per second from 1 second, if one video 10 seconds, that obtains 80 frame in total. Then from the frame number that this is extracted, a starting point is randomly selected, starts to take in the totalframes of extraction with the starting point continuous 64 frames.If sum recycles extraction less than 64 frames.

Then the random cropping function random_crop in Tensorflow 1.9 is called, 64 frame pictures are carried out random It cuts.Then flip horizontal function flipping is called, flip horizontal is carried out to picture according to predetermined probabilities.In the present embodiment Predetermined probabilities are 50%, can realize that data expand in the case where not substantive increase data to the flip horizontal operation of picture Increase, that is, data enhancing, handles the enhanced picture of data the generalization ability that can increase frequency characteristic processing method.

In the present embodiment, the training parameter of preset model is as follows:

Initial learning rate Initial learning rate=0.01；

Exponential decay exponential decay,

Decay factor decay rate=0.96；

Criticize size Batchsize=32,4GPUs

Weight attenuation parameter Weight decay parameter=0.01.

The specific value of above-mentioned parameter can be required according to specific model training to be set, the number in the present embodiment Value is illustrative only, and does not make restriction.

224*224*64 frame image above-mentioned is input to Three dimensional convolution neural network model to be trained.In training process In, the treatment process of convolutional layer can refer to existing operation, still, obtain video feature vector after convolutional layer, and non-straight Connect as feature extraction as a result, but to enter back into processing module block disclosed in the present application and carry out the place of next step Reason, in Three dimensional convolution neural network model, the quantity of processing module block can be it is multiple, can be located in model Different location, but its purpose and effect are all to make further processing to the output of convolutional layer each in model.But No matter why not together its location has, and principle and structure can be carried out by block structural schematic diagram shown in fig. 5 Explanation.

Video feature vector enters after block, and disposed of in its entirety is divided into two big parts, spatial processing part and when Domain handles part, by simplifying separation spatially and temporally the complexity and calculation amount of model, relieving time domain and airspace Coupling can't increase the complexity of model on the whole so that newly-increased block can't introduce too big calculation amount Degree and to the resource occupation of GPU, adapts in the processor of different operational capabilities.

In spatial processing part, firstly, the video feature vector after convolution is divided into four groups in the part input Sub-video feature vector, it is assumed that the video vector after convolution is 224*224*64, then each group of sub-video feature vector is 224* 224*8, subsequent processing pass through 4 channels respectively and have carried out different processing.Need to guarantee the last output of convolutional layer herein Number of channels can be divided exactly by the preset number of channels of block, if the preset number of channels of block is 3, The multiple that the number of channels of convolutional layer finally exported is 3.

Wherein one group of sub-video feature vector is pondization processing max pooling by a certain paths, and sub-video is special Sign vector has carried out dimensionality reduction on airspace.The scale for the 2*2 that max pooling is used, stride=1, indicate on characteristic pattern, 2*2 is maximized in four numbers totally.Why to be expressed as 2*2 and be because single picture or characteristic pattern are 2D planes, Stride indicates the step-length moved every time to the right and downwards, the i.e. mobile step-length of cage.Then pass through the convolution of a 1*1*1 Core is introduced into the first concatenation unit concat.

Wherein two groups of sub-video feature vectors respectively correspond other two paths, do all the way to a certain group of sub-video vector Dimension process of convolution is risen, dimensionality reduction process of convolution is done to another group of sub-video vector all the way, setting can be passed through by rising the operation of peacekeeping dimensionality reduction To realize, rising to tie up to handle can according to circumstances set with the convolution nuclear volume during dimension-reduction treatment the quantity of different convolution kernels It is fixed.Dimension or dimensionality reduction convolution are either risen, convolution kernel is 3*3*1, it is, process of convolution is only done on airspace, to time domain Part is retained.Grouping does a liter peacekeeping dimension-reduction treatment to video vector in the step, can introduce some new ginsengs for model Number, so that the non-linear expression of lift scheme, is conducive to the stability of model.

There are also channels all the way, do not process to its that corresponding group sub-video vector, directly pass through the volume of a 1*1*1 Product core is connected to the first concatenation unit, indicates the raw information for transmitting the video vector of some input inputs, is unlikely to carrying out Characteristic information is lost when the lock out operation of time-space domain.

In the first concatenation unit, tetra- tunnel processing result of Jiang Zhe is spliced, and obtains spatial processing as a result, so far, at airspace Reason process terminates, and the airspace process of convolution in the present embodiment can be the process of convolution of single scale, or multiple dimensioned volume Product processing.If it is multiple dimensioned process of convolution, the accuracy of spatial feature is more preferable, and only whole operand can be promoted.

As shown in Figure 6, in Time Domain Processing part, the output of concatenation unit is also divided into four groups, it is, by airspace Processing result is divided into four groups, obtains four groups of spatial processing as a result, being then fed to four accesses respectively, wherein left side It is equally not processed all the way, the second concatenation unit is connected to by the convolution kernel of a 1*1*1, is indicated spatial processing result Raw information retained, avoid loss characteristic information.

The sub- result of three groups of spatial processings is carried out convolution processing by three accesses on right side, and three groups of convolution kernel is identical, It is 1*1*3, but every group of the convolution coefficient of expansion is different, is followed successively by d=1, d=2, d=4 from left to right, that is, Represent layer-by-layer convolution, interlayer convolution and every 3 layers of convolution.

After convolution is handled, at one group of airspace that three groups of convolutions handle sub- result and the leftmost side is transmitted all the way It manages sub- result to splice together, obtains Time Domain Processing result.

There are also the accesses for being directly connected to the second concatenation unit from input all the way in figure, that is to say, that inputs input Three dimensional convolution processing after the obtained raw information of video vector be also transmitted to this, by the result of itself and Time Domain Processing part Spliced, further avoid the loss of characteristic information, improves the performance and stability of model.

Video features processing method disclosed in the embodiment of the present application, when carrying out convolution processing, every group of spatial processing Sub- result all carries out convolution with the different convolution kernel coefficients of expansion, that is to say, that it is special can to carry out time domain to image from multiple scales Sign is extracted and is captured, the information of image when more can more fully capture at the time of changing in time in this way, energy Enough promote the accuracy of temporal signatures.

In the present embodiment, the parameters such as above-mentioned involved grouping, number of channels, convolution kernel, coefficient of expansion are all to lift Example, to illustrate the treatment process of block, so specific parameter values are all according to Three dimensional convolution neural network to be trained The actual demand of model and set.

Also, the purpose of the convolution kernel of all 1*1*1 is to change the video feature vector of processing in above structure Number of channels, that is to say, that the quantity by setting the convolution kernel of 1*1*1 plays the role of liter dimension or a dimensionality reduction, with Lifting Modules The non-linear expression of type is conducive to the stability of model.Part is handled in spatial domain, the convolution kernel of four 1*1*1 of appearance Number can be unanimously or inconsistent.

In entire Three dimensional convolution neural network model, the depth as locating for each block is different, it includes convolution The quantity of core can be identical, can also be adjusted according to depth.For example, the 1*1*1 being connect with max pooling Convolution kernel, quantity increases with the increase of depth.

Certainly, it if you do not need to carrying out a liter dimension-reduction treatment, then is not necessarily to introduce the convolution kernel of 1*1*1.

The another embodiment of the application discloses a kind of video features processing unit, as shown in fig. 7, comprises:

Module 701 is obtained, for obtaining the video feature vector obtained after Three dimensional convolution processing；

Spatial processing module 702 obtains spatial processing for the video feature vector to be done process of convolution on airspace As a result；

Convolution module 703, for the spatial processing result to be divided into multiple groups spatial processing as a result, and will at least Two groups of skies pre-process sub- result and carry out process of convolution in the time domain respectively, obtain the sub- result of at least two groups Time Domain Processing, wherein every The convolution kernel coefficient of expansion of the process of convolution of one group of sub- result of spatial processing in the time domain is different；

Splicing module 704, for the sub- result of at least two groups Time Domain Processing to be spliced, the video that obtains that treated Feature vector.

Using video features processing unit disclosed in the present embodiment to the video feature vector obtained after Three dimensional convolution processing When being handled, every group of sub- result of spatial processing all carries out convolution with the different convolution kernel coefficients of expansion, that is to say, that understands from more A scale carries out convolution processing to video features, when more can more fully capture at the time of changing in time in this way Image information, be able to ascend the accuracy of temporal signatures.

The specific workflow of the device can refer to embodiment corresponding to Fig. 1-6, specifically repeat no more.

The application also discloses a kind of Three dimensional convolution neural network model, including video features as shown in Figure 7 Processing unit.

Convolutional layer in Three dimensional convolution neural network model is used to carry out process of convolution to video sample, obtains Three dimensional convolution The video feature vector obtained after processing；

And video features processing unit then receives the video feature vector that convolutional layer is handled, then to convolutional layer processing Obtained video feature vector is handled.

Because the convolutional layer in Three dimensional convolution neural network model may include multiple, video features processing unit Can accordingly it be arranged multiple.

The Three dimensional convolution neural network model because being provided with the video features processing unit, can be to convolution after Obtained video feature vector is further processed, and video feature vector is first done process of convolution on airspace, obtains sky Domain processing result obtains spatial processing as a result, therefrom some spatial processings are sub after being then grouped to spatial processing result As a result, carrying out multiple dimensioned convolution in the time domain, that is, every group of sub- result of spatial processing is all used to different convolution swelling of nucleus systems Number carry out convolution, thus realize from multiple scales to video features carry out convolution processing, in this way can more in time more comprehensively Image when capturing at the time of changing information, be able to ascend the accuracy of temporal signatures.

If function described in the embodiment of the present application method is realized in the form of SFU software functional unit and as independent production Product when selling or using, can store in a storage medium readable by a compute device.Based on this understanding, the application is real The part for applying a part that contributes to existing technology or the technical solution can be embodied in the form of software products, The software product is stored in a storage medium, including some instructions are used so that a calculating equipment (can be personal meter Calculation machine, server, mobile computing device or network equipment etc.) execute each embodiment the method for the application whole or portion Step by step.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), with Machine accesses various Jie that can store program code such as memory (RAM, Random Access Memory), magnetic or disk Matter.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with it is other The difference of embodiment, same or similar part may refer to each other between each embodiment.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of video features processing method characterized by comprising

The spatial processing result is divided into multiple groups spatial processing as a result, and the sub- result of at least two groups spatial processing exists respectively Process of convolution is carried out in time domain, obtains the sub- result of at least two groups Time Domain Processing, wherein each group of sub- result of spatial processing is in time domain On process of convolution the convolution kernel coefficient of expansion it is different；

2. the method according to claim 1, wherein described do convolution for the video feature vector on airspace Processing includes:

Lifting dimension process of convolution, the lifting Wei Juanjichu are carried out for optional two groups from at least two groups sub-video feature vector Reason includes: that wherein one group of sub-video feature vector is carried out the liter on airspace to tie up process of convolution, obtains a liter dimension convolution processing result, Another group of sub-video feature vector is subjected to the dimensionality reduction process of convolution on airspace, obtains dimensionality reduction convolution processing result.

3. according to the method described in claim 2, it is characterized in that, the process for obtaining spatial processing result includes:

4. according to the method described in claim 2, it is characterized in that, the method also includes:

5. according to the method described in claim 4, it is characterized in that, the process for obtaining spatial processing result includes:

The result risen after tieing up convolution processing result, the dimensionality reduction convolution processing result and pondization operation is spelled It connects.

6. according to the method described in claim 4, it is characterized in that, the method also includes:

Never it carries out lifting dimension process of convolution and does not do the sub-video feature vector that pondization operates, optional one group, and by itself and institute Result after stating a liter dimension convolution processing result, the dimensionality reduction convolution processing result and pondization operation is spliced.

7. the method according to claim 1, wherein described carry out the sub- result of at least two groups Time Domain Processing Splicing includes:

Never carry out in the sub- result of spatial processing of process of convolution optional one group, and by its with by the multiple groups Time Domain Processing knot Fruit is spliced.

8. the method according to claim 1, wherein described carry out the sub- result of at least two groups Time Domain Processing Splicing includes:

It will be in the video feature vector that obtained after Three dimensional convolution processing and the spatial processing result for not carrying out process of convolution Any one group and the sub- result of at least two groups Time Domain Processing are spliced.

9. method described in any one of -8 according to claim 1, which is characterized in that described that the video feature vector exists Process of convolution is done on airspace includes:

10. a kind of video features processing unit characterized by comprising

Spatial processing module obtains spatial processing result for the video feature vector to be done process of convolution on airspace；

Convolution module, for the spatial processing result to be divided into multiple groups spatial processings as a result, and by least two groups sky It pre-processes sub- result and carries out process of convolution in the time domain respectively, obtain the sub- result of at least two groups Time Domain Processing, wherein each group empty The convolution kernel coefficient of expansion that domain handles the process of convolution of sub- result in the time domain is different；

Splicing module, for the sub- result of at least two groups Time Domain Processing to be spliced, the video features that obtain that treated to Amount.

11. a kind of Three dimensional convolution neural network model characterized by comprising

Convolutional layer, the convolutional layer are used to carry out process of convolution to video sample, obtain the video obtained after Three dimensional convolution processing Feature vector；

Video features processing unit as claimed in claim 10, the video features processing unit receive at the Three dimensional convolution The video feature vector obtained after reason and processing.