CN109934188A - A kind of lantern slide switching detection method, system, terminal and storage medium - Google Patents

A kind of lantern slide switching detection method, system, terminal and storage medium Download PDF

Info

Publication number
CN109934188A
CN109934188A CN201910208617.5A CN201910208617A CN109934188A CN 109934188 A CN109934188 A CN 109934188A CN 201910208617 A CN201910208617 A CN 201910208617A CN 109934188 A CN109934188 A CN 109934188A
Authority
CN
China
Prior art keywords
residual error
empty
module
network model
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910208617.5A
Other languages
Chinese (zh)
Other versions
CN109934188B (en
Inventor
马然
刘致金
李凯
沈礼权
安平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910208617.5A priority Critical patent/CN109934188B/en
Publication of CN109934188A publication Critical patent/CN109934188A/en
Application granted granted Critical
Publication of CN109934188B publication Critical patent/CN109934188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The present invention provides a kind of lantern slide switching detection method, comprising: connects one three classification output layer after convolutional neural networks structure, to obtain the classification information of video frame volume, obtains three classification convolutional neural networks models;On the basis of residual error module in the 3D convolution module and ResNet network in the three classification structures of convolutional neural networks models, 3D ConvNet network when design, empty residual error network model;Extracted using 3D convolution module video frame when, empty feature, residual error module is dissolved into 3D convolution module and obtains 3D convolution residual error module, building for video frame volume classification when, empty residual error network model.The present invention also provides corresponding detection system, terminal, computer readable storage mediums.Accuracy of the present invention is more preferable, and overcoming speech video to have, camera lens is mobile, speaker is mobile and multiple PTZ Shot change interference, and higher than existing method accuracy.

Description

A kind of lantern slide switching detection method, system, terminal and storage medium
Technical field
The present invention relates to a kind of video information processing methods, when especially one kind is based on, empty residual error deep learning network mould The lantern slide switching detection method and system of type.
Background technique
With the development of IT wave and multimedia technology, intuitive, certainty, the high efficiency of video information make total Word video using more and more extensive, and internet closely contacts this vision grand banquet together.Currently, online learn The mode for having become an important acquisition knowledge is practised, people recorded in meeting room or classroom respectively with intelligent equipment The study video of kind form, is broadcast to more people further through internet.However these videos without any structuring at Entire video is presented to the user by reason, study website.If user is interested in some knowledge point, generally require to browse entire view Frequency can just find corresponding knowledge point, can expend plenty of time and the energy of user in this way.Data according to statistics, on YouTub video Passing is about 400 hours amount of video per minute.If these videos are all unprocessed, a large amount of learner can be by these Study video floods, and reduces their learning interest.Therefore, for online education or other application, automation is mentioned It takes the information representative in speech video and makees speech video frequency abstract and be very important.Wherein lantern slide change detection is speech The most key one of technology in video frequency abstract, is critically important research topic.
It is the video played with lantern slide that speech video, which has a big chunk video type, and lantern slide is cut in this kind of video Change the important research point that detection is speech video frequency abstract.By PTZ (pan-tilt-zoom) camera will comprising speaker, Projector slide, spectators are recorded into speech video.According to the difference of video record mode, speech video can be divided into three types Type: static camera lens are recorded, mobile camera lens are recorded and camera lens switching is recorded.Since speech video is not only remembered Lu Liao view field also has recorded speaker and spectators simultaneously, and speaker and spectators these backgrounds are to lantern slide change detection Certain interference is caused, if camera lens are mobile, camera lens switching and speaker's is mobile etc..Moreover, lantern slide switch it is past Toward the variation that view field's content within a very short time occurs, it is difficult manually to go identification switching moment.Therefore it gives a lecture Video slides change detection is a significant and challenging job.
Due to complicated noise jamming, the side of some detections also proposed for different types of video domestic and foreign scholars Method.Certain methods propose the image similarity using visual signature detection consecutive frame, such as color histogram, SIFT, HOG and small Wave etc..But these methods not the movement of speaker, the movement of camera lens and the switching of camera lens, take into account by these interference, For example camera lens is switched to speaker from computer screen, will cause the variation of video in this case.Some method is directed to Be particular video frequency type, such as not Shot change single-lens and fixed lens shoot.These methods have respective limitation Property.
The Chinese patent applied before the applicant, application No. is: 201710878115.4, it discloses a kind of based on sparse The lantern slide switching detection method of time-varying figure.To the speaker that has of multiple-camera shooting, the speech video of lantern slide and spectators is first It first passes through characteristic point detection and matches video segmentation, it is sparse by being established to each time point using every section of video as node Figure, can convert lantern slide change detection problem to supposition figure adjacency matrix problem.Variation between adjacency matrix reflects Lantern slide switching.Patent application effect on processing still frame and the speech video of Shot change type is preferable, still But have that complicated camera lens is mobile in processing speech video, such as camera lens is mobile, scaling, in the presence of switching simultaneously error compared with Greatly.In addition, the patent application is to have ignored the handover information between consecutive frame based on traditional images characteristic point.
Summary of the invention
For the defects in the prior art, the object of the present invention is to provide it is a kind of based on when, empty residual error network model it is unreal Camera lens movement/scaling, speaker's movement, camera lens can be effectively treated in lamp piece switching detection method, system, terminal and storage medium Lantern slide change detection problem under the interference such as switching.Compared with prior art, when the present invention utilizes, empty residual error network model detects Lantern slide switching can overcome speech video to have the interference of mobile and multiple PTZ Shot change of camera lens movement/scaling, speaker etc., Precision of method is high, and the range of processing speech video genre is wide.
Convolution kernel is extended to the 3DConvNet convolutional neural networks of 3D from 2D and is extracted video by the present invention using a kind of Room and time feature.As the convolution number of plies of superposition increases, 3D ConvNet can expend more memories, this can be to model Training cause certain difficulty.In order to solve this problem, present invention employs residual error network model (Residual Network,ResNet).New convolutional network model proposed by the present invention not only saves the training time, is also easier to trained To better lantern slide change detection result.
According to the first aspect of the invention, provide it is a kind of based on when, empty residual error network model lantern slide change detection side Method, comprising:
It include lantern slide, speaker and/or the Video segmentation of spectators at multiple by pass through single or multiple shot records Video frame volume comprising video frame;
Convolutional neural networks structure is designed using the design principle for the network structure for extracting picture spatial feature;In the volume One three classification output layer is connected after product neural network structure, classification letter of the three classification output layer to obtain video frame volume Breath, obtains three classification convolutional neural networks models;3D convolution in the structure of three sorter network models, 3D ConvNet network When being designed on the basis of the residual error module in module and residual error network model ResNet network, empty residual error network model;
Extracted using the 3D convolution module in 3D ConvNet network the video frame when, empty feature, by residual error net 3D convolution residual error mould is obtained in the 3D convolution module that residual error module in network model ResNet is dissolved into 3D ConvNet network Block, building for video frame volume classification when, empty residual error network model;Wherein:
It is divided into multiple video frames comprising video frame to roll up training video, will be sent to after the volume classification of these video frames When, be trained in empty residual error network model, when obtaining trained, empty residual error network model;
Classification results are obtained when the video frame volume of test video is sent into trained, in empty residual error network model, are detected Lantern slide switching moment out.
Preferably, the structure of the three classification convolutional neural networks model is 12 layers of convolutional neural networks structure, includes 8 layers Convolutional layer and 4 layers of full articulamentum;As network is deepened, the width and height of image all constantly reduce with certain rule, often The width and height of secondary Chi Huahou image just reduce half, and channel number is continuously increased one times;Last output layer is three points Classification information of the class output layer to obtain video frame volume.Network structure is very regular, and not so much hyper parameter is absorbed in structure Build simple network.
Preferably, the design principle of the network structure for extracting picture spatial feature, mainly has followed following two and sets Count principle:
If 3D convolution residual error module output and input when, empty characteristic pattern size it is identical, the convolution of convolutional neural networks The port number of core does not change;
If the output of 3D convolution residual error module when, size when being input of empty characteristic pattern, empty characteristic pattern size one Half, the port number doubles of the convolution kernel of convolutional neural networks are to guarantee the consistency of time complexity.
Preferably, the 3D convolution module application 3D convolutional layer in the 3D ConvNet network and the pond 3D layer model Extract the video frame when, empty characteristic pattern, the short connection of the residual error module application of the residual error network model ResNet network and Identical mapping improves model learning efficiency;Residual error module in the residual error network model ResNet is dissolved into 3D ConvNet 3D convolution module in network obtains 3D convolution residual error module;One 1 is contained in the short connection of the 3D convolution residual error module × 1 3D convolutional layer, to the dimension for guaranteeing to export after the output of 3D convolution residual error module and 1 × 1 3D convolutional layer mapping Unanimously.
Preferably, the 3D convolutional layer that one 1 × 1 is contained in the short connection of the 3D convolution residual error module, to guarantee The output of 3D convolution residual error module and 1 × 1 3D convolutional layer mapping after export the consistent method of dimension be:
It include two layers of convolutional layer in the 3D convolution residual error module, therefore, residual error mapping F (x) is expressed as
F (x)=ω2σ(ω1x+b1)+b2
Wherein, x indicates input, ω1Indicate the weight coefficient of first layer convolutional layer;ω2Indicate the weight of second layer convolutional layer Coefficient;b1Indicate the departure of first layer convolutional layer;b2Indicate the departure of second layer convolutional layer;
The activation primitive of σ expression RELU:
Wherein, x indicates input;
In order to keep the dimension for inputting x and residual error mapping F (x) identical, 1 × 1 3D convolutional layer is added in short connection, is obtained To the mapping H (x) of weighting, it is expressed as
H (x)=Wsx
Wherein, WsIt is weighting value matrix, for matching the dimension of input x and residual error mapping F (x);
Then mapping equation Z (X) becomes:
Z (X)=F (x)+H (x).
Preferably, when described, empty residual error network model, be equipped with eight layers of convolutional layer and four layers of full articulamentum, convolutional layer preceding, Rear, every layer of convolutional layer is linked in sequence full articulamentum, and four layers of full articulamentum are connected to after convolutional layer.
Preferably, when described, in empty residual error network model, loss function uses the intersection entropy loss letter of sorter network Number.
Preferably, the cross entropy loss function is:
Wherein, x indicates input, and class indicates that the true value classified belonging to the input, n are the quantity of the video frame of input, Score value obtained in input is sorted in belonging to x [class] expression, x [j] indicates jth class score value obtained in input.
Preferably, it is trained when being sent to after the volume classification by these video frames, in empty residual error network model, wherein Space-time Domain feature is extracted to two frames in the video frame volume of input using the network model of single channel, using Adam algorithm to described When, empty residual error network model is trained.
According to the second aspect of the invention, provide it is a kind of based on when, empty residual error network model lantern slide change detection system System, comprising:
Divide module: by dividing comprising lantern slide, speaker and/or the video of spectators by single or multiple shot records It is cut into multiple video frame volumes comprising video frame;
Sorter network structure designs module: designing convolution using the design principle for the network structure for extracting picture spatial feature Neural network structure;Wherein, the design principle of the network structure of picture spatial feature is extracted are as follows: if sorter network structure is defeated The characteristic pattern size enter, exported is identical, and the port number of the convolution kernel of convolutional neural networks does not change;If sorter network knot The characteristic pattern size of structure output is the half of the characteristic pattern size of input, and the port number quantity of the convolution kernel of convolutional neural networks adds Times to guarantee the consistency of time complexity;The rear end of convolutional neural networks connects one three classification output layer to obtain video The classification information of frame volume, forms sorter network structure;
When, empty residual error network model construct module: extracted using the 3D convolution module in 3D ConvNet network described Video frame when, empty feature, the residual error module in residual error network model ResNet is dissolved into the 3D in 3D ConvNet network Obtain 3D convolution residual error module in convolution module, building for video frame volume classification when, empty residual error network model;
Training module: being divided into multiple video frames comprising video frame to roll up training video, these video frames are rolled up and are classified After be trained when being sent to, in empty residual error network model, when obtaining trained, empty residual error network model;
Detection module: when the video frame volume of test video is sent into trained, in empty residual error network model, described in output The affiliated class of video frame volume, detects lantern slide switching moment.
According to the third aspect of the invention we, a kind of terminal is provided, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the processor can be used for executing when executing described program it is above-mentioned based on when, The lantern slide switching detection method of empty residual error network model.
According to the fourth aspect of the invention, a kind of computer readable storage medium is provided, computer program is stored thereon with, The program can be used for executing when being executed by processor it is above-mentioned based on when, empty residual error network model lantern slide change detection side Method.
It is provided by the invention based on when, empty residual error network model lantern slide switching detection method, system, terminal and storage Medium, when being a kind of ConvNet and ResNet based on 3D, the lantern slide change detection technology of empty residual error network model.It is given One section by single or multiple shot records include lantern slide, speaker and the speech of spectators video, target of the present invention be inspection At the time of surveying wherein lantern slide switching.Since the room and time feature in video is for the importance of detection, so utilizing 3D ConvNet detects empty, Shi Tezheng.Since the time of video is longer, time and the precision of submission detection for optimization processing, Combine ResNet optimization network model.The training video of input is divided into multiple videos comprising video frame first by the present invention After frame volume, these video frames volume is divided into three classes and they are sent in sorter network model and is trained.Then it will survey The video frame volume for trying video is sent into trained sorter network model, class belonging to the video frame volume exported by network model, It can detecte lantern slide switching moment.The present invention be a kind of accuracy preferably overcome speech video to have camera lens is mobile, speech The mobile method with the interference of multiple PTZ Shot changes of person, and it is higher than existing method accuracy, and expand accessible speech The range of video genre.
The present invention preferably extracts the network structure design principle design basis point of picture spatial feature using current performance Class network model, and video frame volume classification problem is converted by lantern slide change detection problem.3D is added on basic model Sky, the temporal signatures that ConvNet (depth 3 ties up convolutional neural networks) extracts video frame, residual error network model ResNet is incorporated Trained efficiency is improved into network structure and constructs a new network model.When model training, by every two frame of video frame Network is sent into as a video frame volume to be trained.In allowable loss function, since lantern slide change detection is converted into point Class problem, therefore loss function is designed as cross entropy loss function.
Compared with prior art, the present invention have it is following the utility model has the advantages that
When 3D convolution residual error module is introduced by the present invention, in empty residual error deep learning network model, view can not only be extracted The spatial feature of frequency frame can also extract visible change feature between adjacent video frames, therefore have obviously in processing speech video The lantern slide switching aspect of visible change has stronger advantage.By speech Video segmentation at multiple video frames comprising video frame Volume, and be sent in learning network and carry out classification learning, enable the present invention to learn a variety of interference characteristics, such as camera lens movement/contracting It puts, the mobile and multiple PTZ Shot change of speaker, this classification learning method allows the invention to handle a plurality of types of drill Say video;Present invention detection accuracy is higher, and the range of processing speech video genre is wider.In addition the present invention does not need additional Data, such as text, voice, electronic slides.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of detection method in one embodiment of the invention;
Fig. 2 is the speech video inputted in one embodiment of the invention;
Fig. 3 is three classification schematic diagram of video frame volume in one embodiment of the invention;
Fig. 4 is 3D convolution residual error module diagram in one embodiment of the invention;
When Fig. 5 is whole in one embodiment of the invention, empty residual error network architecture schematic diagram;
Fig. 6 is lantern slide change detection result schematic diagram in one embodiment of the invention.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection scope.
The embodiment of the present invention provide it is a kind of based on 3D ConvNet and ResNet when, empty residual error network model magic lantern Piece change detection technology includes lantern slide, speaker and spectators by single or multiple shot records for given one section Speech video, target be detect wherein lantern slide switching at the time of.Since the room and time feature in video is for detection Importance, so the embodiment of the present invention detects empty, Shi Tezheng using 3D ConvNet.Since the time of video is longer, For the time of optimization processing and the precision of submission detection, present invention incorporates ResNet to optimize network model.Reality of the invention Apply example by the training video of input be divided into it is multiple comprising video frame video frames volume after, by these video frames volume by three classification Output layer is divided into three classes and they is sent in sorter network model and is trained;Then the video frame of test video is rolled up It is sent into trained sorter network model, affiliated class is rolled up by the video frame that network model exports, can detecte lantern slide Switching moment.
The application environment of following embodiment of the present invention is as follows: overall network model as shown in figure 5, Ubuntu16.04 with And PyTorch environment programming emulation.
Referring to Fig. 1, it is a kind of based on when, empty residual error network model lantern slide switching detection method, the embodiment of the present invention can With first by by single or multiple shot records comprising lantern slide, speaker and/or the Video segmentation to be detected of spectators at more A video frame volume comprising video frame;Then it follows the steps below:
Step 1: three classification convolutional neural networks model structure designs: 12 layer of three classification convolutional neural networks structure of design, It mainly has followed following two design principle:
(1) if the size of input, the characteristic pattern exported is identical, the port number of convolution kernel does not change;
(2) if the size of the characteristic pattern of output is the half of the characteristic pattern size of input, the port number of convolution kernel Doubles are to guarantee the consistency of time complexity.
As network is deepened, the width and height of image all constantly reduce with certain rule, and each Chi Huahou is just Reduce half, channel number is continuously increased one times, such as from the image down of 224 × 224 sizes to 112 × 112 after, the number of channel Amount increases 128 channels from 64 channels.Last output layer is classification of the one three classification output to obtain video frame volume Information.Network structure is very regular, not so much hyper parameter, is absorbed in the simple network of building.
Step 2:3D convolution residual error module design: 3D ConvNet network when, empty feature extracting method be using 3D volumes Volume module, it applies 3D convolutional layer and the pond 3D layer, and the short connection of the residual error module application of ResNet network and identical mapping mention High model learning efficiency.Residual error module in residual error network model ResNet is dissolved into the 3D convolution in 3D ConvNet network Module obtains 3D convolution residual error module.The 3D convolutional layer of addition 1 × 1 guarantees that input is consistent with the dimension of mapping in short connection.
The network structure of 3D convolution residual error module is as shown in figure 4, it contains two layers of 3D convolutional layer and one includes 1 × 1 3D convolutional layer short connection.Connection before and after first layer 3D convolutional layer and second layer 3D convolutional layer, so that input video frame volume warp It is exported after 3D convolution twice, size reduction port number increase.Input video frame rolls up the output after 1 × 1 3D convolutional layer It is identical as the Output Size after 3D convolution twice, final output is used as after being overlapped.
In this step, the identical method of input and output dimension is before and after determining 3D convolutional layer:
Two layers of convolutional layer is contained in 3D convolution residual error module, therefore, residual error mapping F (x) is expressed as
F (x)=ω2σ(ω1+b1)+b2
Wherein, σ indicates the activation primitive of RELU:
In order to keep the dimension for inputting x and residual error mapping F (x) identical, 1 × 1 3D convolutional layer is added in short connection, it can With the mapping H (x) weighted, it is expressed as
H (x)=Wsx
Wherein, WsIt is weighting value matrix, for matching the dimension of input x and residual error mapping F (x)
Therefore, mapping equation becomes:
Z (X)=F (x)+H (x)
Step 3: when, the building of empty residual error network model: the convolution residual error module that step 2 designs is added to step 1 and is designed Network structure in when can be obtained, empty residual error network model.This when, empty residual error network model have eight layers of convolutional layer and four layers Full articulamentum, for network structure as shown in figure 5, the 3D Convolution layer in figure belongs to 3D convolutional layer, 64 indicate convolutional channel numbers It is 64 ,/2 size reduction half for indicating video frame volume, full connection indicates that the layer belongs to full articulamentum.Solid line on 3D convolutional layer Size did not convert after the video frame that short-term refers to involves in this layer of convolution, therefore short connection can be used directly and carry out identical mapping.3D The short connection value video frame of dotted line on convolutional layer rolls up size reduction half, port number after this layer of convolution and increases one times, Guarantee that Output Size dimension is identical after increasing by 1 × 1 3D convolutional layer in short connection, therefore forms a 3D convolution residual error mould Block.Video frame size after eight layers of convolutional layer constantly halves, and port number doubles.Due to the video exported after the 8th layer of convolutional layer It is more that frame rolls up port number, therefore using four layers of full articulamentums progress dimensionality reduction, and final 4th layer full articulamentum contains three knots Point, showing last output is three classes, is the network of one three classification.
Loss function uses the common cross entropy loss function of sorter network.
Specifically, cross entropy loss function is as follows:
Wherein, x indicates input, and class indicates that the true value of the classification, n are the quantity of the video frame of input.
Step 4: model training: using single channel when, empty residual error network model extracts two frames in input video frame volume Empty, temporal signatures;Model is trained using Adam algorithm.
In this step, used network reference services algorithm is Adam algorithm, and batch size (mini-batch) is set as 128, while the parameter of the algorithm is set as β1=0.9 and β2=0.999.Wherein β1For the exponential decay rate of single order moments estimation, β2 For the exponential decay rate of second order moments estimation.Punishment multiplier in weight decaying is set as 5 × 10-4, initial learning rate is 0.001, And decay 10 times as the multiple of training time every 10 increases.Preservation model after the completion of model training.
Step 5: by video frame to be detected roll up that the input training of above-mentioned steps 4 obtains when, empty residual error network model, obtain Corresponding lantern slide switching moment can be obtained by classification results in corresponding classification results.
Corresponding to the above method, the embodiment of the present invention also provide it is a kind of based on when, empty residual error network model lantern slide cut Detection system is changed, can be used to implement above-mentioned method.System specifically includes:
Divide module: by dividing comprising lantern slide, speaker and/or the video of spectators by single or multiple shot records It is cut into multiple video frame volumes comprising video frame;
Sorter network modelling module: convolution is designed using the design principle for the network structure for extracting picture spatial feature Neural network is as basic network model;One three classification output layer is connected after the basic network model, three classification is defeated Classification information of the layer to obtain video frame volume out, forms three sorter network models;
Wherein, the design principle of the network structure of picture spatial feature is extracted are as follows: if three sorter network mode inputs, defeated The size of characteristic pattern is identical out, and the port number of the convolution kernel of convolutional neural networks does not change;If three sorter network models The characteristic pattern size of output is the half of the characteristic pattern size of input, the port number doubles of the convolution kernel of convolutional neural networks To guarantee the consistency of time complexity;
When, empty residual error network model construct module: extracted using the 3D convolution module in 3D ConvNet network described Video frame when, empty feature, the residual error module in residual error network model ResNet is dissolved into the 3D in 3D ConvNet network Obtain 3D convolution residual error module in convolution module, building for video frame volume classification when, empty residual error network model;
Training module: being divided into multiple video frames comprising video frame to roll up training video, these video frames are rolled up and are classified After be trained when being sent to, in empty residual error network model, when obtaining trained, empty residual error network model;
Detection module: when the video frame volume of test video is sent into trained, in empty residual error network model, described in output The affiliated class of video frame volume, detects lantern slide switching moment.
The embodiment of the present invention also provides a kind of terminal, including memory, processor and storage on a memory and can located The computer program that runs on reason device, the processor can be used for executing when executing described program it is above-mentioned based on when, empty residual error The lantern slide switching detection method of network model.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the program Can be used for executing when being executed by processor it is above-mentioned based on when, empty residual error network model lantern slide switching detection method.
With the aforedescribed process, system carry out based on when, empty residual error network model lantern slide change detection, it is defeated in embodiment Enter that video of giving a lecture is shown in Figure 2, video frame volume is that the video frame comprising multiple video frames combines shown in Figure 3, most final inspection Testing result after surveying lantern slide switching is shown in Figure 6.Since camera lens is mobile in input speech video, speaker is mobile, mirror There is very big interference in the switching of speaker and lantern slide in head, and based on when, empty residual error network model lantern slide switch inspection Survey method has handled these interference, so that the result of detection does not occur detecting the problem of people and people and lantern slide switch.
It should be noted that the step in the method provided by the invention, can use corresponding mould in the system Block, device, unit etc. are achieved, and the technical solution that those skilled in the art are referred to the system realizes the method Steps flow chart, that is, the embodiment in the system can be regarded as realizing the preference of the method, and it will not be described here.
One skilled in the art will appreciate that in addition to realizing system provided by the invention in a manner of pure computer readable program code It, completely can be by the way that method and step be carried out programming in logic come so that the present invention provides and its other than modules, device, unit System and its each device with logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller The form of device etc. realizes identical function.So system provided by the invention and its every device are considered one kind firmly Part component, and the structure that the device for realizing various functions for including in it can also be considered as in hardware component;It can also be with It will be considered as realizing the device of various functions either the software module of implementation method can be the knot in hardware component again Structure.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims (10)

1. it is a kind of based on when, empty residual error network model lantern slide switching detection method characterized by comprising
To include at multiple comprising lantern slide, speaker and/or the Video segmentation of spectators by single or multiple shot records The video frame of video frame is rolled up;
Convolutional neural networks structure is designed using the design principle for the network structure for extracting picture spatial feature;In the convolution mind One three classification output layer is connected after network structure, which obtains To three classification convolutional neural networks models;3D in the structure of three classification convolutional neural networks models, 3D ConvNet network When being designed on the basis of the residual error module in convolution module and residual error network model ResNet network, empty residual error network model;
Extracted using the 3D convolution module in 3D ConvNet network the video frame when, empty feature, by residual error network mould 3D convolution residual error module, structure are obtained in the 3D convolution module that residual error module in type ResNet is dissolved into 3D ConvNet network Build for video frame volume classification when, empty residual error network model;Wherein:
Be divided into multiple video frames comprising video frame to roll up training video, to these video frames volume classification after be sent to when, sky It is trained in residual error network model, when obtaining trained, empty residual error network model;
Classification results are obtained when the video frame volume of test video is sent into trained, in empty residual error network model, are detected unreal Lamp piece switching moment.
2. it is according to claim 1 based on when, empty residual error network model lantern slide switching detection method, feature exists In it includes 8 layers of convolutional layer and 4 layers that the structure of the three classification convolutional neural networks model, which is 12 layers of convolutional neural networks structure, Full articulamentum;As network is deepened, the width and height of image constantly reduce, and the width and height of each Chi Huahou image are just Half is reduced, channel number doubles;Last output layer is three classification output layers;
And/or
The design principle of the network structure for extracting picture spatial feature, comprising:
If 3D convolution residual error module output and input when, empty characteristic pattern size it is identical, the convolution kernel of convolutional neural networks Port number does not change;
If the output of 3D convolution residual error module when, size when being input of empty characteristic pattern, the half of empty characteristic pattern size, The port number doubles of the convolution kernel of convolutional neural networks are to guarantee the consistency of time complexity.
3. it is according to claim 1 based on when, empty residual error network model lantern slide switching detection method, feature exists In 3D convolution module application 3D convolutional layer and the pond 3D layer in the 3D ConvNet network extract the video to model Frame when, empty characteristic pattern, the short connection of the residual error module application of the residual error network model ResNet network and identical mapping improve Model learning efficiency;3D volumes residual error module in the residual error network model ResNet is dissolved into 3D ConvNet network Volume module obtains 3D convolution residual error module;One 1 × 1 3D convolution is contained in the short connection of the 3D convolution residual error module Layer, the dimension to guarantee to export after the output of 3D convolution residual error module and 1 × 1 3D convolutional layer mapping are consistent.
4. it is according to claim 3 based on when, empty residual error network model lantern slide switching detection method, feature exists In one 1 × 1 3D convolutional layer being contained in the short connection of the 3D convolution residual error module, to guarantee 3D convolution residual error mould The output of block and 1 × 1 3D convolutional layer mapping after export the consistent method of dimension be:
It include two layers of convolutional layer in the 3D convolution residual error module, therefore, residual error mapping F (x) is expressed as
F (x)=ω2σ(ω1x+b1)+b2
Wherein, x indicates input, ω1Indicate the weight coefficient of first layer convolutional layer;ω2Indicate the weight system of second layer convolutional layer Number;b1Indicate the departure of first layer convolutional layer;b2Indicate the departure of second layer convolutional layer;
The activation primitive of σ expression RELU:
Wherein, x indicates input;
In order to keep the dimension for inputting x and residual error mapping F (x) identical, 1 × 1 3D convolutional layer is added in short connection, is added The mapping H (x) of power, is expressed as
H (x)=Wsx
Wherein, WsIt is weighting value matrix, for matching the dimension of input x and residual error mapping F (x);
Then mapping equation Z (X) becomes:
Z (X)=F (x)+H (x).
5. it is according to claim 1 based on when, empty residual error network model lantern slide switching detection method, feature exists In equipped with sequentially connected eight layers of convolutional layer and being connected to the last layer convolutional layer rear end when described, empty residual error network model Four layers of full articulamentum;
When described, in empty residual error network model, loss function uses the cross entropy loss function of sorter network.
6. it is according to claim 5 based on when, empty residual error network model lantern slide switching detection method, feature exists In the cross entropy loss function is:
Wherein, x indicates input, and class indicates that the true value classified belonging to the input, n are the quantity of the video frame of input, x Score value obtained in input is sorted in belonging to [class] expression, x [j] indicates jth class score value obtained in input.
7. it is according to claim 1-6 based on when, empty residual error network model lantern slide switching detection method, It is characterized in that, is trained when being sent to after the volume classification by these video frames, in empty residual error network model, wherein using single The network model on road extracts Space-time Domain feature to two frames in the video frame of input volume, using Adam algorithm to it is described when, it is empty residual Poor network model is trained.
8. it is a kind of for realizing any one of claim 1-7 the method based on when, empty residual error network model lantern slide cut Change detection system characterized by comprising
Divide module: by by single or multiple shot records comprising lantern slide, speaker and/or the Video segmentation of spectators at Multiple video frame volumes comprising video frame;
Sorter network structure designs module: designing convolutional Neural using the design principle for the network structure for extracting picture spatial feature Network structure;Wherein, the design principle of the network structure of picture spatial feature is extracted are as follows: if sorter network structure inputs, is defeated The size of characteristic pattern is identical out, and the port number of the convolution kernel of convolutional neural networks does not change;If sorter network structure is defeated Characteristic pattern size out is the half of the characteristic pattern size of input, the port number doubles of the convolution kernel of convolutional neural networks with Guarantee the consistency of time complexity;The rear end of convolutional neural networks structure connects one three classification output layer to obtain video The classification information of frame volume, forms sorter network structure;
When, empty residual error network model construct module: extract the video using the 3D convolution module in 3D ConvNet network Frame when, empty feature, the residual error module in residual error network model ResNet is dissolved into the 3D convolution in 3D ConvNet network Obtain 3D convolution residual error module in module, building for video frame volume classification when, empty residual error network model;
Training module: being divided into multiple video frames comprising video frame to roll up training video, will send after the volume classification of these video frames Enter then, be trained in empty residual error network model, when obtaining trained, empty residual error network model;
Detection module: when the video frame volume of test video is sent into trained, in empty residual error network model, the video is exported The affiliated class of frame volume, detects lantern slide switching moment.
9. a kind of terminal including memory, processor and stores the computer journey that can be run on a memory and on a processor Sequence, which is characterized in that the processor can be used for perform claim and require 1-7 any method when executing described program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor It can be used for perform claim when execution and require any method of 1-7.
CN201910208617.5A 2019-03-19 2019-03-19 Slide switching detection method, system, terminal and storage medium Active CN109934188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910208617.5A CN109934188B (en) 2019-03-19 2019-03-19 Slide switching detection method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910208617.5A CN109934188B (en) 2019-03-19 2019-03-19 Slide switching detection method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN109934188A true CN109934188A (en) 2019-06-25
CN109934188B CN109934188B (en) 2020-10-30

Family

ID=66987669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910208617.5A Active CN109934188B (en) 2019-03-19 2019-03-19 Slide switching detection method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN109934188B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110830734A (en) * 2019-10-30 2020-02-21 新华智云科技有限公司 Abrupt change and gradual change lens switching identification method
WO2021062990A1 (en) * 2019-09-30 2021-04-08 北京沃东天骏信息技术有限公司 Video segmentation method and apparatus, device, and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254493A1 (en) * 2014-03-10 2015-09-10 Case Western Reserve University Tumor Plus Adjacent Benign Signature (TABS) For Quantitative Histomorphometry
CN105718130A (en) * 2014-12-01 2016-06-29 珠海金山办公软件有限公司 Page switching method and apparatus for lantern slides
CN105957531A (en) * 2016-04-25 2016-09-21 上海交通大学 Speech content extracting method and speech content extracting device based on cloud platform
CN107798687A (en) * 2017-09-26 2018-03-13 上海大学 A kind of lantern slide switching detection method based on sparse time-varying figure
CN107920280A (en) * 2017-03-23 2018-04-17 广州思涵信息科技有限公司 The accurate matched method and system of video, teaching materials PPT and voice content
CN108399380A (en) * 2018-02-12 2018-08-14 北京工业大学 A kind of video actions detection method based on Three dimensional convolution and Faster RCNN

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254493A1 (en) * 2014-03-10 2015-09-10 Case Western Reserve University Tumor Plus Adjacent Benign Signature (TABS) For Quantitative Histomorphometry
CN105718130A (en) * 2014-12-01 2016-06-29 珠海金山办公软件有限公司 Page switching method and apparatus for lantern slides
CN105957531A (en) * 2016-04-25 2016-09-21 上海交通大学 Speech content extracting method and speech content extracting device based on cloud platform
CN107920280A (en) * 2017-03-23 2018-04-17 广州思涵信息科技有限公司 The accurate matched method and system of video, teaching materials PPT and voice content
CN107798687A (en) * 2017-09-26 2018-03-13 上海大学 A kind of lantern slide switching detection method based on sparse time-varying figure
CN108399380A (en) * 2018-02-12 2018-08-14 北京工业大学 A kind of video actions detection method based on Three dimensional convolution and Faster RCNN

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHIJIN LIU1 ET AL: "Sparse Time-Varying Graphs for Slide Transition Detection in Lecture Videos", 《ICIG 2017》 *
裴颂文 等: "融合的三维卷积神经网络的视频流分类研究", 《小型微型计算机系统》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021062990A1 (en) * 2019-09-30 2021-04-08 北京沃东天骏信息技术有限公司 Video segmentation method and apparatus, device, and medium
CN110830734A (en) * 2019-10-30 2020-02-21 新华智云科技有限公司 Abrupt change and gradual change lens switching identification method

Also Published As

Publication number Publication date
CN109934188B (en) 2020-10-30

Similar Documents

Publication Publication Date Title
Wang et al. Detect globally, refine locally: A novel approach to saliency detection
Chen et al. Global context-aware progressive aggregation network for salient object detection
Hu et al. Learning supervised scoring ensemble for emotion recognition in the wild
Li et al. Building-a-nets: Robust building extraction from high-resolution remote sensing images with adversarial networks
CN108171701B (en) Significance detection method based on U network and counterstudy
CN110942009B (en) Fall detection method and system based on space-time hybrid convolutional network
CN110516670A (en) Suggested based on scene grade and region from the object detection method for paying attention to module
CN107341462A (en) A kind of video classification methods based on notice mechanism
CN112418012B (en) Video abstract generation method based on space-time attention model
US9626585B2 (en) Composition modeling for photo retrieval through geometric image segmentation
CN107330364A (en) A kind of people counting method and system based on cGAN networks
CN113642621B (en) Zero sample image classification method based on generation countermeasure network
CN109919011A (en) A kind of action video recognition methods based on more duration informations
Zhuang et al. Marine Animal Detection and Recognition with Advanced Deep Learning Models.
CN111881716A (en) Pedestrian re-identification method based on multi-view-angle generation countermeasure network
CN110334718A (en) A kind of two-dimensional video conspicuousness detection method based on shot and long term memory
Li et al. Image manipulation localization using attentional cross-domain CNN features
CN114612832A (en) Real-time gesture detection method and device
CN110717863A (en) Single-image snow removing method based on generation countermeasure network
Tian et al. An attempt towards interpretable audio-visual video captioning
CN113627504A (en) Multi-mode multi-scale feature fusion target detection method based on generation of countermeasure network
CN103336835A (en) Image retrieval method based on weight color-sift characteristic dictionary
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
Wang et al. Basketball shooting angle calculation and analysis by deeply-learned vision model
CN109934188A (en) A kind of lantern slide switching detection method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant