WO2018019126A1 - 视频类别识别方法和装置、数据处理装置和电子设备 - Google Patents

视频类别识别方法和装置、数据处理装置和电子设备 Download PDF

Info

Publication number
WO2018019126A1
WO2018019126A1 PCT/CN2017/092597 CN2017092597W WO2018019126A1 WO 2018019126 A1 WO2018019126 A1 WO 2018019126A1 CN 2017092597 W CN2017092597 W CN 2017092597W WO 2018019126 A1 WO2018019126 A1 WO 2018019126A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
time domain
classification result
neural network
convolutional neural
Prior art date
Application number
PCT/CN2017/092597
Other languages
English (en)
French (fr)
Inventor
汤晓鸥
王利民
熊元骏
王喆
乔宇
林达华
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2018019126A1 publication Critical patent/WO2018019126A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Definitions

  • the present disclosure belongs to the field of computer vision technology, and in particular, to a video category identification method and apparatus, a data processing apparatus, and an electronic device.
  • Motion recognition is a hot trend in computer vision research.
  • the motion recognition technology mainly recognizes the motion in the video by processing the video composed of the color picture sequence.
  • the difficulty of motion recognition technology is how to process the dynamically changing video content to overcome the distance, the change of the angle of view, the movement of the camera, and the change of the scene to correctly recognize the motion in the video.
  • the present disclosure provides a video category identification technology solution.
  • a video category identifying method includes: segmenting a video to obtain two or more segmented videos; and sampling each segmented video in two or more segmented videos, Obtaining an original image and an optical flow image of each segmented video; processing the original image of each segmented video by using a spatial convolutional neural network to obtain a spatial classification result of the video; and processing each segmented video by using a time domain convolutional neural network
  • the optical flow image is obtained to obtain a time domain classification result of the video; and the spatial domain classification result and the time domain classification result are merged to obtain a classification result of the video.
  • a video class identification apparatus including: a segmentation unit for segmenting a video to obtain two or more segmented videos; and a sampling unit for respectively respectively for two or more Each segment video in the segmented video is sampled to obtain an original image and an optical flow image of each segmented video; a spatial domain classification processing unit is configured to process the original image of each segmented video by using a spatial convolutional neural network to obtain the The spatial domain classification result of the video; the time domain classification processing unit is configured to respectively process the optical flow image of each segment video by using the time domain convolutional neural network to obtain the time domain classification result of each segment video; the fusion unit is used for the The spatial domain classification result and the time domain classification result are subjected to fusion processing to obtain a classification result of the video.
  • a data processing apparatus comprising: the video category identifying apparatus described above.
  • an electronic device provided is provided with the data processing device described above.
  • a computer storage medium for storing computer readable instructions, the instructions comprising: segmenting a video to obtain instructions of two or more segmented videos; respectively Each segment video in two or more segmented videos is sampled to obtain an instruction of an original image and an optical flow image of each segmented video; the original image of each segmented video is processed by a spatial convolutional neural network to obtain the video An instruction of a spatial domain classification result; and an instruction to process an optical flow image of each segmented video using a time domain convolutional neural network to obtain a time domain classification result of the video; and the spatial domain classification result and the time domain classification result A fusion process is performed to obtain an instruction of the classification result of the video.
  • a computer apparatus comprising: a memory storing executable instructions; and one or more processors in communication with the memory to execute executable instructions to perform operations corresponding to the video category identification method of the present disclosure .
  • two or more segmented videos are obtained by segmenting the video; and each of the two or more segmented videos is separately Segment video is sampled to obtain the original image and optical flow image of each segmented video; then the spatial image of the segmented video is processed by the spatial convolutional neural network to obtain the spatial classification result of the video; and the time domain convolutional neural network can be utilized The optical flow image of each segment video is processed to obtain a time domain classification result of the video; finally, the spatial domain classification result and the time domain classification result are merged to obtain a video classification result.
  • the present disclosure can realize long-term motion modeling by dividing a video into two or more segmented videos and separately sampling a frame image and an inter-frame optical stream for each segmented video, and training the convolutional neural network.
  • the network model obtained by the subsequent training is used to identify the video classification, it is beneficial to improve the correct rate of the video category recognition, and is beneficial to improving the video category recognition effect, and the calculation cost is small.
  • FIG. 1 shows a schematic diagram of an application scenario of the present disclosure.
  • FIG. 2 is a flow chart of one embodiment of a video category identification method of the present disclosure.
  • FIG. 3 is a flow chart of another embodiment of a video category identification method of the present disclosure.
  • FIG. 4 is a flow chart of still another embodiment of the video category identification method of the present disclosure.
  • Figure 5 is a flow diagram of still another embodiment of the video category identification method of the present disclosure.
  • FIG. 6 is a flow diagram of one embodiment of training the initial spatial convolutional neural network in the present disclosure.
  • FIG. 7 is a flow diagram of one embodiment of training the initial time domain convolutional neural network in the present disclosure.
  • Figure 8 is a block diagram showing an embodiment of a video class identification device of the present disclosure.
  • Figure 9 is a block diagram showing another embodiment of the video category identifying apparatus of the present disclosure.
  • FIG. 10 is a schematic structural diagram of still another embodiment of the video category identifying apparatus of the present disclosure.
  • FIG. 11 is a block diagram showing still another embodiment of the video class identification device of the present disclosure.
  • Figure 12 is a block diagram showing still another embodiment of the video category identifying apparatus of the present disclosure.
  • Figure 13 is a diagram showing an application example of the video category identifying apparatus of the present disclosure.
  • FIG. 14 is a schematic structural diagram of an embodiment of an electronic device of the present disclosure.
  • the technical solutions provided by the present disclosure can be applied to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • the computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • the Two-Stream Convolution Neural Network is a representative network model.
  • the dual-stream convolutional neural network uses two convolutional neural networks, namely, a spatial convolutional neural network and a time-domain convolutional neural network to model the frame picture and the inter-frame optical flow, respectively, and through two convolutional neural networks.
  • the classification results are fused to identify the actions in the video.
  • the dual-stream convolutional neural network can simultaneously model the frame picture and the inter-frame optical flow, that is, the short-term motion information, it lacks the ability to model long-term motion, which results in the correct rate of motion recognition. Get a guarantee.
  • FIG. 1 schematically illustrates an application scenario in which a video category identification technology solution provided in accordance with the present disclosure may be implemented.
  • At least one electronic device (such as one or more of the electronic device A1, the electronic device A2, ..., and the electronic device Am on the terminal side) is an electronic device having Internet access capability.
  • a video is stored in one or more of the electronic device A1, the electronic device A2, ... the electronic device Am.
  • the video stored in the electronic device may be a video captured by the user using the electronic device, or may be a video stored by the user in the electronic device through data transmission between the electronic devices, or may be a user using the electronic device from the network. Downloaded videos, etc.
  • the user can upload or send the video stored in the electronic device to the corresponding server or other electronic device on the terminal side through the Internet.
  • the electronic device on the server or the terminal side can classify and store the video obtained by the user.
  • the server may be formed by a single electronic device such as a server on the service side, or may be formed by multiple electronic devices such as a server. The present disclosure does not limit the specific expression of the electronic device in the server or terminal side.
  • the technical solution provided by the present disclosure can enable the electronic device on the server side or the terminal side to automatically analyze the content of the video obtained by the server or the terminal side, and identify the category to which each video belongs, so that the electronic device on the server side or the terminal side can
  • Each of the obtained videos is automatically divided into a video set of the first category, a video set of the second category, ... or a video set of the z-category according to the category to which it belongs.
  • the present disclosure facilitates the electronic device on the server side or the terminal side by automatically dividing each video into a video set of a corresponding category.
  • Video classification management
  • the present disclosure can also be applied to other application scenarios, that is, the application scenarios to which the present disclosure can be applied are not limited by the application scenarios described above; for example, The present disclosure is performed in an electronic device (such as a processor in an electronic device), or in an electronic device (such as a processor of an electronic device) in a peer-to-peer communication of a non-terminal-server structure, and the like.
  • the video is segmented to obtain two or more segmented videos.
  • step 102 may be performed by a processor invoking a memory stored instruction or may be performed by a segmentation unit that is executed by the processor.
  • the video when the segmentation unit segments the video, the video may be equally segmented to obtain two or more segmented videos of the same length. For example, the segmentation unit divides the video into 3 segmented videos or 5 segmented videos of the same length, and the number of segments is determined according to the actual effect. In addition, the segmentation unit may also randomly segment the video or extract several segments from the video as segmented video.
  • the length of the video may be acquired, and the segmentation unit determines the length of each video according to the length of the video and the preset number of segments, according to which the segmentation unit may The received video is equally divided into two or more segmented videos of the same length.
  • the processor trains the network model of the convolutional neural network based on the long-time video.
  • the training process of the network model can be simplified; when using the trained convolutional neural network for video category recognition, the overall efficiency of the video category recognition is improved because the time required for each segment video recognition is similar.
  • step 104 may be performed by a processor invoking a memory stored instruction or may be performed by a sampling unit that is executed by the processor.
  • one frame image may be randomly extracted from each segmented video as the original image of each segmented video.
  • successive multiple frames of images may be randomly extracted from each segmented video to obtain an optical flow image of each segmented video.
  • the optical flow image may be a grayscale image based on an 8-bit bitmap and a total of 256 discrete color gradations, and the median value of the grayscale image is 128.
  • optical flow field is a vector field
  • two scalar field pictures are required to represent the optical flow image, that is, two corresponding to the X-direction and the Y-direction amplitude of the optical flow image coordinate axis.
  • the optical flow sampling module randomly extracts consecutive multi-frame images from each segment video to obtain an optical flow image of each segment video, which can be implemented by: separately for each segment video:
  • the optical stream sampling module randomly extracts consecutive N frames of images from each segment video; wherein N is an integer greater than one;
  • the optical flow sampling module performs calculation based on each adjacent two frames of the N frame image to obtain an N-1 group optical flow image, wherein each of the N-1 optical flow images includes one frame respectively.
  • the optical stream sampling module randomly extracts consecutive 6 frames of images from each segment video; the optical stream sampling module performs calculation based on each adjacent two frames of the 6 frames, respectively.
  • the optical flow sampling module obtains five sets of optical flow grayscale images, wherein each of the five optical flow grayscale images includes a frame of horizontal optical flow grayscale images and a frame of longitudinal optical flow grayscale images. That is, the optical flow sampling module obtains 10 frames of optical flow grayscale images, and the 10 optical flow grayscale images can be used as a 10-channel image.
  • step 106 may be performed by a processor invoking a memory stored instruction, or may be performed by a spatial domain classification processing unit and a time domain classification processing unit operated by the processor, for example, a spatial domain classification processing unit utilizing spatial convolution
  • the neural network processes the original image of each segmented video to obtain a spatial classification result of the video
  • the time domain classification processing unit processes the optical flow image of each segmented video using a time domain convolutional neural network to obtain a time domain classification result of the video.
  • the spatial domain classification result of the video and the time domain classification result of the video are respectively a classification result vector whose dimension is equal to the number of classification categories.
  • the classification results include: running, high jump, walking, pole vault, long jump and triple jump, a total of 6 categories, then the spatial domain classification result and the time domain classification result are respectively the classification result vector with the dimension equal to 6.
  • step 108 may be performed by a processor invoking a memory stored instruction or may be performed by a fusion unit operated by a processor.
  • the classification result of the video is a classification result vector whose dimension is equal to the number of classification categories.
  • the classification results include: running, high jump, walking, pole vault, long jump and triple jump, a total of 6 categories, the classification result of the video is a classification result vector with a dimension equal to 6.
  • the fusion unit may perform the fusion processing on the spatial domain classification result and the time domain classification result: the fusion unit multiplies the spatial domain classification result and the time domain classification result by a preset weight coefficient respectively, and then obtains a video to obtain a video.
  • Classification results are determined by the fusion unit according to the classification accuracy rate of the network model corresponding to the convolutional neural network on the verification data set, and the network model with high classification accuracy rate has higher weight, and the verification data set is marked by the real category, but not The video composition of participating in network training.
  • the validation data set can be obtained in any way possible, for example by searching the search engine for the corresponding category of video.
  • the ratio of the weight coefficient between the spatial domain classification result and the time domain classification result may be any ratio between 1:1 and 1:3. In an optional implementation, the ratio may be 1:1.5.
  • two or more segmented videos are obtained by segmenting the video; and each segment video in the two or more segmented videos is separately sampled to obtain each segment video.
  • the original image and the optical flow image; the spatial image of the segmented video is processed by the spatial convolutional neural network to obtain the spatial classification result of the video; and the optical image of each segmented video is processed by the time domain convolutional neural network to obtain the video.
  • the result of time domain classification; finally, the spatial domain classification result and the time domain classification result are merged to obtain the classification result of the video.
  • the present disclosure separately samples the frame picture and the inter-frame optical stream for each segment video.
  • the modeling of the long-term motion can be realized, so that the subsequent use training can be obtained.
  • the network model identifies the video classification, it is beneficial to improve the correct rate of video category recognition, and is beneficial to improving the video category recognition effect, and the calculation cost is small.
  • the video is segmented to obtain two or more segmented videos.
  • step 202 may be performed by a processor invoking a memory stored instruction or may be performed by a segmentation unit that is executed by the processor.
  • the video when the segmentation unit segments the video, the video may be equally segmented to obtain two or more segmented videos of the same length to simplify the training process of the network model of the convolutional neural network, and improve The overall efficiency of video category recognition.
  • the segmentation unit divides the video into 3 segmented videos or 5 segmented videos of the same length, and the number of segments is determined according to the actual effect.
  • the segmentation unit may also randomly segment the video or extract several segments from the video as segmented video. As shown in FIG. 13, in one application embodiment of the disclosed video category identification method, the segmentation unit divides the video equally into 3 segmented videos.
  • step 204 may be performed by a processor invoking a memory stored instruction or may be performed by a sampling unit that is executed by the processor.
  • the image sampling module in the sampling unit may randomly extract one frame image from each segment video as the original image of each segment video; the optical stream sampling module in the sampling unit may randomly extract from each segment video.
  • a continuous multi-frame image obtains an optical flow image of each segmented video.
  • the sampling unit separately samples three segmented videos to obtain one frame of the original image and the inter-frame optical stream image of the three segmented videos.
  • the original image may be an RGB color image and the optical flow image may be a grayscale image.
  • step 206 may be performed by the processor invoking a memory stored instruction, or may be performed by a spatial domain classification processing module and a first time domain classification processing module executed by the processor, for example, the spatial domain classification processing module respectively utilizes
  • the spatial convolutional neural network processes the original image of each segmented video to obtain the spatial preliminary classification result of each segmented video
  • the first time domain classification processing module uses the time domain convolutional neural network to respectively light the segmented video.
  • the stream image is processed to obtain a time domain preliminary classification result of each segment video.
  • the preliminary classification result of the airspace and the preliminary classification result of the time domain are respectively the classification result vector whose dimension is equal to the number of classification categories.
  • the classification results include: running, high jump, walking, pole vault, long jump and triple jump.
  • the preliminary classification results of the airspace and the preliminary classification results in the time domain are the classification result vectors with the dimension equal to 6.
  • the spatial domain classification processing module respectively processes the original images of the three segmented videos by using a spatial convolutional neural network to obtain three segmented video images. 3 preliminary classification results of airspace;
  • the one-time domain classification processing module uses the time domain convolutional neural network to process the optical image of the three segmented videos, and obtains three time-domain preliminary classification results of the three segmented videos.
  • the spatial convolutional neural network and/or the time domain convolutional neural network may first obtain the feature representation of the image through a combination of a convolutional layer, a nonlinear layer, a pooling layer, etc., and then obtain a category belonging to each category through a linear classification layer.
  • the score which is the preliminary classification result of each segmented video.
  • the classification result may include: running, high jump, walking, pole vault, long jump, and triple jump, a total of 6 categories, then the spatial preliminary classification result and the time domain preliminary classification result of each segmented video respectively include the video belonging to The 6-dimensional vector of the classification scores of these 6 categories.
  • step 208 may be performed by a processor invoking a memory stored instruction, or may be performed by a first integrated processing module and a second integrated processing module executed by the processor, for example, the first integrated processing module may utilize
  • the spatial consensus function performs comprehensive processing on the spatial classification results of the segmented video to obtain the spatial classification result of the video
  • the second integrated processing module can comprehensively process the time domain preliminary classification result of the segmented video by using the time domain consensus function.
  • the time domain classification result of the video may be performed by a processor invoking a memory stored instruction, or may be performed by a first integrated processing module and a second integrated processing module executed by the processor, for example, the first integrated processing module may utilize
  • the spatial consensus function performs comprehensive processing on the spatial classification results of the segmented video to obtain the spatial classification result of the video
  • the second integrated processing module can comprehensively process the time domain preliminary classification result of the segmented video by using the time domain consensus function.
  • the time domain classification result of the video may be performed by a processor invoking
  • the spatial domain classification result of the video and the time domain classification result of the video may be respectively a classification result vector whose dimension is equal to the number of classification categories.
  • the spatial domain consensus function and/or the time domain consensus function includes: an average function, a maximum function, or a weighted average function.
  • the present disclosure may select an average function, a maximum value function or a weighted average function with the highest classification accuracy rate on the verification data set as a spatial consensus function; the present disclosure may select an average function and a maximum value function with the highest classification accuracy rate on the verification data set. Or the weighted average function as a time domain consensus function.
  • the averaging function is specifically averaging the category scores of the same category between different segmented videos as the output of the category score of the category; the maximum function, specifically for the same category between different segmented videos The category score is selected by the function as the output category score; the weighted average function is specifically the average of the weights of the category scores of the same category between different segmented videos as the output category score of the category, wherein Each category uses the same set of weights, and this set of weights is obtained as a network model parameter optimization during training.
  • the processor may select an averaging function as a spatial consensus function and a time domain consensus function, and select an averaging function as a spatial consensus function and a time domain consensus function, and the first integrated processing module utilizes the spatial consensus.
  • the function calculates the average of the three scores belonging to each category in the three spatial preliminary classification results of the three segmented videos as the category score of the category, thus obtaining a set of category scores for all categories as a video.
  • the spatial classification result; the second comprehensive processing module calculates the average of the three scores belonging to each category in the three time domain preliminary category results of the three segmented videos by using the time domain consensus function as the category score of the category, thus A set of category scores for all categories was obtained as a time domain classification result for the video.
  • the classification results include: running, high jump, walking, pole vault, long jump and triple jump, a total of 6 categories, then the spatial classification result and the time domain classification result of the video respectively include the category scores of the video belonging to the six categories. 6-dimensional vector.
  • step 210 may be performed by a processor invoking a memory stored instruction or may be performed by a fusion unit operated by a processor.
  • the classification result of the video is a classification result vector whose dimension is equal to the number of classification categories.
  • the fusion unit multiplies the video spatial domain classification result and the time domain classification result by a weight coefficient of 1:1.5, respectively, and performs summation to obtain a video classification. result.
  • the classification result may include: running, high jump, walking, pole vault, long jump and triple jump, a total of 6 categories, then the video classification result is a 6-dimensional vector containing the classification scores of the video belonging to the 6 categories.
  • the category with the highest score is the category to which the video belongs. In this embodiment, the category with the highest score is the high jump, and the category of the video is recognized as the high jump.
  • the preliminary classification result of each segment video is synthesized by the consensus function, and the classification result of the video is obtained, because the consensus function is not used for each segment video.
  • the convolutional neural network model is limited. Therefore, the parameters of different segmentation video sharing network models can be implemented, and the parameters of the network model are less, so that the network model with fewer parameters can be used to implement the category of video of any length. Identification, in the training process, by segmenting the video of any length and performing segmented network training, by supervising and learning the classification result of the whole video and the real label, the training supervision of the full video level can be realized without the video. Length limit.
  • step 302 the video is segmented to obtain two or more segmented videos.
  • step 302 may be performed by a processor invoking a memory stored instruction or may be performed by a segmentation unit that is executed by the processor.
  • step 304 Sample each of the two or more segmented videos to obtain an original image and an original optical image of each segmented video.
  • step 304 may be performed by a processor invoking a memory stored instruction, or may be performed by a sampling unit executed by the processor, for example, using an image sampling module in the unit to obtain an original image of each segmented video, Optical flow The sample module obtains the original optical flow image of each segmented video.
  • step 306 Acquire a deformed optical flow image after the original optical flow image is deformed.
  • step 306 may be performed by a processor invoking a memory stored instruction or may be performed by an optical flow processing unit that is executed by the processor.
  • the optical flow processing unit obtains the deformed optical flow image after the original optical flow image is deformed, and the optical flow processing unit respectively calculates the image of each adjacent two frames to obtain two adjacent images.
  • the homography transformation matrix the optical stream processing unit respectively performs affine transformation on the next frame image of the corresponding two adjacent frames according to the homography transformation matrix between two adjacent frames of images; optical flow The processing unit calculates a previous frame image and an affine transformed subsequent frame image of each adjacent two frames of images to obtain a deformed optical flow image.
  • the image from the previous frame and the affine transformation is used as input information for video category recognition, which is beneficial to reducing the influence of camera movement on the video category recognition effect.
  • the optical flow processing unit performs calculation on each adjacent two frames of images including: the optical flow processing unit performs inter-frame feature point matching according to the accelerated robustness feature SURF feature point descriptor.
  • the original image of each segmented video is processed by using a spatial convolutional neural network to obtain a preliminary spatial classification result of each segmented video; and the original optical flow of each segmented video is respectively performed by using the first time domain convolutional neural network.
  • the image is processed to obtain a first time domain preliminary classification result of each segment video; and the second time domain convolutional neural network is used to process the deformed optical stream image of each segment video to obtain a second segmentation video Time domain preliminary classification results.
  • step 308 may be performed by a processor invoking a memory stored instruction, or may be performed by a spatial domain classification processing module, a first time domain classification processing module, and a second time domain classification processing module, which are executed by the processor,
  • the spatial domain classification processing module separately processes the original image of each segmented video by using a spatial convolutional neural network to obtain a spatial preliminary classification result of each segmented video
  • the first time domain classification processing module respectively utilizes the first time domain convolution
  • the neural network processes the original optical flow image of each segmented video to obtain a first time domain preliminary classification result of each segmented video
  • the second time domain classification processing module respectively uses the second time domain convolutional neural network to segment each segment.
  • the deformed optical flow image of the video is processed to obtain a second time domain preliminary classification result of each segmented video.
  • step 310 may be performed by a processor invoking a memory stored instruction, or may be performed by a first integrated processing module, a second integrated processing module, and a third integrated processing module executed by the processor, for example,
  • An integrated processing module uses a spatial consensus function to comprehensively process the spatial preliminary classification results of the segmented video to obtain a spatial classification result of the video
  • the second integrated processing module uses the first time domain consensus function to initially segment the first time domain of the segmented video.
  • the classification result is comprehensively processed to obtain the first time domain classification result of the video
  • the third comprehensive processing module uses the second time domain consensus function to comprehensively process the second time domain preliminary classification result of the segmented video to obtain the second time of the video. Domain classification results.
  • step 312 Perform a fusion process on the spatial domain classification result, the first time domain classification result, and the second time domain classification result, to obtain a classification result of the video.
  • step 312 may be performed by a processor invoking a memory stored instruction or may be performed by a fusion unit that is executed by the processor.
  • the fusion unit performs the fusion processing on the spatial domain classification result, the first time domain classification result, and the second time domain classification result, including: the fusion unit classifies the spatial domain classification result, the first time domain classification result, and the second time domain classification
  • the results are multiplied by a predetermined weight coefficient and summed to obtain a classification result of the video.
  • the weight coefficient is determined according to the classification accuracy rate of the corresponding network model on the verification data set, and the network model with high classification accuracy rate obtains a higher weight.
  • the ratio of the weight coefficient between the spatial domain classification result and the first time domain classification result and the second time domain classification result may be 1:a:b, and the sum of a and b is not less than 1, And not more than 3.
  • the ratio may be 1:1:0.5 or the like.
  • a deformed optical stream is used as an additional short-term motion information representation, and the input of the video category identification is expanded into three types of information, namely Frame picture, inter-frame optical flow and deformed optical flow, because the deformed optical flow removes the influence of camera movement, it is beneficial to reduce the influence of camera moving video category recognition effect.
  • three kinds of input information ie frame, are also used.
  • Picture, inter-frame optical flow and deformed optical flow, training the network model is beneficial to reduce the impact of camera movement on the network model, which is beneficial to improve the robustness of video category recognition technology to camera movement.
  • the video is segmented to obtain two or more segmented videos.
  • the memory stored instructions may be executed by the processor or may be executed by a segmentation unit that is executed by the processor.
  • step 3040 Sample each of the two or more segmented videos to obtain an original image and an original optical image of each segmented video.
  • step 3040 may be performed by a processor invoking a memory stored instruction, or may be performed by a sampling unit executed by the processor, for example, using an image sampling module in the unit to obtain an original image of each segmented video, The optical stream sampling module obtains an original optical stream image of each segmented video.
  • step 3060 Acquire a deformed optical flow image after deformation of the original optical flow image.
  • step 3060 can be performed by a processor invoking a memory stored instruction or can be performed by an optical flow processing unit that is executed by the processor.
  • the optical flow processing unit obtains the deformed optical flow image after the original optical flow image is deformed, and the optical flow processing unit separately calculates each adjacent two frames of images to obtain two adjacent images. a homography transformation matrix between the two; the optical stream processing unit respectively performs affine transformation on the next frame image of the corresponding two adjacent frames according to the homography transformation matrix between each adjacent two frames of images; The stream processing unit calculates the previous frame image and the affine transformed subsequent frame image of each adjacent two frames of images to obtain a deformed optical flow image.
  • the optical stream processing unit performs calculation on each adjacent two frames of images including: the optical stream processing unit performs inter-frame feature point matching according to the accelerated robustness feature SURF feature point descriptor.
  • step 3080 may be performed by a processor invoking a memory stored instruction, or may be performed by a spatial domain classification processing module and a second time domain classification processing module executed by the processor, for example, the spatial domain classification processing module utilizes The spatial convolutional neural network processes the original image of each segmented video to obtain the spatial preliminary classification result of each segmented video, and the second time domain classification processing module respectively uses the second time domain convolutional neural network for each segmented video.
  • the deformed optical flow image is processed to obtain a second time domain preliminary classification result of each segmented video.
  • step 3100 may be performed by a processor invoking a memory stored instruction, or may be performed by a first integrated processing module and a third integrated processing module executed by the processor, eg, the first integrated processing module utilizes airspace
  • the consensus function comprehensively processes the spatial classification results of the segmented video to obtain the spatial classification result of the video
  • the third integrated processing module uses the second time domain consensus function to comprehensively process the second time domain preliminary classification result of the segmented video. Obtain the second time domain classification result of the video.
  • 3120 Perform a fusion process on the spatial domain classification result and the second time domain classification result to obtain a classification result of the video.
  • step 3120 may be performed by a processor invoking a memory stored instruction or may be performed by a fusion unit executed by the processor.
  • the fusion unit performs the fusion processing on the spatial domain classification result and the second time domain classification result, where the fusion unit multiplies the spatial domain classification result and the second time domain classification result by a preset weight coefficient respectively, and then performs the summation. , get the classification result of the video.
  • the weight coefficient is determined according to the classification accuracy rate of the corresponding network model on the verification data set, and the network model with high classification accuracy rate obtains a higher weight.
  • the weight coefficient ratio between the spatial domain classification result and the second time domain classification result may be any ratio between 1:1 and 1:3. In an optional implementation, the ratio is Can be 1:1.5 and so on.
  • the above video category recognition technology of the present disclosure can be applied to the training phase of the convolutional neural network model, and can also be applied to the test phase and the subsequent application phase of the convolutional neural network model.
  • the video may be obtained at steps 108, 210, 312 or 3120 when the video category recognition technique is applied to the test phase and subsequent application phases of the convolutional neural network model.
  • the classification result vector obtained by the fusion processing is normalized by the Softmax function, and the classification probability vector of the video belonging to each category is obtained.
  • the normalization processing operation in this step may be performed by the processor invoking a memory stored instruction, or may be performed by a first normalization processing unit executed by the processor.
  • Presetting the initial spatial convolutional neural network and the initial time domain convolutional neural network Presetting the initial spatial convolutional neural network and the initial time domain convolutional neural network; as an alternative example, the operations of the preset initial spatial convolutional neural network and the initial time domain convolutional neural network may be invoked by the processor to store the stored instructions Executing, and the preset initial spatial convolutional neural network and the initial time domain convolutional neural network may be stored in the network training unit;
  • the initial spatial convolutional neural network is trained by the stochastic gradient descent method (SGD) to obtain the spatial convolutional neural network in each of the above embodiments; and the initial time domain volume is adopted by the stochastic gradient descent method.
  • SGD stochastic gradient descent method
  • Neural network training The time domain convolutional neural network in each of the above embodiments is obtained.
  • this step may be performed by the processor invoking a memory stored instruction or may be performed by a network training unit that is executed by the processor.
  • the video as a sample is pre-labeled with standard airspace classification result information.
  • the stochastic gradient descent method is to iteratively update the network model through each sample.
  • the network training unit uses the stochastic gradient descent method to train the initial spatial convolutional neural network and the initial time domain convolutional neural network.
  • the training speed is fast, which is beneficial to improve the network. Training efficiency.
  • FIG. 6, 402 for a video as a sample, the operation of the flow shown in each of the above-described alternative embodiments of the present disclosure is started until the spatial classification result of the video is obtained.
  • the processor performs operations associated with the airspace in operations 102-106, 202-208, 302-310, or 3020-3100 to obtain spatial domain classification results for the video.
  • operation 406 is performed. If it is less than the preset range, the training process for the initial spatial convolutional neural network is terminated, and the current initial spatial convolutional neural network is used as the final spatial convolutional neural network, and the subsequent process of this embodiment is not performed. 406, adjusting network parameters of the initial airspace convolutional neural network.
  • steps 404, 406, and 408 may be performed by a processor invoking a memory stored instruction, or may be performed by a network training unit that is executed by the processor.
  • an operation of segmenting a video is started until a time domain classification result of the video is obtained.
  • the processor performs time domain related operations in operations 102-106, 202-208, 302-310, or 3020-3100 to obtain time domain classification results for the video.
  • operation 506 is performed. If it is not less than the preset range, operation 506 is performed. If it is not less than the preset range, the training process for the initial time domain convolutional neural network is ended, and the current initial time domain convolutional neural network is used as the final time domain convolutional neural network, and the subsequent process of this embodiment is not performed.
  • the time domain convolutional neural network after adjusting the network parameters is used as a new initial time domain convolutional neural network, and the operation 502 is started for the next video as a sample.
  • steps 504, 506, and 508 may be performed by a processor invoking a memory stored instruction, or may be performed by a network training unit that is executed by the processor.
  • the initial time domain convolutional neural network may be a first initial time domain convolutional neural network or a second initial time domain convolutional neural network, and the time domain classification result correspondingly includes the first time The domain classification result or the second time domain classification result, the time domain convolutional neural network correspondingly includes a first time domain convolutional neural network and a second time domain convolutional neural network. That is, the training of the first initial time domain convolutional neural network and the second initial time domain convolutional neural network may be implemented or simultaneously implemented by the embodiment shown in FIG.
  • the following operations may also be included: normalizing the spatial classification result of the video by using the Softmax function.
  • the processing obtains a spatial domain probability vector of each category of the video; and normalizes the time domain classification result of the video by using the Softmax function to obtain a time domain classification probability vector of the video belonging to each category.
  • the operation may be performed by a processor invoking a memory stored instruction or may be performed by a second normalization processing unit executed by the processor.
  • the spatial domain classification result and the time domain classification result shown in FIG. 6 and FIG. 7 may be an unnormalized classification result or a normalized classification probability vector.
  • time domain convolutional neural network may be a first time domain convolutional neural network, or may be a second time domain convolutional neural network.
  • the first time domain convolutional neural network and the second time domain convolutional neural network may also be included.
  • the present disclosure also provides a data processing apparatus including the video category identifying apparatus in the present disclosure.
  • the data processing apparatus provided by the above embodiment of the present disclosure is provided with the object video category identifying apparatus of the above embodiment, which samples the frame picture and the inter-frame optical stream separately for each segment video by dividing the video into two or more segmented videos.
  • the modeling of the long-term motion can be realized, so that when the network model obtained by the training is used to identify the video classification, the correct rate of the video category recognition is improved compared with the prior art. Improved video category recognition and less computational cost.
  • the data processing apparatus of the embodiments of the present disclosure may be any device having a data processing function, and may include, for example, but not limited to, an Advanced Reduced Instruction Set Machine (ARM), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU).
  • ARM Advanced Reduced Instruction Set Machine
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the present disclosure also provides an electronic device, such as a mobile terminal, a personal computer (PC), a tablet computer, a server, etc., which is provided with the data processing device of the present disclosure.
  • the data processing apparatus of the above embodiment is provided, and by dividing the video into two or more segmented videos, the frame picture and the inter-frame optical stream are respectively sampled for each segment video,
  • the convolutional neural network When the convolutional neural network is trained, it can realize the modeling of long-term motion, so that the network model obtained by the subsequent training can identify the video classification, which is beneficial to improve.
  • the correct rate of video category recognition is beneficial to improve the video category recognition effect, and the calculation cost is small.
  • an electronic device for implementing an embodiment of the present disclosure includes a central processing unit (CPU), which may be stored in a read only memory (ROM) according to a central processing unit (CPU).
  • CPU central processing unit
  • the executable instructions or executable instructions loaded from a storage portion into a random access memory (RAM) perform various appropriate actions and processes.
  • the central processing unit can communicate with the read only memory and/or the random access memory to execute executable instructions to perform operations corresponding to the video category identification method provided by the present disclosure, for example, segmenting the video to obtain two or more segments Video; respectively sampling each segmented video in two or more segmented videos to obtain original image and optical flow image of each segmented video; respectively processing the original image of each segmented video by using a spatial convolutional neural network Obtaining the spatial domain classification result of each segmented video; and processing the optical flow image of each segmented video by using the time domain convolutional neural network respectively, obtaining the time domain classification result of each segmented video; and classifying the result and time of the spatial domain The domain classification result is fused to obtain the classification result of the video.
  • the CPU, ROM, and RAM are connected to each other through a bus.
  • An input/output (I/O) interface is also connected to the bus.
  • the following components are connected to the I/O interface: an input portion including a keyboard, a mouse, and the like; an output portion including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion including a hard disk or the like; The communication part of the network interface card of the LAN card, modem, etc.
  • the communication section performs communication processing via a network such as the Internet.
  • the drive is also connected to the I/O interface as needed.
  • a removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive as needed so that a computer program read therefrom is installed into the storage portion as needed.
  • the processes described above with reference to the flowcharts can be implemented as a computer software program.
  • the technical solution of the present disclosure includes a computer program product, which can include a computer program tangibly embodied on a machine readable medium, the computer program including program code for executing the method illustrated in the flowchart, the program code Executable instructions corresponding to performing any of the video classification method steps provided by the present disclosure may be included, for example, segmenting a video to obtain executable instructions of two or more segmented videos; respectively for two or more segments Each segment video in the video is sampled, and an executable instruction of the original image and the optical flow image of each segment video is obtained; the original image of each segment video is processed by using a spatial convolutional neural network to obtain each segment video.
  • An executable instruction of the preliminary classification result of the spatial domain and an executable instruction for processing the optical flow image of each segmented video by using a time domain convolutional neural network, respectively, to obtain a time domain preliminary classification result of each segment video;
  • the spatial preliminary classification result of the video is comprehensively processed to obtain an executable instruction of the spatial classification result of the video;
  • the computer program can be downloaded and installed from the network via the communication portion, and/or installed from a removable medium.
  • the functions defined in the method of the present disclosure are performed when the computer program is executed by a central processing unit (CPU).
  • Embodiments of the present disclosure further provide a computer storage medium for storing computer readable instructions, the instructions including: segmenting a video to obtain executable instructions of two or more segmented videos; respectively Each segment video in one or more segmented videos is sampled, and an executable instruction of the original image and the optical flow image of each segmented video is obtained; and the original image of each segmented video is processed by using a spatial convolutional neural network, respectively, Obtaining an executable instruction of the preliminary classification result of the spatial domain of each segment video; and processing the optical flow image of each segment video by using a time domain convolutional neural network, respectively, to obtain an executable process of the time domain preliminary classification result of each segment video
  • the instruction performs comprehensive processing on the spatial classification result of the segmented video to obtain an executable instruction of the spatial classification result of the video; and comprehensively processes the time domain preliminary classification result of the segmented video to obtain the time domain classification result of the video. Execute the instruction; fuse the spatial domain classification result and the time domain classification result to obtain the classification
  • the present disclosure also provides a computer device comprising: a memory storing executable instructions; and one or more processors in communication with the memory to execute executable instructions to complete the video category identification method of any of the above examples of the present disclosure Corresponding operation.
  • each of the examples in the present application are described in a progressive manner, and each example may include differences from other examples, and the same or similar parts between the various examples may be referred to each other.
  • the description is relatively simple, and the relevant part can be referred to the description of the method example.
  • the methods, apparatus, and apparatus of the present disclosure may be implemented in many ways.
  • the methods, apparatus, and devices of the present disclosure can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order described above unless otherwise specifically stated.
  • the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种视频类别识别方法和装置、数据处理装置和电子设备。其中的方法包括:对视频进行分段,获得两个或者以上分段视频;分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;利用空域卷积神经网络处理各分段视频的原始图像以获得所述视频的空域分类结果;以及利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果;对空域分类结果和时域分类结果进行融合处理,获得所述视频的分类结果。

Description

视频类别识别方法和装置、数据处理装置和电子设备
本公开要求在2016年7月29日提交中国专利局、申请号为201610619654.1、发明名称为“视频类别识别方法和装置、数据处理装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开属于计算机视觉技术领域,特别是涉及一种视频类别识别方法和装置、数据处理装置和电子设备。
背景技术
动作识别是计算机视觉研究的一个热门方向。动作识别技术主要是通过对由彩色图片序列构成的视频进行处理,来识别出视频中的动作。动作识别技术的难点在于:如何对动态变化的视频内容进行处理,以克服距离、视角的变化,相机的移动,以及场景的变化等来正确识别出视频中的动作。
发明内容
本公开提供一种视频类别识别技术方案。
根据本公开的一个方面,提供一种视频类别识别方法,包括:对视频进行分段,获得两个或者以上分段视频;分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;利用空域卷积神经网络处理各分段视频的原始图像以获得所述视频的空域分类结果;以及利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果;对所述空域分类结果和所述时域分类结果进行融合处理,获得所述视频的分类结果。
根据本公开的另一个方面,提供一种视频类别识别装置,包括:分段单元,用于对视频进行分段,获得两个或者以上分段视频;采样单元,用于分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;空域分类处理单元,用于利用空域卷积神经网络处理各分段视频的原始图以获得所述视频的空域分类结果;时域分类处理单元,用于分别利用时域卷积神经网络处理各分段视频的光流图像以获得各分段视频的时域分类结果;融合单元,用于对所述空域分类结果和所述时域分类结果进行融合处理,获得所述视频的分类结果。
根据本公开的又一个方面,提供一种数据处理装置,包括:上述所述的视频类别识别装置。
根据本公开的再一个方面,提供的一种电子设备,设置有上述所述的数据处理装置。
根据本公开的再一个方面,提供的一种计算机存储介质,用于存储计算机可读取的指令,所述指令包括:对视频进行分段,获得两个或者以上分段视频的指令;分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像的指令;利用空域卷积神经网络处理各分段视频的原始图像以得到所述视频的空域分类结果的指令;以及利用时域卷积神经网络处理各分段视频的光流图像以得到所述视频的时域分类结果的指令;对所述空域分类结果和所述时域分类结果进行融合处理,获得所述视频的分类结果的指令。
根据本公开的再一个方面,提供一种计算机设备,包括:存储器,存储可执行指令;一个或多个处理器,与存储器通信以执行可执行指令从而完成本公开上述视频类别识别方法对应的操作。
基于本公开提供的视频类别识别方法和装置、数据处理装置和电子设备,通过对视频进行分段处理,获得两个或者以上分段视频;并分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;再利用空域卷积神经网络处理各分段视频的原始图像以获得视频的空域分类结果;而且可以利用时域卷积神经网络处理各分段视频的光流图像以获得视频的时域分类结果;最后对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果。本公开通过将视频分成两个或者以上分段视频,并对各分段视频分别采样帧图片和帧间光流,在对卷积神经网络进行训练时,可以实现对长时间动作的建模,使得后续利用训练获得的网络模型对视频分类进行识别时,有利于提高视频类别识别的正确率,有利于提升视频类别识别效果,并且计算代价较小。
附图说明
构成说明书的一部分的附图描述了本公开的实施例,并且连同描述一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:
图1示出了本公开的一应用场景示意图。
图2是本公开视频类别识别方法一个实施例的流程图。
图3是本公开视频类别识别方法另一个实施例的流程图。
图4是本公开视频类别识别方法又一个实施例的流程图。
图5是本公开视频类别识别方法再一个实施例的流程图。
图6是本公开中对初始空域卷积神经网络进行训练的一个实施例的流程图。
图7是本公开中对初始时域卷积神经网络进行训练的一个实施例的流程图。
图8是本公开视频类别识别装置一个实施例的结构示意图。
图9是本公开视频类别识别装置另一个实施例的结构示意图。
图10是本公开视频类别识别装置又一个实施例的结构示意图。
图11是本公开视频类别识别装置又一个实施例的结构示意图。
图12是本公开视频类别识别装置再一个实施例的结构示意图。
图13是本公开视频类别识别装置一个应用实例的示意图。
图14是本公开电子设备一个实施例的结构示意图。
具体实施例
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件的相对布置、数字表达式和数值不限制本公开的范围。同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开提供的技术方案可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
在基于深度学习的动作识别技术中,双流式卷积神经网络(Two-Stream Convolution Neural Network)是具有代表性的一种网络模型。双流式卷积神经网络是使用两个卷积神经网络,即空域卷积神经网络和时域卷积神经网络分别对帧图片和帧间光流进行建模,并通过对两个卷积神经网络的分类结果进行融合,来识别出视频中的动作。
然而,虽然双流式卷积神经网络可以同时对帧图片和帧间光流,即对短时动作信息进行建模,但是却缺乏对长时间动作的建模能力,这导致动作识别的正确率无法获得保证。
图1示意性地示出了根据本公开提供的视频类别识别技术方案可以在其中实现的一应用场景。
图1中,至少一个电子设备(如终端侧的电子设备A1、电子设备A2、……、以及电子设备Am中的一个或多个)为具有互联网接入能力的电子设备。电子设备A1、电子设备A2、……电子设备Am中的一个或多个电子设备中均存储有视频。存储于电子设备中的视频可以是用户利用其电子设备拍摄的视频,也可以是用户通过电子设备间的数据传输而存储在其电子设备中的视频,还可以是用户利用其电子设备从网络中下载的视频等。用户可以将其电子设备中存储的视频通过互联网上传或发送至相应的服务端或终端侧的其他电子设备,服务端或者终端侧的电子设备均可以对其获得的视频进行分类存储管理。上述服务端可以由服务侧的单个如服务器等电子设备形成,也可以由多个如服务器等电子设备形成。本公开不限制服务端或者终端侧中的电子设备的具体表现形式。
本公开提供的技术方案可以使服务端或终端侧的电子设备自动地对其获得的视频的内容分别进行分析,并识别出各视频各自所属的类别,从而服务端或终端侧的电子设备可以将其获得的各视频按照其所属的类别自动的划分到第一类别的视频集合、第二类别的视频集合、……或者第z类别的视频集合中。本公开通过将各视频自动划分在相应类别的视频集合中,方便了服务端或终端侧的电子设备 的视频分类管理。
然而,本领域技术人员可以理解,本公开还可以适用于其他应用场景中,即本公开所能够适用的应用场景并不会受上述举例的应用场景的限制;例如:可以在没有接入互联网的某个电子设备(如电子设备中的处理器)中执行本公开,或者在非终端-服务端结构的点对点通信中的电子设备(如电子设备的处理器)中执行本公开,等等。
下面结合图2-图14对本公开提供的视频类别识别技术方案进行说明。
图2中,102,对视频进行分段,获得两个或者以上分段视频。
作为一个可选示例,步骤102可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的分段单元执行。
作为一个可选示例,分段单元对视频进行分段时,可以对视频进行平均分段,获得长度相同的两个或者以上分段视频。例如,分段单元将视频平均分成长度相同的3个分段视频或5个分段视频,分段数量视实际效果确定。另外,分段单元也可以对视频进行随机分段、或者从视频中提取几段分别作为分段视频。
在一个可选示例中,分段单元接收到视频后,可以获取视频的长度,分段单元根据视频的长度及预先设定的分段数量确定每一段视频的长度,据此分段单元可以将接收到的视频平均分成长度相同的两个或者以上分段视频。
分段单元对视频进行平均分段时,得到的各分段视频的长度相同,在处理器(例如,被处理器运行的网络训练单元)基于长时间视频对卷积神经网络的网络模型进行训练时,可以简化网络模型的训练过程;在利用训练好的卷积神经网络进行视频类别识别时,由于对各分段视频识别所需的时间相近,有利于提高视频类别识别的整体效率。
104,分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像。
作为一个可选示例,步骤104可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的采样单元执行。
示例性地,采样单元中的图像采样模块获得各分段视频的原始图像时,可以分别从各分段视频中随机抽取一帧图像,作为各分段视频的原始图像。
示例性地,采样单元中的光流采样模块获得各分段视频的光流图像时,可以分别从各分段视频中随机抽取连续的多帧图像,获得各分段视频的光流图像。
在一个可选的实现方式中,光流图像可以是基于8位位图、共256个离散的色阶的灰度图像,灰度图像的中值为128。
由于光流场是一个向量场,当使用灰度图像表示光流图像时,需要用两幅标量场图片表示光流图像,即分别对应于光流图像坐标轴的X方向和Y方向幅度的两幅标量场图片。
可选地,光流采样模块分别从各分段视频中随机抽取连续的多帧图像,获得各分段视频的光流图像,可以通过如下方式实现:分别针对各分段视频:
光流采样模块从每一分段视频中随机抽取连续的N帧图像;其中,N为大于1的整数;以及
光流采样模块分别基于N帧图像中的每相邻的两帧图像进行计算,获得N-1组光流图像,其中N-1组光流图像中的每一组光流图像分别包括一帧横向光流图像及一帧纵向光流图像。
例如,可以分别针对各分段视频:光流采样模块从每一分段视频中随机抽取连续的6帧图像;光流采样模块分别基于6帧图像中的每相邻的两帧图像进行计算,光流采样模块获得5组光流灰度图像,其中5组光流灰度图像中的每一组光流灰度图像分别包括一帧横向光流灰度图像及一帧纵向光流灰度图像,即光流采样模块获得10帧光流灰度图像,这10帧光流灰度图像可以作为一张10通道的图像。
106,利用空域卷积神经网络处理各分段视频的原始图像以获得视频的空域分类结果;以及利用时域卷积神经网络处理各分段视频的光流图像以获得视频的时域分类结果。
作为一个可选示例,步骤106可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的空域分类处理单元和时域分类处理单元执行,例如,空域分类处理单元利用空域卷积神经网络处理各分段视频的原始图像以获得视频的空域分类结果,而时域分类处理单元利用时域卷积神经网络处理各分段视频的光流图像以获得视频的时域分类结果。
其中,视频的空域分类结果和视频的时域分类结果分别为维度等于分类类别数量的分类结果向量。例如,分类结果包括:跑步、跳高、竞走、撑杆跳、跳远及三级跳,共6个类别,则空域分类结果和时域分类结果分别为维度等于6的分类结果向量。
108,对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果。
作为一个可选示例,步骤108可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的融合单元执行。
其中,视频的分类结果为维度等于分类类别数量的分类结果向量。例如,分类结果包括:跑步、跳高、竞走、撑杆跳、跳远及三级跳,共6个类别,则视频的分类结果为维度等于6的分类结果向量。
作为一个可选示例,融合单元对空域分类结果和时域分类结果进行融合处理可以是:融合单元将空域分类结果与时域分类结果分别乘以预先设定的权重系数后进行求和,获得视频的分类结果。其中,权重系数是融合单元根据对应卷积神经网络的网络模型在验证数据集上的分类正确率确定,分类正确率高的网络模型权重较高,验证数据集是由具有真实类别标注,而未参与网络训练的视频构成。验证数据集可以通过任何可能的方式获得,例如,通过在搜索引擎中搜索相应类别的视频获得。
在一个可选应用中,空域分类结果与时域分类结果之间的权重系数比值可以是1:1至1:3之间的任一比值,在一个可选的实现方式中,该比值可以为1:1.5。
基于本公开提供的视频类别识别方法,通过对视频进行分段,获得两个或以上分段视频;并分别对两个或以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;利用空域卷积神经网络处理各分段视频的原始图像以获得视频的空域分类结果;以及利用时域卷积神经网络处理各分段视频的光流图像以获得视频的时域分类结果;最后对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果。本公开通过将视频分成分段视频,对各分段视频分别采样帧图片和帧间光流,在对卷积神经网络进行训练时,可以实现对长时间动作的建模,使得后续利用训练获得的网络模型对视频分类进行识别时,有利于提高视频类别识别的正确率,有利于提升视频类别识别效果,并且计算代价较小。
图3中,202,对视频进行分段,获得两个或者以上分段视频。
作为一个可选示例,步骤202可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的分段单元执行。
作为一个可选示例,分段单元对视频进行分段时,可以对视频进行平均分段,获得长度相同的两个或以上分段视频,以简化卷积神经网络的网络模型的训练过程,提高视频类别识别的整体效率。例如,分段单元将视频平均分成长度相同的3个分段视频或5个分段视频,分段数量视实际效果确定。
另外,分段单元也可以对视频进行随机分段、或者从视频中提取几段作为分段视频。如图13所示,在本公开视频类别识别方法的一个应用实施例中,分段单元将视频平均分成3个分段视频。
204,分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像。
作为一个可选示例,步骤204可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的采样单元执行。
例如,采样单元中的图像采样模块可以分别从各分段视频中随机抽取一帧图像,作为各分段视频的原始图像;采样单元中的光流采样模块可以分别从各分段视频中随机抽取连续的多帧图像,获得各分段视频的光流图像。
如图13所示,在本公开视频类别识别方法的一个应用实施例中,采样单元分别对3个分段视频进行采样,得到3个分段视频的一帧原始图像和帧间光流图像。在一个可选的实现方式中,原始图像可以为RGB彩色图像,光流图像可以为灰度图像。
206,分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果;以及分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果。
作为一个可选示例,步骤206可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的空域分类处理模块和第一时域分类处理模块执行,例如,空域分类处理模块分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果,而第一时域分类处理模块分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果。
其中,空域初步分类结果和时域初步分类结果分别为维度等于分类类别数量的分类结果向量。例如,分类结果包括:跑步、跳高、竞走、撑杆跳、跳远及三级跳,共6个类别,则空域初步分类结果和时域初步分类结果分别为维度等于6的分类结果向量。
如图13所示,在本公开视频类别识别技术的一个可选示例中,空域分类处理模块分别利用空域卷积神经网络对3个分段视频的原始图像进行处理,得到3个分段视频的3个空域初步分类结果;第 一时域分类处理模块分别利用时域卷积神经网络对3个分段视频的光流图像进行处理,得到3个分段视频的3个时域初步分类结果。空域卷积神经网络和/或时域卷积神经网络,可以先通过卷积层、非线性层、池化层等的组合,获得图像的特征表示,再通过线性分类层,得到属于每一类别的得分,即每个分段视频的初步分类结果。例如,分类结果可以包括:跑步、跳高、竞走、撑杆跳、跳远以及三级跳,共6个类别,则每个分段视频的空域初步分类结果和时域初步分类结果分别为包含视频属于这6个类别的分类得分的6维向量。
208,利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果;以及利用时域共识函数对分段视频的时域初步分类结果进行综合处理,获得视频的时域分类结果。
作为一个可选示例,步骤208可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的第一综合处理模块和第二综合处理模块执行,例如,第一综合处理模块可以利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果,而第二综合处理模块可以利用时域共识函数对分段视频的时域初步分类结果进行综合处理,获得视频的时域分类结果。
其中,视频的空域分类结果和视频的时域分类结果可以分别为维度等于分类类别数量的分类结果向量。
在一个可选示例中,空域共识函数和/或时域共识函数包括:平均函数、最大值函数或带权平均函数。本公开可以选取在验证数据集上分类正确率最高的平均函数、最大值函数或带权平均函数作为空域共识函数;本公开可以选取在验证数据集上分类正确率最高的平均函数、最大值函数或带权平均函数作为时域共识函数。
在一个可选示例中,平均函数,具体为对不同分段视频间同一类别的类别得分取平均值作为输出的该类别的类别得分;最大值函数,具体为对不同分段视频间同一类别的类别得分,通过函数选取其中的最大值作为输出的类别得分;带权平均函数,具体为对不同分段视频间同一类别的类别得分取带权的平均值作为输出的该类别的类别得分,其中各个类别使用同一套权值,并这套权值是在训练时作为网络模型参数优化获得。
例如,在图13所示的应用实施例中,处理器可以选取平均函数作为空域共识函数和时域共识函数,选取平均函数作为空域共识函数和时域共识函数,第一综合处理模块利用空域共识函数计算3个分段视频的3个空域初步分类结果中属于每一类别的3个得分的平均值,作为该类别的类别得分,这样就得到了一组对所有类别的类别得分,作为视频的空域分类结果;第二综合处理模块利用时域共识函数计算3个分段视频的3个时域初步类别结果中属于每一类别的3个得分的平均值,作为该类别的类别得分,这样就得到了一组对所有类别的类别得分,作为视频的时域分类结果。例如,分类结果包括:跑步、跳高、竞走、撑杆跳、跳远及三级跳,共6个类别,则视频的空域分类结果和时域分类结果分别为包含视频属于这6个类别的类别得分的6维向量。
210,对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果。
作为一个可选示例,步骤210可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的融合单元执行。
其中,视频的分类结果为维度等于分类类别数量的分类结果向量。
如图13所示,在本公开视频类别识别方法的一个应用实施例中,融合单元将视频空域分类结果与时域分类结果分别乘以1:1.5的权重系数后进行求和,得到视频的分类结果。例如,分类结果可以包括:跑步、跳高、竞走、撑杆跳、跳远及三级跳,共6个类别,则视频的分类结果为包含视频属于这6个类别的分类得分的6维向量。其中,得分最高的类别即为视频所属的类别,在该实施例中得分最高的类别为跳高,则识别出视频的类别为跳高。
基于本公开提供的视频类别识别技术方案,通过在各分段视频间使用共识函数,通过共识函数综合各分段视频的初步分类结果,获得视频的分类结果,由于共识函数不对各分段视频使用的卷积神经网络模型进行限制,因此,可以实现不同分段视频共享网络模型的参数,使网络模型的参数更少,从而可以采用具有较少参数的网络模型实现对任意长度的视频的类别的识别,在训练过程中,通过对任意长度的视频分段,并进行分段式网络训练,通过比较整个视频的分类结果与真实标签进行监督学习,可以实现全视频层次的训练监督,不受视频长度的限制。
图4中,302,对视频进行分段,获得两个或者以上分段视频。作为一个可选示例,步骤302可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的分段单元执行。
304,分别对两个或以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及原始光流图像。作为一个可选示例,步骤304可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的采样单元执行,例如,采用单元中的图像采样模块获得各分段视频的原始图像,光流采 样模块获得各分段视频的原始光流图像。
306,获取原始光流图像变形后的变形光流图像。作为一个可选示例,步骤306可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的光流处理单元执行。
在一个可选示例中,光流处理单元获取原始光流图像变形后的变形光流图像包括:光流处理单元分别对每相邻的两帧图像进行计算,获得每相邻的两帧图像之间的单应性变换矩阵;光流处理单元分别根据每相邻的两帧图像之间的单应性变换矩阵对相应相邻的两帧图像中的后一帧图像进行仿射变换;光流处理单元分别对每相邻的两帧图像中的前一帧图像及仿射变换后的后一帧图像进行计算,获得变形光流图像。
由于经过上述仿射变换后的后一帧图像上的特征点与作为基准的前一帧图像上对应的特征点之间不存在单应性变换,因此,由前一帧图像及仿射变换后的后一帧图像计算得到的变形光流图像,作为视频类别识别的输入信息,有利于降低相机移动对视频类别识别效果的影响。
在一个可选示例中,光流处理单元对每相邻的两帧图像进行计算包括:光流处理单元根据加速鲁棒性特征SURF特征点描述子进行帧间特征点匹配。
308,分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果;分别利用第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果;以及分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果。
作为一个可选示例,步骤308可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的空域分类处理模块、第一时域分类处理模块和第二时域分类处理模块执行,例如,空域分类处理模块分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果,第一时域分类处理模块分别利用第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果,第二时域分类处理模块分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果。
310,利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果;利用第一时域共识函数对分段视频的第一时域初步分类结果进行综合处理,获得视频的第一时域分类结果;以及利用第二时域共识函数对分段视频的第二时域初步分类结果进行综合处理,获得视频的第二时域分类结果。
作为一个可选示例,步骤310可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的第一综合处理模块、第二综合处理模块和第三综合处理模块执行,例如,第一综合处理模块利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果,第二综合处理模块利用第一时域共识函数对分段视频的第一时域初步分类结果进行综合处理,获得视频的第一时域分类结果,第三综合处理模块利用第二时域共识函数对分段视频的第二时域初步分类结果进行综合处理,获得视频的第二时域分类结果。
312,对空域分类结果、第一时域分类结果和第二时域分类结果进行融合处理,获得视频的分类结果。作为一个可选示例,步骤312可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的融合单元执行。
作为一个可选示例,融合单元对空域分类结果、第一时域分类结果和第二时域分类结果进行融合处理包括:融合单元将空域分类结果、第一时域分类结果和第二时域分类结果分别乘以预先设定的权重系数后进行求和,获得视频的分类结果。其中,权重系数是根据对应的网络模型在验证数据集上的分类正确率确定,分类正确率高的网络模型获得较高权重。
例如,在一个可选应用中,空域分类结果与第一时域分类结果及第二时域分类结果之间的权重系数比值可以是1:a:b,且a与b之和不小于1,且不大于3,在一个可选的实现方式中,该比值可以为1:1:0.5等。
由于目前广泛使用的双流式卷积神经网络采用短时运动信息表示光流图像,在提取光流图像时并未考虑相机的移动,这可能会导致在相机移动较大时无法识别视频中的动作,而影响识别效果。
基于本公开提供的视频类别识别技术,除了采用帧图片和帧间光流之外,还使用变形的光流作为附加的短时运动信息表示,将视频类别识别的输入拓展为三种信息,即帧图片、帧间光流和变形光流,由于变形光流去除了相机移动的影响,因此有利于降低相机移动视频类别识别效果的影响,在训练过程中,同样采用三种输入信息,即帧图片、帧间光流和变形光流,对网络模型进行训练,有利于降低相机移动对网络模型的影响,从而有利于提高视频类别识别技术对相机移动的鲁棒性。
图5中,3020,对视频进行分段,获得两个或者以上分段视频。作为一个可选示例,步骤3020 可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的分段单元执行。
3040,分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及原始光流图像。作为一个可选示例,步骤3040可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的采样单元执行,例如,采用单元中的图像采样模块获得各分段视频的原始图像,光流采样模块获得各分段视频的原始光流图像。
3060,获取原始光流图像变形后的变形光流图像。作为一个可选示例,步骤3060可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的光流处理单元执行。
在一个可选的示例中,光流处理单元获取原始光流图像变形后的变形光流图像包括:光流处理单元分别对每相邻的两帧图像进行计算,获得每相邻的两帧图像之间的单应性变换矩阵;光流处理单元分别根据每相邻的两帧图像之间的单应性变换矩阵对相应相邻的两帧图像中的后一帧图像进行仿射变换;光流处理单元分别对每相邻的两帧图像中的前一帧图像及仿射变换后的后一帧图像进行计算,获得变形光流图像。光流处理单元对每相邻的两帧图像进行计算包括:光流处理单元根据加速鲁棒性特征SURF特征点描述子进行帧间特征点匹配。
3080,分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果;分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果。
作为一个可选示例,步骤3080可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的空域分类处理模块和第二时域分类处理模块执行,例如,空域分类处理模块分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果,第二时域分类处理模块分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果。
3100,利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果;以及利用第二时域共识函数对分段视频的第二时域初步分类结果进行综合处理,获得视频的第二时域分类结果。
作为一个可选示例,步骤3100可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的第一综合处理模块和第三综合处理模块执行,例如,第一综合处理模块利用空域共识函数对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果,第三综合处理模块利用第二时域共识函数对分段视频的第二时域初步分类结果进行综合处理,获得视频的第二时域分类结果。
3120,对空域分类结果和第二时域分类结果进行融合处理,获得视频的分类结果。
作为一个可选示例,步骤3120可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的融合单元执行。
作为一个可选示例,融合单元对空域分类结果和第二时域分类结果进行融合处理包括:融合单元将空域分类结果和第二时域分类结果分别乘以预先设定的权重系数后进行求和,获得视频的分类结果。其中,权重系数是根据对应的网络模型在验证数据集上的分类正确率确定,分类正确率高的网络模型获得较高权重。
在一个可选示例中,空域分类结果与第二时域分类结果之间的权重系数比值可以是1:1-1:3之间的任一比值,在一个可选的实现方式中,该比值可以为1:1.5等。
本公开上述视频类别识别技术可应用于卷积神经网络模型的训练阶段,也可应用于卷积神经网络模型的测试阶段和后续应用阶段。
在本公开视频类别识别技术的一个可选的实施例中,在视频类别识别技术应用于卷积神经网络模型的测试阶段和后续应用阶段时,可以在步骤108、210、312或3120获得视频的分类结果后,利用Softmax函数对融合处理获得的分类结果向量进行归一化处理,得到视频属于各类别的分类概率向量。作为一个可选示例,本步骤中的归一化处理操作可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的第一归一化处理单元执行。
在本公开视频类别识别技术的一个可选的实施例中,上述视频类别识别技术应用于卷积神经网络模型的训练阶段时,还可以包括如下操作:
预设初始空域卷积神经网络和初始时域卷积神经网络;作为一个可选示例,预设初始空域卷积神经网络和初始时域卷积神经网络的操作可以由处理器调用存储器存储的指令执行,且预设的初始空域卷积神经网络和初始时域卷积神经网络可以存储于网络训练单元中;
分别基于各作为样本的视频,采用随机梯度下降法(SGD)对初始空域卷积神经网络进行训练,获得上述各实施例中的空域卷积神经网络;以及采用随机梯度下降法对初始时域卷积神经网络进行训 练,获得上述各实施例中的时域卷积神经网络。作为一个可选示例,本步骤可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的网络训练单元执行。
其中,作为样本的各视频预先标注有标准空域分类结果信息。
随机梯度下降法是通过每个样本来迭代更新一次网络模型,网络训练单元采用随机梯度下降法对初始空域卷积神经网络和初始时域卷积神经网络进行训练,训练速度快,有利于提高网络训练效率。
图6中,402,针对一个作为样本的视频,开始执行本公开上述各可选实施例所示流程的操作,直到获得视频的空域分类结果。例如,处理器执行操作102-106、202-208、302-310或3020-3100中与空域相关的操作,获得视频的空域分类结果。
404,比较视频的空域分类结果相对于该视频的预设标准空域分类结果的偏差是否小于预设范围。
若不小于预设范围,执行操作406。若小于预设范围,结束对初始空域卷积神经网络的训练流程,以当前的初始空域卷积神经网络作为最终的空域卷积神经网络,不执行本实施例的后续流程。406,对初始空域卷积神经网络的网络参数进行调整。
408,以调整网络参数后的空域卷积神经网络作为新的初始空域卷积神经网络,针对下一个作为样本的视频,开始执行操作402。作为一个可选示例,步骤404、406和408可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的网络训练单元执行。
图7中,502,针对一个作为样本的视频,开始执行对视频进行分段的操作,直到获得视频的时域分类结果。例如,处理器执行操作102-106、202-208、302-310或3020-3100中与时域相关的操作,获得视频的时域分类结果。
504,比较视频的时域分类结果相对于视频的预设标准时域分类结果的偏差是否小于预设范围。
若不小于预设范围,执行操作506。若不小于预设范围,结束对初始时域卷积神经网络的训练流程,以当前的初始时域卷积神经网络作为最终的时域卷积神经网络,不执行本实施例的后续流程。
506,对初始时域卷积神经网络的网络参数进行调整。
508,以调整网络参数后的时域卷积神经网络作为新的初始时域卷积神经网络,针对下一个作为样本的视频,开始执行操作502。
作为一个可选示例,步骤504、506和508可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的网络训练单元执行。
在图7所示的可选实施例中,初始时域卷积神经网络可以第一初始时域卷积神经网络或第二初始时域卷积神经网络,时域分类结果相应的包括第一时域分类结果或第二时域分类结果,时域卷积神经网络相应的包括第一时域卷积神经网络和第二时域卷积神经网络。即,可以通过图7所示实施例分别实现或同时实现对第一初始时域卷积神经网络、第二初始时域卷积神经网络的训练。
进一步地,通过图6、图7所示实施例对初始空域卷积神经网络和初始时域卷积神经网络进行训练时,还可以包括如下操作:利用Softmax函数对视频的空域分类结果进行归一化处理,获得视频属于各类别的一个空域分类概率向量;以及利用Softmax函数对视频的时域分类结果进行归一化处理,获得视频属于各类别的一个时域分类概率向量。作为一个可选示例,该操作可以由处理器调用存储器存储的指令执行,或者,可以由被处理器运行的第二归一化处理单元执行。相应地,图6、图7所示的空域分类结果、时域分类结果,可以是未归一化的分类结果、或者归一化的分类概率向量。
如图13所示,为本公开视频类别识别装置的一个可选应用实例,其中的时域卷积神经网络可以是第一时域卷积神经网络,也可以是第二时域卷积神经网络,还可以同时包括第一时域卷积神经网络和第二时域卷积神经网络。
另外,本公开还提供了一种数据处理装置,该数据处理装置包括本公开中的视频类别识别装置。
基于本公开上述实施例提供的数据处理装置,设置有上述实施例的物视频类别识别装置,通过将视频分成两个或者以上分段视频,对各分段视频分别采样帧图片和帧间光流,在对卷积神经网络进行训练时,可以实现对长时间动作的建模,使得后续利用训练获得的网络模型对视频分类进行识别时,相对于现有技术提高了视频类别识别的正确率,提升了视频类别识别效果,并且计算代价较小。
本公开实施例的数据处理装置可以是任意具有数据处理功能的装置,例如可以包括但不限于:进阶精简指令集机器(ARM)、中央处理单元(CPU)或图形处理单元(GPU)等。另外,本公开还提供了一种电子设备,例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等,该电子设备设置有本公开的数据处理装置。
基于本公开上述实施例提供的电子设备,设置有上述实施例的数据处理装置,通过将视频分成两个或者以上分段视频,对各分段视频分别采样帧图片和帧间光流,在对卷积神经网络进行训练时,可以实现对长时间动作的建模,使得后续利用训练获得的网络模型对视频分类进行识别时,有利于提高 视频类别识别的正确率,有利于提升视频类别识别效果,并且计算代价较小。
图14是本公开电子设备一个实施例的结构示意图,如图14所示,用于实现本公开实施例的电子设备包括中央处理单元(CPU),其可以根据存储在只读存储器(ROM)中的可执行指令或者从存储部分加载到随机访问存储器(RAM)中的可执行指令而执行各种适当的动作和处理。中央处理单元可与只读存储器和/或随机访问存储器中通信以执行可执行指令从而完成本公开提供的视频类别识别方法对应的操作,例如:对视频进行分段,获得两个或者以上分段视频;分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;分别利用空域卷积神经网络对各分段视频的原始图像进行处理,以获得各分段视频的空域分类结果;以及分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域分类结果;对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果。
此外,在RAM中,还可存储有系统操作所需的各种程序和数据。CPU、ROM以及RAM通过总线彼此相连。输入/输出(I/O)接口也连接至总线。
以下部件连接至I/O接口:包括键盘、鼠标等的输入部分;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分;包括硬盘等的存储部分;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分。通信部分经由诸如因特网的网络执行通信处理。驱动器也根据需要连接至I/O接口。可拆卸介质,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器上,以便于从其上读出的计算机程序根据需要被安装入存储部分。
特别地,根据本公开的可选示例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的技术方案包括一种计算机程序产品,其可以包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,所述程序代码可包括对应执行本公开提供的任一项视频分类方法步骤对应的可执行指令,例如,对视频进行分段,获得两个或者以上分段视频的可执行指令;分别对两个或以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像的可执行指令;分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果的可执行指令;以及分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果的可执行指令;对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果的可执行指令;以及对分段视频的时域初步分类结果进行综合处理,获得视频的时域分类结果的可执行指令;对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果的可执行指令。该计算机程序可以通过通信部分从网络上被下载和安装,和/或从可拆卸介质被安装。在该计算机程序被中央处理单元(CPU)执行时,执行本公开的方法中限定的功能。
本公开实施例还提供了一种计算机存储介质,用于存储计算机可读取的指令,所述指令包括:对视频进行分段,获得两个或者以上分段视频的可执行指令;分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像的可执行指令;分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果的可执行指令;以及分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果的可执行指令;对分段视频的空域初步分类结果进行综合处理,获得视频的空域分类结果的可执行指令;以及对分段视频的时域初步分类结果进行综合处理,获得视频的时域分类结果的可执行指令;对空域分类结果和时域分类结果进行融合处理,获得视频的分类结果的可执行指令。
另外,本公开还提供了一种计算机设备,包括:存储器,存储可执行指令;一个或多个处理器,与存储器通信以执行可执行指令,从而完成本公开上述任一示例的视频类别识别方法对应的操作。
本申请中的各个示例均采用递进的方式描述,每个示例重点说明的可能包括与其它示例的不同之处,各个示例之间相同或相似的部分可以相互参见。对于装置/系统示例而言,由于其与方法示例基本对应,所以描述的比较简单,相关之处参见方法示例的部分说明即可。
可能以许多方式来实现本公开的方法、装置以及设备。例如,可以通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法、装置以及设备。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
本公开的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本公开限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本公开的原理和实际应用,并且使本领域的普通技术人员能够理解本公开从而设计适于特定用途的带有各种修改的各种实施例。

Claims (49)

  1. 一种视频类别识别方法,其特征在于,包括:
    对视频进行分段,获得两个或者以上分段视频;
    分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;
    利用空域卷积神经网络处理各分段视频的原始图像以获得所述视频的空域分类结果;以及利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果;
    对所述空域分类结果和所述时域分类结果进行融合处理,获得所述视频的分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述对视频进行分段包括:
    对所述视频进行平均分段,获得长度相同的两个或者以上分段视频。
  3. 根据权利要求1或2所述的方法,其特征在于,所述分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像包括:
    分别从各分段视频中随机抽取一帧图像,作为各分段视频的原始图像;和/或
    分别从各分段视频中随机抽取连续的多帧图像,根据所述多帧图像获得各分段视频的光流图像。
  4. 根据权利要求1或2或3所述的方法,其特征在于,所述光流图像为基于8位位图、共256个离散的色阶的灰度图像,所述灰度图像的中值为128。
  5. 根据权利要求3或4所述的方法,其特征在于,所述分别从各分段视频中随机抽取连续的多帧图像,根据所述多帧图像获得各分段视频的光流图像包括:
    分别针对各分段视频:从每一分段视频中随机抽取连续的N帧图像;其中,N为大于1的整数;以及
    分别基于所述N帧图像中的每相邻的两帧图像进行计算,获得N-1组光流图像,所述N-1组光流图像中的每一组光流图像分别包括一帧横向光流图像及一帧纵向光流图像。
  6. 根据权利要求1至5任意一项所述的方法,其特征在于,
    所述利用空域卷积神经网络处理各分段视频的原始图像以获得所述视频的空域分类结果包括:
    分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果;
    利用空域共识函数对所述分段视频的空域初步分类结果进行综合处理,获得所述视频的空域分类结果;
    和/或
    所述利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果包括:
    分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果;
    利用时域共识函数对所述分段视频的时域初步分类结果进行综合处理,获得所述视频的时域分类结果。
  7. 根据权利要求6所述的方法,其特征在于,所述空域共识函数和/或所述时域共识函数包括:平均函数、最大值函数或带权平均函数。
  8. 根据权利要求7所述的方法,其特征在于,所述平均函数、最大值函数或带权平均函数具体为:在验证数据集上分类正确率最高的平均函数、最大值函数或带权平均函数。
  9. 根据权利要求6至8任意一项所述的方法,其特征在于,所述空域初步分类结果及所述时域初步分类结果分别为维度等于分类类别数量的分类结果向量;
    所述视频的空域分类结果及所述视频的时域分类结果分别为维度等于分类类别数量的分类结果向量;
    所述视频的分类结果为维度等于分类类别数量的分类结果向量。
  10. 根据权利要求1至9任意一项所述的方法,其特征在于,所述对所述空域分类结果和所述时域分类结果进行融合处理包括:
    将所述空域分类结果与所述时域分类结果分别乘以预先设定的权重系数后进行求和,获得所述视频的分类结果。
  11. 根据权利要求10所述的方法,其特征在于,所述空域分类结果与所述时域分类结果之间的权重系数比值为1:1至1:3中任一比值。
  12. 根据权利要求1至11任意一项所述的方法,其特征在于,所述光流图像具体为原始光流图 像,所述时域卷积神经网络具体为第一时域卷积神经网络;
    且所述利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果包括:
    分别利用所述第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果;
    利用第一时域共识函数对所述分段视频的第一时域初步分类结果进行综合处理,获得所述视频的第一时域分类结果;
    所述对所述空域分类结果和所述时域分类结果进行融合处理包括:对所述空域分类结果和所述第一时域分类结果进行融合处理,获得所述视频的分类结果。
  13. 根据权利要求1至11任意一项所述的方法,其特征在于,所述光流图像具体为原始光流图像的变形光流图像,所述时域卷积神经网络具体为第二时域卷积神经网络;
    所述方法还包括:获取所述原始光流图像变形后的变形光流图像;
    且所述利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果包括:
    分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果;
    利用第二时域共识函数对所述各分段视频的第二时域初步分类结果进行综合处理,获得所述视频的第二时域分类结果;
    所述对所述空域分类结果和所述时域分类结果进行融合处理包括:对所述空域分类结果和所述第二时域分类结果进行融合处理,获得所述视频的分类结果。
  14. 根据权利要求1至11任意一项所述的方法,其特征在于,所述光流图像具体为原始光流图像和变形光流图像,所述时域卷积神经网络具体为第一时域卷积神经网络和第二时域卷积神经网络;
    所述方法还包括:获取所述原始光流图像变形后的变形光流图像;
    且所述利用时域卷积神经网络处理各分段视频的光流图像以获得所述视频的时域分类结果包括:
    分别利用第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果;
    利用第一时域共识函数对所述各分段视频的第一时域初步分类结果进行综合处理,获得所述视频的第一时域分类结果;
    分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果;
    利用第二时域共识函数对所述各分段视频的第二时域初步分类结果进行综合处理,获得所述视频的第二时域分类结果;
    所述对所述空域分类结果和所述时域分类结果进行融合处理包括:对所述空域分类结果、所述第一时域分类结果和所述第二时域分类结果进行融合处理,获得所述视频的分类结果。
  15. 根据权利要求13或14所述的方法,其特征在于,所述获取所述原始光流图像变形后的变形光流图像包括:
    分别对每相邻的两帧图像进行计算,获得每相邻的两帧图像之间的单应性变换矩阵;
    分别根据每相邻的两帧图像之间的单应性变换矩阵对相应相邻的两帧图像中的后一帧图像进行仿射变换;
    分别对每相邻的两帧图像中的前一帧图像及仿射变换后的后一帧图像进行计算,获得变形光流图像。
  16. 根据权利要求15所述的方法,其特征在于,所述对每相邻的两帧图像进行计算包括:根据加速鲁棒性特征SURF特征点描述子进行帧间特征点匹配。
  17. 根据权利要求14至16任意一项所述的方法,其特征在于,对所述空域分类结果、所述第一时域分类结果和所述第二时域分类结果进行融合处理包括:
    将所述空域分类结果、所述第一时域分类结果和所述第二时域分类结果分别乘以预先设定的权重系数后进行求和,获得所述视频的分类结果。
  18. 根据权利要求17所述的方法,其特征在于,所述空域分类结果与所述第一时域分类结果及所述第二时域分类结果之间的权重系数比值为1:a:b,且a与b之和不小于1,且不大于3。
  19. 根据权利要求1至18任意一项所述的方法,其特征在于,所述视频的分类结果为维度等于分类类别数量的分类结果向量;
    所述方法还包括:
    利用Softmax函数对所述视频的分类结果向量进行归一化处理,获得视频属于各类别的分类概率 向量;或者
    利用Softmax函数对所述视频的空域分类结果进行归一化处理,获得所述视频属于各类别的一个空域分类概率向量;以及利用Softmax函数对所述视频的时域分类结果进行归一化处理,获得所述视频属于各类别的一个时域分类概率向量。
  20. 根据权利要求1至18任意一项所述的方法,其特征在于,还包括:
    预设初始空域卷积神经网络和初始时域卷积神经网络;
    分别基于各作为样本的视频,采用随机梯度下降法对所述初始空域卷积神经网络进行训练,获得所述空域卷积神经网络;以及采用随机梯度下降法对所述初始时域卷积神经网络进行训练,获得所述时域卷积神经网络。
  21. 根据权利要求20所述的方法,其特征在于,采用随机梯度下降法对所述初始空域卷积神经网络进行训练,获得所述空域卷积神经网络包括:
    针对一个作为样本的视频,开始执行所述对视频进行分段的操作,直到获得所述视频的空域分类结果;
    比较所述视频的空域分类结果相对于所述视频的预设标准空域分类结果的偏差是否小于预设范围;
    若不小于预设范围,对所述初始空域卷积神经网络的网络参数进行调整;以调整网络参数后的空域卷积神经网络作为初始空域卷积神经网络,针对下一个作为样本的视频,开始执行所述对视频进行分段的操作;
    若小于预设范围,以当前的初始空域卷积神经网络作为所述空域卷积神经网络。
  22. 根据权利要求20所述的方法,其特征在于,采用随机梯度下降法对所述初始时域卷积神经网络进行训练,获得所述时域卷积神经网络包括:
    针对一个作为样本的视频,开始执行所述对视频进行分段的操作,直到获得所述视频的时域分类结果;
    比较所述视频的时域分类结果相对于所述视频的预设标准时域分类结果的偏差是否小于预设范围;
    若不小于预设范围,对所述初始时域卷积神经网络的网络参数进行调整;以调整网络参数后的时域卷积神经网络作为初始时域卷积神经网络,针对下一个作为样本的视频,开始执行所述对视频进行分段的操作;
    若小于预设范围,以当前的初始时域卷积神经网络作为所述时域卷积神经网络;
    所述初始时域卷积神经网络包括第一初始时域卷积神经网络或第二初始时域卷积神经网络,所述时域分类结果相应的包括第一时域分类结果或第二时域分类结果,所述时域卷积神经网络相应的包括第一时域卷积神经网络和第二时域卷积神经网络。
  23. 一种视频类别识别装置,其特征在于,包括:
    分段单元,用于对视频进行分段,获得两个或者以上分段视频;
    采样单元,用于分别对两个或者以上分段视频中的各分段视频进行采样,获得各分段视频的原始图像及光流图像;
    空域分类处理单元,用于利用空域卷积神经网络处理各分段视频的原始图以获得所述视频的空域分类结果;
    时域分类处理单元,用于分别利用时域卷积神经网络处理各分段视频的光流图像以获得各分段视频的时域分类结果;
    融合单元,用于对所述空域分类结果和所述时域分类结果进行融合处理,获得所述视频的分类结果。
  24. 根据权利要求23所述的装置,其特征在于,所述分段单元,具体用于对所述视频进行平均分段,获得长度相同的两个或者以上分段视频。
  25. 根据权利要求23或24所述的装置,其特征在于,所述采样单元包括:
    图像采样模块,用于分别从各分段视频中随机抽取一帧图像,作为各分段视频的原始图像;和/或
    光流采样模块,用于分别从各分段视频中随机抽取连续的多帧图像,根据所述多帧图像获得各分段视频的光流图像。
  26. 根据权利要求23或24或25所述的装置,其特征在于,所述光流图像为基于8位位图、共256个离散的色阶的灰度图像,所述灰度图像的中值为128。
  27. 根据权利要求25或26所述的装置,其特征在于,所述光流采样模块,具体用于:
    分别针对各分段视频:从每一分段视频中随机抽取连续的N帧图像;其中,N为大于1的整数;以及分别基于所述N帧图像中的每相邻的两帧图像进行计算,获得N-1组光流图像,所述N-1组光流图像中的每一组光流图像分别包括一帧横向光流图像及一帧纵向光流图像。
  28. 根据权利要求23至27任意一项所述的装置,其特征在于,所述空域分类处理单元包括:
    空域分类处理模块,用于分别利用空域卷积神经网络对各分段视频的原始图像进行处理,获得各分段视频的空域初步分类结果;和
    第一综合处理模块,用于利用空域共识函数对所述分段视频的空域初步分类结果进行综合处理,获得所述视频的空域分类结果;
    所述时域分类处理单元包括:
    第一时域分类处理模块,用于分别利用时域卷积神经网络对各分段视频的光流图像进行处理,获得各分段视频的时域初步分类结果;和
    第二综合处理模块,用于利用时域共识函数对所述分段视频的时域初步分类结果进行综合处理,获得所述视频的时域分类结果。
  29. 根据权利要求28所述的装置,其特征在于,所述空域共识函数和/或所述时域共识函数包括:平均函数、最大值函数或带权平均函数。
  30. 根据权利要求29所述的装置,其特征在于,所述空域共识函数具体为在验证数据集上分类正确率最高的平均函数、最大值函数或带权平均函数;
    所述时域共识函数具体为在验证数据集上分类正确率最高的平均函数、最大值函数或带权平均函数。
  31. 根据权利要求28至30任意一项所述的装置,其特征在于,所述空域初步分类结果及所述时域初步分类结果分别为维度等于分类类别数量的分类结果向量;
    所述视频的空域分类结果及所述视频的时域分类结果分别为维度等于分类类别数量的分类结果向量;
    所述视频的分类结果为维度等于分类类别数量的分类结果向量。
  32. 根据权利要求23至31任意一项所述的装置,其特征在于,所述融合单元,具体用于将所述空域分类结果与所述时域分类结果分别乘以预先设定的权重系数后进行求和,获得所述视频的分类结果。
  33. 根据权利要求32所述的装置,其特征在于,所述空域分类结果与所述时域分类结果之间的权重系数比值为1:1至1:3中任一比值。
  34. 根据权利要求28至33任意一项所述的装置,其特征在于,所述光流图像具体为原始光流图像,所述时域卷积神经网络具体为第一时域卷积神经网络;
    所述第一时域分类处理模块,具体用于分别利用第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果;
    所述第二综合处理模块,具体用于利用第一时域共识函数对所述分段视频的第一时域初步分类结果进行综合处理,获得所述视频的第一时域分类结果;
    所述融合单元,具体用于对所述空域分类结果和所述第一时域分类结果进行融合处理,获得所述视频的分类结果。
  35. 根据权利要求23至33任意一项所述的装置,其特征在于,所述光流图像具体为原始光流图像的变形光流图像,所述时域卷积神经网络具体为第二时域卷积神经网络;
    所述装置还包括:光流处理单元,用于获取所述原始光流图像变形后的变形光流图像;;
    且所述时域分类处理单元包括:
    第二时域分类处理模块,用于分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果;
    第三综合处理模块,用于利用第二时域共识函数对所述各分段视频的第二时域初步分类结果进行综合处理,获得所述视频的第二时域分类结果;
    所述融合单元,具体用于对所述空域分类结果和所述第二时域分类结果进行融合处理,获得所述视频的分类结果。
  36. 根据权利要求23至33任意一项所述的装置,其特征在于,所述光流图像具体为原始光流图像和变形光流图像,所述时域卷积神经网络具体为第一时域卷积神经网络和第二时域卷积神经网络;
    所述装置还包括:
    光流处理单元,用于获取所述原始光流图像变形后的变形光流图像;
    所述时域分类处理单元包括:
    第一时域分类处理模块,用于分别利用第一时域卷积神经网络对各分段视频的原始光流图像进行处理,获得各分段视频的第一时域初步分类结果;
    第二综合处理模块,具体用于利用第一时域共识函数对所述分段视频的第一时域初步分类结果进行综合处理,获得所述视频的第一时域分类结果;
    第二时域分类处理模块,用于分别利用第二时域卷积神经网络对各分段视频的变形光流图像进行处理,获得各分段视频的第二时域初步分类结果;
    第三综合处理模块,用于对所述分段视频的第二时域初步分类结果进行综合处理,获得所述视频的第二时域分类结果;
    所述融合单元,具体用于对所述空域分类结果、所述第一时域分类结果和所述第二时域分类结果进行融合处理,获得所述视频的分类结果。
  37. 根据权利要求35或36所述的装置,其特征在于,所述光流处理单元,具体用于:
    分别对每相邻的两帧图像进行计算,获得每相邻的两帧图像之间的单应性变换矩阵;
    分别根据每相邻的两帧图像之间的单应性变换矩阵对相应相邻的两帧图像中的后一帧图像进行仿射变换;以及
    分别对每相邻的两帧图像中的前一帧图像及仿射变换后的后一帧图像进行计算,获得变形光流图像。
  38. 根据权利要求37所述的装置,其特征在于,所述光流处理单元对每相邻的两帧图像进行计算时,具体用于根据加速鲁棒性特征SURF特征点描述子进行帧间特征点匹配。
  39. 根据权利要求36至38任意一项所述的装置,其特征在于,所述融合单元,具体用于将所述空域分类结果、所述第一时域分类结果和所述第二时域分类结果分别乘以预先设定的权重系数后进行求和,获得所述视频的分类结果。
  40. 根据权利要求39所述的装置,其特征在于,所述空域分类结果与所述第一时域分类结果及所述第二时域分类结果之间的权重系数比值为1:a:b,且a与b之和不小于1,且不大于3。
  41. 根据权利要求23至40任意一项所述的装置,其特征在于,还包括:
    第一归一化处理单元,用于利用Softmax函数对所述视频的分类结果向量进行归一化处理,得到视频属于各类别的分类概率向量;或者
    第二归一化处理单元,用于利用Softmax函数对所述视频的空域分类结果进行归一化处理,获得所述视频属于各类别的一个空域分类概率向量;以及利用Softmax函数对所述视频的时域分类结果进行归一化处理,获得所述视频属于各类别的一个时域分类概率向量。
  42. 根据权利要求23至40任意一项所述的装置,其特征在于,还包括:
    网络训练单元,用于存储预设初始空域卷积神经网络和初始时域卷积神经网络;以及分别基于各作为样本的视频,采用随机梯度下降法对所述初始空域卷积神经网络进行训练,获得所述空域卷积神经网络;以及采用随机梯度下降法对所述初始时域卷积神经网络进行训练,获得所述时域卷积神经网络。
  43. 根据权利要求42所述的装置,其特征在于,所述网络训练单元采用随机梯度下降法对所述初始空域卷积神经网络进行训练时,具体用于:
    针对一个作为样本的视频,比较所述空域分类处理单元获得的视频的空域分类结果与所述视频的预设标准空域分类结果是否相同;
    若不相同,对所述初始空域卷积神经网络的网络参数进行调整;以调整网络参数后的空域卷积神经网络作为初始空域卷积神经网络,再针对下一个作为样本的视频,开始执行所述比较所述空域分类处理单元获得的视频的空域分类结果与所述视频的预设标准空域分类结果是否相同的操作;
    若相同,以当前的初始空域卷积神经网络作为所述空域卷积神经网络。
  44. 根据权利要求42所述的装置,其特征在于,所述网络训练单元采用随机梯度下降法对所述初始时域卷积神经网络进行训练时,具体用于:
    针对一个作为样本的视频,比较所述时域分类处理单元获得的视频的时域分类结果与所述视频的预设标准时域分类结果是否相同;
    若不相同,对所述初始时域卷积神经网络的网络参数进行调整;以调整网络参数后的时域卷积神经网络作为初始时域卷积神经网络,再针对下一个作为样本的视频,开始执行所述比较所述时域分类处理单元获得的视频的时域分类结果与所述视频的预设标准时域分类结果是否相同的操作;
    若相同,以当前的初始时域卷积神经网络作为所述时域卷积神经网络;
    所述初始时域卷积神经网络包括第一初始时域卷积神经网络或第二初始时域卷积神经网络,所述时域分类结果相应的包括第一时域分类结果或第二时域分类结果,所述时域卷积神经网络相应的包括第一时域卷积神经网络和第二时域卷积神经网络。
  45. 一种数据处理装置,其特征在于,包括权利要求23至44任意一项所述的视频分类识别装置。
  46. 根据权利要求45所述的数据处理装置,其特征在于,所述数据处理装置包括进阶精简指令集机器ARM、中央处理单元CPU或图形处理单元GPU。
  47. 一种电子设备,其特征在于,设置有权利要求45或46所述的数据处理装置。
  48. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备中运行时,所述设备中的处理器执行用于实现权利要求1-22中的任一权利要求所述的视频类别识别方法中的步骤的可执行指令。
  49. 一种计算机可读介质,用于存储权利要求48所述的计算机程序。
PCT/CN2017/092597 2016-07-29 2017-07-12 视频类别识别方法和装置、数据处理装置和电子设备 WO2018019126A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610619654 2016-07-29
CN201610619654.1 2016-07-29

Publications (1)

Publication Number Publication Date
WO2018019126A1 true WO2018019126A1 (zh) 2018-02-01

Family

ID=58592577

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/092597 WO2018019126A1 (zh) 2016-07-29 2017-07-12 视频类别识别方法和装置、数据处理装置和电子设备

Country Status (2)

Country Link
CN (1) CN106599789B (zh)
WO (1) WO2018019126A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109120932A (zh) * 2018-07-12 2019-01-01 东华大学 Hevc压缩域双svm模型的视频显著性预测方法
CN111027482A (zh) * 2019-12-10 2020-04-17 浩云科技股份有限公司 基于运动向量分段分析的行为分析方法及装置
CN111050219A (zh) * 2018-10-12 2020-04-21 奥多比公司 用于定位视频内容中的目标对象的空间-时间记忆网络
CN111104553A (zh) * 2020-01-07 2020-05-05 中国科学院自动化研究所 一种高效运动互补神经网络系统
CN111753574A (zh) * 2019-03-26 2020-10-09 顺丰科技有限公司 抛扔区域定位方法、装置、设备及存储介质
CN111783713A (zh) * 2020-07-09 2020-10-16 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN112307821A (zh) * 2019-07-29 2021-02-02 顺丰科技有限公司 一种视频流处理方法、装置、设备及存储介质
CN112528780A (zh) * 2019-12-06 2021-03-19 百度(美国)有限责任公司 通过混合时域自适应的视频动作分割
CN112580589A (zh) * 2020-12-28 2021-03-30 国网上海市电力公司 基于双流法考虑非均衡数据的行为识别方法、介质及设备
CN112731359A (zh) * 2020-12-31 2021-04-30 无锡祥生医疗科技股份有限公司 超声探头的速度确定方法、装置及存储介质
CN112926549A (zh) * 2021-04-15 2021-06-08 华中科技大学 基于时间域-空间域特征联合增强的步态识别方法与系统
CN113128354A (zh) * 2021-03-26 2021-07-16 中山大学中山眼科中心 一种洗手质量检测方法及装置
CN113395542A (zh) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 基于人工智能的视频生成方法、装置、计算机设备及介质
CN114373194A (zh) * 2022-01-14 2022-04-19 南京邮电大学 基于关键帧与注意力机制的人体行为识别方法
CN114756115A (zh) * 2020-12-28 2022-07-15 阿里巴巴集团控股有限公司 交互控制方法、装置及设备

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599789B (zh) * 2016-07-29 2019-10-11 北京市商汤科技开发有限公司 视频类别识别方法和装置、数据处理装置和电子设备
CN107330362B (zh) * 2017-05-25 2020-10-09 北京大学 一种基于时空注意力的视频分类方法
CN107463949B (zh) * 2017-07-14 2020-02-21 北京协同创新研究院 一种视频动作分类的处理方法及装置
CN108229290B (zh) * 2017-07-26 2021-03-02 北京市商汤科技开发有限公司 视频物体分割方法和装置、电子设备、存储介质
CN107943849B (zh) * 2017-11-03 2020-05-08 绿湾网络科技有限公司 视频文件的检索方法及装置
CN108010538B (zh) * 2017-12-22 2021-08-24 北京奇虎科技有限公司 音频数据处理方法及装置、计算设备
CN108230413B (zh) * 2018-01-23 2021-07-06 北京市商汤科技开发有限公司 图像描述方法和装置、电子设备、计算机存储介质
CN108171222B (zh) * 2018-02-11 2020-08-25 清华大学 一种基于多流神经网络的实时视频分类方法及装置
CN110321761B (zh) * 2018-03-29 2022-02-11 中国科学院深圳先进技术研究院 一种行为识别方法、终端设备及计算机可读存储介质
CN108764084B (zh) * 2018-05-17 2021-07-27 西安电子科技大学 基于空域分类网络和时域分类网络融合的视频分类方法
CN110598504B (zh) * 2018-06-12 2023-07-21 北京市商汤科技开发有限公司 图像识别方法及装置、电子设备和存储介质
CN109271840A (zh) * 2018-07-25 2019-01-25 西安电子科技大学 一种视频手势分类方法
CN109325430B (zh) * 2018-09-11 2021-08-20 苏州飞搜科技有限公司 实时行为识别方法及系统
CN109325435B (zh) * 2018-09-15 2022-04-19 天津大学 基于级联神经网络的视频动作识别及定位方法
CN109376603A (zh) * 2018-09-25 2019-02-22 北京周同科技有限公司 一种视频识别方法、装置、计算机设备及存储介质
CN109657546A (zh) * 2018-11-12 2019-04-19 平安科技(深圳)有限公司 基于神经网络的视频行为识别方法及终端设备
CN109376696B (zh) * 2018-11-28 2020-10-23 北京达佳互联信息技术有限公司 视频动作分类的方法、装置、计算机设备和存储介质
CN109740670B (zh) 2019-01-02 2022-01-11 京东方科技集团股份有限公司 视频分类的方法及装置
CN109726765A (zh) 2019-01-02 2019-05-07 京东方科技集团股份有限公司 一种视频分类问题的样本提取方法及装置
CN109886165A (zh) * 2019-01-23 2019-06-14 中国科学院重庆绿色智能技术研究院 一种基于运动目标检测的动作视频提取和分类方法
CN109840917B (zh) * 2019-01-29 2021-01-26 北京市商汤科技开发有限公司 图像处理方法及装置、网络训练方法及装置
CN109871828B (zh) 2019-03-15 2022-12-02 京东方科技集团股份有限公司 视频识别方法和识别装置、存储介质
CN110020639B (zh) * 2019-04-18 2021-07-23 北京奇艺世纪科技有限公司 视频特征提取方法及相关设备
CN111820947B (zh) * 2019-04-19 2023-08-29 无锡祥生医疗科技股份有限公司 超声心脏反流自动捕捉方法、系统及超声成像设备
CN110062248B (zh) * 2019-04-30 2021-09-28 广州酷狗计算机科技有限公司 推荐直播间的方法和装置
CN112288345A (zh) * 2019-07-25 2021-01-29 顺丰科技有限公司 装卸口状态检测方法、装置、服务器及存储介质
CN110602527B (zh) * 2019-09-12 2022-04-08 北京小米移动软件有限公司 视频处理方法、装置及存储介质
CN111125405A (zh) * 2019-12-19 2020-05-08 国网冀北电力有限公司信息通信分公司 电力监控图像异常检测方法和装置、电子设备及存储介质
CN111898458B (zh) * 2020-07-07 2024-07-12 中国传媒大学 基于注意力机制的双模态任务学习的暴力视频识别方法
CN111860353A (zh) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 基于双流神经网络的视频行为预测方法、装置及介质
CN113139467B (zh) * 2021-04-23 2023-04-25 西安交通大学 基于分级式结构的细粒度视频动作识别方法
CN113395537B (zh) * 2021-06-16 2023-05-16 北京百度网讯科技有限公司 用于推荐直播间的方法和装置
CN113870040B (zh) * 2021-09-07 2024-05-21 天津大学 融合不同传播模式的双流图卷积网络微博话题检测方法
CN116645917A (zh) * 2023-06-09 2023-08-25 浙江技加智能科技有限公司 Led显示屏亮度调节系统及其方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN103218831A (zh) * 2013-04-21 2013-07-24 北京航空航天大学 一种基于轮廓约束的视频运动目标分类识别方法
CN104217214A (zh) * 2014-08-21 2014-12-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 基于可配置卷积神经网络的rgb-d人物行为识别方法
CN104966104A (zh) * 2015-06-30 2015-10-07 孙建德 一种基于三维卷积神经网络的视频分类方法
CN105550699A (zh) * 2015-12-08 2016-05-04 北京工业大学 一种基于cnn融合时空显著信息的视频识别分类方法
CN105740773A (zh) * 2016-01-25 2016-07-06 重庆理工大学 基于深度学习和多尺度信息的行为识别方法
CN106599789A (zh) * 2016-07-29 2017-04-26 北京市商汤科技开发有限公司 视频类别识别方法和装置、数据处理装置和电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129691B (zh) * 2011-03-22 2014-06-18 北京航空航天大学 一种采用Snake轮廓模型的视频对象跟踪分割方法
CN102289795B (zh) * 2011-07-29 2013-05-22 上海交通大学 基于融合思想的视频时空联合增强方法
US8917948B2 (en) * 2011-09-16 2014-12-23 Adobe Systems Incorporated High-quality denoising of an image sequence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN103218831A (zh) * 2013-04-21 2013-07-24 北京航空航天大学 一种基于轮廓约束的视频运动目标分类识别方法
CN104217214A (zh) * 2014-08-21 2014-12-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 基于可配置卷积神经网络的rgb-d人物行为识别方法
CN104966104A (zh) * 2015-06-30 2015-10-07 孙建德 一种基于三维卷积神经网络的视频分类方法
CN105550699A (zh) * 2015-12-08 2016-05-04 北京工业大学 一种基于cnn融合时空显著信息的视频识别分类方法
CN105740773A (zh) * 2016-01-25 2016-07-06 重庆理工大学 基于深度学习和多尺度信息的行为识别方法
CN106599789A (zh) * 2016-07-29 2017-04-26 北京市商汤科技开发有限公司 视频类别识别方法和装置、数据处理装置和电子设备

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109120932A (zh) * 2018-07-12 2019-01-01 东华大学 Hevc压缩域双svm模型的视频显著性预测方法
CN109120932B (zh) * 2018-07-12 2021-10-26 东华大学 Hevc压缩域双svm模型的视频显著性预测方法
CN111050219A (zh) * 2018-10-12 2020-04-21 奥多比公司 用于定位视频内容中的目标对象的空间-时间记忆网络
CN111753574A (zh) * 2019-03-26 2020-10-09 顺丰科技有限公司 抛扔区域定位方法、装置、设备及存储介质
CN112307821A (zh) * 2019-07-29 2021-02-02 顺丰科技有限公司 一种视频流处理方法、装置、设备及存储介质
CN112528780B (zh) * 2019-12-06 2023-11-21 百度(美国)有限责任公司 通过混合时域自适应的视频动作分割
CN112528780A (zh) * 2019-12-06 2021-03-19 百度(美国)有限责任公司 通过混合时域自适应的视频动作分割
CN111027482B (zh) * 2019-12-10 2023-04-14 浩云科技股份有限公司 基于运动向量分段分析的行为分析方法及装置
CN111027482A (zh) * 2019-12-10 2020-04-17 浩云科技股份有限公司 基于运动向量分段分析的行为分析方法及装置
CN111104553B (zh) * 2020-01-07 2023-12-12 中国科学院自动化研究所 一种高效运动互补神经网络系统
CN111104553A (zh) * 2020-01-07 2020-05-05 中国科学院自动化研究所 一种高效运动互补神经网络系统
CN111783713A (zh) * 2020-07-09 2020-10-16 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN111783713B (zh) * 2020-07-09 2022-12-02 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN113395542A (zh) * 2020-10-26 2021-09-14 腾讯科技(深圳)有限公司 基于人工智能的视频生成方法、装置、计算机设备及介质
CN113395542B (zh) * 2020-10-26 2022-11-08 腾讯科技(深圳)有限公司 基于人工智能的视频生成方法、装置、计算机设备及介质
CN114756115A (zh) * 2020-12-28 2022-07-15 阿里巴巴集团控股有限公司 交互控制方法、装置及设备
CN112580589A (zh) * 2020-12-28 2021-03-30 国网上海市电力公司 基于双流法考虑非均衡数据的行为识别方法、介质及设备
CN112731359A (zh) * 2020-12-31 2021-04-30 无锡祥生医疗科技股份有限公司 超声探头的速度确定方法、装置及存储介质
CN112731359B (zh) * 2020-12-31 2024-04-09 无锡祥生医疗科技股份有限公司 超声探头的速度确定方法、装置及存储介质
CN113128354B (zh) * 2021-03-26 2022-07-19 中山大学中山眼科中心 一种洗手质量检测方法及装置
CN113128354A (zh) * 2021-03-26 2021-07-16 中山大学中山眼科中心 一种洗手质量检测方法及装置
CN112926549B (zh) * 2021-04-15 2022-06-24 华中科技大学 基于时间域-空间域特征联合增强的步态识别方法与系统
CN112926549A (zh) * 2021-04-15 2021-06-08 华中科技大学 基于时间域-空间域特征联合增强的步态识别方法与系统
CN114373194A (zh) * 2022-01-14 2022-04-19 南京邮电大学 基于关键帧与注意力机制的人体行为识别方法

Also Published As

Publication number Publication date
CN106599789A (zh) 2017-04-26
CN106599789B (zh) 2019-10-11

Similar Documents

Publication Publication Date Title
WO2018019126A1 (zh) 视频类别识别方法和装置、数据处理装置和电子设备
WO2018192570A1 (zh) 时域动作检测方法和系统、电子设备、计算机存储介质
US11436739B2 (en) Method, apparatus, and storage medium for processing video image
US11455782B2 (en) Target detection method and apparatus, training method, electronic device and medium
US11227147B2 (en) Face image processing methods and apparatuses, and electronic devices
CN108898186B (zh) 用于提取图像的方法和装置
CN108229296B (zh) 人脸皮肤属性识别方法和装置、电子设备、存储介质
CN107578017B (zh) 用于生成图像的方法和装置
CN108446390B (zh) 用于推送信息的方法和装置
WO2018177379A1 (zh) 手势识别、控制及神经网络训练方法、装置及电子设备
WO2018121737A1 (zh) 关键点预测、网络训练及图像处理方法和装置、电子设备
CN110431560B (zh) 目标人物的搜索方法和装置、设备和介质
WO2018099473A1 (zh) 场景分析方法和系统、电子设备
WO2019001481A1 (zh) 车辆外观特征识别及车辆检索方法、装置、存储介质、电子设备
CN108230291B (zh) 物体识别系统训练方法、物体识别方法、装置和电子设备
WO2018054329A1 (zh) 物体检测方法和装置、电子设备、计算机程序和存储介质
US20170161591A1 (en) System and method for deep-learning based object tracking
US10643063B2 (en) Feature matching with a subspace spanned by multiple representative feature vectors
US20220147735A1 (en) Face-aware person re-identification system
CN108491872B (zh) 目标再识别方法和装置、电子设备、程序和存储介质
US11164004B2 (en) Keyframe scheduling method and apparatus, electronic device, program and medium
CN113971751A (zh) 训练特征提取模型、检测相似图像的方法和装置
WO2019029459A1 (zh) 用于识别面部年龄的方法、装置和电子设备
CN108229494B (zh) 网络训练方法、处理方法、装置、存储介质和电子设备
WO2019241346A1 (en) Visual tracking by colorization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833429

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17833429

Country of ref document: EP

Kind code of ref document: A1