WO2021093468A1 - 视频分类方法、模型训练方法、装置、设备及存储介质 - Google Patents

视频分类方法、模型训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021093468A1
WO2021093468A1 PCT/CN2020/117358 CN2020117358W WO2021093468A1 WO 2021093468 A1 WO2021093468 A1 WO 2021093468A1 CN 2020117358 W CN2020117358 W CN 2020117358W WO 2021093468 A1 WO2021093468 A1 WO 2021093468A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature
image frame
image frames
feature information
Prior art date
Application number
PCT/CN2020/117358
Other languages
English (en)
French (fr)
Inventor
李岩
史欣田
纪彬
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20886816.6A priority Critical patent/EP3989111A4/en
Publication of WO2021093468A1 publication Critical patent/WO2021093468A1/zh
Priority to US17/515,164 priority patent/US11967151B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Definitions

  • the embodiments of the present application relate to the field of computer vision technology, and in particular to a video classification method, model training method, device, equipment, and storage medium.
  • a corresponding video classification tag is usually set for each video.
  • the video classification label is set by the following methods: divide the video into multiple sub-videos equally; extract one image frame from each of the multiple sub-videos to obtain multiple image frames; A 3D (three-dimension, three-dimensional) convolution operation is performed in the time dimension, and each image frame is obtained by fusing the feature information of other image frames; the video classification label of the video is determined according to the feature information of each image frame.
  • the embodiments of the present application provide a video classification method, model training method, device, equipment, and storage medium, which shorten the time for finally obtaining the classification result of the video.
  • the technical solutions are as follows:
  • an embodiment of the present application provides a video classification method, which is applied to a computer device, and the method includes:
  • the feature information of each of the n image frames is extracted; wherein, the feature fusion strategy is used to indicate that the first image frame of the n image frames is fusing the when the feature information of other image frames in the n image frames, the ratio of the feature information of each image frame;
  • the classification result of the video is determined according to the characteristic information of each of the n image frames.
  • an embodiment of the present application provides a method for training a video classification model, which is applied to a computer device, and the method includes:
  • the feature information of each of the n sample image frames is extracted according to a feature fusion strategy; wherein, the feature fusion strategy is used to indicate the first of the n sample image frames When the sample image frame is fused with the characteristic information of other sample image frames in the n sample image frames, the proportion of the characteristic information of each sample image frame;
  • an embodiment of the present application provides a video classification device, and the device includes:
  • Video acquisition module for acquiring video
  • An image frame selection module for selecting n image frames from the video, where n is a positive integer
  • the feature extraction module is used to extract the feature information of each of the n image frames according to the learned feature fusion strategy through the feature extraction network; wherein, the feature fusion strategy is used to indicate the first of the n image frames When the image frame is fused with the feature information of other image frames among the n image frames, the proportion of the feature information of each image frame;
  • the video classification module is configured to determine the classification result of the video according to the respective characteristic information of the n image frames.
  • an embodiment of the present application provides a video classification model training device, the device includes:
  • a data acquisition module for acquiring training data of a video classification model, where the training data includes at least one sample video
  • An image frame selection module for selecting n sample image frames from the sample video, where n is a positive integer
  • the feature extraction module is used to extract the feature information of each of the n sample image frames through the feature extraction network in the video classification model according to a feature fusion strategy; wherein, the feature fusion strategy is used to indicate the n samples When the first sample image frame in the image frame is fused with the characteristic information of other sample image frames in the n sample image frames, the proportion of the characteristic information of each sample image frame;
  • a video classification module configured to determine the predicted classification result of the sample video according to the respective feature information of the n sample image frames
  • the model training module is used to train the video classification model according to the predicted classification result and the standard classification result of the sample video.
  • an embodiment of the present application provides a computer device.
  • the computer device includes a processor and a memory.
  • the memory stores at least one instruction, at least one program, code set, or instruction set, and the at least one instruction,
  • the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the video classification method described above, or the training method of the video classification model described above.
  • an embodiment of the present application provides a computer-readable storage medium that stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program,
  • the code set or instruction set is loaded and executed by the processor to implement the video classification method described above, or the training method of the video classification model described above.
  • the feature information of the image frame is extracted.
  • the feature fusion strategy indicates the proportion of the feature information of each image frame when the feature information of other image frames is fused for each image frame.
  • the feature information determines the classification result of the video.
  • the feature fusion strategy only needs to achieve simple information fusion between adjacent image frames, instead of convolution in both spatial and temporal dimensions like 3D convolution.
  • the feature fusion strategy replaces complex information through simple feature information fusion. , Repeated 3D convolution operation, the workload is small, so that the final video classification result is shorter and more efficient.
  • Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a video classification method provided by an embodiment of the present application.
  • Figure 3 shows a schematic diagram of an offset strategy in the related art
  • Figure 4 is a schematic diagram of a residual structure provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a video classification method provided by another embodiment of the present application.
  • FIG. 6 is a schematic diagram of image frame selection provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a local image frame enhancement strategy provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a video classification method in related technologies
  • FIG. 9 is a flowchart of a method for training a video classification model provided by an embodiment of the present application.
  • FIG. 10 is a block diagram of a video classification device provided by an embodiment of the present application.
  • FIG. 11 is a block diagram of a video classification device provided by another embodiment of the present application.
  • FIG. 12 is a block diagram of a training device for a video classification model provided by an embodiment of the present application.
  • Fig. 13 is a block diagram of a training device for a video classification model provided by an embodiment of the present application.
  • Fig. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image.
  • Computer vision studies related theories and technologies trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
  • Natural language processing (Nature Language Processing, NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • FIG. 1 shows a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment may include: a terminal 10 and a server 20.
  • a client is installed and running in the terminal 10, and the client refers to a client for uploading videos, for example, a video client.
  • the terminal 10 may be an electronic device such as a mobile phone, a tablet computer, a wearable device, and a PC (Personal Computer, personal computer).
  • the server 20 may be one server, a server cluster composed of multiple servers, or a cloud computing center.
  • the server 20 may communicate with the terminal 10 through a wired or wireless network.
  • the server 20 may obtain the video uploaded by the terminal 10, classify the video, determine the classification result of the video, and send the determined classification result to the client in the terminal 10 for display, and perform video recommendation based on the classification result;
  • the server 20 can also review and filter the videos uploaded by the terminal 10 to determine whether the videos uploaded by the terminal 10 are bad videos.
  • the bad videos can be violent videos, pornographic and vulgar videos, etc.; if the server 20 determines that the video uploaded by the terminal 10 is If the video is bad, it is confirmed that the review is not passed, the video is filtered, and a message that the review is not passed is fed back to the terminal 10.
  • FIG. 2 shows a flowchart of a video classification method provided by an embodiment of the present application.
  • the execution subject of the method may be a computer device, and the computer device may be any electronic device with computing and processing capabilities, such as a terminal or a server as described in FIG. 1.
  • the method can include the following steps.
  • Step 201 Obtain a video.
  • the video can be any video.
  • the video may be a video uploaded by a terminal, for example, the video may be a video uploaded by a terminal with a video client installed.
  • Step 202 Select n image frames from the video, where n is a positive integer.
  • a video is a collection of multiple image frames arranged in an orderly manner.
  • the computer device selects n image frames from the video through frame extraction, where n is the total number of frames obtained after the video frame extraction.
  • step 203 the feature information of each of the n image frames is extracted according to the learned feature fusion strategy through the feature extraction network.
  • the feature extraction network is used to extract feature information of n image frames.
  • the feature extraction network is used to fuse the feature information between each image frame and other image frames according to the feature fusion strategy to obtain the fused feature information of each image frame, and then use the network structure to the fused feature The information is processed, and the final characteristic information of each image frame is output.
  • the feature fusion strategy is used to indicate the proportion of the feature information of each image frame when the first image frame among the n image frames merges the feature information of other image frames in the n image frames.
  • the first image frame is any one of the n image frames, and the other image frames are all or part of the n image frames except the first image frame.
  • the other image frames are image frames adjacent to the first image frame. For example, there are 5 image frames: image frame 1, image frame 2, image frame 3, image frame 4, and image frame 5.
  • the first image frame is image frame 2
  • the other image frames can be image frame 1 and image frame 3.
  • the feature information of the image frame 2 extracted according to the feature fusion strategy is fused with the feature information of the image frame 1 and the image frame 3.
  • the first image frame indicated by the feature fusion strategy fuses the feature information of other image frames among the n image frames, the proportion of the feature information of each image frame is obtained through learning, and is not a fixed pattern.
  • the feature information of the image frame can be represented by the gray value of the pixel in the image frame.
  • the gray value of a certain pixel of image frame 2 is 2, and that of image frame 1 is a certain pixel.
  • the gray value of a certain pixel of image frame 3 is 3, and the learned feature fusion strategy is used to instruct image frame 2 to fuse the feature information of image frame 1 and image frame 3.
  • the proportion of image frame 2 is 0.2, the proportion of image frame 2 is 0.4, and the proportion of image frame 3 is 0.4.
  • the feature fusion strategy is simple in design, highly effective, and can be embedded into the existing feature extraction network; the feature information of the image frame is extracted through the feature fusion strategy, which can realize the flexibility of information exchange and information fusion with other image frames.
  • the feature fusion strategy replaces complex and repetitive 3D convolution operations through simple feature information fusion, and the workload is small.
  • the exchange of feature information between image frames and image frames is dynamic and not a fixed mode.
  • the feature fusion strategy provided by the embodiments of the present application can automatically learn a suitable information exchange mode more effectively.
  • Step 204 Determine the classification result of the video according to the respective feature information of the n image frames.
  • the feature information of the n image frames can represent the feature information of the video, and the classification result of the video can be determined according to the respective feature information of the n image frames.
  • the classification result of the video is used to characterize the classification of the video, for example, the video is a violent video, a pornographic video, an animated video, a science fiction video, and so on.
  • the feature information of the image frame is extracted according to the learned feature fusion strategy.
  • the feature fusion strategy indicates that when each image frame is fused with the feature information of other image frames, The proportion of the feature information of each image frame is determined according to the feature information of the image frame to determine the classification result of the video.
  • the feature fusion strategy only needs to achieve simple information fusion between adjacent image frames, instead of convolution in both spatial and temporal dimensions like 3D convolution.
  • the feature fusion strategy replaces complex information through simple feature information fusion. , Repeated 3D convolution operation, the workload is small, so that the final video classification result is shorter and more efficient.
  • the feature fusion strategy in the embodiment of the present application is obtained based on learning, which can perform feature information fusion more efficiently and flexibly.
  • the feature extraction network includes m cascaded network structures, and m is a positive integer.
  • the computer equipment can extract the characteristic information of the image frame in the following ways:
  • the processed first feature information is fused with feature information of the first image frame and other image frames.
  • the first feature information includes features of c channels, and c is a positive integer.
  • the computer device may obtain the processed first characteristic information through the following sub-steps:
  • the convolution kernel is used to define the feature fusion strategy corresponding to the feature of the i-th channel in the first image frame.
  • the feature fusion strategies corresponding to the features of the different channels in the first image frame may be different, that is, the convolution kernels corresponding to the different channels in the first image frame are different.
  • the feature of the i-th channel in the first image frame may be represented by the gray value of the pixel of the i-th channel in the first image frame.
  • image frame 1 image frame 2
  • image frame 3 image frame 4
  • image frame 5 image frame 5.
  • the first image frame is image frame 2
  • the other image frames are image frame 1 and image frame 3.
  • the size of the convolution kernel is 3, the parameters of the convolution kernel are [0.2, 0.4, 0.4], the gray value of the pixel in the i-th channel of image frame 1 is 3, and the gray value of the pixel in the i-th channel of image frame 2 is The degree value is 2, and the gray value of the pixel in the i-th channel of image frame 3 is 4, then the gray value of the pixel in the i-th channel in image frame 2 is checked by the above convolution, as well as image frame 1 and image frame
  • the processed first feature information is obtained according to the processed features of the 256 channels in the first image frame.
  • FIG. 3 it shows a schematic diagram of the offset strategy in the related art.
  • the respective feature information of 4 image frames image frame 1, image frame 2, image frame 3, and image frame 4
  • the features of 6 channels represented by c.
  • the offset strategy in the related technology can be regarded as the use of a convolution kernel with a size of 3 and a convolution kernel with fixed convolution kernel parameters to check the characteristics of the i-th channel in the first image frame and the i-th channel in adjacent image frames
  • the convolution kernel [001] is used to define the feature of a channel in the image frame to move in the opposite direction in the time dimension
  • the convolution kernel [100] is used to define the feature of a channel in the image frame. Move in the positive direction in the time dimension.
  • the feature information of image frame 2 and image frame 3 is fused with the feature information of image frame 1 and image frame 3
  • the feature information of image frame 3 is fused with image frame 2 And the feature information of image frame 4.
  • the offset strategy in the related technology is too rigid, and the mode of information exchange between image frames and image frames is fixed.
  • the new feature information obtained after the fusion of adjacent image frames the original The proportion of features in the offset feature information is also fixed.
  • the strategies in related technologies are not flexible enough.
  • the convolution kernel parameters of the convolution kernel provided in the embodiment of the present application are not fixed and are obtained through learning.
  • the feature fusion strategy provided in the embodiment of the present application is more flexible.
  • the processed first feature information is processed through the k-th network structure to generate the second feature information of the first image frame.
  • the network structure is used to perform spatial convolution processing on the first feature information.
  • the feature fusion processing and network structure corresponding to the feature fusion strategy constitute a residual structure.
  • the kth network structure is any one of the m network structures. As shown in FIG. 4, it shows a schematic diagram of the residual structure provided by an embodiment of the present application.
  • the residual structure includes feature fusion processing and a network structure.
  • the network structure may include spatial 1 ⁇ 1 convolution, spatial 3 ⁇ 3 convolution, and spatial 1 ⁇ 1 convolution.
  • the input of the residual structure is the first feature information.
  • the processed first feature information is obtained; then the processed first feature information is subjected to spatial 1x1 convolution and spatial 3x3 convolution respectively 1 ⁇ 1 convolution in the product and space to obtain the first feature information after the convolution; the first feature information after the convolution and the first feature information are added to obtain the second feature information.
  • the feature extraction network may be a residual network that includes multiple cascaded network structures.
  • the computer equipment Before inputting the feature information of the image frame into each network structure, the computer equipment can perform feature fusion processing on the feature information according to the feature fusion strategy. In a possible implementation, before the feature information of the image frame is input into the partial network structure, feature fusion processing is performed on the feature information according to the feature fusion strategy.
  • the second feature information is the feature information of the first image frame output by the feature extraction network, or the intermediate feature information of the first image frame generated by the feature extraction network.
  • the second feature information is the intermediate feature information of the first image frame generated by the feature extraction network
  • the intermediate feature information is processed through other network structures after the kth network structure to obtain the feature information of the first image frame.
  • the feature information of the image frame is processed by using the learned convolution kernel to perform feature fusion processing, which is simple in operation and small in workload.
  • the video classification method provided in the embodiment of the present application may further include the following steps:
  • Step 501 Obtain a video.
  • Step 502 Extract image frames from the video according to the preset frame rate to obtain a sequence of video frames.
  • the preset frame rate can be 24 frames per second, the preset frame rate can be the default frame rate, or the frame rate set by the researcher according to actual needs.
  • Step 503 Divide the video frame sequence into n subsequences on average.
  • n The length of each subsequence is 1/n of the video frame sequence.
  • n can be 8, 16, or 24.
  • the embodiment of the present application does not limit the size of n. In practical applications, n is generally selected as a multiple of 2.
  • Step 504 Extract one image frame from each of the n sub-sequences to obtain n image frames.
  • the above steps 502 to 504 are to select image frames from the video by adopting an image frame extraction strategy based on sparse sampling.
  • the feature information of each image frame represents the feature information of a sub-sequence, and a video frame sequence of any length is converted into n image frames that cover the entire video as much as possible so as to retain time information as much as possible.
  • the computer device can randomly extract an image frame from each sequence to obtain n image frames; the computer device can also select the image frame at a fixed position in each sequence (for example, the computer device can select each sequence The first image frame or the last image frame in), the embodiment of the present application does not limit the manner of how to extract the image frame from the sequence. Exemplarily, as shown in FIG.
  • step 505 the feature information of each of the n image frames is extracted according to the learned feature fusion strategy through the feature extraction network.
  • the feature fusion strategy is used to indicate the proportion of the feature information of each image frame when the first image frame among the n image frames merges the feature information of other image frames in the n image frames.
  • Step 506 Obtain n classification results corresponding to the n image frames according to the respective characteristic information of the n image frames.
  • the computer device can obtain the classification result corresponding to each image frame in the following manner:
  • the j-th classifier among the n classifiers obtains the classification result corresponding to the j-th image frame, where j is a positive integer less than or equal to n.
  • Step 507 Determine the classification result of the video according to the n classification results.
  • the sum of the weight products corresponding to each of the n classification results and the n classification results is determined as the classification result of the video.
  • the n classification results can be averaged, and the average value can be used as the video classification result.
  • the strategy corresponding to the above step 506 to step 507 may be referred to as a partial image frame enhancement strategy.
  • the local image frame enhancement strategy is used to process image frames, which enhances the feature expression capabilities of image frames, especially edge image frames (image frames at the beginning and end of the video), compared to the related technology shown in Figure 8.
  • the feature information of each of n image frames is averaged.
  • the embodiment of the present application realizes the video classification index by requiring local features, so that local features also need to be excavated with sufficient discriminative power, thereby enhancing local features, especially edges.
  • the expressive ability of the feature information of the image frame in turn makes the final video classification result more accurate.
  • each image frame can fuse the feature information of multiple other image frames to obtain information in a longer time range, but the edge image frame is for a large range of time information
  • the capture of video is still inadequate and inadequate.
  • they are inadequate in modeling the feature information of the video, that is, their ability to express video is insufficient.
  • the classification results are finally predicted, and the feature information of the edge image frame is treated the same as the feature information of other image frames.
  • the final average video expression ability will also be dragged down by the feature information of the edge image frame, and the feature information of the edge image frame will be affected.
  • the ultimate video modeling capability In the embodiment of this application, for each individual image frame, a classifier is used to perform the action recognition task, forcing the edge image frame to dig out more useful information when the information is insufficient, and to enhance the performance of these edge image frames.
  • Information expression ability is used to perform the action recognition task, forcing the edge image frame to dig out more useful information when the information is insufficient, and to enhance the performance of these edge image frames.
  • FIG. 9 shows a flowchart of a method for training a video classification model provided by an embodiment of the present application.
  • the execution subject of the method may be a computer device, such as the server or terminal introduced in FIG.
  • the method can include the following steps.
  • Step 901 Obtain training data of a video classification model.
  • the video classification model is used to determine the classification result of the video.
  • the training data includes at least one sample video.
  • the classification results of the sample videos included in the training data are consistent.
  • Step 902 Select n sample image frames from the sample video, where n is a positive integer.
  • step 902 please refer to the description of step 502 to step 504 above, which is not repeated here.
  • step 903 the feature information of each of the n sample image frames is extracted according to the feature fusion strategy through the feature extraction network in the video classification model.
  • the feature fusion strategy is used to indicate the feature of each sample image frame when the first sample image frame among the n sample image frames is fused with the feature information of other sample image frames in the n sample image frames. The proportion of information.
  • each iteration of training includes N video frame sequences, and n image frames are selected from each video frame sequence.
  • the size of each image frame is HxW, and H represents the image
  • the height of the frame W represents the width of the image frame
  • the feature information includes the features of c channels, that is, the number of feature channels is c
  • each video frame sequence corresponds to a video classification model, a residual structure input X, and its size That is (Nn)xcxHxW
  • the process of feature fusion processing is: First, reshape the input X so that its expression form becomes (NxHxW)xcxn.
  • Step 904 Determine the predictive classification result of the sample video according to the characteristic information of each of the n sample image frames.
  • step 904 For the introduction and description of step 904, refer to the above embodiment, which will not be repeated here.
  • Step 905 Train the video classification model according to the predicted classification result and the standard classification result of the sample video.
  • the predicted classification result is used to characterize the classification of the video predicted by the video classification model, and the standard classification result of the sample video may be a manually labeled classification result.
  • the video classification model can be trained according to the distance between the predicted classification result and the standard classification result.
  • the video classification model is trained according to the cosine distance, Euclidean distance, Manhattan distance or other distances between the predicted classification result and the standard classification result.
  • the training of the video classification model is stopped. The smaller the distance between the predicted classification result and the standard classification result, the more accurate the video classification model.
  • the computer device may calculate the loss function value corresponding to the video classification model according to the predicted classification result and the standard classification result, and perform training according to the loss function value corresponding to the video classification model.
  • the loss function value is used to characterize the degree of inconsistency between the predicted classification result and the standard classification result. If the loss function value is small, it indicates that the predicted classification result is very close to the standard classification result, and the performance of the video classification model is good; if the loss function value is large, it indicates that the predicted classification result is far from the standard classification result, and the video classification model performance good.
  • the computer device can adjust the feature fusion strategy according to the value of the loss function.
  • the parameters of the convolution kernel are adjusted according to the loss function value.
  • the convolution kernel is used to define a feature fusion strategy corresponding to the feature of the i-th channel in the first sample image frame, and i is a positive integer.
  • the feature fusion strategy is adjusted to realize the training of the video classification model. Multiple rounds of adjustment can be performed.
  • the training of the feature fusion strategy is stopped.
  • the first training stop condition may include any one of the following: when the loss function value meets a preset threshold, stop training the feature fusion strategy; or, when the number of training times reaches a preset number, for example, when it reaches 100,000 times, stop Train the feature fusion strategy; or, when the difference between the loss function value calculated in the k+1 round and the loss function value calculated in the k round is less than the preset difference, for example, less than 10 -9 , Stop training the feature fusion strategy.
  • the video classification model further includes n classifiers.
  • the computer device can adjust the parameters of the n classifiers according to the loss function value, and the h-th classifier among the n classifiers is used to obtain the predicted classification result corresponding to the first sample image frame according to the feature information of the first sample image frame , H is a positive integer less than or equal to n.
  • the parameters of the classifier are adjusted to realize the training of the video classification model. Multiple rounds of adjustment can be performed.
  • the second stop training condition is met, the training of the parameters of the classifier is stopped.
  • the second training stop condition may include any one of the following: when the loss function value meets the preset threshold, the training of the parameters of the classifier is stopped; or, when the number of training times reaches the preset number, for example, when it reaches 100,000 times, stop parameters classifier training; or, when the difference between the loss function value k + 1 th round function value calculated by the loss of the k-th wheel is less than a preset difference calculated, e.g., less than 10 - At 9 o'clock, the training of the parameters of the classifier is stopped.
  • the server adjusts the feature fusion strategy and the classifier according to the loss function value.
  • the training of the feature fusion strategy is stopped; when the second stop training condition is met, the training of the parameters of the classifier is stopped.
  • the first stop training condition and the second stop training condition may be the same or different, which is not limited in the embodiment of the present application.
  • the feature information of the image frame is extracted according to the learned feature fusion strategy.
  • the feature fusion strategy indicates that when each image frame is fused with the feature information of other image frames, The proportion of the feature information of each image frame is determined according to the feature information of the image frame to determine the classification result of the video.
  • the feature fusion strategy only needs to realize simple information fusion between adjacent image frames, instead of convolution in both spatial and temporal dimensions like 3D convolution.
  • the feature fusion strategy uses simple feature information fusion of other image frames. , To replace the complicated and repetitive 3D convolution operation, the workload is small, so that the final video classification result is obtained in a short time and high efficiency.
  • FIG. 10 shows a block diagram of a video classification device provided by an embodiment of the present application.
  • the device has the function of realizing the above example of the video classification method, and the function can be realized by hardware, or by hardware executing corresponding software.
  • the device can be a computer device, or it can be set in a computer device.
  • the device 1000 may include: a video acquisition module 1010, an image frame selection module 1020, a feature extraction module 1030, and a video classification module 1040.
  • the video acquisition module 1010 is used to acquire videos.
  • the image frame selection module 1020 is used to select n image frames from the video, where n is a positive integer.
  • the feature extraction module 1030 is configured to extract the respective feature information of the n image frames according to the learned feature fusion strategy through the feature extraction network; wherein, the feature fusion strategy is used to indicate the nth image frame When an image frame merges the feature information of other image frames among the n image frames, the proportion of the feature information of each image frame.
  • the video classification module 1040 is configured to determine the classification result of the video according to the respective characteristic information of the n image frames.
  • the feature information of the image frame is extracted according to the learned feature fusion strategy.
  • the feature fusion strategy indicates that when each image frame is fused with the feature information of other image frames, The proportion of the feature information of each image frame is determined according to the feature information of the image frame to determine the classification result of the video.
  • the feature fusion strategy only needs to achieve simple information fusion between adjacent image frames, instead of convolution in both spatial and temporal dimensions like 3D convolution.
  • the feature fusion strategy replaces complex information through simple feature information fusion. , Repeated 3D convolution operation, the workload is small, so that the final video classification result is shorter and more efficient.
  • the feature extraction network includes m cascaded network structures, and the m is a positive integer
  • the feature extraction module 1030 is used for:
  • the first feature information is performed according to the feature fusion strategy Feature fusion processing to obtain processed first feature information; wherein, the processed first feature information is fused with feature information of the first image frame and the other image frames, and the k is less than or equal to The positive integer of m;
  • the second feature information is feature information of the first image frame output by the feature extraction network, or intermediate feature information of the first image frame generated by the feature extraction network.
  • the first characteristic information includes the characteristics of c channels, and the c is a positive integer
  • the feature extraction module 1030 is used for:
  • the learned convolution check is used to check the feature of the i-th channel in the first image frame, and the i-th channel in the other image frames Performing a convolution operation on the feature of the channel to obtain the processed feature of the i-th channel in the first image frame, where i is a positive integer less than or equal to c;
  • the convolution kernel is used to define a feature fusion strategy corresponding to the feature of the i-th channel in the first image frame.
  • the video classification module 1040 includes: a result obtaining unit 1041 and a video classification unit 1042.
  • the result obtaining unit 1041 is configured to obtain n classification results corresponding to the n image frames according to the respective characteristic information of the n image frames.
  • the video classification unit 1042 is configured to determine the classification result of the video according to the n classification results.
  • the result obtaining unit 1041 is configured to:
  • the j-th classifier among the n classifiers obtains the classification result corresponding to the j-th image frame, where j is less than or equal to the A positive integer of n.
  • the video classification unit 1042 is configured to:
  • the sum of the weight products corresponding to the n classification results and the n classification results is determined as the classification result of the video.
  • the image frame selection module 1020 is configured to:
  • One image frame is extracted from each of the n sub-sequences to obtain the n image frames.
  • FIG. 12 shows a block diagram of a training device for a video classification model provided by an embodiment of the present application.
  • the device has the function of realizing the example of the training method of the video classification model, and the function can be realized by hardware, or by hardware executing corresponding software.
  • the device can be a computer device, or it can be set in a computer device.
  • the device 1200 may include: a data acquisition module 1210, an image frame selection module 1220, a feature extraction module 1230, a video classification module 1240, and a model training module 1250.
  • the data acquisition module 1210 is configured to acquire training data of a video classification model, where the training data includes at least one sample video.
  • the image frame selection module 1220 is configured to select n sample image frames from the sample video, where n is a positive integer.
  • the feature extraction module 1230 is configured to extract the feature information of each of the n sample image frames according to a feature fusion strategy through the feature extraction network in the video classification model; wherein, the feature fusion strategy is used to indicate the n When the first sample image frame in the sample image frames merges the characteristic information of other sample image frames in the n sample image frames, the proportion of the characteristic information of each sample image frame.
  • the video classification module 1240 is configured to determine the prediction classification result of the sample video according to the characteristic information of each of the n sample image frames.
  • the model training module 1250 is configured to train the video classification model according to the predicted classification result and the standard classification result of the sample video.
  • the feature information of the image frame is extracted according to the learned feature fusion strategy.
  • the feature fusion strategy indicates that when each image frame is fused with the feature information of other image frames, The proportion of the feature information of each image frame is determined according to the feature information of the image frame to determine the classification result of the video.
  • the feature fusion strategy only needs to realize simple information fusion between adjacent image frames, instead of convolution in both spatial and temporal dimensions like 3D convolution.
  • the feature fusion strategy uses simple feature information fusion of other image frames. , To replace the complicated and repetitive 3D convolution operation, the workload is small, so that the final video classification result is obtained in a short time and high efficiency.
  • the model training module 1250 includes: a function calculation unit 1251 and a strategy adjustment unit 1252.
  • the function calculation unit 1251 is configured to calculate a loss function value corresponding to the video classification model according to the predicted classification result and the standard classification result.
  • the strategy adjustment unit 1252 is configured to adjust the feature fusion strategy according to the loss function value.
  • the strategy adjustment unit 1252 is configured to:
  • the parameter of the convolution kernel is adjusted according to the loss function value.
  • the convolution kernel is used to define a feature fusion strategy corresponding to the feature of the i-th channel in the first sample image frame, and the i is a positive integer.
  • the video classification model further includes n classifiers
  • the model training module 1250 further includes: a classifier adjustment unit 1253.
  • the classifier adjustment unit 1253 is configured to adjust the parameters of the n classifiers according to the loss function value, and the h-th classifier among the n classifiers is used to adjust the characteristic information of the first sample image frame To obtain the prediction classification result corresponding to the first sample image frame, and the h is a positive integer less than or equal to the n.
  • FIG. 14 shows a schematic structural diagram of a computer device 1400 according to an embodiment of the present application.
  • the computer device 1400 can be used to implement the methods provided in the foregoing embodiments.
  • the computer device 1400 may be the terminal 10 or the server 20 introduced in the embodiment in FIG. 1. Specifically:
  • the computer equipment 1400 includes a central processing unit (CPU) 1401, a system memory 1404 including RAM (Random Access Memory) 1402 and ROM (Read-Only Memory) 1403, and A system bus 1405 connecting the system memory 1404 and the central processing unit 1401.
  • the computer device 1400 also includes a basic input/output system (I/O system, Input/Output system) 1406 that helps transfer information between various devices in the computer, and is used to store an operating system 1413, application programs 1414, and other programs.
  • Module 1415 is a mass storage device 1407.
  • the basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409 such as a mouse and a keyboard for the user to input information.
  • the display 1408 and the input device 1409 are both connected to the central processing unit 1401 through the input and output controller 1410 connected to the system bus 1405.
  • the basic input/output system 1406 may also include an input and output controller 1410 for receiving and processing input from multiple other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input and output controller 1410 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405.
  • the mass storage device 1407 and its associated computer-readable medium provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • the computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory (Flash Memory) or other solid-state storage technologies, CD-ROM, DVD (Digital Versatile Disc, Digital Versatile Disc) or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory Flash Memory
  • CD-ROM DVD (Digital Versatile Disc, Digital Versatile Disc) or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic
  • the computer device 1400 may also be connected to a remote computer on the network through a network such as the Internet to run. That is, the computer device 1400 can be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or in other words, the network interface unit 1411 can also be used to connect to other types of networks or remote computer systems (not shown) ).
  • the memory also includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by one or more processors.
  • the above-mentioned one or more programs include a training method for implementing the above-mentioned video classification method or the above-mentioned video classification model.
  • a computer device including a processor and a memory, and at least one instruction, at least a section of program, code set, or instruction set is stored in the memory.
  • the at least one instruction, at least one program, code set, or instruction set is configured to be executed by one or more processors to implement the foregoing video classification method or implement the foregoing video classification model training method.
  • a computer-readable storage medium stores at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program ,
  • the code set or the instruction set implements the above video classification method or the training method of the above video classification model when being executed by the processor of the computer device.
  • the aforementioned computer-readable storage medium may be ROM (Read-Only Memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, CD-ROM) ), magnetic tapes, floppy disks and optical data storage devices.
  • ROM Read-Only Memory
  • RAM Random Access Memory, random access memory
  • CD-ROM Compact Disc Read-Only Memory
  • magnetic tapes magnetic tapes
  • floppy disks and optical data storage devices.
  • a computer program product is also provided.
  • the computer program product When executed, it is used to implement the foregoing video classification method or implement the foregoing video classification model training method.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种视频分类方法、模型训练方法、装置、设备及存储介质,属于计算机视觉技术领域。所述方法包括:获取视频;从视频中选取n个图像帧;通过特征提取网络根据学习到的特征融合策略,提取n个图像帧各自的特征信息;根据n个图像帧各自的特征信息,确定视频的分类结果。本申请实施例中的特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征融合策略通过简单的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。

Description

视频分类方法、模型训练方法、装置、设备及存储介质
本申请要求于2019年11月15日提交的、申请号为201911121362.5、发明名称为“视频分类方法、模型训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机视觉技术领域,特别涉及一种视频分类方法、模型训练方法、装置、设备及存储介质。
背景技术
为使用户更快地获知视频内容,通常会为每个视频设置相应的视频分类标签。
在相关技术中,通过如下方式设置视频分类标签:将视频平均分成多段子视频;从上述多段子视频中各自抽取一个图像帧,得到多个图像帧;对该多个图像帧分别在空间维度和时间维度上进行3D(three-dimension,三维)卷积操作,得到每一个图像帧融合了其它图像帧的特征信息;根据上述每一个图像帧的特征信息,确定上述视频的视频分类标签。
然而,由于上述相关技术中的3D卷积操作计算量大,使得最终得到视频分类标签的时间较长。
发明内容
本申请实施例提供了一种视频分类方法、模型训练方法、装置、设备及存储介质,缩短了最终得到视频的分类结果的时间。技术方案如下:
一方面,本申请实施例提供一种视频分类方法,应用于计算机设备中,所述方法包括:
获取视频;
从所述视频中选取n个图像帧,所述n为正整数;
通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自 的特征信息;其中,所述特征融合策略用于指示所述n个图像帧中的第一图像帧在融合所述n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例;
根据所述n个图像帧各自的特征信息,确定所述视频的分类结果。
另一方面,本申请实施例提供一种视频分类模型的训练方法,应用于计算机设备中,所述方法包括:
获取视频分类模型的训练数据,所述训练数据包括至少一个样本视频;
从所述样本视频中选取n个样本图像帧,所述n为正整数;
通过所述视频分类模型中的特征提取网络根据特征融合策略,提取所述n个样本图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个样本图像帧中的第一样本图像帧在融合所述n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例;
根据所述n个样本图像帧各自的特征信息,确定所述样本视频的预测分类结果;
根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练。
另一方面,本申请实施例提供一种视频分类装置,所述装置包括:
视频获取模块,用于获取视频;
图像帧选取模块,用于从所述视频中选取n个图像帧,所述n为正整数;
特征提取模块,用于通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个图像帧中的第一图像帧在融合所述n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例;
视频分类模块,用于根据所述n个图像帧各自的特征信息,确定所述视频的分类结果。
再一方面,本申请实施例提供一种视频分类模型的训练装置,所述装置包括:
数据获取模块,用于获取视频分类模型的训练数据,所述训练数据包括至少一个样本视频;
图像帧选取模块,用于从所述样本视频中选取n个样本图像帧,所述n为正整数;
特征提取模块,用于通过所述视频分类模型中的特征提取网络根据特征融合策略,提取所述n个样本图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个样本图像帧中的第一样本图像帧在融合所述n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例;
视频分类模块,用于根据所述n个样本图像帧各自的特征信息,确定所述样本视频的预测分类结果;
模型训练模块,用于根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练。
又一方面,本申请实施例提供一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述视频分类方法,或实现上述视频分类模型的训练方法。
又一方面,本申请实施例提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述视频分类方法,或实现上述视频分类模型的训练方法。
本申请实施例提供的技术方案可以带来如下有益效果:
通过根据学习到的特征融合策略,提取图像帧的特征信息,特征融合策略指示了每个图像帧在融合其它图像帧的特征信息时,各个图像帧的特征信息所占的比例,根据图像帧的特征信息,确定视频的分类结果。特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征融合策略通过简单的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图1是本申请一个实施例提供的实施环境的示意图;
图2是本申请一个实施例提供的视频分类方法的流程图;
图3示出了相关技术中的偏移策略的示意图;
图4是本申请一个实施例提供的残差结构的示意图;
图5是本申请另一个实施例提供的视频分类方法的流程图;
图6是本申请一个实施例提供的图像帧选取的示意图;
图7是本申请一个实施例提供的局部图像帧增强策略的示意图;
图8是相关技术中的视频分类方法的示意图;
图9是本申请一个实施例提供的视频分类模型的训练方法的流程图;
图10是本申请一个实施例提供的视频分类装置的框图;
图11是本申请另一个实施例提供的视频分类装置的框图;
图12是本申请一个实施例提供的视频分类模型的训练装置的框图;
图13是本申请一个实施例提供的视频分类模型的训练装置的框图;
图14是本申请一个实施例提供的计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。
本申请实施例提供的方案涉及人工智能的计算机视觉技术、自然语言处理、机器学习等技术,具体通过如下实施例进行说明。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
请参考图1,其示出了本申请一个实施例提供的实施环境的示意图。该实施环境可以包括:终端10和服务器20。
终端10中安装运行有客户端,客户端是指用于上传视频的客户端,例如,视频客户端。终端10可以是诸如手机、平板电脑、可穿戴设备、PC(Personal Computer,个人计算机)等电子设备。
服务器20可以是一台服务器,也可以是由多台服务器组成的服务器集群,或者是一个云计算中心。服务器20可以通过有线或者无线网络与终端10进行通信。服务器20可以获取终端10上传的视频,对该视频进行分类,确定出该视频的分类结果,从而将确定出的分类结果发送给终端10中的客户端进行显示,并基于分类结果进行视频推荐;服务器20还可以对终端10上传的视频进行审核与过滤,判断终端10上传的视频是否是不良视频,例如,不良视频可以是暴力视频、色情低俗视频等;若服务器20判断终端10上传的视频是不良视频,则确认审核不通过,并过滤该视频,向终端10反馈审核不通过消息。
请参考图2,其示出了本申请一个实施例提供的视频分类方法的流程图。该方法的执行主体可以是计算机设备,该计算机设备可以是任何具备计算和处理能力的电子设备,如图1中介绍的终端或服务器。该方法可以包括以下几个步骤。
步骤201,获取视频。
视频可以是任意一个视频。视频可以是终端上传的视频,例如,该视频可以是安装有视频客户端的终端上传的视频。
步骤202,从视频中选取n个图像帧,n为正整数。
视频是多个有序排列图像帧的合集。可选地,计算机设备通过抽帧的方式从视频中选取n个图像帧,n为视频抽帧后得到的总帧数。
步骤203,通过特征提取网络根据学习到的特征融合策略,提取n个图像帧各自的特征信息。
特征提取网络用于提取n个图像帧的特征信息。示例性地,特征提取网络用于根据特征融合策略融合每个图像帧和其它图像帧之间的特征信息,得到每个图像帧的融合后的特征信息,然后通过网络结构对该融合后的特征信息进行处理,输出每个图像帧的最终的特征信息。
在本申请实施例中,特征融合策略用于指示n个图像帧中的第一图像帧在融合n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例。
第一图像帧是n个图像帧中的任意一个图像帧,其它图像帧是n个图像帧中除第一图像帧之外的全部或部分图像帧。可选地,其它图像帧是与第一图像帧相邻的图像帧。例如,有5个图像帧:图像帧1、图像帧2、图像帧3、图像 帧4和图像帧5,假设第一图像帧是图像帧2,则其它图像帧可以是图像帧1和图像帧3,此时,根据特征融合策略提取的图像帧2的特征信息中融合有图像帧1和图像帧3的特征信息。特征融合策略指示的第一图像帧在融合n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例是通过学习得到的,不是固定的模式。可选地,图像帧的特征信息可以用图像帧中的像素的灰度值表示,仍然以上述示例为例,假设图像帧2的某一像素的灰度值为2,图像帧1某一像素的灰度值为3,图像帧3某一像素的灰度值为4,学习到的特征融合策略用于指示图像帧2在融合图像帧1和图像帧3的特征信息时,图像帧1所占的比例为0.2,图像帧2所占的比例为0.4,图像帧3所占的比例为0.4,则通过特征提取网络根据特征融合策略,提取的图像帧2的某一像素的灰度值为(0.2*3+0.4*2+0.4*4)=3。
特征融合策略设计简单,实现高效,可以嵌入进已有的特征提取网络中;通过特征融合策略提取图像帧的特征信息,能够实现灵活地在与其它图像帧之间进行信息交换与信息融合。特征融合策略通过简单的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小。图像帧与图像帧之间的特征信息的交换是动态的,不是固定的模式,本申请实施例提供的特征融合策略可以更加有效地自动学习到适合的信息交换模式。
步骤204,根据n个图像帧各自的特征信息,确定视频的分类结果。
n个图像帧的特征信息可以代表视频的特征信息,根据n个图像帧各自的特征信息,可以确定出视频的分类结果。视频的分类结果用于表征该视频的分类,例如,该视频是暴力视频、色情视频、动画视频、科幻视频等等。
综上所述,本申请实施例提供的技术方案中,通过根据学习到的特征融合策略,提取图像帧的特征信息,特征融合策略指示了每个图像帧在融合其它图像帧的特征信息时,各个图像帧的特征信息所占的比例,根据图像帧的特征信息,确定视频的分类结果。特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征融合策略通过简单的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。
另外,本申请实施例中的特征融合策略是根据学习得到的,可以更加高效灵活地进行特征信息融合。
在示意性实施例中,特征提取网络包括m个级联的网络结构,m为正整数。计算机设备可以通过如下方式提取图像帧的特征信息:
第一、对于第一图像帧,在将第一图像帧的第一特征信息输入至特征提取网络的第k个网络结构之前,根据特征融合策略对第一特征信息进行特征融合处理,得到处理后的第一特征信息,k为小于或等于m的正整数;
在本申请实施例中,处理后的第一特征信息中融合有第一图像帧以及其它图像帧的特征信息。
示例性地,第一特征信息包括c个通道的特征,c为正整数。
示例性地,计算机设备可以通过如下几个子步骤得到处理后的第一特征信息:
1、对于第一特征信息中第i个通道的特征,采用学习到的卷积核对第一图像帧中第i个通道的特征,以及其它图像帧中第i个通道的特征进行卷积操作,得到第一图像帧中第i个通道的处理后的特征,i为小于或等于c的正整数;
在本申请实施例中,卷积核用于定义第一图像帧中第i个通道的特征所对应的特征融合策略。第一图像帧中不同通道的特征所对应的特征融合策略可以不一样,也即,第一图像帧中不同通道对应的卷积核不一样。
第一图像帧中第i个通道的特征可以用第一图像帧中第i个通道的像素的灰度值表示。假设存在5个图像帧:图像帧1、图像帧2、图像帧3、图像帧4和图像帧5,第一图像帧是图像帧2,其它图像帧是图像帧1和图像帧3,学习到的卷积核大小为3,卷积核参数为[0.2,0.4,0.4],图像帧1在第i个通道的像素的灰度值为3,图像帧2在第i个通道的像素的灰度值为2,图像帧3在第i个通道的像素的灰度值为4,则采用上述卷积核对图像帧2中第i个通道的像素的灰度值,以及图像帧1和图像帧3中第i个通道的像素的灰度值进行卷积操作,得到图像帧2中第i个通道的处理后的像素的灰度值为(0.2*3+0.4*2+0.4*4)=3。
2、根据第一图像帧中各个通道的处理后的特征,得到处理后的第一特征信息。
假设第一特征信息包括256个通道的特征,根据第一图像帧中256个通道处理后的特征,得到处理后的第一特征信息。
如图3所示,其示出了相关技术中的偏移策略的示意图。假设4个图像帧(图像帧1、图像帧2、图像帧3和图像帧4)各自的特征信息分别包括6个通 道(用c表示)的特征,相关技术中,在c=1通道,计算机设备将图像帧1中的特征平移到图像帧2中,将图像帧2中的特征平移到图像帧3中,将图像帧3中的特征平移到图像帧4中;在c=2通道,计算机设备将图像帧2中的特征平移到图像帧1中,将图像帧3中的特征平移到图像帧2中,将图像帧4的特征平移到图像帧3中;在c=3和c=4通道,4个图像帧中的特征保持不变。相关技术中的偏移策略可以看做是采用卷积核大小为3,卷积核参数固定的卷积核对第一图像帧中第i个通道的特征,以及相邻图像帧中第i个通道的特征进行卷积操作,如采用卷积核[001]定义图像帧中某一个通道的特征在时间维度上向反方向移动,用卷积核[100]定义图像帧中某一个通道的特征在时间维度上向正方向移动。对于偏移后的图像帧2和图像帧3的特征信息而言,图像帧2的特征信息中融合了图像帧1和图像帧3的特征信息,图像帧3的特征信息中融合了图像帧2和图像帧4的特征信息。然而,相关技术中的偏移策略过于死板,图像帧与图像帧之间的信息交换的模式是固定不变的,对于相邻图像帧融合之后得到的新特征信息,每一个图像帧的原有特征在偏移后的特征信息中所占的比重也是固定的,显然,相关技术中的策略不够灵活。然而,本申请实施例提供的卷积核的卷积核参数是不固定的,是通过学习得到的,本申请实施例提供的特征融合策略更为灵活。
第二、通过第k个网络结构对处理后的第一特征信息进行处理,生成第一图像帧的第二特征信息。
网络结构用于对第一特征信息做空间卷积处理。示例性地,特征融合策略对应的特征融合处理和网络结构构成了一个残差结构。第k个网络结构是m个网络结构中的任意一个网络结构。如图4所示,其示出了本申请一个实施例提供的残差结构的示意图。该残差结构包括特征融合处理和网络结构,网络结构可以包括空间1x1卷积、空间3x3卷积和空间1x1卷积。残差结构的输入是第一特征信息,通过对第一特征信息进行特征融合处理,得到处理后的第一特征信息;然后对处理后的第一特征信息分别做空间1x1卷积、空间3x3卷积和空间1x1卷积,得到卷积后的第一特征信息;将卷积后的第一特征信息和第一特征信息相加,得到第二特征信息。
特征提取网络可以是一个残差网络,其包括多个级联的网络结构,在将图像帧的特征信息输入至各个网络结构之前,计算机设备都可以根据特征融合策略对特征信息进行特征融合处理。在可能的实现方式中,在将图像帧的特征信 息输入部分网络结构之前,根据特征融合策略对特征信息进行特征融合处理。
在本申请实施例中,第二特征信息为特征提取网络输出的第一图像帧的特征信息,或者特征提取网络生成的第一图像帧的中间特征信息。当第二特征信息为特征提取网络生成的第一图像帧的中间特征信息时,通过第k个网络结构之后的其它网络结构对该中间特征信息进行处理,得到第一图像帧的特征信息。
综上所述,本申请实施例提供的技术方案中,通过采用学习到的卷积核对图像帧的特征信息进行特征融合处理,操作简单,工作量小。
在示意性实施例中,如图5所示,本申请实施例提供的视频分类方法还可以包括如下几个步骤:
步骤501,获取视频。
步骤502,按照预设帧率从视频中抽取图像帧,得到视频帧序列。
预设帧率可以是24帧/秒,预设帧率可以是默认帧率,也可以是研究人员根据实际需求进行设置的帧率。
步骤503,将视频帧序列平均分成n个子序列。
每个子序列的长度为视频帧序列的1/n。n可以是8、16或24,本申请实施例对n的大小不作限定,在实际应用中,n一般选取2的倍数。
步骤504,从n个子序列中的每一个序列中抽取一个图像帧,得到n个图像帧。
上述步骤502至步骤504是采取基于稀疏采样的图像帧抽取策略从视频中选取图像帧的。每个图像帧的特征信息代表一个子序列的特征信息,把任意长度的视频帧序列转化为了n个尽可能地覆盖整个视频从而尽可能保留时间信息的图像帧。可选地,计算机设备可以从每个序列中随机抽取一个图像帧,得到n个图像帧;计算机设备也可以选取每个序列中的固定位置处的图像帧(例如,计算机设备可以选取每个序列中的第一个图像帧或最后一个图像帧),本申请实施例对如何从序列中抽取图像帧的方式不作限定。示例性地,如图6所示,计算机设备按照预设帧率从视频中抽取图像帧,得到视频帧序列,以n为8为例进行介绍说明,计算机设备将上述视频帧序列平均分成8个子序列:段1、段2、段3……段8,从该8个子序列中的每一个序列中随机抽取一个图像帧,得到8个图像帧。
步骤505,通过特征提取网络根据学习到的特征融合策略,提取n个图像帧各自的特征信息。
在本申请实施例中,特征融合策略用于指示n个图像帧中的第一图像帧在融合n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例。
步骤506,根据n个图像帧各自的特征信息,得到n个图像帧对应的n个分类结果。
n个图像帧和n个分类结果一一对应,也即每一个图像帧对应于一个分类结果。
示例性地,如图7所示,计算机设备可以通过如下方式得到每个图像帧对应的分类结果:
1、对于n个图像帧中的第j个图像帧,对第j个图像帧的特征信息做降维处理,得到降维后的第j个图像帧的特征信息;
2、通过n个分类器中的第j个分类器根据降维后的第j个图像帧的特征信息,得到第j个图像帧对应的分类结果,j为小于或等于n的正整数。
先对图像帧的特征信息做降维处理,根据降维后的图像帧的特征信息训练分类器,有利于分类器的优化,减轻计算机设备的计算压力。
步骤507,根据n个分类结果,确定视频的分类结果。
可选地,将n个分类结果与n个分类结果各自对应的权重乘积之和,确定为视频的分类结果。当然,在其它可能的实现方式中,可以将n个分类结果求平均,将平均值作为视频的分类结果。
上述步骤506至步骤507对应的策略可以称为局部图像帧增强策略。通过局部图像帧增强策略对图像帧进行处理,加强了图像帧,特别是边缘图像帧(位于视频开始和末尾的图像帧)的特征表达能力,相较于如图8所示的相关技术中对n个图像帧各自的特征信息进行平均操作,本申请实施例是通过要求局部特征来实现视频分类指标,使得局部特征也需要挖掘出具有足够判别力的特征,进而增强了局部特征,尤其是边缘图像帧的特征信息的表达能力,进而使得最终确定的视频的分类结果更准确。
即使利用多次特征融合策略对n个图像帧进行时间特征融合,每个图像帧能够融合多个其它图像帧的特征信息,获得更长时间范围内的信息,但是边缘图像帧对于大范围时间信息的捕捉仍然是欠缺的、不充分的,相应的,它们对 于视频的特征信息的建模是不足的,也即对于视频的表达能力是不足的。这种情况下简单地使用如图8所示的相关技术中的平均策略将n个图像帧各自的特征信息整合成视频的特征信息,继而使用一个分类器根据该视频的特征信息,对视频的分类结果进行最终预测,把边缘图像帧的特征信息和其它图像帧的特征信息同等看待,最终平均得到的视频表达能力也会被边缘图像帧的特征信息所拖累,边缘图像帧的特征信息会影响最终的视频建模能力。在本申请实施例中,对于每个单独的图像帧,都各自使用一个分类器进行动作识别任务,强迫边缘图像帧在信息不充足的情况下挖掘更多的有用信息,增强这些边缘图像帧的信息表达能力。
综上所述,本申请实施例提供的技术方案中,通过为每个图像帧的特征信息设置一个分类器,增强了图像帧的特征表达能力。
如图9所示,其示出了本申请一个实施例提供的视频分类模型的训练方法的流程图,该方法的执行主体可以是计算机设备,如可以是图1中介绍的服务器或终端,该方法可以包括如下几个步骤。
步骤901,获取视频分类模型的训练数据。
视频分类模型用于确定视频的分类结果。在本申请实施例中,训练数据包括至少一个样本视频。训练数据中包括的样本视频的分类结果是一致的。
步骤902,从样本视频中选取n个样本图像帧,n为正整数。
步骤902的介绍说明可参见上文步骤502至步骤504的介绍说明,此处不再赘述。
步骤903,通过视频分类模型中的特征提取网络根据特征融合策略,提取n个样本图像帧各自的特征信息。
在本申请实施例中,特征融合策略用于指示n个样本图像帧中的第一样本图像帧在融合n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例。
假设训练过程中批大小(batch_size)为N,即每次迭代训练包括N个视频帧序列,分别从每个视频帧序列中选取n个图像帧,每个图像帧的大小为HxW,H代表图像帧的高度,W代表图像帧的宽度,同时假设特征信息包括c个通道的特征,即特征通道数为c,每个视频帧序列对应一个视频分类模型,一个残差结构的输入X,其大小即为(Nn)xcxHxW,特征融合处理的过程为: 首先对输入X进行重塑(reshape)处理,使其表达形式变为(NxHxW)xcxn,在这种情况下,我们可以近似认为,对于N个视频帧序列中的每一个空间位置(H,W),其特征表达由n个图像帧的特征信息组成,其中,每个图像帧的特征通道数为c,采用学习到的卷积核对各个图像帧进行卷积操作,得到卷积处理后的特征信息,再对卷积处理后的特征信息做重塑处理,使得卷积处理后的特征信息的表达形式变为(Nn)xcxHxW。通过网络结构对处理后的特征信息进行空间卷积处理,得到空间卷积后的特征信息,将输入和空间卷积后的特征信息相加,得到最终的特征信息。
步骤904,根据n个样本图像帧各自的特征信息,确定样本视频的预测分类结果。
步骤904的介绍说明可参见上文实施例,此处不再赘述。
步骤905,根据预测分类结果和样本视频的标准分类结果,对视频分类模型进行训练。
预测分类结果用于表征视频分类模型预测的视频的分类,样本视频的标准分类结果可以是人工标注的分类结果。示例性地,可以根据预测分类结果和标准分类结果之间的距离,对视频分类模型进行训练。例如,根据预测分类结果和标准分类结果之间的余弦距离、欧式距离、曼哈顿距离或其它距离等,对视频分类模型进行训练。当预测分类结果和标准分类结果之间的距离小于预设距离时,停止对视频分类模型的训练。预测分类结果和标准分类结果之间的距离越小,说明视频分类模型越精确。
示例性地,计算机设备可以根据预测分类结果和标准分类结果,计算视频分类模型对应的损失函数值,根据损失函数值对应视频分类模型进行训练。在可能的实现方式中,当损失函数值小于预设阈值时,停止对视频分类模型的训练。损失函数值用于表征预测分类结果和标准分类结果之间的不一致程度。若损失函数值较小,则表明预测分类结果和标准分类结果很接近,视频分类模型性能良好;若损失函数值较大,则表明预测分类结果和标准分类结果差距很大,视频分类模型性能不佳。
在可能的实现方式中,计算机设备可以根据损失函数值调整特征融合策略。
可选地,根据损失函数值调整卷积核的参数,卷积核用于定义第一样本图像帧中第i个通道的特征所对应的特征融合策略,i为正整数。
根据损失函数值,对特征融合策略进行调整,实现对视频分类模型的训练,可以进行多轮调整,当满足第一停止训练条件时,停止对特征融合策略进行训练。
第一停止训练条件可以包括以下任意一项:当损失函数值满足预设阈值时,停止对特征融合策略进行训练;或者,当训练次数达到预设次数时,例如,达到10万次时,停止对特征融合策略进行训练;或者,当第k+1轮计算得到的损失函数值与第k轮计算得到的损失函数值之间的差值小于预设差值时,例如,小于10 -9时,停止对特征融合策略进行训练。
在可能的实现方式中,视频分类模型还包括n个分类器。计算机设备可以根据损失函数值调整n个分类器的参数,n个分类器中的第h个分类器用于根据第一样本图像帧的特征信息,得到第一样本图像帧对应的预测分类结果,h为小于或等于n的正整数。
根据损失函数值,对分类器的参数进行调整,实现对视频分类模型的训练,可以进行多轮调整,当满足第二停止训练条件时,停止对分类器的参数进行训练。
第二停止训练条件可以包括以下任意一项:当损失函数值满足预设阈值时,停止对分类器的参数进行训练;或者,当训练次数达到预设次数时,例如,达到10万次时,停止对分类器的参数进行训练;或者,当第k+1轮计算得到的损失函数值与第k轮计算得到的损失函数值之间的差值小于预设差值时,例如,小于10 -9时,停止对分类器的参数进行训练。
在可能的实现方式中,服务器根据损失函数值调整特征融合策略和分类器。当满足第一停止训练条件时,停止对特征融合策略进行训练;当满足第二停止训练条件时,停止对分类器的参数进行训练。
第一停止训练条件和第二停止训练条件可以相同,也可以不同,本申请实施例对此不作限定。
综上所述,本申请实施例提供的技术方案中,通过根据学习到的特征融合策略,提取图像帧的特征信息,特征融合策略指示了每个图像帧在融合其它图像帧的特征信息时,各个图像帧的特征信息所占的比例,根据图像帧的特征信息,确定视频的分类结果。特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征融合策略通过简单的其它图像帧的特征信息融合,替换复杂的、重复的3D卷 积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图10,其示出了本申请一个实施例提供的视频分类装置的框图。该装置具有实现上述视频分类方法示例的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置在计算机设备中。该装置1000可以包括:视频获取模块1010、图像帧选取模块1020、特征提取模块1030和视频分类模块1040。
视频获取模块1010,用于获取视频。
图像帧选取模块1020,用于从所述视频中选取n个图像帧,所述n为正整数。
特征提取模块1030,用于通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个图像帧中的第一图像帧在融合所述n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例。
视频分类模块1040,用于根据所述n个图像帧各自的特征信息,确定所述视频的分类结果。
综上所述,本申请实施例提供的技术方案中,通过根据学习到的特征融合策略,提取图像帧的特征信息,特征融合策略指示了每个图像帧在融合其它图像帧的特征信息时,各个图像帧的特征信息所占的比例,根据图像帧的特征信息,确定视频的分类结果。特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征融合策略通过简单的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。
在示意性实施例中,所述特征提取网络包括m个级联的网络结构,所述m为正整数;
所述特征提取模块1030,用于:
对于所述第一图像帧,在将所述第一图像帧的第一特征信息输入至所述特征提取网络的第k个网络结构之前,根据所述特征融合策略对所述第一特征信息进行特征融合处理,得到处理后的第一特征信息;其中,所述处理后的第一 特征信息中融合有所述第一图像帧以及所述其它图像帧的特征信息,所述k为小于或等于所述m的正整数;
通过所述第k个网络结构对所述处理后的第一特征信息进行处理,生成所述第一图像帧的第二特征信息;
其中,所述第二特征信息为所述特征提取网络输出的所述第一图像帧的特征信息,或者所述特征提取网络生成的所述第一图像帧的中间特征信息。
在示意性实施例中,所述第一特征信息包括c个通道的特征,所述c为正整数;
所述特征提取模块1030,用于:
对于所述第一特征信息中第i个通道的特征,采用学习到的卷积核对所述第一图像帧中所述第i个通道的特征,以及所述其它图像帧中所述第i个通道的特征进行卷积操作,得到所述第一图像帧中所述第i个通道的处理后的特征,所述i为小于或等于所述c的正整数;
根据所述第一图像帧中各个通道的处理后的特征,得到所述处理后的第一特征信息;
其中,所述卷积核用于定义所述第一图像帧中所述第i个通道的特征所对应的特征融合策略。
在示意性实施例中,如图11所示,所述视频分类模块1040,包括:结果获取单元1041和视频分类单元1042。
结果获取单元1041,用于根据所述n个图像帧各自的特征信息,得到所述n个图像帧对应的n个分类结果。
视频分类单元1042,用于根据所述n个分类结果,确定所述视频的分类结果。
在示意性实施例中,所述结果获取单元1041,用于:
对于所述n个图像帧中的第j个图像帧,对所述第j个图像帧的特征信息做降维处理,得到降维后的第j个图像帧的特征信息;
通过n个分类器中的第j个分类器根据所述降维后的第j个图像帧的特征信息,得到所述第j个图像帧对应的分类结果,所述j为小于或等于所述n的正整数。
在示意性实施例中,所述视频分类单元1042,用于:
将所述n个分类结果与所述n个分类结果各自对应的权重乘积之和,确定 为所述视频的分类结果。
在示意性实施例中,所述图像帧选取模块1020,用于:
按照预设帧率从所述视频中抽取图像帧,得到视频帧序列;
将所述视频帧序列平均分成n个子序列;
从所述n个子序列中的每一个序列中抽取一个图像帧,得到所述n个图像帧。
请参考图12,其示出了本申请一个实施例提供的视频分类模型的训练装置的框图。该装置具有实现上述视频分类模型的训练方法示例的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置在计算机设备中。该装置1200可以包括:数据获取模块1210、图像帧选取模块1220、特征提取模块1230、视频分类模块1240和模型训练模块1250。
数据获取模块1210,用于获取视频分类模型的训练数据,所述训练数据包括至少一个样本视频。
图像帧选取模块1220,用于从所述样本视频中选取n个样本图像帧,所述n为正整数。
特征提取模块1230,用于通过所述视频分类模型中的特征提取网络根据特征融合策略,提取所述n个样本图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个样本图像帧中的第一样本图像帧在融合所述n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例。
视频分类模块1240,用于根据所述n个样本图像帧各自的特征信息,确定所述样本视频的预测分类结果。
模型训练模块1250,用于根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练。
综上所述,本申请实施例提供的技术方案中,通过根据学习到的特征融合策略,提取图像帧的特征信息,特征融合策略指示了每个图像帧在融合其它图像帧的特征信息时,各个图像帧的特征信息所占的比例,根据图像帧的特征信息,确定视频的分类结果。特征融合策略只需要实现简单的相邻图像帧之间的信息融合,而不需要像3D卷积同时在空间维度和时间维度上进行卷积,特征 融合策略通过简单的其它图像帧的特征信息融合,替换复杂的、重复的3D卷积操作,工作量小,使得最终得到视频的分类结果的时间较短,效率高。
在示意性实施例中,如图13所示,所述模型训练模块1250,包括:函数计算单元1251和策略调整单元1252。
函数计算单元1251,用于根据所述预测分类结果和所述标准分类结果,计算所述视频分类模型对应的损失函数值。
策略调整单元1252,用于根据所述损失函数值调整所述特征融合策略。
在示意性实施例中,所述策略调整单元1252,用于:
根据所述损失函数值调整卷积核的参数,所述卷积核用于定义所述第一样本图像帧中第i个通道的特征所对应的特征融合策略,所述i为正整数。
在示意性实施例中,所述视频分类模型还包括n个分类器;
所述模型训练模块1250,还包括:分类器调整单元1253。
分类器调整单元1253,用于根据所述损失函数值调整所述n个分类器的参数,所述n个分类器中的第h个分类器用于根据所述第一样本图像帧的特征信息,得到所述第一样本图像帧对应的预测分类结果,所述h为小于或等于所述n的正整数。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内容结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图14,其示出了本申请一个实施例提供的计算机设备1400的结构示意图。该计算机设备1400可用于实施上述实施例中提供的方法。该计算机设备1400可以是图1实施例中介绍的终端10或服务器20。具体来讲:
所述计算机设备1400包括中央处理单元(Central Processing Unit,CPU)1401、包括RAM(Random Access Memory,随机存取存储器)1402和ROM(Read-Only Memory,只读存储器)1403的系统存储器1404,以及连接系统存储器1404和中央处理单元1401的系统总线1405。所述计算机设备1400还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统, Input/Output系统)1406,和用于存储操作系统1413、应用程序1414和其他程序模块1415的大容量存储设备1407。
所述基本输入/输出系统1406包括有用于显示信息的显示器1408和用于用户输入信息的诸如鼠标、键盘之类的输入设备1409。其中所述显示器1408和输入设备1409都通过连接到系统总线1405的输入输出控制器1410连接到中央处理单元1401。所述基本输入/输出系统1406还可以包括输入输出控制器1410以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1410还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1407通过连接到系统总线1405的大容量存储控制器(未示出)连接到中央处理单元1401。所述大容量存储设备1407及其相关联的计算机可读介质为计算机设备1400提供非易失性存储。也就是说,所述大容量存储设备1407可以包括诸如硬盘或者CD-ROM(Compact Disc Read-Only Memory,只读光盘)驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM(Erasable Programmable Read-Only Memory,可擦除可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦可编程只读存储器)、闪存(Flash Memory)或其他固态存储其技术,CD-ROM、DVD(Digital Versatile Disc,数字通用光盘)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1404和大容量存储设备1407可以统称为存储器。
根据本申请的各种实施例,所述计算机设备1400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1400可以通过连接在所述系统总线1405上的网络接口单元1411连接到网络1412,或者说,也可以使用网络接口单元1411来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行。上述一个或者 一个以上程序包含用于实现上述视频分类方法,或实现上述视频分类模型的训练方法。
在示意性实施例中,还提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集。所述至少一条指令、至少一段程序、代码集或指令集经配置以由一个或者一个以上处理器执行,以实现上述视频分类方法,或实现上述视频分类模型的训练方法。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或所述指令集在被计算机设备的处理器执行时实现上述视频分类方法,或实现上述视频分类模型的训练方法。
可选地,上述计算机可读存储介质可以是ROM(Read-Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,只读光盘)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,当该计算机程序产品被执行时,其用于实现上述视频分类方法,或实现上述视频分类模型的训练方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种视频分类方法,应用于计算机设备中,所述方法包括:
    获取视频;
    从所述视频中选取n个图像帧,所述n为正整数;
    通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个图像帧中的第一图像帧在融合所述n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例;
    根据所述n个图像帧各自的特征信息,确定所述视频的分类结果。
  2. 根据权利要求1所述的方法,其中,所述特征提取网络包括m个级联的网络结构,所述m为正整数;
    所述通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自的特征信息,包括:
    对于所述第一图像帧,在将所述第一图像帧的第一特征信息输入至所述特征提取网络的第k个网络结构之前,根据所述特征融合策略对所述第一特征信息进行特征融合处理,得到处理后的第一特征信息;其中,所述处理后的第一特征信息中融合有所述第一图像帧以及所述其它图像帧的特征信息,所述k为小于或等于所述m的正整数;
    通过所述第k个网络结构对所述处理后的第一特征信息进行处理,生成所述第一图像帧的第二特征信息;
    其中,所述第二特征信息为所述特征提取网络输出的所述第一图像帧的特征信息,或者所述特征提取网络生成的所述第一图像帧的中间特征信息。
  3. 根据权利要求2所述的方法,其中,所述第一特征信息包括c个通道的特征,所述c为正整数;
    所述根据所述特征融合策略对所述第一特征信息进行特征融合处理,得到处理后的第一特征信息,包括:
    对于所述第一特征信息中第i个通道的特征,采用学习到的卷积核对所述第一图像帧中所述第i个通道的特征,以及所述其它图像帧中所述第i个通道的特 征进行卷积操作,得到所述第一图像帧中所述第i个通道的处理后的特征,所述i为小于或等于所述c的正整数;
    根据所述第一图像帧中各个通道的处理后的特征,得到所述处理后的第一特征信息;
    其中,所述卷积核用于定义所述第一图像帧中所述第i个通道的特征所对应的特征融合策略。
  4. 根据权利要求1所述的方法,其中,所述根据所述n个图像帧各自的特征信息,确定所述视频的分类结果,包括:
    根据所述n个图像帧各自的特征信息,得到所述n个图像帧对应的n个分类结果;
    根据所述n个分类结果,确定所述视频的分类结果。
  5. 根据权利要求4所述的方法,其中,所述根据所述n个图像帧各自的特征信息,得到所述n个图像帧对应的n个分类结果,包括:
    对于所述n个图像帧中的第j个图像帧,对所述第j个图像帧的特征信息做降维处理,得到降维后的第j个图像帧的特征信息;
    通过n个分类器中的第j个分类器根据所述降维后的第j个图像帧的特征信息,得到所述第j个图像帧对应的分类结果,所述j为小于或等于所述n的正整数。
  6. 根据权利要求4所述的方法,其中,所述根据所述n个分类结果,确定所述视频的分类结果,包括:
    将所述n个分类结果与所述n个分类结果各自对应的权重乘积之和,确定为所述视频的分类结果。
  7. 根据权利要求1至6任一项所述的方法,其中,所述从所述视频中选取n个图像帧,包括:
    按照预设帧率从所述视频中抽取图像帧,得到视频帧序列;
    将所述视频帧序列平均分成n个子序列;
    从所述n个子序列中的每一个序列中抽取一个图像帧,得到所述n个图像帧。
  8. 一种视频分类模型的训练方法,应用于计算机设备中,所述方法包括:
    获取视频分类模型的训练数据,所述训练数据包括至少一个样本视频;
    从所述样本视频中选取n个样本图像帧,所述n为正整数;
    通过所述视频分类模型中的特征提取网络根据特征融合策略,提取所述n个样本图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个样本图像帧中的第一样本图像帧在融合所述n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例;
    根据所述n个样本图像帧各自的特征信息,确定所述样本视频的预测分类结果;
    根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练。
  9. 根据权利要求8所述的方法,其中,所述根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练,包括:
    根据所述预测分类结果和所述标准分类结果,计算所述视频分类模型对应的损失函数值;
    根据所述损失函数值调整所述特征融合策略。
  10. 根据权利要求9所述的方法,其中,所述根据所述损失函数值调整所述特征融合策略,包括:
    根据所述损失函数值调整卷积核的参数,所述卷积核用于定义所述第一样本图像帧中第i个通道的特征所对应的特征融合策略,所述i为正整数。
  11. 根据权利要求9所述的方法,其中,所述视频分类模型还包括n个分类器;
    所述根据所述预测分类结果和所述标准分类结果,计算所述视频分类模型对应的损失函数值之后,还包括:
    根据所述损失函数值调整所述n个分类器的参数,所述n个分类器中的第h个分类器用于根据所述第一样本图像帧的特征信息,得到所述第一样本图像帧对应的预测分类结果,所述h为小于或等于所述n的正整数。
  12. 一种视频分类装置,所述装置包括:
    视频获取模块,用于获取视频;
    图像帧选取模块,用于从所述视频中选取n个图像帧,所述n为正整数;
    特征提取模块,用于通过特征提取网络根据学习到的特征融合策略,提取所述n个图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个图像帧中的第一图像帧在融合所述n个图像帧中的其它图像帧的特征信息时,各个图像帧的特征信息所占的比例;
    视频分类模块,用于根据所述n个图像帧各自的特征信息,确定所述视频的分类结果。
  13. 一种视频分类模型的训练装置,所述装置包括:
    数据获取模块,用于获取视频分类模型的训练数据,所述训练数据包括至少一个样本视频;
    图像帧选取模块,用于从所述样本视频中选取n个样本图像帧,所述n为正整数;
    特征提取模块,用于通过所述视频分类模型中的特征提取网络根据特征融合策略,提取所述n个样本图像帧各自的特征信息;其中,所述特征融合策略用于指示所述n个样本图像帧中的第一样本图像帧在融合所述n个样本图像帧中的其它样本图像帧的特征信息时,各个样本图像帧的特征信息所占的比例;
    视频分类模块,用于根据所述n个样本图像帧各自的特征信息,确定所述样本视频的预测分类结果;
    模型训练模块,用于根据所述预测分类结果和所述样本视频的标准分类结果,对所述视频分类模型进行训练。
  14. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、 所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一项所述的方法,或实现如权利要求8至11任一项所述的方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任一项所述的方法,或实现如权利要求8至11任一项所述的方法。
PCT/CN2020/117358 2019-11-15 2020-09-24 视频分类方法、模型训练方法、装置、设备及存储介质 WO2021093468A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20886816.6A EP3989111A4 (en) 2019-11-15 2020-09-24 VIDEO CLASSIFICATION METHOD AND DEVICE, MODEL TRAINING METHOD AND DEVICE, DEVICE AND STORAGE MEDIA
US17/515,164 US11967151B2 (en) 2019-11-15 2021-10-29 Video classification method and apparatus, model training method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911121362.5A CN110929622B (zh) 2019-11-15 2019-11-15 视频分类方法、模型训练方法、装置、设备及存储介质
CN201911121362.5 2019-11-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/515,164 Continuation US11967151B2 (en) 2019-11-15 2021-10-29 Video classification method and apparatus, model training method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021093468A1 true WO2021093468A1 (zh) 2021-05-20

Family

ID=69853121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117358 WO2021093468A1 (zh) 2019-11-15 2020-09-24 视频分类方法、模型训练方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US11967151B2 (zh)
EP (1) EP3989111A4 (zh)
CN (1) CN110929622B (zh)
WO (1) WO2021093468A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449784A (zh) * 2021-06-18 2021-09-28 宜通世纪科技股份有限公司 基于先验属性图谱的图像多分类方法、装置、设备及介质
CN114491272A (zh) * 2022-02-14 2022-05-13 北京有竹居网络技术有限公司 一种多媒体内容推荐方法及装置

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929622B (zh) 2019-11-15 2024-01-05 腾讯科技(深圳)有限公司 视频分类方法、模型训练方法、装置、设备及存储介质
CN111444878B (zh) * 2020-04-09 2023-07-18 Oppo广东移动通信有限公司 一种视频分类方法、装置及计算机可读存储介质
CN111881726B (zh) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 一种活体检测方法、装置及存储介质
CN111901639A (zh) * 2020-07-31 2020-11-06 上海博泰悦臻电子设备制造有限公司 多媒体视频上传方法、装置及系统、存储介质及平台
CN112016683B (zh) * 2020-08-04 2023-10-31 杰创智能科技股份有限公司 数据增强学习、训练方法、电子设备、可读存储介质
CN112132070B (zh) * 2020-09-27 2024-06-04 上海高德威智能交通系统有限公司 驾驶行为分析方法、装置、设备及存储介质
US11736545B2 (en) * 2020-10-16 2023-08-22 Famous Group Technologies Inc. Client user interface for virtual fan experience
CN112507920B (zh) * 2020-12-16 2023-01-24 重庆交通大学 一种基于时间位移和注意力机制的考试异常行为识别方法
CN112784734A (zh) 2021-01-21 2021-05-11 北京百度网讯科技有限公司 一种视频识别方法、装置、电子设备和存储介质
CN112926472A (zh) * 2021-03-05 2021-06-08 深圳先进技术研究院 视频分类方法、装置及设备
CN112862005B (zh) * 2021-03-19 2023-08-01 北京百度网讯科技有限公司 视频的分类方法、装置、电子设备和存储介质
CN113435270A (zh) * 2021-06-10 2021-09-24 上海商汤智能科技有限公司 目标检测方法、装置、设备及存储介质
CN113449148B (zh) * 2021-06-24 2023-10-20 北京百度网讯科技有限公司 视频分类方法、装置、电子设备及存储介质
CN113516080A (zh) * 2021-07-16 2021-10-19 上海高德威智能交通系统有限公司 一种行为检测方法和装置
CN113449700B (zh) * 2021-08-30 2021-11-23 腾讯科技(深圳)有限公司 视频分类模型的训练、视频分类方法、装置、设备及介质
CN115205763B (zh) * 2022-09-09 2023-02-17 阿里巴巴(中国)有限公司 视频处理方法及设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390154A (zh) * 2013-07-31 2013-11-13 中国人民解放军国防科学技术大学 基于进化多特征提取的人脸识别方法
US20140328537A1 (en) * 2011-09-14 2014-11-06 Eads Deutschland Gmbh Automatic Learning Method for the Automatic Learning of Forms of Appearance of Objects in Images
CN106599907A (zh) * 2016-11-29 2017-04-26 北京航空航天大学 多特征融合的动态场景分类方法与装置
CN108898067A (zh) * 2018-06-06 2018-11-27 北京京东尚科信息技术有限公司 确定人和物关联度的方法、装置及计算机可读存储介质
CN109886951A (zh) * 2019-02-22 2019-06-14 北京旷视科技有限公司 视频处理方法、装置及电子设备
CN109919166A (zh) * 2017-12-12 2019-06-21 杭州海康威视数字技术股份有限公司 获取属性的分类信息的方法和装置
CN110929622A (zh) * 2019-11-15 2020-03-27 腾讯科技(深圳)有限公司 视频分类方法、模型训练方法、装置、设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514432B (zh) * 2012-06-25 2017-09-01 诺基亚技术有限公司 人脸特征提取方法、设备和计算机程序产品
US10068171B2 (en) * 2015-11-12 2018-09-04 Conduent Business Services, Llc Multi-layer fusion in a convolutional neural network for image classification
US11687770B2 (en) * 2018-05-18 2023-06-27 Synaptics Incorporated Recurrent multimodal attention system based on expert gated networks
CN109214375B (zh) * 2018-11-07 2020-11-24 浙江大学 一种基于分段采样视频特征的胚胎妊娠结果预测装置
WO2020102988A1 (zh) * 2018-11-20 2020-05-28 西安电子科技大学 基于特征融合和稠密连接的红外面目标检测方法
CN109977793B (zh) * 2019-03-04 2022-03-04 东南大学 基于变尺度多特征融合卷积网络的路侧图像行人分割方法
CN110070511B (zh) * 2019-04-30 2022-01-28 北京市商汤科技开发有限公司 图像处理方法和装置、电子设备及存储介质
CN110222700A (zh) * 2019-05-30 2019-09-10 五邑大学 基于多尺度特征与宽度学习的sar图像识别方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140328537A1 (en) * 2011-09-14 2014-11-06 Eads Deutschland Gmbh Automatic Learning Method for the Automatic Learning of Forms of Appearance of Objects in Images
CN103390154A (zh) * 2013-07-31 2013-11-13 中国人民解放军国防科学技术大学 基于进化多特征提取的人脸识别方法
CN106599907A (zh) * 2016-11-29 2017-04-26 北京航空航天大学 多特征融合的动态场景分类方法与装置
CN109919166A (zh) * 2017-12-12 2019-06-21 杭州海康威视数字技术股份有限公司 获取属性的分类信息的方法和装置
CN108898067A (zh) * 2018-06-06 2018-11-27 北京京东尚科信息技术有限公司 确定人和物关联度的方法、装置及计算机可读存储介质
CN109886951A (zh) * 2019-02-22 2019-06-14 北京旷视科技有限公司 视频处理方法、装置及电子设备
CN110929622A (zh) * 2019-11-15 2020-03-27 腾讯科技(深圳)有限公司 视频分类方法、模型训练方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3989111A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449784A (zh) * 2021-06-18 2021-09-28 宜通世纪科技股份有限公司 基于先验属性图谱的图像多分类方法、装置、设备及介质
CN113449784B (zh) * 2021-06-18 2024-04-05 宜通世纪科技股份有限公司 基于先验属性图谱的图像多分类方法、装置、设备及介质
CN114491272A (zh) * 2022-02-14 2022-05-13 北京有竹居网络技术有限公司 一种多媒体内容推荐方法及装置
CN114491272B (zh) * 2022-02-14 2023-09-12 北京有竹居网络技术有限公司 一种多媒体内容推荐方法及装置

Also Published As

Publication number Publication date
EP3989111A4 (en) 2022-08-31
US11967151B2 (en) 2024-04-23
CN110929622B (zh) 2024-01-05
EP3989111A1 (en) 2022-04-27
US20220051025A1 (en) 2022-02-17
CN110929622A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021093468A1 (zh) 视频分类方法、模型训练方法、装置、设备及存储介质
WO2021017606A1 (zh) 视频处理方法、装置、电子设备及存储介质
CN110533097B (zh) 一种图像清晰度识别方法、装置、电子设备及存储介质
US11403531B2 (en) Factorized variational autoencoders
CN110889672B (zh) 一种基于深度学习的学生打卡及上课状态的检测系统
CN108229280A (zh) 时域动作检测方法和系统、电子设备、计算机存储介质
WO2019023500A1 (en) PERCEPTUAL APPARATUS IMPLEMENTED BY COMPUTER
CN112418292B (zh) 一种图像质量评价的方法、装置、计算机设备及存储介质
CN111368672A (zh) 一种用于遗传病面部识别模型的构建方法及装置
CN112052837A (zh) 基于人工智能的目标检测方法以及装置
CN110598638A (zh) 模型训练方法、人脸性别预测方法、设备及存储介质
WO2021184754A1 (zh) 视频对比方法、装置、计算机设备和存储介质
CA3193958A1 (en) Processing images using self-attention based neural networks
CN110569814A (zh) 视频类别识别方法、装置、计算机设备及计算机存储介质
CN112257665A (zh) 图像内容的识别方法、图像识别模型的训练方法及介质
WO2021169642A1 (zh) 基于视频的眼球转向确定方法与系统
CN113254491A (zh) 一种信息推荐的方法、装置、计算机设备及存储介质
KR20190128933A (ko) 시공간 주의 기반 감정 인식 장치 및 방법
CN113763385A (zh) 视频目标分割方法、装置、设备及介质
CN114333049A (zh) 猪只攻击行为识别方法、系统、计算机设备和存储介质
CN115578770A (zh) 基于自监督的小样本面部表情识别方法及系统
CN113570689A (zh) 人像卡通化方法、装置、介质和计算设备
CN110457523B (zh) 封面图片的选取方法、模型的训练方法、装置及介质
CN112115744A (zh) 点云数据的处理方法及装置、计算机存储介质、电子设备
CN113705293A (zh) 图像场景的识别方法、装置、设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20886816

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 20886816.6

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2020886816

Country of ref document: EP

Effective date: 20220124

NENP Non-entry into the national phase

Ref country code: DE