WO2023016290A1 - Procédé et appareil de classification de vidéo, support lisible et dispositif électronique - Google Patents

Procédé et appareil de classification de vidéo, support lisible et dispositif électronique Download PDF

Info

Publication number
WO2023016290A1
WO2023016290A1 PCT/CN2022/109470 CN2022109470W WO2023016290A1 WO 2023016290 A1 WO2023016290 A1 WO 2023016290A1 CN 2022109470 W CN2022109470 W CN 2022109470W WO 2023016290 A1 WO2023016290 A1 WO 2023016290A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
transformation
target
transformed
videos
Prior art date
Application number
PCT/CN2022/109470
Other languages
English (en)
Chinese (zh)
Inventor
佘琪
沈铮阳
王长虎
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023016290A1 publication Critical patent/WO2023016290A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of video processing, and in particular, to a video classification method, device, readable medium and electronic equipment.
  • video classification usually uses an end-to-end CNN (English: Convolutional Neural Networks, Chinese: Convolutional Neural Networks) model to learn the implicit spatiotemporal relationship in the video for video classification.
  • CNN International: Convolutional Neural Networks
  • Chinese Convolutional Neural Networks
  • the present disclosure provides a video classification method, the method comprising:
  • the video to be classified is transformed to obtain multiple transformed videos
  • the video classification model is used to determine the target video feature corresponding to the converted video according to the converted video, and determine the video classification result according to the target video feature; the target video feature is the converted video has transformation invariance properties.
  • the present disclosure provides a video classification device, the device comprising:
  • the transformation module is used to transform the video to be classified by the target transformation group to obtain a plurality of transformed videos
  • a determining module configured to determine the video classification result of the video to be classified through the trained video classification model according to a plurality of transformed videos
  • the video classification model is used to determine the target video feature corresponding to the converted video according to the converted video, and determine the video classification result according to the target video feature; the target video feature is the converted video has transformation invariance properties.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method described in the first aspect of the present disclosure are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method described in the first aspect of the present disclosure.
  • Fig. 1 is a flow chart of a video classification method shown according to an exemplary embodiment
  • Fig. 2 is a flow chart showing a step 101 according to the embodiment shown in Fig. 1;
  • Fig. 3 is a flow chart showing a step 102 according to the embodiment shown in Fig. 1;
  • Fig. 4 is a flow chart showing a training video classification model according to an exemplary embodiment
  • Fig. 5 is a block diagram of a video classification device according to an exemplary embodiment
  • Fig. 6 is a block diagram of a transformation module shown according to the embodiment shown in Fig. 5;
  • Fig. 7 is a block diagram of a determination module according to the embodiment shown in Fig. 5;
  • Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flowchart of a video classification method according to an exemplary embodiment. As shown in Figure 1, the method may include the following steps:
  • Step 101 Transform the video to be classified through the target transformation group to obtain multiple transformed videos.
  • the preset transformation group corresponding to each specified type of transformation is generated.
  • the preset transformation group corresponding to each specified type of transformation is a group formed by multiple transformations of the specified type.
  • the preset transformation group when the specified type of transformation is a rotation transformation, can be a rotation group composed of multiple rotation transformations; when the specified type of transformation is a scaling transformation, the preset transformation group can be multiple A scaling group composed of multiple scaling transformations.
  • the preset transformation group when the specified type of transformation is an affine transformation, can be an affine transformation group composed of multiple affine transformations.
  • the video to be classified may be acquired, and a target transformation group is determined from a plurality of pre-generated preset transformation groups according to the video to be classified.
  • a target transformation group is determined from a plurality of pre-generated preset transformation groups according to the video to be classified.
  • One possible way is to select the target transformation group from a plurality of preset transformation groups according to the characteristics of the video to be classified. For example, if the object in the video to be classified (the object For example, it may be that the size of a person or an object is relatively large, then a scaling group may be selected as the target transformation group.
  • the video to be classified can be transformed respectively through multiple transformations included in the target transformation group to obtain the transformed video corresponding to each transformation.
  • Step 102 Determine the video classification result of the video to be classified through the trained video classification model according to the plurality of transformed videos.
  • the video classification model is used to determine the target video feature corresponding to the transformed video according to the transformed video, and determine the video classification result according to the target video feature, and the target video feature is a transform invariant feature in the transformed video.
  • the multiple transformed videos may be input into a trained video classification model.
  • the video features of each transformed video are extracted by the video classification model, and the maximum pooling process is performed on the video features of all the extracted transformed videos to obtain the target video features with transformation invariance in the transformed videos.
  • the target transformation group is a rotation group
  • multiple transformed videos are actually obtained after performing different rotation transformations on the videos to be classified.
  • the obtained target video features are actually invariant to rotation transformations in the videos to be classified. Characteristics.
  • the video classification model can determine a video classification result of the video to be classified from a plurality of preset video types according to the characteristics of the target video.
  • the multiple preset video types may include normal videos and multiple abnormal video types.
  • the present disclosure actually takes the video to be classified as a whole, and obtains the video classification result of the video to be classified through the target transformation group and the video classification model.
  • the classification result is to determine the video classification result of the video to be classified.
  • the video classification method of the present disclosure may not only be applied to classify videos, but also may be applied to classify images, which is not specifically limited in the present disclosure.
  • the present disclosure first transforms the video to be classified through the target transformation group to obtain multiple transformed videos, and then determines the video classification result of the video to be classified according to the multiple transformed videos through the trained video classification model, wherein , the video classification model is used to determine the target video features corresponding to the transformed video according to the transformed video, and determine the video classification result according to the target video features.
  • the target video features are features with transformation invariance in the transformed video.
  • the disclosure uses a video classification model to extract transformation-invariant target video features from multiple transformed videos, and performs video classification through the target video features, which can avoid the influence of rotation, scaling or affine transformation on video classification, and improves the efficiency of video classification. Accuracy of video classification.
  • Fig. 2 is a flow chart showing a step 101 according to the embodiment shown in Fig. 1 .
  • step 101 may include the following steps:
  • Step 1011 determine a target transformation group from a plurality of preset transformation groups.
  • the plurality of preset transformation groups include rotation group, scaling group and affine transformation group, and each preset transformation group includes a plurality of transformation matrices.
  • the object information corresponding to the target object in the video to be classified may be determined first by using a preset recognition algorithm. Then, the target transformation group can be determined from multiple preset transformation groups according to the object information.
  • the object information may include the position, direction and size of the target object, and each preset transformation group may include multiple transformations, and each transformation corresponds to a transformation matrix.
  • a standard image can be preset. If the direction of the target object in the video to be classified is quite different from the direction of the object in the standard image, the rotation group can be selected as the target transformation group. If the direction of the target object in the video to be classified is If the size of the target object in the target object differs greatly from the size of the object in the standard image, the scaling group can be selected as the target transformation group.
  • Step 1012 Transform the video to be classified by using each target transformation matrix in the target transformation group to obtain the transformed video corresponding to each target transformation matrix.
  • the target transformation group is the rotation group, and the rotation group includes 4 rotation transformations as an example. If the 4 rotation transformations respectively rotate the video to be classified by 45°, 90°, 135° and 180° clockwise, Then, after the target transformation group is determined, the video to be classified can be transformed through the target transformation matrix corresponding to each rotation transformation in the target transformation group (the transformation at this time is a rotation transformation), and the video to be classified can be rotated clockwise by 45 4 transformation videos of °, 90°, 135° and 180°.
  • Fig. 3 is a flow chart showing a step 102 according to the embodiment shown in Fig. 1 .
  • the video classification model includes a Siamese network, a maximum pooling layer, and a classifier.
  • the Siamese network includes multiple neural sub-networks, and the transformed video is in one-to-one correspondence with the neural sub-networks.
  • Step 102 may include the following steps:
  • Step 1021 input each transformed video into the neural sub-network corresponding to the transformed video, so as to extract video features of the transformed video.
  • Step 1022 Perform maximum pooling processing on the video features of each transformed video through the maximum pooling layer to obtain target video features.
  • Step 1023 using the classifier to determine the video classification result according to the characteristics of the target video.
  • the video classification model can be constructed through Siamese network, maximum pooling layer and classifier.
  • the twin network includes multiple neural sub-networks, and these neural sub-networks share network weights and network parameters.
  • the Siamese network can use 3D-CNN or two-stream CNN, and the classifier can use a linear classifier.
  • each transformed video may be input into a neural sub-network corresponding to the transformed video to obtain n-dimensional video features of the transformed video. Then, the video features of each transformed video can be input into the maximum pooling layer, and the maximum pooling layer performs element-by-element maximum pooling operation on the video features of the transformed video obtained by each neural sub-network, and outputs the target video features. Afterwards, the target video feature can be input into the classifier, and the classifier determines the video classification result according to the target video feature.
  • Fig. 4 is a flowchart showing a training video classification model according to an exemplary embodiment. As shown in Figure 4, the video classification model is obtained by:
  • Step 201 acquire a training sample set.
  • the training sample set includes training videos and training video classification results corresponding to the training videos.
  • Step 202 Transform the training video through each preset transformation group to obtain a plurality of training transformed videos corresponding to each preset transformation group.
  • Step 203 according to multiple training transformation videos and training video classification results, train the preset model to obtain a video classification model.
  • videos may be collected from business lines, and the collected videos may be divided into a training sample set and a test sample set.
  • the training sample set includes training videos and training video classification results corresponding to the training videos
  • the testing sample set includes testing videos and testing video classification results corresponding to the testing videos.
  • the training video can be transformed through each preset transformation group to obtain multiple training transformation videos corresponding to each preset transformation group.
  • the preset model can include a Siamese network, a maximum pooling layer, and a classifier.
  • the Siamese network can include multiple neural sub-networks.
  • the test sample set can be used to perform a performance test on the obtained video classification model (for example, the performance of the video classification model can be judged by the accuracy of the video classification result output by the video classification model), if the performance of the video classification model cannot reach requirements, retrain the video classification model until the performance of the video classification model meets the requirements.
  • the present disclosure first transforms the video to be classified through the target transformation group to obtain multiple transformed videos, and then determines the video classification result of the video to be classified according to the multiple transformed videos through the trained video classification model, wherein , the video classification model is used to determine the target video features corresponding to the transformed video according to the transformed video, and determine the video classification result according to the target video features.
  • the target video features are features with transformation invariance in the transformed video.
  • the disclosure uses a video classification model to extract transformation-invariant target video features from multiple transformed videos, and performs video classification through the target video features, which can avoid the influence of rotation, scaling or affine transformation on video classification, and improves the efficiency of video classification. Accuracy of video classification.
  • Fig. 5 is a block diagram of a device for classifying videos according to an exemplary embodiment. As shown in Figure 5, the device 300 includes:
  • the transformation module 301 is configured to transform the video to be classified through the target transformation group to obtain multiple transformed videos.
  • the determination module 302 is configured to determine the video classification result of the video to be classified according to the multiple converted videos and through the trained video classification model.
  • the video classification model is used to determine the target video feature corresponding to the transformed video according to the transformed video, and determine the video classification result according to the target video feature, and the target video feature is a transform invariant feature in the transformed video.
  • Fig. 6 is a block diagram of a transformation module according to the embodiment shown in Fig. 5 .
  • the transformation module 301 includes:
  • the determination sub-module 3011 is used to determine a target transformation group from a plurality of preset transformation groups, the plurality of preset transformation groups include a rotation group, a scaling group and an affine transformation group, and each preset transformation group includes a plurality of transformations matrix.
  • the transformation sub-module 3012 is configured to respectively transform the video to be classified through each target transformation matrix in the target transformation group to obtain the transformed video corresponding to each target transformation matrix.
  • the determining submodule 3011 is used for:
  • Object information corresponding to the target object in the video to be classified is determined, and the object information includes the position, direction and size of the target object.
  • a target transformation group is determined from a plurality of preset transformation groups.
  • Fig. 7 is a block diagram of a determining module according to the embodiment shown in Fig. 5 .
  • the video classification model includes a twin network, a maximum pooling layer and a classifier
  • the twin network includes a plurality of neural sub-networks
  • the transformed video is in one-to-one correspondence with the neural sub-networks
  • the determination module 302 includes:
  • the feature extraction sub-module 3021 is configured to input each transformed video into a neural sub-network corresponding to the transformed video, so as to extract video features of the transformed video.
  • the pooling sub-module 3022 is configured to perform maximum pooling processing on the video features of each converted video through a maximum pooling layer to obtain target video features.
  • the classification sub-module 3023 is configured to use a classifier to determine a video classification result according to the characteristics of the target video.
  • the determination module 302 is used to train the video classification model in the following manner:
  • the training sample set includes training videos, and training video classification results corresponding to the training videos.
  • the training video is transformed through each preset transformation group to obtain a plurality of training transformation videos corresponding to each preset transformation group.
  • the preset model is trained to obtain a video classification model.
  • the present disclosure first transforms the video to be classified through the target transformation group to obtain multiple transformed videos, and then determines the video classification result of the video to be classified according to the multiple transformed videos through the trained video classification model, wherein , the video classification model is used to determine the target video features corresponding to the transformed video according to the transformed video, and determine the video classification result according to the target video features.
  • the target video features are features with transformation invariance in the transformed video.
  • the disclosure uses a video classification model to extract transformation-invariant target video features from multiple transformed videos, and performs video classification through the target video features, which can avoid the influence of rotation, scaling or affine transformation on video classification, and improves the efficiency of video classification. Accuracy of video classification.
  • FIG. 8 it shows a schematic structural diagram of an electronic device (such as the terminal device or server in FIG. 1 ) 400 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 400 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 403 .
  • RAM random access memory
  • various programs and data necessary for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • the following devices can be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 407 such as a computer; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409.
  • the communication means 409 may allow the electronic device 400 to perform wireless or wired communication with other devices to exchange data. While FIG. 8 shows electronic device 400 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 409, or from storage means 408, or from ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable signal medium may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: transforms the video to be classified through the target transformation group to obtain multiple transformed videos; A plurality of the converted videos, through the trained video classification model, determine the video classification result of the video to be classified; wherein, the video classification model is used to determine the target video corresponding to the converted video according to the converted video feature, and determine the video classification result according to the target video feature; the target video feature is a feature with transformation invariance in the transformed video.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances.
  • the transformation module can also be described as "a module for transforming the video to be classified to obtain multiple transformed videos".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a video classification method, including: transforming the video to be classified through the target transformation group to obtain multiple transformed videos; according to the multiple transformed videos, through training A good video classification model is used to determine the video classification result of the video to be classified; wherein, the video classification model is used to determine the target video features corresponding to the converted video according to the converted video, and to determine the corresponding target video features according to the target video features. Determining the video classification result; the target video feature is a feature with transformation invariance in the transformed video.
  • Example 2 provides the method of Example 1.
  • the video to be classified is transformed through the target transformation group to obtain multiple transformed videos, including: from a plurality of preset transformation groups, determining the target transformation group; the multiple preset transformation groups include a rotation group, a scaling group and an affine transformation group, each of the preset transformation groups includes a plurality of transformation matrices; through the target transformation group Each target transformation matrix respectively transforms the video to be classified to obtain a transformed video corresponding to each target transformation matrix.
  • Example 3 provides the method of Example 2, the determining the target transformation group from a plurality of preset transformation groups includes: determining the target object in the video to be classified Corresponding object information, the object information includes the position, direction and size of the target object; according to the object information, the target transformation group is determined from a plurality of the preset transformation groups.
  • Example 4 provides the method of Example 1, the video classification model includes a Siamese network, a maximum pooling layer and a classifier, the Siamese network includes a plurality of neural sub-networks, the Transformed videos are in one-to-one correspondence with the neural sub-network; according to a plurality of transformed videos, through a trained video classification model, determining the video classification result of the video to be classified includes: converting each transformed video Input into the neural sub-network corresponding to the transformed video to extract the video features of the transformed video; through the maximum pooling layer, perform maximum pooling processing on the video features of each transformed video to obtain the target video features ; Using the classifier to determine the video classification result according to the target video features.
  • Example 5 provides the method of Example 1, the video classification model is obtained by: obtaining a training sample set; the training sample set includes training videos, and the training The training video classification result corresponding to the video; the training video is transformed by each preset transformation group to obtain a plurality of training transformation videos corresponding to each of the preset transformation groups; according to the multiple training transformation videos and the obtained The classification result of the training video is used to train the preset model to obtain the video classification model.
  • Example 6 provides a video classification device, the device includes: a transformation module, which is used to transform the video to be classified through a target transformation group to obtain multiple transformed videos; a determination module , for determining the video classification result of the video to be classified by using a trained video classification model according to a plurality of transformed videos; wherein, the video classification model is used for determining the transformed video according to the transformed video corresponding target video feature, and determine the video classification result according to the target video feature; the target video feature is a feature with transformation invariance in the transformed video.
  • Example 7 provides the device of Example 6, the transformation module includes: a determining submodule, configured to determine the target transformation group from a plurality of preset transformation groups; A plurality of preset transformation groups include a rotation group, a scaling group and an affine transformation group, and each of the preset transformation groups includes a plurality of transformation matrices; a transformation sub-module is used to pass each target in the target transformation group Transformation matrices, respectively transforming the videos to be classified to obtain transformed videos corresponding to each of the target transformation matrices.
  • Example 8 provides the apparatus of Example 6, the video classification model includes a Siamese network, a maximum pooling layer and a classifier, the Siamese network includes a plurality of neural subnetworks, the Transformed videos correspond one-to-one to the neural sub-network; the determination module includes: a feature extraction sub-module, which is used to input each transformed video into the neural sub-network corresponding to the transformed video, so as to extract the features of the transformed video Video features; pooling sub-module, for performing maximum pooling processing on the video features of each of the transformed videos through the maximum pooling layer, to obtain the target video features; classification sub-module, for passing the classification The device determines the video classification result according to the target video feature.
  • the video classification model includes a Siamese network, a maximum pooling layer and a classifier, the Siamese network includes a plurality of neural subnetworks, the Transformed videos correspond one-to-one to the neural sub-network; the determination module includes: a feature extraction sub-module
  • Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the methods described in Example 1 to Example 5 are implemented.
  • Example 10 provides an electronic device, including: a storage device, on which a computer program is stored; a processing device, configured to execute the computer program in the storage device, to Implement the steps of the method described in Example 1 to Example 5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention relève du domaine technique du traitement vidéo, et concerne un procédé et un appareil de classification de vidéo, un support lisible et un dispositif électronique. Le procédé comprend : au moyen d'un groupe de transformations cible, l'exécution de transformations sur une vidéo à classifier pour obtenir une pluralité de vidéos transformées ; et en fonction de la pluralité de vidéos transformées, la détermination de résultats de classification de vidéo de la vidéo au moyen d'un modèle de classification de vidéo entraîné, le modèle de classification de vidéo étant utilisé pour déterminer, en fonction des vidéos transformées, des caractéristiques de vidéo cibles correspondant aux vidéos transformées, et pour déterminer les résultats de classification de vidéo en fonction des caractéristiques de vidéo cibles, les caractéristiques de vidéo cibles étant des caractéristiques qui ont une invariance de transformation dans les vidéos transformées. Dans la présente invention, le modèle de classification de vidéo est utilisé pour extraire les caractéristiques de vidéo cibles ayant une invariance de transformation parmi la pluralité de vidéos transformées, et une classification de vidéo est effectuée au moyen des caractéristiques de vidéo cibles, ce qui peut empêcher qu'une rotation, une mise à l'échelle ou une transformation affine n'influent sur la classification de vidéo, et améliorer la précision de la classification de vidéo.
PCT/CN2022/109470 2021-08-12 2022-08-01 Procédé et appareil de classification de vidéo, support lisible et dispositif électronique WO2023016290A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110926870.1 2021-08-12
CN202110926870.1A CN113705386A (zh) 2021-08-12 2021-08-12 视频分类方法、装置、可读介质和电子设备

Publications (1)

Publication Number Publication Date
WO2023016290A1 true WO2023016290A1 (fr) 2023-02-16

Family

ID=78652514

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109470 WO2023016290A1 (fr) 2021-08-12 2022-08-01 Procédé et appareil de classification de vidéo, support lisible et dispositif électronique

Country Status (2)

Country Link
CN (1) CN113705386A (fr)
WO (1) WO2023016290A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705386A (zh) * 2021-08-12 2021-11-26 北京有竹居网络技术有限公司 视频分类方法、装置、可读介质和电子设备

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160358A (zh) * 2015-09-07 2015-12-16 苏州大学张家港工业技术研究院 一种图像分类方法及系统
CN108985217A (zh) * 2018-07-10 2018-12-11 常州大学 一种基于深度空间网络的交通标志识别方法及系统
CN111353548A (zh) * 2020-03-11 2020-06-30 中国人民解放军军事科学院国防科技创新研究院 一种基于对抗空间变换网络的鲁棒特征深度学习方法
CN111401452A (zh) * 2020-03-17 2020-07-10 北京大学 一种基于偏微分算子的等变卷积网络模型的图像分类方法
WO2020221278A1 (fr) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Procédé d'entraînement de modèles, procédé de classification de vidéos, appareil associé, et dispositif électronique
CN112257753A (zh) * 2020-09-23 2021-01-22 北京大学 基于偏微分算子的广义等变卷积网络模型的图像分类方法
US20210034913A1 (en) * 2018-05-23 2021-02-04 Beijing Sensetime Technology Development Co., Ltd. Method and device for image processing, and computer storage medium
CN112990315A (zh) * 2021-03-17 2021-06-18 北京大学 基于偏微分算子的等变3d卷积网络的3d形状图像分类方法
CN113033677A (zh) * 2021-03-30 2021-06-25 北京有竹居网络技术有限公司 视频分类方法、装置、电子设备和存储介质
CN113705386A (zh) * 2021-08-12 2021-11-26 北京有竹居网络技术有限公司 视频分类方法、装置、可读介质和电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145927A (zh) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 一种对形变图像的目标识别方法及装置
CN109840556B (zh) * 2019-01-24 2020-10-23 浙江大学 一种基于孪生网络的图像分类识别方法
CN110287836B (zh) * 2019-06-14 2021-10-15 北京迈格威科技有限公司 图像分类方法、装置、计算机设备和存储介质
CN110377787B (zh) * 2019-06-21 2022-03-25 北京奇艺世纪科技有限公司 一种视频分类方法、装置及计算机可读存储介质
CN110347876A (zh) * 2019-07-12 2019-10-18 Oppo广东移动通信有限公司 视频分类方法、装置、终端设备及计算机可读存储介质
CN111612093A (zh) * 2020-05-29 2020-09-01 Oppo广东移动通信有限公司 一种视频分类方法、视频分类装置、电子设备及存储介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160358A (zh) * 2015-09-07 2015-12-16 苏州大学张家港工业技术研究院 一种图像分类方法及系统
US20210034913A1 (en) * 2018-05-23 2021-02-04 Beijing Sensetime Technology Development Co., Ltd. Method and device for image processing, and computer storage medium
CN108985217A (zh) * 2018-07-10 2018-12-11 常州大学 一种基于深度空间网络的交通标志识别方法及系统
WO2020221278A1 (fr) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Procédé d'entraînement de modèles, procédé de classification de vidéos, appareil associé, et dispositif électronique
CN111353548A (zh) * 2020-03-11 2020-06-30 中国人民解放军军事科学院国防科技创新研究院 一种基于对抗空间变换网络的鲁棒特征深度学习方法
CN111401452A (zh) * 2020-03-17 2020-07-10 北京大学 一种基于偏微分算子的等变卷积网络模型的图像分类方法
CN112257753A (zh) * 2020-09-23 2021-01-22 北京大学 基于偏微分算子的广义等变卷积网络模型的图像分类方法
CN112990315A (zh) * 2021-03-17 2021-06-18 北京大学 基于偏微分算子的等变3d卷积网络的3d形状图像分类方法
CN113033677A (zh) * 2021-03-30 2021-06-25 北京有竹居网络技术有限公司 视频分类方法、装置、电子设备和存储介质
CN113705386A (zh) * 2021-08-12 2021-11-26 北京有竹居网络技术有限公司 视频分类方法、装置、可读介质和电子设备

Also Published As

Publication number Publication date
CN113705386A (zh) 2021-11-26

Similar Documents

Publication Publication Date Title
WO2020155907A1 (fr) Procédé et appareil pour la génération d'un modèle de conversion au style cartoon
WO2022252881A1 (fr) Procédé et appareil de traitement d'image, support lisible et dispositif électronique
CN110826567B (zh) 光学字符识别方法、装置、设备及存储介质
WO2022247562A1 (fr) Procédé et appareil de récupération de données multimodales, et support et dispositif électronique
WO2022105779A1 (fr) Procédé de traitement d'image, procédé d'entraînement de modèle, appareil, support et dispositif
WO2023143178A1 (fr) Procédé et appareil de segmentation d'objet, dispositif et support de stockage
WO2022161357A1 (fr) Procédé et appareil d'acquisition d'échantillon d'apprentissage basés sur l'augmentation de données, et dispositif électronique
WO2023077995A1 (fr) Procédé et appareil d'extraction d'informations, dispositif, support et produit
WO2022171036A1 (fr) Procédé de suivi de cible vidéo, appareil de suivi de cible vidéo, support de stockage et dispositif électronique
WO2023035877A1 (fr) Procédé et appareil de reconnaissance vidéo, support lisible et dispositif électronique
CN113033580B (zh) 图像处理方法、装置、存储介质及电子设备
CN112752118B (zh) 视频生成方法、装置、设备及存储介质
WO2022105622A1 (fr) Procédé et appareil de segmentation d'image, support lisible et dispositif électronique
CN112766284B (zh) 图像识别方法和装置、存储介质和电子设备
WO2023143016A1 (fr) Procédé et appareil de génération de modèle d'extraction de caractéristiques, et procédé et appareil d'extraction de caractéristiques d'image
WO2023030427A1 (fr) Procédé d'entraînement pour modèle génératif, procédé et appareil d'identification de polypes, support et dispositif
WO2023179310A1 (fr) Procédé et appareil de restauration d'image, dispositif, support et produit
WO2023016111A1 (fr) Procédé et appareil de mise en correspondance de valeurs clés, et support lisible et dispositif électronique
WO2023185516A1 (fr) Procédé et appareil d'apprentissage de modèle de reconnaissance d'image, procédé et appareil de reconnaissance, support et dispositif
WO2023016290A1 (fr) Procédé et appareil de classification de vidéo, support lisible et dispositif électronique
WO2023093361A1 (fr) Procédé d'entraînement de modèle de reconnaissance de caractères d'image, et procédé et appareil de reconnaissance de caractères d'image
WO2023103653A1 (fr) Procédé et appareil d'appariement clé-valeur, support lisible, et dispositif électronique
CN114429658A (zh) 人脸关键点信息获取方法、生成人脸动画的方法及装置
WO2023130925A1 (fr) Procédé et appareil de reconnaissance de police, support lisible et dispositif électronique
CN111311609B (zh) 一种图像分割方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE