WO2023147778A1 - 动作识别方法、装置、电子设备及存储介质 - Google Patents

动作识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023147778A1
WO2023147778A1 PCT/CN2023/074537 CN2023074537W WO2023147778A1 WO 2023147778 A1 WO2023147778 A1 WO 2023147778A1 CN 2023074537 W CN2023074537 W CN 2023074537W WO 2023147778 A1 WO2023147778 A1 WO 2023147778A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
action recognition
augmented
videos
segment
Prior art date
Application number
PCT/CN2023/074537
Other languages
English (en)
French (fr)
Inventor
吴捷
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023147778A1 publication Critical patent/WO2023147778A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present disclosure relates to the field of computer technology, for example, to an action recognition method, device, electronic equipment and storage medium.
  • Action recognition is a fundamental computer vision task that plays a key role in video structure analysis and potential downstream applications. Due to the diversity of action categories, it is necessary to rely on sufficient video data of multiple types of actions to train the action recognition model, which will consume a huge cost of screening and labeling video data.
  • the present disclosure provides an action recognition method, device, electronic equipment, and storage medium, which can complete model training based on a small amount of data while ensuring model accuracy, thereby greatly reducing training costs.
  • An embodiment of the present disclosure provides an action recognition method, including:
  • An embodiment of the present disclosure also provides an action recognition device, including:
  • the amplification module is configured to amplify the original video to obtain multiple amplified videos
  • the recognition module is configured to extract the multi-level video features of the multi-section augmented videos based on the pre-trained action recognition model, and output the action recognition results of the original video according to the multi-level video features of the multi-section augmented videos.
  • An embodiment of the present disclosure also provides an electronic device, and the electronic device includes:
  • processors one or more processors
  • storage means configured to store one or more programs
  • Embodiments of the present disclosure also provide a storage medium containing computer-executable instructions, and the computer-executable instructions are used to execute the action recognition method described in any one of the embodiments of the present disclosure when executed by a computer processor.
  • FIG. 1 is a schematic flowchart of an action recognition method provided by Embodiment 1 of the present disclosure
  • FIG. 2 is a schematic flow chart of model training in an action recognition method provided in Embodiment 1 of the present disclosure
  • FIG. 3 is a flowchart of an action recognition method provided in Embodiment 2 of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an action recognition device provided by Embodiment 3 of the present disclosure.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by Embodiment 4 of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a schematic flowchart of an action recognition method provided by Embodiment 1 of the present disclosure.
  • the embodiment is applicable to the situation of performing action recognition on a video, for example, it is applicable to the situation of performing action recognition on a video based on the model after training an action recognition model based on a small amount of data.
  • the method can be executed by an action recognition device, which can be implemented in the form of software and/or hardware, and the device can be configured in an electronic device, such as a computer.
  • the action recognition method provided by this embodiment may include the following steps.
  • Performing action recognition on videos may include identifying action categories of actions performed by objects in the video. For example, recognize action categories such as "running, jumping, throwing, waving" of people in the video.
  • action recognition inputting different video frames from the same video can significantly improve model predictions.
  • different augmentation methods can be used to obtain multiple augmented videos based on the original video, so as to input different video frames of the original video into the action recognition model.
  • amplifying the original video may include sampling, cropping, rotating, mirroring, and adjusting brightness of the video frame of the original video.
  • amplifying the original video to obtain multiple amplified videos may include: cropping multiple video frames of the original video according to preset cropping rules, and according to the clipped multiple video frames Generate a plurality of augmented videos with an enlarged spatial dimension; and/or, intercept a preset number of video segments from the original video, and use the plurality of video segments as multi-expanded videos with an augmented time dimension.
  • each video frame of the original video is cropped according to preset cropping rules, for example, triple cropping (Three Crop), tenfold cropping (Ten Crop) or other cropping rules can be used for cropping.
  • preset cropping rules for example, triple cropping (Three Crop), tenfold cropping (Ten Crop) or other cropping rules can be used for cropping.
  • each video frame of the original video can be cropped into three random areas of equal area according to the same cropping method, and the expanded video frames can be respectively generated according to the same cropped areas in multiple video frames.
  • the video is multiplied to obtain three video sub-segments.
  • five equal-area areas can be cropped from the upper left corner, lower left corner, upper right corner, lower right corner and middle of each video frame of the original video, and then flipped horizontally, and according to multiple video
  • the clipping area in the frame and whether it is flipped or not generate an amplified video respectively to obtain ten video sub-segments.
  • Regions representing different spatial semantics in the video frame can be extracted by cropping, so that the original video can be amplified in the spatial dimension, and multi-segment amplified videos after the amplified spatial dimension can be obtained.
  • the preset number can be set according to experience values or experimental values, for example, it can be 10 or 24.
  • the number of video frames in each intercepted video segment may be a fixed value, and the fixed value may be determined according to the value of a single batch of video frames processed by the action recognition model. For example, when the action recognition model can process 8 video frames in a single batch, the number of video frames of the intercepted video segment may be 8 frames.
  • the original video can be passed through the above-mentioned space dimension and/or time dimension
  • the multi-segment amplified video obtained through the amplified processing is beneficial for the action recognition model to mine the characteristics of the action category in the original video from different angles based on these amplified videos, which can improve the accuracy of action recognition.
  • Action recognition can include identifying spatial semantics and identifying action rhythms.
  • Spatial semantics can describe information such as the outline and shape of an action; the rhythm of an action can represent the dynamics and time scale of an action, for example, it can be divided into fast rhythm and slow rhythm.
  • Some action categories have similar spatial semantics but different action rhythms. For example, the spatial semantics of the action category "walk” and the action category “run” are similar, but “walk” can belong to slow rhythm, and “run” can belong to fast rhythm. Therefore, the recognition of action rhythm is crucial in action recognition.
  • the action recognition model can use a series of temporal convolutions to extract multi-level video features from low to high from each augmented video.
  • fast-paced information and slow-paced information can be captured through video features of different depths, which can help fine-grained distinction of action categories.
  • Outputting the action recognition result of the original video according to the multi-level video features of the multi-section augmented video may include: for the multi-section augmented video, the multi-level video features representing different action rhythms are fused, and the action category label of the output original video (for example, can be Indicated by numerical value); determine the action category of the original video according to multiple action category labels, for example, weighted and summed the values of multiple action category labels to determine the final category label, and use the action category corresponding to the final category label as the original video action category, that is, the action recognition result of the original video is obtained.
  • the original video is output according to the multi-level video features of the multi-segment augmented video
  • the action recognition result of the action recognition result may include: according to the multi-level video features of the multi-segment augmented videos amplified according to the spatial dimension, and the multi-level video features of the multi-segment augmented videos amplified by the time dimension, the action recognition result of the original video is output.
  • the action category label may be determined based on the multi-level video features of the multi-segment augmented videos after spatial dimension augmentation and the multi-level video features of the multi-segment augmented videos after temporal dimension augmentation, respectively.
  • the two types of category labels can be weighted and summed to determine the action category of the original video. Prediction by integrating the augmented video in the space dimension and the time dimension is beneficial to predicting the action category based on spatiotemporal consistency and improving the recognition accuracy.
  • the accuracy of action recognition can be improved.
  • the multi-level video features of multi-segment augmented videos are combined by the action recognition model to perform action recognition, which can ensure the consistency of action recognition and improve the accuracy of action recognition.
  • FIG. 2 is a schematic flowchart of model training in an action recognition method provided by Embodiment 1 of the present disclosure.
  • the action recognition model can be trained based on the following steps.
  • the sample video may be video data for multiple types of action categories obtained from an open source library, and/or may also be video data under the multi-type action categories of the objects collected under the authorization of the captured objects.
  • a manual marking method may be used to mark the action category of the object performing the action in the video data.
  • S220 Amplify the sample video to obtain multiple pieces of sample amplified video.
  • the process of amplifying the sample video during the training process may be consistent with the process of amplifying the original video during the actual application process.
  • multiple sample augmented videos can be obtained by performing amplification processing on the above-mentioned space dimension and/or time dimension on the sample video.
  • the process of extracting the multi-level video features of the multi-segment sample augmented video in the training process may be consistent with the process of extracting the multi-level video features of the multi-segment augmented video in the actual application process.
  • multi-level video features of multi-segment sample amplification videos can be extracted through a series of temporal convolutions.
  • multi-level video features representing different action rhythms can also be fused to output the action category label of the sample video; the action category of the sample video is determined according to the multiple action category labels.
  • the deviation value between the output action category and the action label can be calculated.
  • multiple parameters in the action recognition model can be trained so that the action recognition model can learn the logical relationship between multi-level video features and action categories.
  • each type of action requires a sufficient amount of video data for video feature modeling.
  • a large number of manual screening and labeling samples are required.
  • the training method of the action recognition model disclosed in this embodiment by extracting sample video features of different depths for encoding, different action rhythms can be captured, which helps to capture multi-granularity and task-oriented clues in a data-efficient environment.
  • the recognition accuracy is improved; by expanding the samples and integrating the video features of multiple expanded samples for prediction, it is beneficial to the recognition of fewer samples, and ensures the consistency of action recognition, which can ensure the accuracy of action recognition.
  • the model training can be completed based on a small amount of data, which greatly reduces the training cost.
  • the original video is amplified to obtain multi-segment amplified video; the multi-level video features of the multi-segment amplified video are extracted based on the pre-trained action recognition model, and the multi-level video features of the multi-segment amplified video are
  • the feature output is the action recognition result of the raw video.
  • the action recognition model can extract multi-level video features, and can integrate multi-level video features of multiple augmented videos for action recognition, it can be realized in the training process: By extracting sample video features of different depths for encoding, it can capture different Action rhythm, which helps to capture multi-granularity and task-oriented cues in a data-efficient environment to achieve improved recognition accuracy in the case of few samples; prediction is made by augmenting the samples and integrating the video features of multiple augmented samples , which is conducive to the recognition of few samples, and ensures the consistency of action recognition, which can ensure the accuracy of action recognition.
  • the model training can be completed based on a small amount of data, which greatly reduces the training cost.
  • the embodiments of the present disclosure may be combined with multiple optional solutions in the action recognition method provided in the above embodiments.
  • the action recognition method provided in this embodiment describes the steps of extracting multi-level video features of multiple augmented videos, and the steps of performing action recognition based on the multi-level video features of multi-segment augmented videos.
  • By processing video frame features at a lower frame rate and slower refresh rate it helps to capture the semantic information provided by a small number of sparse frames.
  • Complementary annotation of multi-grained motion cues can be captured through subsequent spatial and/or temporal modulation processing of multi-level video features.
  • the consistency of action recognition can be guaranteed, and the robustness and accuracy of recognition can be improved.
  • extracting multi-level video features of multiple augmented videos based on a pre-trained action recognition model includes: sampling each augmented video at a preset frame rate interval based on a pre-trained action recognition model , and extract multi-level video features of multiple video frames after sampling.
  • the preset frame rate interval can be preset with a low frame rate sampling as the target, for example, it can be set to sample a video clip of thirty frames per second at a frame rate interval of two frames per second. After sampling and amplifying the video at preset frame rate intervals, the multi-level video features of multiple video frames can be extracted by the feature extraction network based on the residual class to capture fast-paced information and slow-paced information.
  • a complete Slow-Fast network can be used to implement sampling amplification video and extracting samples Multi-level video features for subsequent video frames.
  • the Slow-Fast network can include a slow path (Slow branch) for processing low frame rate samples and a fast path (Fast branch) for processing high frame rate samples, so that multi-granularity action rhythm information can be extracted.
  • the Fast branch brings little improvement in the performance of action recognition, but it will significantly reduce the speed of model training. Therefore, the network can be simplified by truncating the Fast branch to form a network containing only the Slow branch (called the Slow-Only network) to achieve multi-level video feature extraction with lower frame rates and slower refresh rates.
  • a small number of sparse frames can be captured by sampling and amplifying the video at a lower frame rate and slower refresh rate, which helps to extract richer semantic feature information and improve the accuracy of action recognition .
  • the multi-level video features of the multi-segment augmented videos after extracting the multi-level video features of the multi-segment augmented videos based on the pre-trained action recognition model, it also includes: spatially modulating and/or temporally modulating the multi-level video features of the multi-segment augmented videos Correspondingly, output the action recognition result of the original video according to the multi-level video features of the multiple augmented videos, including: output the action recognition result of the original video according to the modulated multi-level video features of the multiple augmented videos.
  • Spatial modulation of multi-level video features may include: processing video features of different levels into feature images of equal size through convolution and other methods, and realizing spatial semantic alignment.
  • the temporal modulation of the multi-level video features may include: sampling the features of different levels on channels according to a set of different sampling rate parameters, so as to downsample the features.
  • the feature information of different action rhythms can be aligned and obtained through the spatial modulation layer and the temporal modulation layer.
  • the action recognition model may adopt a temporal pyramid network (Temporal Pyramid Network, TPN) network structure.
  • TPN Temporal Pyramid Network
  • the TPN can be a plug-and-play network, and can include a backbone coding layer, a spatial modulation layer, a time modulation layer, an information flow layer, and a prediction layer.
  • the backbone coding network can be an ordinary residual network, or it can be the above-mentioned Slow-Only network, which can be used to extract video features at different levels; the spatial modulation layer and the temporal modulation layer can be used to align and obtain feature information of different action rhythms; Video features at different levels can be fused through the information flow layer to gather information of different action rhythms on the feature layer; video features at different levels can be pooled through the prediction layer, and video features at different levels can be connected in the channel dimension , to predict the action category.
  • the multi-level video features can be subjected to subsequent spatial modulation and/or temporal modulation processing to capture supplementary annotations of multi-grained motion cues.
  • spatial modulation and/or temporal modulation can also be performed on the multi-level video features of multi-segment sample amplification videos, and the action recognition results can be output based on the modulated multi-level video features, which can be beneficial
  • multi-grained, multi-level and task-oriented supplementary information is captured from a small number of samples.
  • the action recognition model includes at least one; correspondingly, the action recognition result of the original video is output according to the multi-level video features of the multi-segment augmented video, including: determining at least one action recognition model according to the multi-segment augmented video The initial action recognition result of the multi-level video feature output; multiple initial action recognition results are fused to obtain the action recognition result of the original video.
  • Each augmented video can be input into at least one action recognition model for action recognition, or can be input into some action recognition models for action recognition. Moreover, when each augmented video is input into at least one action recognition model for action recognition, an initial action recognition result output by at least one action recognition model according to the multi-level video features of each augmented video can be obtained. When there is an augmented video input to a partial action recognition model, the initial action recognition result of each action recognition model on the input augmented video may be recorded. Wherein, the initial action recognition result may be an initial action category label (for example, may be represented by a numerical value).
  • the fusion of multiple initial action recognition results to obtain the action recognition result of the original video may include, for example: weighting and summing the values of multiple initial action category labels to determine the final category label, and calculating the corresponding action of the final category label
  • the category is used as the action category of the original video, that is, the action recognition result of the original video is obtained.
  • the initial action recognition result can be fused by the following formula to predict the action recognition result of the original video:
  • pred can represent the action recognition result of the final original video, and can be output in the form of labels;
  • i is the identification of class A action recognition models with different network layer depths
  • N a can represent the total number of class A action recognition models
  • j is the identification of different augmented videos input into class A action recognition models
  • M a can represent Input the total quantity of the augmented video of class A action recognition model
  • s (a i, j ) can represent the initial action recognition result of the jth augmented video of i class A action recognition model output;
  • p is the identification of B-type action recognition models with different network layer depths
  • N b can represent the total number of B-type action recognition models
  • q is the identification of different augmented videos input into B-type action recognition models
  • M b can represent Input the total quantity of the augmented video of class B action recognition model
  • s (b p, q ) can represent the initial action recognition result of the qth augmented video of the pth class B action recognition model output;
  • the motion recognition result of the original video can be obtained.
  • FIG. 3 is a flowchart of an action recognition method provided in Embodiment 2 of the present disclosure. As shown in FIG. 3 , the action recognition method provided by this embodiment may include the following steps.
  • the original video is amplified in the space dimension and the time dimension to obtain multiple amplified videos.
  • the action recognition model can include two types: one type can be a model of different network depths that extract multi-level video features through Slow-Only (such as Slow-Only101, Slow-Only152 and Slow-Only200); the other type is through TPN Models of different network depths (e.g., TPN101, TPN152, and TPN200) that spatially and temporally modulate the extracted multi-level video features.
  • the suffix value of each model can indicate the network depth, for example, the suffix 101 of the Slow-Only101 model can indicate that the network depth of the model is 101 layers.
  • initial action recognition results of the augmented video input into action recognition models 1-3 and the initial action recognition results of the augmented videos input into action recognition models 4-6 are weighted and summed , to determine the final class label.
  • Action recognition results can be output.
  • the action category corresponding to the final category label is used as the action category of the original video, that is, the action recognition result of the original video is obtained, and the action category corresponding to the fused value can be used as the action category of the original video.
  • the technical solutions of the embodiments of the present disclosure describe the steps of extracting multi-level video features of multiple augmented videos, and the steps of performing action recognition based on the multi-level video features of multiple augmented videos.
  • processing video frame features at a lower frame rate and slower refresh rate it helps to capture semantic information provided by a small number of sparse frames.
  • Complementary annotation of multi-grained motion cues can be captured through subsequent spatial and/or temporal modulation processing of multi-level video features.
  • the consistency of action recognition can be guaranteed, and the robustness and accuracy of recognition can be improved.
  • the action recognition method provided by the embodiment of the present disclosure and the action recognition method provided by the above-mentioned embodiment belong to the same disclosed concept, and the technical details not described in detail in this embodiment can be referred to the above-mentioned embodiment. And the same technical feature has the same effect in this embodiment and the above-mentioned embodiment.
  • FIG. 4 is a schematic structural diagram of an action recognition device provided by Embodiment 3 of the present disclosure.
  • the action recognition device provided in this embodiment is applicable to the situation of performing action recognition on video, for example, it is suitable for the situation of performing action recognition on video based on the model after training an action recognition model based on a small amount of data.
  • the motion recognition device may include the following modules.
  • Amplification module 410 is configured to perform amplification processing on the original video to obtain multiple sections of amplified video
  • the recognition module 420 is configured to extract multi-level video features of multiple augmented videos based on the pre-trained action recognition model, and output action recognition results of the original video according to the multi-level video features of the multi-segment augmented videos.
  • the amplification module 410 can be set to:
  • Multiple video frames of the original video are cropped according to a preset cropping rule, and a plurality of augmented videos with spatial dimension expansion are generated according to the multiple video frames after cropping; and/or,
  • a preset number of video clips are intercepted from the original video, and multiple video clips are used as multi-segment augmented videos after time dimension amplification.
  • the identification module 420 can be set to:
  • an action recognition result of the original video is output.
  • the identification module 420 may be set to:
  • each augmented video is sampled at preset frame rate intervals, and multi-level video features of multiple video frames after sampling are extracted.
  • the identification module 420 may also be set to:
  • the recognition module 420 can be set to:
  • an action recognition result of the original video is output.
  • the action recognition model includes at least one; correspondingly, the recognition module 420 can be set to:
  • the multiple initial motion recognition results are fused to obtain the motion recognition result of the original video.
  • the action recognition device may also include:
  • the model training module can be set to train the action recognition model based on the following steps:
  • the sample video is amplified to obtain multiple sample amplified videos
  • the action recognition model is trained.
  • the action recognition device provided by the embodiment of the present disclosure can execute the action recognition method provided by any embodiment of the present disclosure, and has corresponding functional modules and effects for executing the method.
  • the multiple units and modules included in the above-mentioned device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; in addition, the names of multiple functional units are only for the convenience of distinguishing each other , and are not intended to limit the protection scope of the embodiments of the present disclosure.
  • FIG. 5 it shows a schematic structural diagram of an electronic device (such as a terminal device or a server in FIG. 5 ) 500 suitable for implementing an embodiment of the present disclosure.
  • the terminal equipment in the embodiments of the present disclosure may include mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), tablet computers (Portable Android Device, PAD), portable multimedia players (Portable Multimedia Player, PMP), vehicle-mounted terminals (such as vehicle-mounted navigation terminals) and other mobile terminals, and fixed terminals such as digital television (television, TV), desktop computers and so on.
  • PDA Personal Digital Assistant
  • PAD Portable multimedia players
  • PMP Portable Multimedia Player
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital television (television, TV), desktop computers and so on.
  • the electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 500 may include a processing device (such as a central processing unit, a graphics processing unit, etc.)
  • the storage device 508 loads programs in the random access memory (Random Access Memory, RAM) 503 to execute various appropriate actions and processes.
  • RAM Random Access Memory
  • various programs and data necessary for the operation of the electronic device 500 are also stored.
  • the processing device 501, ROM 502, and RAM 503 are connected to each other through a bus 504.
  • An input/output (Input/Output, I/O) interface 505 is also connected to the bus 504 .
  • the following devices can be connected to the I/O interface 505: including, for example, a touch screen, touch pad, keyboard, mouse, An input device 506 such as a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 507 including, for example, a liquid crystal display (Liquid Crystal Display, LCD), a speaker, a vibrator, etc.; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and communication Device 509.
  • the communication means 509 may allow the electronic device 500 to perform wireless or wired communication with other devices to exchange data. While FIG. 5 shows electronic device 500 having various means, it is not a requirement to implement or possess all of the means shown. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 509, or from storage means 508, or from ROM 502.
  • the processing device 501 When the computer program is executed by the processing device 501, the above-mentioned functions defined in the action recognition method of the embodiment of the present disclosure are executed.
  • the electronic device provided by the embodiment of the present disclosure belongs to the same disclosed concept as the action recognition method provided by the above embodiment.
  • the above embodiment please refer to the above embodiment, and this embodiment has the same features as the above embodiment. Effect.
  • An embodiment of the present disclosure provides a computer storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the action recognition method provided in the foregoing embodiments is implemented.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device or device, or any combination thereof.
  • the computer-readable storage medium may include: an electrical connection with one or more wires, a portable computer disk, a hard disk, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Flash Memory (FLASH ), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution device or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution device or device.
  • the program code contained on the computer readable medium can be stored in any suitable medium Mass transmission, including: wires, optical cables, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future-developed network protocols such as Hyper Text Transfer Protocol (Hyper Text Transfer Protocol, HTTP), and can communicate with any form or medium of digital Data communication (eg, communication network) interconnections.
  • Examples of communication networks include local area networks (Local Area Network, LAN), wide area networks (Wide Area Network, WAN), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently existing networks that are known or developed in the future.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
  • Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a LAN or WAN, or it can be connected to an external computer (eg via the Internet using an Internet Service Provider).
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be implemented by dedicated hardware implemented in combination with computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the names of units and modules do not constitute limitations on the units and modules themselves in one case.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • Complex Programmable Logic Device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction-executing apparatus or apparatus.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may comprise electronic, magnetic, optical, electromagnetic, infrared, or semiconductor devices or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media include one or more wire-based electrical connections, portable computer discs, hard drives, RAM, ROM, EPROM, flash memory, optical fiber, portable CD-ROMs, optical storage devices, magnetic storage devices, or Any suitable combination of content.
  • the storage medium may be a non-transitory storage medium.
  • Example 1 provides an action recognition method, which includes:
  • Example 2 provides an action recognition method, further comprising:
  • the amplifying process is performed on the original video to obtain multiple amplified videos, including:
  • the multiple video frames of the original video are cropped according to a preset cropping rule, and a plurality of augmented videos after the spatial dimension is enlarged are generated according to the multiple video frames after clipping; and/or,
  • a preset number of video clips are intercepted from the original video, and the multiple video clips are respectively used as a plurality of amplified videos after the time dimension is amplified.
  • Example 3 provides an action recognition method, which further includes:
  • the Hierarchical video features output the action recognition result of the original video, including:
  • Example 4 provides an action recognition method, further comprising:
  • the multi-level video features of the multi-segment augmented videos are extracted based on the pre-trained action recognition model, including:
  • each augmented video is sampled at preset frame rate intervals, and multi-level video features of multiple video frames after sampling are extracted.
  • Example 5 provides an action recognition method, further comprising:
  • the outputting the action recognition result of the original video according to the multi-level video features of the multi-segment augmented video includes:
  • Example 6 provides an action recognition method, further comprising:
  • the action recognition model includes at least one; correspondingly, outputting the action recognition results of the original video according to the multi-level video features of the multi-segment augmented videos includes:
  • a plurality of the initial motion recognition results are fused to obtain the motion recognition result of the original video.
  • Example 7 provides an action recognition method, further comprising:
  • the action recognition model is trained based on the following steps:
  • the action recognition model is trained according to the action recognition result of the sample video and the action annotation.
  • Example 8 provides an action recognition device, which includes:
  • the amplification module is configured to amplify the original video to obtain multiple amplified videos
  • the recognition module is configured to extract the multi-level video features of the multi-section augmented videos based on the pre-trained action recognition model, and output the action recognition results of the original video according to the multi-level video features of the multi-section augmented videos.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种动作识别方法、装置、电子设备及存储介质,该动作识别方法包括:对原始视频进行扩增处理,得到多段扩增视频;基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,并根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果。

Description

动作识别方法、装置、电子设备及存储介质
本申请要求在2022年02月07日提交中国专利局、申请号为202210116704.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,例如涉及一种动作识别方法、装置、电子设备及存储介质。
背景技术
动作识别是一项基本的计算机视觉任务,在视频结构分析和潜在的下游应用中起着关键作用。由于动作类别的多样性,需要依赖多类动作的足量视频数据来训练动作识别模型,这将耗费巨大的筛选、以及标注视频数据的成本。
发明内容
本公开提供了一种动作识别方法、装置、电子设备及存储介质,能够在保证模型精度的基础上,基于少量数据完成模型训练,大大减少了训练成本。
本公开实施例提供了一种动作识别方法,包括:
对原始视频进行扩增处理,得到多段扩增视频;
基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
本公开实施例还提供了一种动作识别装置,包括:
扩增模块,设置为对原始视频进行扩增处理,得到多段扩增视频;
识别模块,设置为基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
本公开实施例还提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多 个处理器实现如本公开实施例任一所述的动作识别方法。
本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的动作识别方法。
附图说明
图1为本公开实施例一所提供的一种动作识别方法的流程示意图;
图2为本公开实施例一所提供的一种动作识别方法中模型训练的流程示意图;
图3为本公开实施例二所提供的一种动作识别方法的流程框图;
图4为本公开实施例三所提供的一种动作识别装置的结构示意图;
图5为本公开实施例四所提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了理解本公开。本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。
实施例一
图1为本公开实施例一所提供的一种动作识别方法的流程示意图。本公开 实施例适用于对视频进行动作识别的情形,例如适用于基于少量数据训练得到动作识别模型后,基于该模型对视频进行动作识别的情形。该方法可以由动作识别装置来执行,该装置可以通过软件和/或硬件的形式实现,该装置可配置于电子设备中,例如配置于计算机中。
如图1所示,本实施例提供的动作识别方法,可以包括以下步骤。
S110、对原始视频进行扩增处理,得到多段扩增视频。
对视频进行动作识别,可包含识别视频中对象执行动作的动作类别。例如,识别视频中人的“跑、跳、投掷、招手”等动作类别。在动作识别中,来自同一视频的不同视频帧输入方式可以对模型预测有显著的提升作用。本实施例中可采用不同的扩增方式在原始视频基础上获得多段扩增视频,以实现将原始视频的不同视频帧输入动作识别模型。其中,对原始视频进行扩增,可以包括对原始视频的视频帧进行采样、剪裁、旋转、镜像、以及调整亮度等处理。
在一些可选的实现方式中,对原始视频进行扩增处理,得到多段扩增视频,可以包括:按预设裁剪规则对原始视频的多个视频帧进行裁剪,根据剪裁后的多个视频帧生成空间维度扩增后的多段扩增视频;和/或,从原始视频中截取预设数量的视频片段,将多个视频片段分别作为时间维度扩增后的多段扩增视频。
按预设裁剪规则对原始视频的多个视频帧进行裁剪,例如可以采用三重裁剪(Three Crop)、十重裁剪(Ten Crop)或其他裁剪规则进行裁剪。示例性的,在Three Crop的设置中,可以将原始视频的每个视频帧按相同的裁剪方式分别裁剪为三个等面积的随机区域,并根据多个视频帧中相同的裁剪区域分别生成扩增视频,以得到三个视频子片段。在Ten Crop的设置中,可以将原始视频的每个视频帧的左上角、左下角、右上角、右下角和中间分别裁剪五个等面积的区域后,再做水平翻转,并根据多个视频帧中裁剪区域及是否翻转分别生成扩增视频,以得到十个视频子片段。通过裁剪可以提取视频帧中表征不同空间语义的区域,从而可实现原始视频在空间维度上的扩增,得到空间维度扩增后的多段扩增视频。
预设数量可以根据经验值或实验值进行设置,例如可以为10或24等。其中,截取的每个视频片段中的视频帧数量可以为固定数值,且该固定数值可以根据动作识别模型的单批量处理视频帧的数值确定。例如,当动作识别模型可单批量处理8帧视频帧时,截取的视频片段的视频帧数量可以为8帧。通过截取原始视频中的视频片段,可以实现原始视频在时间维度上的扩增,得到时间维度扩增后的多段扩增视频。
在这些可选的实现方式中,原始视频可经上述空间维度上和/或时间维度上 的扩增处理得到多段扩增视频,有利于使动作识别模型根据该些扩增视频从不同角度挖掘原始视频中动作类别的特征,可提升动作识别的精度。
S120、基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,并根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果。
动作识别可包括识别空间语义和识别动作节奏。空间语义可以描述动作的轮廓、以及形状等信息;动作节奏可以表征动作的动态性和时间尺度,例如可以分为快节奏和慢节奏等。一些动作类别的空间语义相似但动作节奏不同,例如动作类别“走”和动作类别“跑”的空间语义相似,但“走”可属于慢节奏,“跑”可属于快节奏。因此,动作节奏的识别在动作识别中至关重要。
本实施例中,动作识别模型可以采用一系列时间卷积的方式,从每段扩增视频中提取出由低到高的多层级的视频特征。从而可实现在单个模型中,通过不同深度的视频特征来捕获快节奏信息和慢节奏信息,可有助于对动作类别进行细粒度区分。此外,还可以实现以单一速率向动作识别模型馈送输入帧,而无需按动作节奏对视频进行分类后再输入模型,可简化识别操作,提高识别效率。
根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果,可以包括:针对多段扩增视频,将表征不同动作节奏的多层级视频特征进行融合,输出原始视频的动作类别标签(例如可以以数值表示);根据多个动作类别标签确定原始视频的动作类别,例如将多个动作类别标签的数值进行加权求和确定最终的类别标签,并将最终的类别标签对应的动作类别作为原始视频的动作类别,即得到了原始视频的动作识别结果。
在一些可选的实现方式中,若扩增视频包括空间维度扩增后的多段扩增视频和时间维度扩增后的多段扩增视频,则根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果,可以包括:根据空间维度扩增后的多段扩增视频的多层级视频特征,以及时间维度扩增后的多段扩增视频的多层级视频特征,输出原始视频的动作识别结果。
在这些可选的实现方式中,可以分别基于空间维度扩增后的多段扩增视频的多层级视频特征,以及时间维度扩增后的多段扩增视频的多层级视频特征确定动作类别标签。并且,可以将两类类别标签进行加权求和,确定原始视频的动作类别。通过集成空间维度上和时间维度上的扩增视频进行预测,有利于基于时空一致性预测动作类别,可提高识别精度。
本公开实施例中,通过对不同深度的视频特征进行编码来捕获视频中快节奏和慢节奏的信息,并在特征层上聚集多个节奏的信息,可提高动作识别精度。 通过动作识别模型集合多段扩增视频的多层级视频特征进行动作识别,可保证动作识别一致性,提高动作识别精度。
示例性的,图2为本公开实施例一所提供的一种动作识别方法中模型训练的流程示意图。参见图2,在一些实现方式中,动作识别模型可基于下述步骤进行训练。
S210、获取样本视频以及每个样本视频的动作标注。
样本视频可以为从开源库中获取的针对多类动作类别的视频数据,和/或,也可以为在被采集对象授权情况下采集的该些对象多类动作类别下的视频数据。其中,可以采用人工标注的方式,对视频数据中对象执行动作的动作类别进行标注。
S220、将样本视频进行扩增处理,得到多段样本扩增视频。
在训练过程中对样本视频进行扩增处理过程,可与在实际应用过程中对原始视频进行的扩增处理过程一致。例如,可以对样本视频经上述空间维度上和/或时间维度上的扩增处理得到多段样本扩增视频。
S230、基于动作识别模型提取多段样本扩增视频的多层级视频特征,并根据多段样本扩增视频的多层级视频特征输出样本视频的动作识别结果。
在训练过程中提取多段样本扩增视频的多层级视频特征的过程,可与实际应用过程中提取多段扩增视频的多层级视频特征的过程一致。例如,可以通过一系列时间卷积的方式提取多段样本扩增视频的多层级视频特征。并且,针对多段样本扩增视频,同样可将表征不同动作节奏的多层级视频特征进行融合,输出样本视频的动作类别标签;根据多个动作类别标签确定样本视频的动作类别。
S240、根据样本视频的动作识别结果,以及动作标注,对动作识别模型进行训练。
在输出样本视频的动作类别后,可以计算输出的动作类别与动作标注的偏差值。并且可以以计算的偏差值小于预设值为目标,训练动作识别模型中的多项参数,以使动作识别模型可以学习多层级的视频特征与动作类别之间的逻辑关系。
动作识别模型训练中,由于视频的模糊性、不稳定性以及动作类别的多样性,每类动作都需要足量的视频数据来进行视频特征建模。要获取如此大量的训练数据,需要大量的人工筛选和标注样本。然而,本实施例公开的动作识别模型的训练方法中,通过提取不同深度的样本视频特征进行编码,能够捕获不同动作节奏,有助于在数据高效的环境中捕获多粒度和面向任务的线索,以实 现在少样本的情况下提高识别精度;通过对样本进行扩充并集成多个扩增样本的视频特征进行预测,有利于少样本识别,且保证动作识别一致性,可保证动作识别精度。综上,可实现在保证模型精度的基础上,基于少量数据完成模型训练,大大减少了训练成本。
本公开实施例的技术方案,对原始视频进行扩增处理,得到多段扩增视频;基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,并根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果。通过动作识别模型提取出不同层级的视频特征,能够得到表征不同动作节奏的信息,可提高动作识别精度;通过动作识别模型集合多段扩增视频的多层级视频特征进行动作识别,可保证动作识别一致性,提高动作识别精度。
由于动作识别模型可提取多层级的视频特征,以及可集合多段扩增视频的多层级视频特征来进行动作识别,可实现在训练过程中:通过提取不同深度的样本视频特征进行编码,能够捕获不同动作节奏,有助于在数据高效的环境中捕获多粒度和面向任务的线索,以实现在少样本的情况下提高识别精度;通过对样本进行扩充并集成多个扩增样本的视频特征进行预测,有利于少样本识别,且保证动作识别一致性,可保证动作识别精度。综上,可实现在保证模型精度的基础上,基于少量数据完成模型训练,大大减少了训练成本。
实施例二
本公开实施例与上述实施例中所提供的动作识别方法中多个可选方案可以结合。本实施例所提供的动作识别方法,对多段扩增视频的多层级视频特征的提取步骤,以及基于多段扩增视频的多层级视频特征进行动作识别的步骤进行描述。通过以较低的帧速率和较慢的刷新速度处理视频帧特征,有助于捕获少量稀疏帧提供的语义信息。通过对多层级视频特征进行后续的空间调制和/或时间调制处理,可捕获多粒度运动线索的补充注释。通过融合多模型的输出结果,能够保证动作识别的一致性,提高识别鲁棒性与识别精度。
在一些可选的实现方式中,基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,包括:基于预先训练的动作识别模型对每段扩增视频按预设帧率间隔进行采样,并提取采样后的多个视频帧的多层级视频特征。
可以以低帧率采样为目标对预设帧率间隔进行预先设置,例如可以设置为以每秒两帧的帧率间隔采样每秒三十帧的视频片段。在按预设帧率间隔采样扩增视频后,可以基于残差类的特征提取网络对多个视频帧的多层级视频特征进行提取,以捕获快节奏信息和慢节奏信息。
示例性的,可以采用完整的Slow-Fast网络来实现采样扩增视频和提取采样 后多个视频帧的多层级视频特征。Slow-Fast网络可包含处理低帧率样本的慢路径(Slow分支)和处理高帧率样本的快路径(Fast分支),从而可以提取多粒度的动作节奏信息。经研究发现,Fast分支对动作识别带来的性能改善较小,但会明显降低模型训练速度。因此,可以通过截断Fast分支来简化网络,形成一个仅包含Slow分支的网络(可称为Slow-Only网络),来实现较低的帧速率和较慢的刷新速度的多层级视频特征提取。
在这些可选的实现方式中,通过以较低的帧速率和较慢的刷新速度采样扩增视频可捕获少量稀疏帧,从而有助于提取更为丰富的语义特征信息,可提高动作识别精度。
在一些可选的实现方式中,在基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征之后,还包括:对多段扩增视频的多层级视频特征进行空间调制和/或时间调制;相应的,根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果,包括:根据多段扩增视频的调制后的多层级视频特征,输出原始视频的动作识别结果。
对多层级视频特征进行空间调制可以包括:通过卷积等方式将不同层级的视频特征处理为等大的特征图像,并实现空间语义对齐。对多层级视频特征进行时间调制可以包括:根据一组不同的采样率参数对不同层级的特征进行通道上的采样,来降低采样特征。通过空间调制层和时间调制层可以对齐并获得不同动作节奏的特征信息。在多层级视频特征经空间调制和/或时间调制处理后,可提供更多粒度的、面向动作识别任务的补充信息,从而有利于提高动作识别的精度。
示例性的,动作识别模型可以采用时间金字塔网络(Temporal Pyramid Network,TPN)的网络结构。其中,TPN可为即插即用的网络,且可以包括主干编码层、空间调制层、时间调制层、信息流层和预测层。其中,主干编码网络可以为普通的残差网络,或者可以为上述Slow-Only网络,可用于提取不同层级的视频特征;通过空间调制层和时间调制层可以对齐并获得不同动作节奏的特征信息;通过信息流层可融合不同层级的视频特征,以在特征层上聚集不同动作节奏的信息;通过预测层可对不同层级的视频特征进行池化,并可在通道维度上连接不同层级的视频特征,以预测动作类别。
在这些可选的实现方式中,通过对多层级的视频特征进行后续的空间调制和/或时间调制处理,可捕获多粒度运动线索的补充注释。此外,在动作识别模型训练过程中,也可对多段样本扩增视频的多层级视频特征进行空间调制和/或时间调制处理,并基于调制后的多层级视频特征输出动作识别结果,可有利于在训练过程中,从少量样本中捕获多粒度多层级的和面向任务的补充信息。
在一些可选的实现方式中,动作识别模型包括至少一个;相应的,根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果,包括:确定至少一个动作识别模型根据多段扩增视频的多层级视频特征输出的初始动作识别结果;将多个初始动作识别结果进行融合,得到原始视频的动作识别结果。
不同的动作识别模型之间,可以在多层级视频特征的提取方式上存在差异,可以在动作识别结果的确定方式上存在差异,也可以在网络层深度上存在差异。通过基于不同动作识别模型对原始视频进行动作识别,可以提高识别结果的鲁棒性。
每段扩增视频可以分别输入至少一个动作识别模型中进行动作识别,也可以输入部分动作识别模型中进行动作识别。并且,当将每段扩增视频分别输入至少一个动作识别模型中进行动作识别时,可以得到至少一个动作识别模型根据每段扩增视频的多层级视频特征输出的初始动作识别结果。当存在输入部分动作识别模型的扩增视频时,可以记录每个动作识别模型对输入的扩增视频的初始动作识别结果。其中,初始动作识别结果可以为初始动作类别标签(例如可以以数值表示)。
将多个初始动作识别结果进行融合,得到原始视频的动作识别结果,例如可以包括:将多个初始动作类别标签的数值进行加权求和确定最终的类别标签,并将最终的类别标签对应的动作类别作为原始视频的动作类别,即得到了原始视频的动作识别结果。
示例性的,可以通过下述公式融合初始动作识别结果,预测原始视频的动作识别结果:
其中,pred可表示最终的原始视频的动作识别结果,且可以标签的形式输出;
其中,i为不同网络层深度的A类动作识别模型的标识,Na可表示A类动作识别模型的总数量;j为输入A类动作识别模型的不同扩增视频的标识,Ma可表示输入A类动作识别模型的扩增视频的总数量;s(ai,j)可表示第i个A类动作识别模型输出的第j个扩增视频的初始动作识别结果;
其中,p为不同网络层深度的B类动作识别模型的标识,Nb可表示B类动作识别模型的总数量;q为输入B类动作识别模型的不同扩增视频的标识,Mb可表示输入B类动作识别模型的扩增视频的总数量;s(bp,q)可表示第p个B类动作识别模型输出的第q个扩增视频的初始动作识别结果;
通过将多个初始动作识别结果求平均,即可得到原始视频的动作识别结果。
在这些可选的实现方式中,通过融合多模型的输出结果,能够保证动作识别的一致性,提高识别鲁棒性与识别精度。
示例性的,图3为本公开实施例二所提供的一种动作识别方法的流程框图。如图3所示,本实施例提供的动作识别方法,可以包括以下步骤。
对原始视频进行空间维度上和时间维度上的扩增处理,得到多段扩增视频。
可以将多段扩增视频选择性地输入六个动作识别模型中,以分别输出初始动作识别结果。其中,动作识别模型可包含两类:一类可以为通过Slow-Only提取多层级视频特征的不同网络深度的模型(例如Slow-Only101、Slow-Only152和Slow-Only200);另一类为通过TPN将提取的多层级视频特征进行空间和时间调制的不同网络深度的模型(例如TPN101、TPN152和TPN200)。其中,每个模型的后缀数值可以表示网络深度,例如Slow-Only101模型的后缀101可以表示该模型的网络深度为101层。
可以融合多个初始动作识别结果。例如,可以参考上述(公式1),将输入动作识别模型1-3的扩增视频的初始动作识别结果,与分别输入动作识别模型4-6的扩增视频的初始动作识别结果进行加权求和,确定最终的类别标签。
可以输出动作识别结果。将最终的类别标签对应的动作类别作为原始视频的动作类别,即得到了原始视频的动作识别结果,可以将融合后数值对应的动作类别作为原始视频的动作类别。
本公开实施例的技术方案,对多段扩增视频的多层级视频特征的提取步骤,以及基于多段扩增视频的多层级视频特征进行动作识别的步骤进行了描述。通过以较低的帧速率和较慢的刷新速度处理视频帧特征,有助于捕获由少量稀疏帧提供的语义信息。通过对多层级视频特征进行后续的空间调制和/或时间调制处理,可捕获多粒度运动线索的补充注释。通过融合多模型的输出结果,能够保证动作识别的一致性,提高识别鲁棒性与识别精度。
此外,本公开实施例提供的动作识别方法与上述实施例提供的动作识别方法属于同一公开构思,未在本实施例中详尽描述的技术细节可参见上述实施例, 并且相同的技术特征在本实施例与上述实施例中具有相同的效果。
实施例三
图4为本公开实施例三所提供的一种动作识别装置的结构示意图。本实施例提供的动作识别装置适用于对视频进行动作识别的情形,例如适用于基于少量数据训练得到动作识别模型后,基于该模型对视频进行动作识别的情形。
如图4所示,本实施例提供的动作识别装置,可以包括以下模块。
扩增模块410,设置为对原始视频进行扩增处理,得到多段扩增视频;
识别模块420,设置为基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,并根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果。
在一些可选的实现方式中,扩增模块410可设置为:
按预设裁剪规则对原始视频的多个视频帧进行裁剪,根据剪裁后的多个视频帧生成空间维度扩增后的多段扩增视频;和/或,
从原始视频中截取预设数量的视频片段,将多个视频片段作为时间维度扩增后的多段扩增视频。
在一些可选的实现方式中,若扩增视频包括空间维度扩增后的多段扩增视频和时间维度扩增后的多段扩增视频,则识别模块420可设置为:
根据空间维度扩增后的多段扩增视频的多层级视频特征,以及时间维度扩增后的多段扩增视频的多层级视频特征,输出原始视频的动作识别结果。
在一些可选的实现方式中,识别模块420可设置为:
基于预先训练的动作识别模型对每段扩增视频按预设帧率间隔进行采样,并提取采样后的多个视频帧的多层级视频特征。
在一些可选的实现方式中,识别模块420还可以设置为:
在基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征之后,对多段扩增视频的多层级视频特征进行空间调制和/或时间调制;
识别模块420可以设置为:
根据多段扩增视频的调制后的多层级视频特征,输出原始视频的动作识别结果。
在一些可选的实现方式中,动作识别模型包括至少一个;相应的,识别模块420可以设置为:
确定至少一个动作识别模型根据多段扩增视频的多层级视频特征输出的初始动作识别结果;
将多个初始动作识别结果进行融合,得到原始视频的动作识别结果。
在一些可选的实现方式中,动作识别装置还可以包括:
模型训练模块,可设置为基于下述步骤对动作识别模型进行训练:
获取样本视频以及每个样本视频的动作标注;
将样本视频进行扩增处理,得到多段样本扩增视频;
基于动作识别模型提取多段样本扩增视频的多层级视频特征,并根据多段样本扩增视频的多层级视频特征输出样本视频的动作识别结果;
根据样本视频的动作识别结果,以及动作标注,对动作识别模型进行训练。
本公开实施例所提供的动作识别装置,可执行本公开任意实施例所提供的动作识别方法,具备执行方法相应的功能模块和效果。
上述装置所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。
实施例四
下面参考图5,其示出了适于用来实现本公开实施例的电子设备(例如图5中的终端设备或服务器)500的结构示意图。本公开实施例中的终端设备可以包括诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Multimedia Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(television,TV)、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图5所示,电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501,处理装置501可以根据存储在只读存储器(Read-Only Memory,ROM)502中的程序或者从存储装置508加载到随机访问存储器(Random Access Memory,RAM)503中的程序而执行多种适当的动作和处理。在RAM 503中,还存储有电子设备500操作所需的多种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(Input/Output,I/O)接口505也连接至总线504。
以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、 摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置507;包括例如磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有多种装置的电子设备500,但是并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的动作识别方法中限定的上述功能。
本公开实施例提供的电子设备与上述实施例提供的动作识别方法属于同一公开构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的效果。
实施例五
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述实施例所提供的动作识别方法。
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的装置或器件,或者任意以上的组合。计算机可读存储介质可以包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存(FLASH)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介 质传输,包括:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(Hyper Text Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:
对原始视频进行扩增处理,得到多段扩增视频;基于预先训练的动作识别模型提取多段扩增视频的多层级视频特征,并根据多段扩增视频的多层级视频特征输出原始视频的动作识别结果。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元、模块的名称在一种情况下并不构成对该单元、模块本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行装置或设备使用或与指令执行装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体装置或设备,或者上述内容的任何合适组合。机器可读存储介质包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM、快闪存储器、光纤、便捷式CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。存储介质可以是非暂态(non-transitory)存储介质。
根据本公开的一个或多个实施例,【示例一】提供了一种动作识别方法,该方法包括:
对原始视频进行扩增处理,得到多段扩增视频;
基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
根据本公开的一个或多个实施例,【示例二】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,所述对原始视频进行扩增处理,得到多段扩增视频,包括:
按预设裁剪规则对所述原始视频的多个视频帧进行裁剪,根据剪裁后的多个视频帧生成空间维度扩增后的多段扩增视频;和/或,
从所述原始视频中截取预设数量的视频片段,将多个视频片段分别作为时间维度扩增后的多段扩增视频。
根据本公开的一个或多个实施例,【示例三】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,若所述扩增视频包括空间维度扩增后的多段扩增视频和时间维度扩增后的多段扩增视频,则所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
根据空间维度扩增后的多段扩增视频的多层级视频特征,以及时间维度扩增后的多段扩增视频的多层级视频特征,输出所述原始视频的动作识别结果。
根据本公开的一个或多个实施例,【示例四】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,所述基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,包括:
基于预先训练的动作识别模型对每段扩增视频按预设帧率间隔进行采样,并提取采样后的多个视频帧的多层级视频特征。
根据本公开的一个或多个实施例,【示例五】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,在所述基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征之后,还包括:
对所述多段扩增视频的多层级视频特征进行空间调制和/或时间调制;
相应的,所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
根据所述多段扩增视频的调制后的多层级视频特征,输出所述原始视频的动作识别结果。
根据本公开的一个或多个实施例,【示例六】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,所述动作识别模型包括至少一个;相应的,所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
确定至少一个动作识别模型根据所述多段扩增视频的多层级视频特征输出的初始动作识别结果;
将多个所述初始动作识别结果进行融合,得到所述原始视频的动作识别结果。
根据本公开的一个或多个实施例,【示例七】提供了一种动作识别方法,还包括:
在一些可选的实现方式中,所述动作识别模型基于下述步骤进行训练:
获取样本视频以及每个样本视频的动作标注;
将所述样本视频进行扩增处理,得到多段样本扩增视频;
基于动作识别模型提取所述多段样本扩增视频的多层级视频特征,并根据所述多段样本扩增视频的多层级视频特征输出所述样本视频的动作识别结果;
根据所述样本视频的动作识别结果,以及所述动作标注,对所述动作识别模型进行训练。
根据本公开的一个或多个实施例,【示例八】提供了一种动作识别装置,该装置包括:
扩增模块,设置为对原始视频进行扩增处理,得到多段扩增视频;
识别模块,设置为基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了多个操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (10)

  1. 一种动作识别方法,包括:
    对原始视频进行扩增处理,得到多段扩增视频;
    基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
  2. 根据权利要求1所述的方法,其中,所述对原始视频进行扩增处理,得到多段扩增视频,包括以下至少之一:
    按预设裁剪规则对所述原始视频的多个视频帧进行裁剪,根据剪裁后的多个视频帧生成空间维度扩增后的多段扩增视频;或,
    从所述原始视频中截取预设数量的视频片段,将多个视频片段分别作为时间维度扩增后的多段扩增视频。
  3. 根据权利要求2所述的方法,其中,在所述多段扩增视频包括空间维度扩增后的多段扩增视频和时间维度扩增后的多段扩增视频的情况下,所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
    根据空间维度扩增后的多段扩增视频的多层级视频特征,以及时间维度扩增后的多段扩增视频的多层级视频特征,输出所述原始视频的动作识别结果。
  4. 根据权利要求1所述的方法,其中,所述基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,包括:
    基于预先训练的动作识别模型对每段扩增视频按预设帧率间隔进行采样,并提取采样后的多个视频帧的多层级视频特征。
  5. 根据权利要求1所述的方法,其中,在所述基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征之后,还包括:
    对所述多段扩增视频的多层级视频特征进行以下至少之一:空间调制或时间调制;
    所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
    根据所述多段扩增视频的调制后的多层级视频特征,输出所述原始视频的动作识别结果。
  6. 根据权利要求1所述的方法,其中,所述动作识别模型包括至少一个;所述根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果,包括:
    确定至少一个动作识别模型根据所述多段扩增视频的多层级视频特征输出的初始动作识别结果;
    将多个初始动作识别结果进行融合,得到所述原始视频的动作识别结果。
  7. 根据权利要求1所述的方法,其中,所述动作识别模型基于下述步骤进行训练:
    获取样本视频以及每个样本视频的动作标注;
    将所述样本视频进行扩增处理,得到多段样本扩增视频;
    基于所述动作识别模型提取所述多段样本扩增视频的多层级视频特征,并根据所述多段样本扩增视频的多层级视频特征输出所述样本视频的动作识别结果;
    根据所述样本视频的动作识别结果,以及所述动作标注,对所述动作识别模型进行训练。
  8. 一种动作识别装置,包括:
    扩增模块,设置为对原始视频进行扩增处理,得到多段扩增视频;
    识别模块,设置为基于预先训练的动作识别模型提取所述多段扩增视频的多层级视频特征,并根据所述多段扩增视频的多层级视频特征输出所述原始视频的动作识别结果。
  9. 一种电子设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序,
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-7中任一项所述的动作识别方法。
  10. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-7中任一项所述的动作识别方法。
PCT/CN2023/074537 2022-02-07 2023-02-06 动作识别方法、装置、电子设备及存储介质 WO2023147778A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210116704.X 2022-02-07
CN202210116704.XA CN116612524A (zh) 2022-02-07 2022-02-07 一种动作识别方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023147778A1 true WO2023147778A1 (zh) 2023-08-10

Family

ID=87553160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074537 WO2023147778A1 (zh) 2022-02-07 2023-02-06 动作识别方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN116612524A (zh)
WO (1) WO2023147778A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280406A (zh) * 2017-12-30 2018-07-13 广州海昇计算机科技有限公司 一种基于分段双流模型的行为识别方法、系统及装置
CN110321761A (zh) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 一种行为识别方法、终端设备及计算机可读存储介质
CN113128395A (zh) * 2021-04-16 2021-07-16 重庆邮电大学 基于混合卷积的多级特征融合模型的视频动作识别方法及系统
CN113723169A (zh) * 2021-04-26 2021-11-30 中国科学院自动化研究所 基于SlowFast的行为识别方法、系统及设备
WO2022012239A1 (en) * 2020-07-16 2022-01-20 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Action recognition method and related device, storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280406A (zh) * 2017-12-30 2018-07-13 广州海昇计算机科技有限公司 一种基于分段双流模型的行为识别方法、系统及装置
CN110321761A (zh) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 一种行为识别方法、终端设备及计算机可读存储介质
WO2022012239A1 (en) * 2020-07-16 2022-01-20 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Action recognition method and related device, storage medium
CN113128395A (zh) * 2021-04-16 2021-07-16 重庆邮电大学 基于混合卷积的多级特征融合模型的视频动作识别方法及系统
CN113723169A (zh) * 2021-04-26 2021-11-30 中国科学院自动化研究所 基于SlowFast的行为识别方法、系统及设备

Also Published As

Publication number Publication date
CN116612524A (zh) 2023-08-18

Similar Documents

Publication Publication Date Title
CN110321958B (zh) 神经网络模型的训练方法、视频相似度确定方法
JP7222008B2 (ja) 動画クリップの検索方法及び装置
US11490168B2 (en) Method and apparatus for selecting video clip, server and medium
CN111399729A (zh) 图像的绘制方法、装置、可读介质和电子设备
WO2020107625A1 (zh) 视频分类方法、装置、电子设备及计算机可读存储介质
WO2022252881A1 (zh) 图像处理方法、装置、可读介质和电子设备
WO2023185391A1 (zh) 交互式分割模型训练方法、标注数据生成方法及设备
WO2023103897A1 (zh) 图像处理方法、装置、设备及存储介质
CN113177450A (zh) 行为识别方法、装置、电子设备和存储介质
WO2023035935A1 (zh) 数据处理方法、装置、电子设备和存储介质
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
CN111680799A (zh) 用于处理模型参数的方法和装置
CN113919320A (zh) 异构图神经网络的早期谣言检测方法、系统及设备
WO2023138441A1 (zh) 视频生成方法、装置、设备及存储介质
WO2023147778A1 (zh) 动作识别方法、装置、电子设备及存储介质
WO2024012251A1 (zh) 语义分割模型训练方法、装置、电子设备及存储介质
CN112907628A (zh) 视频目标追踪方法、装置、存储介质及电子设备
CN117171573A (zh) 多模态模型的训练方法、装置、设备和存储介质
WO2023165390A1 (zh) 变焦特效的生成方法、装置、设备及存储介质
CN112000842A (zh) 视频处理方法和装置
WO2023202543A1 (zh) 文字处理方法、装置、电子设备及存储介质
WO2023202361A1 (zh) 视频生成方法、装置、介质及电子设备
CN111783632A (zh) 针对视频流的人脸检测方法、装置、电子设备及存储介质
CN111666449B (zh) 视频检索方法、装置、电子设备和计算机可读介质
CN113033552B (zh) 文本识别方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23749357

Country of ref document: EP

Kind code of ref document: A1