WO2021169604A1 - 动作信息识别方法、装置、电子设备及存储介质 - Google Patents

动作信息识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021169604A1
WO2021169604A1 PCT/CN2020/142510 CN2020142510W WO2021169604A1 WO 2021169604 A1 WO2021169604 A1 WO 2021169604A1 CN 2020142510 W CN2020142510 W CN 2020142510W WO 2021169604 A1 WO2021169604 A1 WO 2021169604A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
level
feature
fusion
convolutional neural
Prior art date
Application number
PCT/CN2020/142510
Other languages
English (en)
French (fr)
Inventor
杨策元
徐英豪
戴勃
石建萍
周博磊
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to KR1020227008074A priority Critical patent/KR20220042467A/ko
Priority to JP2021545743A priority patent/JP2022525723A/ja
Publication of WO2021169604A1 publication Critical patent/WO2021169604A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present disclosure relates to the technical field of neural networks, and in particular to an action information recognition method, device, electronic equipment, and storage medium.
  • Action recognition is an important part of computer vision and video understanding.
  • the main goal of action recognition is to judge the behavior category of an object in a video.
  • the present disclosure provides at least a method, device, electronic device, and storage medium for recognizing motion information.
  • an action information recognition method including:
  • a second feature map corresponding to the first feature map of each level is obtained; wherein the parameter information of the second feature maps corresponding to the first feature maps of different levels is the same;
  • the action information in the video to be recognized is determined.
  • the second feature map corresponding to the first feature map at all levels is obtained, and the time dimension value of the second feature map at each level is adjusted, so that the obtained first feature maps at all levels
  • the time dimension value of the third feature map corresponding to the second feature map has a proportional relationship, which can then be based on the third feature map with different time dimensions (different time dimensions reflect the different rhythms of the action, and then obtain the action characteristics under different rhythms) , Determine the motion information in the video to be recognized, realize the video to be recognized based on the original frame rate, and determine the motion information of the video to be recognized. Since there is no need to adjust the frame rate of the video to be recognized, the recognition accuracy rate is reduced while ensuring The calculation amount of recognition improves the efficiency of recognition.
  • an action information recognition device including:
  • the feature extraction module is used to perform feature extraction on the video to be recognized to obtain a multi-level first feature map
  • the parameter adjustment module is used to adjust the parameter information of the first feature map to obtain the second feature map corresponding to the first feature map of each level; wherein, the parameter information of the second feature map corresponding to the first feature map of different levels same;
  • the time dimension adjustment module is used to adjust the parameter information of the second feature maps at all levels to obtain the third feature maps corresponding to the second feature maps at all levels. Among them, the ratio of the time dimension values of the third feature maps at all levels to the preset Proportionate
  • the determining module is configured to determine the action information in the video to be recognized based on the third characteristic map.
  • the present disclosure provides an electronic device, including a processor, a memory, and a bus.
  • the memory stores machine-readable instructions executable by the processor.
  • the processor and the bus The memories communicate through a bus, and when the machine-readable instructions are executed by the processor, the steps of the action information identification method according to the first aspect or any one of the embodiments are executed.
  • the present disclosure provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the actions described in the first aspect or any of the above-mentioned embodiments when the computer program is run by a processor The steps of the information identification method.
  • the present disclosure provides a computer program product, the computer program product comprising program instructions that, when executed by a processor, cause the processor to execute as described in the first aspect or any of the above-mentioned embodiments.
  • the steps of the action information identification method are described in the first aspect or any of the above-mentioned embodiments.
  • FIG. 1 shows a schematic flowchart of an action information recognition method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic flow diagram of a method of obtaining a second feature map corresponding to the first feature map at all levels by adjusting the parameter information of the first feature map in an action information recognition method provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic flow diagram of a method of separately adjusting the parameter information of the second feature maps of each level to obtain the third feature map corresponding to the second feature map of each level in an action information recognition method provided by an embodiment of the present disclosure ;
  • FIG. 4 shows a schematic flowchart of a method for determining the action information in the video to be recognized based on the third characteristic diagram in the action information recognition method provided by the embodiment of the present disclosure
  • Figure 5a shows that in an action information recognition method provided by an embodiment of the present disclosure, according to the set fusion sequence, the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain each fusion.
  • Figure 5b shows that in an action information recognition method provided by an embodiment of the present disclosure, according to the set fusion sequence, the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain each fusion.
  • Figure 5c shows that in an action information recognition method provided by an embodiment of the present disclosure, according to the set fusion sequence, the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain each fusion.
  • Figure 5d shows that in an action information recognition method provided by an embodiment of the present disclosure, according to the set fusion sequence, the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain each fusion.
  • FIG. 6 shows a schematic flow chart of a method of obtaining a fourth feature map based on the intermediate feature map after each fusion in an action information recognition method provided by an embodiment of the present disclosure
  • FIG. 7 shows a schematic structural diagram of an action information recognition device provided by an embodiment of the present disclosure
  • FIG. 8 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the video to be recognized may include dynamic objects.
  • the video to be recognized may include walking humans, running humans, prey animals, and so on.
  • the designed convolutional neural network can be used to recognize the video to be recognized; or, when detecting multiple videos to be recognized, the designed The convolutional neural network clusters multiple to-be-recognized videos based on the categories of actions included in each to-be-recognized video.
  • rhythm factors in the execution of the action there are rhythm factors in the execution of the action.
  • the rhythm of running is faster than the rhythm of walking, that is, when the same object performs different actions, the corresponding rhythm is different; at the same time, due to the physical condition of the object performing the action Depending on factors such as age, age, and other factors, different subjects will have different rhythms when performing the same action. Therefore, the rhythm of the action plays a vital role in the detection of the action.
  • the frame rate of the video to be recognized can be adjusted based on different sampling frequencies, and multiple videos with different frame rates corresponding to the video to be recognized can be obtained.
  • the original frame rate of the video to be recognized is 24 frames per second.
  • the original frame rate of the video to be recognized is adjusted to obtain multiple videos with different frame rates corresponding to the video to be recognized, that is, a video with a frame rate of 24 frames per second, a video with a frame rate of 18 frames per second, and a frame rate of 12 Frame/sec video, frame rate 6 frames/sec video, etc.; then multiple videos of different frame rates corresponding to the video to be recognized can be input into the action recognition neural network to determine the detection result corresponding to each video, And based on the detection result corresponding to each video, the category of the action included in the video to be recognized is determined.
  • the embodiment of the present disclosure proposes an action information recognition method, which can adjust the parameter information and time dimension value of the feature map, and recognize the video to be recognized based on the adjusted feature map, which can be based only on the original frame rate. Recognize the video and determine the action information of the video to be recognized, which reduces the amount of calculation for recognition and improves the efficiency of recognition.
  • FIG. 1 is a schematic flowchart of an action information recognition method provided by an embodiment of the present disclosure.
  • the method includes S101-S104.
  • S101 Perform feature extraction on a video to be recognized to obtain a multi-level first feature map.
  • S102 Obtain the second feature maps corresponding to the first feature maps at all levels by adjusting the parameter information of the first feature maps; wherein the parameter information of the second feature maps corresponding to the first feature maps at different levels is the same.
  • S104 Determine the action information in the video to be recognized based on the third feature map.
  • the second feature map corresponding to the first feature map of each level is obtained, and the time dimension value of the second feature map of each level is adjusted, so that the obtained first feature map of each level
  • the time dimension value of the third feature map corresponding to the second feature map has a proportional relationship, which can then be based on the third feature map with different time dimensions (different time dimensions reflect the different rhythms of the action, and then obtain the action characteristics under different rhythms) , Determine the motion information in the video to be recognized, realize the video to be recognized based on the original frame rate, and determine the motion information of the video to be recognized. Since there is no need to adjust the frame rate of the video to be recognized, the recognition accuracy rate is reduced while ensuring The calculation amount of recognition improves the efficiency of recognition.
  • feature extraction is performed on the video to be recognized to obtain a multi-level first feature map, where the first feature map of the first level is obtained by feature extraction of the video to be recognized, and the first feature maps in two adjacent levels are The first feature map of the next level is obtained by feature extraction of the first feature map of the previous level in the first feature maps of two adjacent levels.
  • the feature extraction of the video to be recognized can be performed through the multi-level first convolutional neural network to obtain the output of the first convolutional neural network of each level The first feature map.
  • the neural network formed by the multi-level first convolutional neural network can be any neural network that recognizes the action information contained in the video to be recognized.
  • the neural network that recognizes the action information contained in the video to be detected can be divided into It is a multi-stage convolutional neural network, and each stage of the convolutional neural network corresponds to the first-level convolutional neural network.
  • the structure of the multi-level first convolutional neural network can be set according to actual needs, which is not specifically limited in the embodiment of the present disclosure.
  • the multi-level first convolutional neural network includes a first-level first convolutional neural network, a second-level first convolutional neural network, and a third-level first convolutional neural network
  • the first-level first convolutional neural network A convolutional neural network can perform convolution processing on the video to be recognized to obtain the first feature map output by the first-level first convolutional neural network; and send the first feature map output by the first-level first convolutional neural network to The second-level first convolutional neural network, and the second-level first convolutional neural network performs convolution processing on the received first feature map to obtain the first feature map output by the second-level first convolutional neural network; and
  • the first feature map output by the second-level first convolutional neural network is sent to the third-level first convolutional neural network, and the third-level first convolutional neural network performs convolution processing on the received first feature map,
  • the first feature map output by the first convolutional neural network of the third level is obtained, and the first feature map output by the first con
  • the first feature map output by the first-level first convolutional neural network undergoes less convolution processing times, the first feature map output by the first-level first convolutional neural network has more detailed features, There are fewer spatial semantic features; the first feature map output by the third-level first convolutional neural network has undergone more convolution processing times, so the space of the first feature map output by the third-level first convolutional neural network There are more semantic features (that is, the first feature map contains more features related to the action information) and fewer detailed features.
  • the video to be recognized may be any video containing action information, where the duration of the video to be recognized may be any duration, for example, the duration of the video to be recognized may be 10 seconds, 20 seconds, and so on.
  • the video detection duration can be determined based on the multi-level first convolutional neural network.
  • the video to be recognized can be divided into multiple videos, so that the duration of each divided video Same as video detection duration. For example, if the duration of the video to be recognized is 1 minute and the determined video detection duration is 10 seconds, the video to be recognized can be divided into 6 videos with a duration of 10 seconds.
  • Feature extraction is performed on a 10-second video, the action information corresponding to each 10-second video is determined, and then the action information of the video to be recognized is obtained.
  • the first feature map may include four-dimensional parameter information, and the four-dimensional parameter information may be length value ⁇ width value ⁇ time dimension value ⁇ number of channels, where the length value ⁇ width value is the size of the first feature map,
  • the time dimension value of the neural network represents the number of images that the neural network can process at one time.
  • the multi-level first convolutional neural network is a three-dimensional convolutional neural network
  • the first feature map of the video to be recognized may be obtained, and the first feature map may include four-dimensional parameter information; if the multi-level first convolution
  • the neural network is a two-dimensional convolutional neural network.
  • the feature extraction can be performed by the multi-level first convolutional neural network to obtain the feature map corresponding to each frame of the image to be recognized.
  • the feature maps are combined according to the time dimension to obtain the first feature map corresponding to the video to be recognized.
  • the parameter information of the first feature map can be adjusted to obtain the second feature map corresponding to the first feature map at all levels.
  • the parameter information of the first feature map of the first level, the parameter information of the first feature map of the second level, and the parameter information of the first feature map of the third level are adjusted to be consistent. That is, the length value, and/or width value, and/or time dimension value, and/or the number of channels of the first feature map at all levels are adjusted, so that the length value, width value, and time dimension of the second feature map at all levels are obtained The value and the number of channels are the same.
  • the second feature map corresponding to the first feature map at all levels is obtained, including:
  • the first feature map with the smallest size in the parameter information corresponding to the first feature maps at all levels and adjust the other first feature maps except the first feature map with the smallest size to be the same as the first feature map with the smallest size
  • the first feature map with the smallest size and the adjusted feature map with the same parameter information as the first feature map with the smallest size are used as the second feature map; or,
  • the multi-level first feature map includes the first level first feature map, the second level first feature map, and the third level first feature map
  • the first level first feature map and the second level first feature map are determined.
  • a feature map, the first feature map of the third level, the first feature map with the smallest size that is, the first feature map with the smallest determined length value ⁇ width value).
  • the parameter information of the first feature map at the first level can be : 200 ⁇ 200 ⁇ 24 ⁇ 256
  • the parameter information of the first feature map of the second level may be: 100 ⁇ 100 ⁇ 24 ⁇ 512
  • the parameter information of the first feature map of the third level may be: 50 ⁇ 50 ⁇ 24 ⁇ 1024
  • the parameter information of the feature map is: 50 ⁇ 50 ⁇ 24 ⁇ 1024.
  • the size in the preset parameter information is smaller than or equal to the parameter information of the first feature map with the smallest size among the parameter information corresponding to the first feature maps output by the first convolutional neural networks at all levels.
  • the preset parameter information can be It is 25 ⁇ 25 ⁇ 24 ⁇ 1024, or the preset parameter information may also be 50 ⁇ 50 ⁇ 24 ⁇ 1024. Among them, the preset parameter information can be set according to actual conditions.
  • the first first feature maps at all levels are adjusted to a smaller size.
  • the amount of calculation for recognition can be reduced and the efficiency of recognition can be improved.
  • performing feature extraction on the video to be recognized to obtain a multi-level first feature map includes:
  • the first feature map output by each level of the first convolutional neural network is obtained.
  • the second feature map corresponding to the first feature map at all levels is obtained, including:
  • S201 Based on the determined adjusted parameter information and the parameter information of the first feature map output by the first convolutional neural network of each level, determine the network of the second convolutional neural network corresponding to the first convolutional neural network of the level Parameter information;
  • the determined adjusted parameter information may be 50 ⁇ 50 ⁇ 24 ⁇ 1024, and the parameter information of the first feature map corresponding to the first level of the first convolutional neural network may be: 200 ⁇ 200 ⁇ 24 ⁇ 256, the parameter information of the first feature map corresponding to the second-level first convolutional neural network may be: 100 ⁇ 100 ⁇ 24 ⁇ 512, the parameter information of the first feature map corresponding to the third-level first convolutional neural network It can be: 50 ⁇ 50 ⁇ 24 ⁇ 1024; then based on the determined adjusted parameter information and the parameter information of the first feature map output by the first convolutional neural network of each level, the first level and the first volume can be determined respectively.
  • the network parameter information of the second convolutional neural network that is, determine the length of the convolution kernel in each level of the second convolutional neural network ⁇ width ⁇ time dimension value ⁇ number of channels, and the corresponding length moving step ⁇ width moving step ⁇ Time dimension movement step length and other information.
  • O is the parameter information of the second feature map
  • I is the parameter information of the first feature map
  • K is the network parameter information of the convolution kernel corresponding to the second convolutional neural network
  • S is the moving step size
  • P is the filling number . Therefore, after determining the parameter information of the first feature map and the parameter information of the second feature map, the network parameters corresponding to the second convolutional neural network can be determined. For example, by setting different length and/or width movement steps for each level of the second convolutional neural network, the parameter information of the second feature map output by each level of the second convolutional neural network can be the same. .
  • the second convolutional neural network carrying network parameter information corresponding to the first-level first convolutional neural network performs convolution processing on the first feature map corresponding to the first-level first convolutional neural network to obtain The second feature map output by the second convolutional neural network of this stage.
  • the second convolutional neural network carrying network parameter information corresponding to the second-level first convolutional neural network performs convolution processing on the first feature map corresponding to the second-level first convolutional neural network to obtain the The second feature map output by the second convolutional neural network.
  • the second convolutional neural network carrying network parameter information corresponding to the third-level first convolutional neural network performs convolution processing on the first feature map corresponding to the third-level first convolutional neural network to obtain the second level of the second convolutional neural network.
  • the second feature map output by the convolutional neural network is not limited to:
  • the processing realizes that the size in the parameter information of the first feature map output by the first convolutional neural network at all levels is adjusted to a smaller size, so that when the video to be recognized is recognized, the amount of calculation is reduced and the recognition rate is improved. efficient.
  • the parameter information of the second feature maps at all levels can be adjusted to obtain the third feature maps corresponding to the second feature maps at all levels, so that the ratio of the time dimension values of the third feature maps at all levels obtained is proportional to The preset ratio matches.
  • the time dimension value of the third feature map of each level is related to its receptive field. Specifically, the fewer the number of times the feature map undergoes convolution processing, the smaller the receptive field, and the larger the corresponding time dimension value is set, the more accurate the action information in the video to be recognized can be determined; otherwise, the feature map is rolled The greater the number of product processing, the larger the field of perception.
  • the corresponding time dimension value can be made smaller, so that the accuracy of the video recognition to be recognized is guaranteed, while the amount of calculation is reduced, and the recognition is improved. efficient.
  • the ratio of the time dimension value between the first-level third feature map and the second-level third feature map can be set to 1:2, or 2:4, or 3:9.
  • the parameter information of the second feature maps of each level is adjusted separately to obtain the third feature map corresponding to the second feature map of each level, including:
  • S301 Determine the first convolutional neural network of each level based on the ratio of the time dimension value between the first convolutional neural networks of different levels and the time dimension value of the second feature map corresponding to the first convolutional neural network of each level Respectively corresponding to the time dimension value of the third feature map;
  • the ratio of the time dimension value between the first convolutional neural networks of different levels can be set according to actual needs, for example, if the multi-level first convolutional neural network includes the first-level first convolutional neural network .
  • the second level of the first convolutional neural network, the third level of the first convolutional neural network, the ratio of the time dimension value between the first convolutional neural network of different levels can be 1:2:4, or it can be 1. :3:9 etc.
  • the first level of the first convolutional neural network can be determined
  • the time dimension value of the corresponding third feature map is 6, the time dimension value of the third feature map corresponding to the second-level first convolutional neural network is 12, and the third feature map corresponding to the third-level first convolutional neural network
  • the time dimension value of is 24.
  • the network parameter information of the third convolutional neural network corresponding to each level of the first convolutional neural network can be determined according to the above-described formula (1). For example, by setting different time dimension movement steps for each level of the third convolutional neural network, the time dimension value of the third feature map output by each level of the third convolutional neural network is the same as the set ratio.
  • the third convolutional neural network carrying network parameter information corresponding to the first level of the first convolutional neural network performs convolution processing on the second feature map corresponding to the level to obtain the third convolutional neural network of the level
  • the third feature map output by the network By analogy, the second level of the first convolutional neural network corresponding to the third convolutional neural network carrying network parameter information, the second feature map corresponding to the level of convolution processing, to obtain the third level of convolutional neural network
  • the third convolutional neural network corresponding to the first convolutional neural network of the third level carries network parameter information, and the second feature map corresponding to this level is convolved to obtain the first output of the third convolutional neural network of this level.
  • the time dimension value of the second feature map corresponding to each level of the first convolutional neural network is adjusted, so that the time dimension value of the third feature map output by each level of the third convolutional neural network is equal to
  • the set ratio is consistent (equivalent to adjusting the rhythm of the action information included in the video to be recognized), so that based on the third feature map after adjusting the time dimension value, the action information included in the video to be recognized can be recognized more accurately, which improves The accuracy of recognition is improved.
  • the third feature maps corresponding to the first convolutional neural networks at all levels can be fused, and the feature maps obtained after the fusion of the third feature maps can be input into the predictive neural network, and the video to be recognized includes The action information. If the video to be recognized includes multiple action information, each action information included in the video to be recognized can be obtained.
  • determining the action information in the video to be recognized includes:
  • S401 Perform fusion processing on third feature maps corresponding to second feature maps at all levels to obtain a fused fourth feature map
  • S402 Determine the action information in the video to be recognized based on the fourth feature map.
  • the third feature maps at all levels can be fused to obtain a fused fourth feature map. Based on the fourth feature map, Determine the action information in the video to be recognized.
  • the obtained third feature maps corresponding to the second feature maps at all levels are fused, so that the obtained fourth feature maps may include the features of the third feature maps with different time dimension values, and then are based on the fourth feature.
  • the accuracy of recognition can be improved.
  • the third feature maps corresponding to the second feature maps at all levels are fused to obtain a fused fourth feature map, including:
  • the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain the intermediate feature maps after each fusion;
  • the fourth feature map is obtained.
  • the fusion order of the third feature map can be set, and the third feature maps corresponding to the second feature maps at all levels are sequentially fused according to the set fusion order to obtain the intermediate features after each fusion. picture.
  • the set fusion sequence is: the third feature map corresponding to the first level of the first convolutional neural network, the third feature map corresponding to the second level of the first convolutional neural network, and the third level of the first convolutional neural network
  • the third feature map corresponding to the first-level first convolutional neural network can be merged with the third feature map corresponding to the second-level first convolutional neural network to obtain the first
  • the fused intermediate feature map; the obtained fused intermediate feature map is fused with the third feature map corresponding to the third-level first convolutional neural network to obtain the intermediate feature map after the second fusion.
  • a fourth feature map can be obtained based on the intermediate feature map after each fusion.
  • the first level of the first convolutional neural network can be combined first.
  • the third feature map corresponding to the network is subjected to image interpolation processing, and the third feature map corresponding to the first level of the first convolutional neural network after the image interpolation processing and the third feature map corresponding to the second level of the first convolutional neural network Perform fusion to obtain the intermediate feature map after the first fusion.
  • fusion process refer to the process of fusing the third feature map corresponding to the first-level first convolutional neural network and the third feature map corresponding to the second-level first convolutional neural network. This will not be repeated here.
  • the parameter information of the third feature map corresponding to the first-level first convolutional neural network is 7 ⁇ 7 ⁇ 1 ⁇ 512
  • the parameter information of the third feature map corresponding to the second-level first convolutional neural network is 7 ⁇ 7 ⁇ 2 ⁇ 512
  • the third feature map corresponding to the first-level first convolutional neural network can be subjected to image interpolation processing, and the third feature map corresponding to the first-level first convolutional neural network after interpolation processing
  • the parameter information of is 7 ⁇ 7 ⁇ 2 ⁇ 512; then the value of each feature point in the third feature map corresponding to the first-level first convolutional neural network after interpolation is compared with the second-level first convolutional neural network
  • the values of the corresponding feature points in the third feature map corresponding to the network are summed to obtain the intermediate feature map after the first fusion, where the parameter information of the intermediate feature map after the first fusion is 7 ⁇ 7 ⁇ 2 ⁇ 512.
  • the third feature map corresponding to the second feature map of each level is used as the first level third feature map to the Nth level third feature map, where the time dimension value of the Nth level third feature map Greater than the time dimension value of the N-1 level third feature map, and N is a positive integer greater than 1.
  • the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain the intermediate feature maps after each fusion, including the following methods:
  • Method 1 According to the fusion sequence from the first level third feature map to the Nth level third feature map, the third feature maps of each level are sequentially fused to obtain the feature maps after each fusion, and the first level The third feature map and the feature map after each fusion are used as the obtained intermediate feature map.
  • Method 2 In accordance with the fusion sequence from the third feature map of the Nth level to the third feature map of the first level, the third feature maps of each level are sequentially fused to obtain the feature maps after each fusion, and the Nth level The third feature map and the feature map after each fusion are used as the obtained intermediate feature map.
  • Method 3 According to the fusion sequence from the first level third feature map to the Nth level third feature map, the third feature maps of each level are fused to obtain the third feature map from the first level to the Nth level.
  • the first level third feature map and each fusion feature map are respectively convolved to obtain the first level fusion feature map to the Nth level fusion feature Figure, where the parameter information of each level of fusion feature map is the same as the parameter information of the corresponding feature map before convolution processing;
  • each level of The fusion feature map is fused, and each fusion feature map is obtained from the Nth level fusion feature map to the first level fusion feature map.
  • Each fusion feature map is combined with the Nth level fusion feature map. , As the obtained intermediate feature map.
  • Method 4 According to the fusion sequence from the first level third feature map to the Nth level third feature map, the third feature maps of each level are fused to obtain the feature maps after each fusion, and the first level is the first The three feature maps and the feature maps after each fusion process from the first level third feature map to the Nth level third feature map are used as the first intermediate feature map obtained, and the third feature map from the Nth level The fusion sequence from the first level to the third feature map, the third feature maps of all levels are fused, and the feature maps after each fusion are obtained.
  • each fused feature map is used as the obtained second intermediate feature map; the first intermediate feature map and the second intermediate feature map are taken as the obtained intermediate feature map.
  • the embodiment of the present disclosure illustrates the above-mentioned mode one.
  • the first-level third feature map 501 and the second-level third feature map 502 can be combined first. Fusion to obtain the feature map after the first fusion; then merge the feature map obtained the first time with the third-level third feature map 503 to obtain the feature map after the second fusion, and so on, Until the feature map after the N-2th fusion is fused with the third feature map 504 of the Nth level, the feature map after the N-1th fusion is obtained; the feature map after the first fusion (the first level The feature map obtained after the fusion of the three feature maps and the second level third feature map), the feature map after the second fusion, ..., the feature map after the N-1th fusion, and the first level third feature map, as The resulting intermediate feature map.
  • the embodiment of the present disclosure explains the second method above.
  • the embodiment of the present disclosure explains the third method mentioned above.
  • the first level third feature map and the second level third feature map can be merged first. Get the feature map after the first fusion; then fuse the feature map after the first fusion with the third level third feature map to get the feature map after the second fusion, and so on, you can get the first fusion feature map.
  • Feature maps after N-1 fusion respectively, the first level third feature map, the first fusion feature map, the second fusion feature map,..., the N-1th fusion feature map Input to the corresponding intermediate convolutional neural network 505 for convolution processing to obtain the first-level fusion feature map corresponding to the first-level third feature map, the second-level fusion feature map corresponding to the feature map after the first fusion, The third-level fusion feature map corresponding to the feature map after the second fusion,..., the Nth-level fusion feature map corresponding to the feature map after the N-1th fusion.
  • the parameter information of each level of the fusion feature map is the same as the parameter information of the corresponding feature map before the convolution processing.
  • the parameter information of the first level and the third feature map is 7 ⁇ 7 ⁇ 1 ⁇ 512
  • the parameter information of the first-level fusion feature map is also 7 ⁇ 7 ⁇ 1 ⁇ 512
  • the parameter information of the feature map after a fusion is 7 ⁇ 7 ⁇ 2 ⁇ 512
  • the intermediate convolutional neural network corresponding to the feature map after the first fusion performs convolution processing on the feature map after the first fusion
  • the parameter information of the obtained second-level fusion feature map is also 7 ⁇ 7 ⁇ 2 ⁇ 512.
  • the fusion feature maps of each level are sequentially fused to obtain the fusion feature maps from the Nth level to the first level.
  • each fusion feature map and the N-th level fusion feature map are used as the obtained intermediate feature map.
  • the embodiment of the present disclosure explains the above mode four.
  • the third feature maps of each level can be merged through the above mode 1, and the first level
  • the three feature maps and the feature maps after each fusion process from the first level third feature map to the Nth level third feature map are used as the first intermediate feature map obtained; at the same time, each can be combined through the above method two Perform the fusion processing on the third-level feature map, and use the third-level feature map of the Nth level and the third feature map of the first-level third feature map as the obtained first-level third feature map.
  • Intermediate feature maps; among them, the first intermediate feature map and the second intermediate feature map constitute the intermediate feature map obtained by way four.
  • the third feature maps of each level are sequentially fused, which enriches the fusion methods of the feature maps.
  • the fourth feature map is obtained based on the intermediate feature map after each fusion, including:
  • S601 Perform convolution processing on the intermediate feature map after each fusion to obtain a fifth feature map corresponding to the intermediate feature map; wherein, the time dimension value of the fifth feature map corresponding to each intermediate feature map is the same.
  • S602 Combine fifth feature maps corresponding to the respective intermediate feature maps to obtain a fourth feature map.
  • the intermediate feature map after each fusion includes the intermediate feature map with parameter information of 7 ⁇ 7 ⁇ 1 ⁇ 512, the intermediate feature map of 7 ⁇ 7 ⁇ 2 ⁇ 512, and the middle feature map of 7 ⁇ 7 ⁇ 4 ⁇ 512.
  • the feature map, the determined fusion time dimension value is 1, where the fusion time dimension value can be set according to actual needs, and the network parameter information of the fourth convolutional neural network corresponding to each intermediate feature map can be determined.
  • the network parameter information of the convolutional neural network B, the determined parameter information is the network parameter information of the fourth convolutional neural network C corresponding to the intermediate feature map of 7 ⁇ 7 ⁇ 4 ⁇ 512; based on the fourth convolutional neural network that carries the network parameter information Network A performs convolution processing on the intermediate feature map with parameter information of 7 ⁇ 7 ⁇ 1 ⁇ 512, and obtains the fifth feature map corresponding to the intermediate feature map with parameter information of 7 ⁇ 7 ⁇ 1 ⁇ 512; and then the parameter information can be obtained as The fifth feature map corresponding to the 7 ⁇ 7 ⁇ 2 ⁇ 512 intermediate feature map, and the fifth feature map corresponding to the intermediate feature map with the parameter information of 7 ⁇ 7 ⁇ 4 ⁇ 512, where each intermediate feature map corresponds to the fifth The parameter information of the feature map is 7 ⁇ 7 ⁇ 1 ⁇ 512.
  • the fifth feature maps corresponding to the respective intermediate feature maps are combined to obtain the fourth feature map, that is, the parameter information of the obtained fourth feature map is 7 ⁇ 7 ⁇ 4 ⁇ 1536.
  • the fifth feature maps corresponding to the respective intermediate feature maps are merged, the fifth feature maps can be connected in series through the Concatenate operation to obtain the fourth feature map.
  • the convolution processing is performed on the intermediate feature maps after each fusion, and the fifth feature maps obtained after the convolution processing are merged to obtain the fourth feature map, so that the fourth feature map includes both semantics.
  • the feature information with strong features also includes feature information with strong detailed features, and the obtained fourth feature map also includes feature information with different time dimension values, so that the action information included in the video to be identified is performed based on the fourth feature map.
  • the accuracy of recognition can be improved.
  • the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possibility.
  • the inner logic is determined.
  • a schematic diagram of the architecture of the action information recognition provided by the embodiment of the present disclosure includes a feature extraction module 701 and a parameter adjustment module. 702.
  • the feature extraction module 701 is configured to perform feature extraction on the video to be recognized to obtain a multi-level first feature map
  • the parameter adjustment module 702 is configured to adjust the parameter information of the first feature map to obtain the second feature map corresponding to the first feature map at each level; wherein, the parameters of the second feature map corresponding to the first feature map at different levels The same information;
  • the time dimension adjustment module 703 is used to adjust the parameter information of the second feature maps at all levels to obtain the third feature maps corresponding to the second feature maps at all levels. Set the proportions to match;
  • the determining module 704 is configured to determine the action information in the video to be recognized based on the third characteristic map.
  • the parameter adjustment module 702 is configured to adjust the parameter information of the first feature map to obtain the second feature map corresponding to the first convolutional neural network at all levels. :
  • the feature map adjusted to have the same parameter information as the first feature map with the smallest size, and the first feature map with the smallest size and the adjusted feature map with the same parameter information as the first feature map with the smallest size are taken as The second feature map; or,
  • the first feature maps respectively output by the first convolutional neural networks at all levels are adjusted to the feature maps under preset parameter information, and the feature maps under the preset parameter information are used as the second feature maps.
  • the feature extraction module is used to: in the case of performing feature extraction on the video to be recognized to obtain a multi-level first feature map:
  • the parameter adjustment module 702 is configured to: in the case of obtaining a second feature map corresponding to the first feature map of each level by adjusting the parameter information of the first feature map:
  • convolution processing is performed on the first feature map output by the first convolutional neural network corresponding to the second convolutional neural network of this level, to obtain The second feature map output by the second convolutional neural network of this stage.
  • the feature extraction module is used to: in the case of performing feature extraction on the video to be recognized to obtain a multi-level first feature map:
  • the time dimension adjustment module 703 is configured to adjust the parameter information of the second feature maps at all levels to obtain the third feature map corresponding to the second feature maps at all levels, to:
  • the first convolutional neural network of each level is determined Respectively corresponding to the time dimension value of the third feature map;
  • convolution processing is performed on the second feature map corresponding to the third convolutional neural network of the level to obtain the third convolutional neural network of the level The output of the third characteristic map.
  • the determining module 704 is configured to determine the action information in the video to be recognized based on the third characteristic map:
  • the action information in the video to be recognized is determined.
  • the determining module 704 is configured to perform fusion processing on the third feature maps corresponding to the second feature maps at all levels to obtain a fused fourth feature map, to:
  • the third feature maps corresponding to the second feature maps at all levels are sequentially fused to obtain an intermediate feature map after each fusion;
  • the fourth feature map is obtained.
  • the third feature map corresponding to the second feature map of each level is used as the first level third feature map to the Nth level third feature map, where the time dimension value of the Nth level third feature map Is greater than the time dimension value of the third feature map of level N-1, and N is a positive integer greater than 1, then the determining module 704 will, in accordance with the set fusion order, compare the first feature map corresponding to the second feature map of each level
  • the three feature maps are sequentially fused to obtain the intermediate feature map after each fusion, which is used to:
  • the third feature maps of each level are sequentially fused to obtain each fused feature map, and the first The third-level feature map and the feature map after each fusion are used as the obtained intermediate feature map; or,
  • the third feature maps of each level are sequentially fused to obtain the feature maps after each fusion.
  • the third-level feature map and the feature map after each fusion are used as the intermediate feature map to be obtained; or,
  • the third feature maps of each level are fused to obtain the third feature map of the first level to the third feature map.
  • the N-level third feature map is fused, for each fusion feature map, the first-level third feature map and each fused feature map are convolved to obtain the first-level fusion feature map to the Nth level.
  • Level fusion feature map wherein the parameter information of the fusion feature map at each level is the same as the parameter information of the corresponding feature map before the convolution processing;
  • the fusion feature maps of each level are sequentially fused, and the fusion feature maps from the Nth level to the first level fusion feature maps are obtained.
  • the latter feature map and the N-th level fusion feature map are used as the obtained intermediate feature map; or,
  • the third feature maps of each level are fused to obtain the feature maps after each fusion, and the first level The third feature map and the feature map after each fusion process from the first level third feature map to the Nth level third feature map are used as the first intermediate feature map obtained, and follow from the Nth level
  • the fusion sequence from the third feature map to the first level third feature map, the third feature maps at all levels are fused to obtain each fused feature map, and the Nth level third feature map and From the Nth level third feature map to the first level third feature map, each fused feature map is used as the second intermediate feature map obtained; and the first intermediate feature map is combined with the The second intermediate feature map is used as the obtained intermediate feature map.
  • the determining module 704 in the case of obtaining the fourth feature map based on the intermediate feature map after each fusion, is configured to:
  • the fifth feature maps corresponding to the respective intermediate feature maps are merged to obtain the fourth feature map.
  • the functions or templates contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • the functions or templates contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • FIG. 8 it is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure, which includes a processor 801, a memory 802, and a bus 803.
  • the memory 802 is used to store execution instructions, including a memory 8021 and an external memory 8022; the memory 8021 here is also called an internal memory, which is used to temporarily store operational data in the processor 801 and data exchanged with an external memory 8022 such as a hard disk.
  • the processor 801 exchanges data with the external memory 8022 through the memory 8021.
  • the processor 801 and the memory 802 communicate through the bus 803, so that the processor 801 executes the following instructions:
  • a second feature map corresponding to the first feature map of each level is obtained; wherein the parameter information of the second feature maps corresponding to the first feature maps of different levels is the same;
  • the action information in the video to be recognized is determined.
  • the embodiments of the present disclosure also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium. step.
  • the computer program product of the action information recognition method provided by the embodiment of the present disclosure includes a computer-readable storage medium storing program code.
  • the program code includes instructions that can be used to execute the action information recognition method described in the above method embodiment For the specific steps, please refer to the above method embodiment, which will not be repeated here.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种动作信息识别方法、装置、电子设备及存储介质,该方法包括:对待识别视频进行特征提取,得到多级第一特征图;通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;基于第三特征图,确定待识别视频中的动作信息。

Description

动作信息识别方法、装置、电子设备及存储介质
本申请要求于2020年02月28日提交中国国家知识产权局、申请号为202010128428.X、申请名称为“动作信息识别方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及神经网络技术领域,具体而言,涉及一种动作信息识别方法、装置、电子设备及存储介质。
背景技术
动作识别是计算机视觉以及视频理解中的重要环节,动作识别的主要目标是判断一段视频中对象的行为类别。
不同的对象在执行同一动作时,会因年龄、身体素质等因素的影响,以不同的节奏执行;同一对象在执行不同的动作时,节奏也存在差异,使得动作的识别较为复杂。
发明内容
有鉴于此,本公开至少提供一种动作信息识别方法、装置、电子设备及存储介质。
第一方面,本公开提供了一种动作信息识别方法,包括:
对待识别视频进行特征提取,得到多级第一特征图;
通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
基于所述第三特征图,确定所述待识别视频中的动作信息。
采用上述方法,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,并对各级第二特征图的时间维度值进行调整,使得得到的各级第二特征图对应的第三特征图的时间维度值存在比例关系,进而可以基于时间维度不同的第三特征图(通过不同的时间维度来体现动作的不同节奏,进而得到不同节奏下的动作特征),确定待识别视频中的动作信息,实现了基于原始帧率的待识别视频,确定待识别视频的动作信息,由于不需要调节待识别视频的帧率,在保证识别准确率的同时,降低了识别的运算量,提高 了识别的效率。
第二方面,本公开提供了一种动作信息识别装置,包括:
特征提取模块,用于对待识别视频进行特征提取,得到多级第一特征图;
参数调整模块,用于通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
时间维度调整模块,用于分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
确定模块,用于基于所述第三特征图,确定所述待识别视频中的动作信息。
第三方面,本公开提供一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如上述第一方面或任一实施方式所述的动作信息识别方法的步骤。
第四方面,本公开提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如上述第一方面或任一实施方式所述的动作信息识别方法的步骤。
第五方面,本公开提供了一种计算机程序产品,该计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如上述第一方面或任一实施方式所述的动作信息识别方法的步骤。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种动作信息识别方法的流程示意图;
图2示出了本公开实施例所提供的一种动作信息识别方法中,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图的方式的流程示意图;
图3示出了本公开实施例所提供的一种动作信息识别方法中,分别调整各级第二特征 图的参数信息,得到各级第二特征图对应的第三特征图的方式的流程示意图;
图4示出了本公开实施例所提供的一种动作信息识别方法中,基于第三特征图,确定待识别视频中的动作信息的方式的流程示意图;
图5a示出了本公开实施例所提供的一种动作信息识别方法中,按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图的方式的流程示意图;
图5b示出了本公开实施例所提供的一种动作信息识别方法中,按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图的方式的流程示意图;
图5c示出了本公开实施例所提供的一种动作信息识别方法中,按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图的方式的流程示意图;
图5d示出了本公开实施例所提供的一种动作信息识别方法中,按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图的方式的流程示意图;
图6示出了本公开实施例所提供的一种动作信息识别方法中,基于每一次融合后的中间特征图,得到第四特征图的方式的流程示意图;
图7示出了本公开实施例所提供的一种动作信息识别装置的架构示意图;
图8示出了本公开实施例所提供的一种电子设备的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
待识别视频中可以包括动态的对象,例如,待识别视频中可以包括行走的人类、跑步的人类、捕食的动物等。在对待识别视频进行检测,判断待识别视频中包括的动作的类别时,可以通过设计的卷积神经网络对待识别视频进行识别;或者,在对多个待识别视频进行检测,还可以通过设计的卷积神经网络,基于每个待识别视频中包括的动作的类别,将 多个待识别视频进行聚类。
示例性的,动作的执行过程中存在节奏的因素,比如,跑步的节奏快于走步的节奏,即同一对象在执行不同动作时,对应的节奏不同;同时,由于执行动作的对象的身体状况、年龄状况等因素的不同,不同的对象执行同一动作时的节奏也会不同,故动作的节奏对动作的检测起到至关重要的作用。
一般的,可以基于不同的采样频率,对待识别视频的帧率进行调整,可以得到待识别视频对应的多个不同帧率的视频,比如,待识别视频的原始帧率为24帧/秒,可以对待识别视频的原始帧率进行调整,得到待识别视频对应的多个不同帧率的视频,即得到帧率为24帧/秒的视频、帧率为18帧/秒的视频、帧率为12帧/秒的视频、帧率为6帧/秒的视频等;然后可以将待识别视频对应的多个不同帧率的视频分别输入至动作识别神经网络中,确定每个视频对应的检测结果,并基于每个视频对应的检测结果,确定待识别视频中包括的动作的类别。但是,通过基于不同帧率的视频,确定待识别视频中包括的动作的类别时,识别过程较复杂,运算量较高,使得识别的效率较低。因此,本公开实施例,提出了一种动作信息识别方法,可以通过调节特征图的参数信息以及时间维度值,并基于调整后的特征图对待识别视频进行识别,可以仅基于原始帧率的待识别视频,确定待识别视频的动作信息,降低了识别的运算量,提高了识别的效率。
为便于对本公开实施例进行理解,首先对本公开实施例所公开的一种动作信息识别方法进行详细介绍。
参见图1所示,为本公开实施例所提供的一种动作信息识别方法的流程示意图,该方法包括S101-S104。
S101,对待识别视频进行特征提取,得到多级第一特征图。
S102,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同。
S103,分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符。
S104,基于第三特征图,确定待识别视频中的动作信息。
上述步骤中,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,并对各级第二特征图的时间维度值进行调整,使得得到的各级第二特征图对应的第三特征图的时间维度值存在比例关系,进而可以基于时间维度不同的第三特征图(通过不同的时间维度来体现动作的不同节奏,进而得到不同节奏下的动作特征),确定待识别视频中的动作信息,实现了基于原始帧率的待识别视频,确定待识别视频的动作信息,由于不需要调节待识别视频的帧率,在保证识别准确率的同时,降低了识别的运算量,提高了识别的效率。
以下对S101-S104进行详细说明。
针对S101:
本公开实施例中,对待识别视频进行特征提取,得到多级第一特征图,其中,第一级第一特征图是对待识别视频进行特征提取得到的,相邻两级第一特征图中的后一级第一特征图是对相邻两级第一特征图中的前一级第一特征图进行特征提取得到的。
本公开实施例中,对待识别视频进行特征提取,得到多级第一特征图时,可以通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图。其中,多级第一卷积神经网络构成的神经网络可以为对待识别视频中包含的动作信息进行识别的任一神经网络,具体的,对待检测视频中包含的动作信息进行识别的神经网络可以划分为多个阶段的卷积神经网络,每一阶段的卷积神经网络对应一级第一卷积神经网络。其中,多级第一卷积神经网络的结构可以根据实际需要进行设置,本公开实施例对此不进行具体限定。
示例性的,若多级第一卷积神经网络包括第一级第一卷积神经网络、第二级第一卷积神经网路、第三级第一卷积神经网络,则第一级第一卷积神经网络可以对待识别视频进行卷积处理,得到第一级第一卷积神经网络输出的第一特征图;并将第一级第一卷积神经网络输出的第一特征图发送给第二级第一卷积神经网络,第二级第一卷积神经网络对接收到的第一特征图进行卷积处理,得到第二级第一卷积神经网络输出的第一特征图;再将第二级第一卷积神经网络输出的第一特征图发送给第三级第一卷积神经网络,第三级第一卷积神经网络对接收到的第一特征图进行卷积处理,得到第三级第一卷积神经网络输出的第一特征图,进而得到了每一级第一卷积神经网络输出的第一特征图。其中,由于第一级第一卷积神经网络输出的第一特征图经过的卷积处理的次数较少,故第一级第一卷积神经网络输出的第一特征图的细节特征较多、空间语义特征较少;而第三级第一卷积神经网络输出的第一特征图经过的卷积处理的次数较多,故第三级第一卷积神经网络输出的第一特征图的空间语义特征较多(即第一特征图中包含的与动作信息相关的特征较多)、细节特征较少。
本公开实施例中,待识别视频可以为包含动作信息的任一视频,其中,待识别视频的时长可以为任一时长,比如,待识别视频的时长可以为10秒、20秒等。具体的,可以基于多级第一卷积神经网络确定视频检测时长,在待识别视频的时长大于视频检测时长时,可以将待识别视频划分为多个视频,使得划分后的每个视频的时长与视频检测时长相同。比如,若待识别视频的时长为1分钟时,确定的视频检测时长为10秒,则可以将待识别视频划分为6个时长为10秒的视频,多级第一卷积神经网络分别对每个10秒的视频进行特征提取,确定每个10秒视频对应的动作信息,进而得到该待识别视频的动作信息。
本公开实施例中,第一特征图可以包括四维参数信息,该四维参数信息可以为长度值×宽度值×时间维度值×通道数,其中,长度值×宽度值为第一特征图的尺寸,神经网络的时间维度值表征神经网络一次能够处理的图像的数量。示例性的,若多级第一卷积神经网络为三维卷积神经网络,则可以得到待识别视频的第一特征图,该第一特征图可以包括四维参 数信息;若多级第一卷积神经网络为二维卷积神经网络,则可以通过多级第一卷积神经网络进行特征提取,得到待识别视频中每帧图像对应的特征图,将得到的待识别视频中每帧图像对象的特征图按照时间维度进行组合,得到待识别视频对应的第一特征图。
针对S102:
本公开实施例中,可以对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图。比如,将第一级第一特征图的参数信息、第二级第一特征图的参数信息、以及第三级第一特征图的参数信息调整为一致。即将各级第一特征图的长度值、和/或宽度值、和/或时间维度值、和/或通道数进行调整,使得得到的各级第二特征图的长度值、宽度值、时间维度值、以及通道数均相同。
一种可能的实施方式中,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,包括:
确定各级第一特征图对应的参数信息中尺寸最小的第一特征图,并将除尺寸最小的第一特征图外的其它第一特征图,调整为与该尺寸最小的第一特征图相同参数信息的特征图,将尺寸最小的第一特征图,以及调整后与该尺寸最小的第一特征图相同参数信息的特征图作为第二特征图;或者,
将各级第一特征图调整为预设参数信息下的特征图,将该预设参数信息下的特征图作为第二特征图。
示例性的,若多级第一特征图包括第一级第一特征图、第二级第一特征图、第三级第一特征图,则确定第一级第一特征图、第二级第一特征图、第三级第一特征图中,尺寸最小的第一特征图(即确定长度值×宽度值最小的第一特征图),比如,第一级第一特征图的参数信息可以为:200×200×24×256,第二级第一特征图的参数信息可以为:100×100×24×512,第三级第一特征图的参数信息可以为:50×50×24×1024,则确定第三级第一特征图对应的参数信息中尺寸最小,则分别将第一级第一特征图以及第二级第一特征图的参数信息进行调整,使得调整后的各级第二特征图的参数信息均为:50×50×24×1024。
或者,确定一个预设参数信息,将各级第一特征图调整为预设参数信息下的特征图,将该预设参数信息下的特征图作为第二特征图。一般的,预设参数信息中的尺寸小于或等于各级第一卷积神经网络输出的第一特征图对应的参数信息中尺寸最小的第一特征图的参数信息。承接上述实施例继续说明,在第三级第一特征图(即该第一特征图对应的参数信息中尺寸最小)的参数信息为:50×50×24×1024时,则预设参数信息可以为25×25×24×1024,或者,预设参数信息也可以为50×50×24×1024。其中,预设参数信息可以根据实际情况进行设置。
上述实施方式中,将各级第一第一特征图调整为较小的尺寸,在对待识别视频中包含的动作信息进行识别时,可以降低识别的运算量,提高识别的效率。
一种可能的实施方式中,对待识别视频进行特征提取,得到多级第一特征图,包括:
通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图。
参见图2所示,通过对第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,包括:
S201,基于确定的调整后的参数信息,以及每一级第一卷积神经网络输出的第一特征图的参数信息,确定该级第一卷积神经网络对应的第二卷积神经网络的网络参数信息;
S202,基于携带有确定的网络参数信息的每一级第二卷积神经网络,对该级第二卷积神经网络对应的第一卷积神经网络输出的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的第二特征图。
承接上述实施例继续说明,确定的调整后的参数信息可以为50×50×24×1024,第一级第一卷积神经网络对应的第一特征图的参数信息可以为:200×200×24×256,第二级第一卷积神经网络对应的第一特征图的参数信息可以为:100×100×24×512,第三级第一卷积神经网络对应的第一特征图的参数信息可以为:50×50×24×1024;则可以基于确定的调整后的参数信息,以及每一级第一卷积神经网络输出的第一特征图的参数信息,分别确定第一级第一卷积神经网络对应的第二卷积神经网络的网络参数信息、第二级第一卷积神经网络对应的第二卷积神经网络的网络参数信息、第三级第一卷积神经网络对应的第二卷积神经网络的网络参数信息,即确定每一级第二卷积神经网络中卷积核的长度×宽度×时间维度值×通道数,以及对应的长度移动步长×宽度移动步长×时间维度移动步长等信息。
示例性的,第一特征图的参数信息、与第二卷积神经网络对应的网络参数信息、以及与第二特征图对应的参数信息之间存在的关系如下公式(1)所示:
Figure PCTCN2020142510-appb-000001
其中,O为第二特征图的参数信息,I为第一特征图的参数信息,K为第二卷积神经网络对应的卷积核的网络参数信息,S为移动步长,P为填充数。因此,在确定第一特征图的参数信息、第二特征图的参数信息后,可以确定第二卷积神经网络对应的网络参数。比如,可以通过为每一级第二卷积神经网络设置不同的长度移动步长、和/或宽度移动步长,使得每一级第二卷积神经网络输出的第二特征图的参数信息相同。
示例性的,第一级第一卷积神经网络对应的携带有网络参数信息的第二卷积神经网络,对第一级第一卷积神经网络对应的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的第二特征图。依次类推,第二级第一卷积神经网络对应的携带有网络参数信息的第二卷积神经网络,对第二级第一卷积神经网络对应的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的第二特征图。第三级第一卷积神经网络对应的携带有网络参数信息的第二卷积神经网络,对第三级第一卷积神经网络对应的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的第二特征图。
上述实施方式中,通过确定各级第二卷积神经网络的网络参数信息,并基于携带有确定的网络参数信息的每一级第二卷积神经网络,对对应的第一特征图进行卷积处理,实现了将各级第一卷积神经网络输出的第一特征图的参数信息中的尺寸调整为较小的尺寸,进而使得对待识别视频进行识别时,降低了运算量,提高了识别的效率。
针对S103:
本公开实施例中,可以对各级第二特征图的参数信息进行调整,得到各级第二特征图对应的第三特征图,使得得到的各级第三特征图的时间维度值的比例与预设比例相符。其中,每一级第三特征图的时间维度值与其感受野相关。具体的,特征图经过卷积处理的次数越少,感受野越小,则对应的时间维度值设置的较大时,才能较准确的确定待识别视频中的动作信息;反之,特征图经过卷积处理的次数越多,感受野越大,则为了降低运算量,则可以将对应的时间维度值的较小,实现了在保证待识别视频识别的准确度的同时,降低运算量,提高识别效率。比如,第一级第三特征图与第二级第三特征图之间的时间维度值的比例可以设置为1:2、或2:4、或3:9等。
一种可能的实施方式中,参见图3所示,分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,包括:
S301,基于不同级第一卷积神经网络之间的时间维度值的比例,以及每一级第一卷积神经网络对应的第二特征图的时间维度值,确定各级第一卷积神经网络分别对应的第三特征图的时间维度值;
S302,基于确定的各级第一卷积神经网络分别对应的第三特征图的时间维度值,以及每一级第一卷积神经网络对应的第二特征图的时间维度值,确定该级第一卷积神经网络对应的第三卷积神经网络的网络参数信息;
S303,基于携带有确定的网络参数信息的每一级第三卷积神经网络,对该级第三卷积神经网络对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的第三特征图。
本公开实施例中,不同级第一卷积神经网络之间的时间维度值的比例可以根据实际需要进行设置,比如,若多级第一卷积神经网络包括第一级第一卷积神经网络、第二级第一卷积神经网络、第三级第一卷积神经网络,则不同级第一卷积神经网络之间的时间维度值的比例可以为1:2:4,也可以为1:3:9等。进一步的,若每一级第一卷积神经网络对应的第二特征图的时间维度值为24,时间维度值的比例为1:2:4,则可以确定第一级第一卷积神经网络对应的第三特征图的时间维度值为6,第二级第一卷积神经网络对应的第三特征图的时间维度值为12,第三级第一卷积神经网络对应的第三特征图的时间维度值为24。
本公开实施例中,可以根据上述描述的公式(1)确定每一级第一卷积神经网络对应的第三卷积神经网络的网络参数信息。比如,可以通过为每一级第三卷积神经网络设置不同的时间维度移动步长,使得各级第三卷积神经网络输出的第三特征图的时间维度值与设置的比例相同。
示例性的,第一级第一卷积神经网络对应的携带有网络参数信息的第三卷积神经网络,对该级对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的第三特征图。依次类推,第二级第一卷积神经网络对应的携带有网络参数信息的第三卷积神经网络,对该级对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的第三特征图。第三级第一卷积神经网络对应的携带有网络参数信息的第三卷积神经网络,对该级对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的第三特征图。
上述实施方式中,通过调节每一级第一卷积神经网络对应的第二特征图的时间维度值,使得得到的每一级第三卷积神经网络输出的第三特征图的时间维度值与设置的比例相符(相当于调节了待识别视频中包括的动作信息的节奏),使得基于调整时间维度值后的第三特征图,能够较准确的对待识别视频中包括的动作信息进行识别,提高了识别的准确度。
针对S104:
本公开实施例中,可以将各级第一卷积神经网络对应的第三特征图进行融合,并将第三特征图融合后得到的特征图输入至预测神经网络中,得到待识别视频中包括的动作信息。若待识别视频中包括多个动作信息,则可以得到待识别视频中包括的每一动作信息。
一种可能的实施方式中,参见图4所示,基于第三特征图,确定待识别视频中的动作信息,包括:
S401,将各级第二特征图对应的第三特征图进行融合处理,得到融合后的第四特征图;
S402,基于第四特征图,确定待识别视频中的动作信息。
本公开实施例中,在得到各级第二特征图对应的第三特征图之后,可以将各级第三特征图进行融合处理,得到融合后的第四特征图,在基于第四特征图,确定待识别视频中的动作信息。
上述实施方式中,将得到的各级第二特征图对应的第三特征图进行融合处理,使得得到的第四特征图可以包括时间维度值不同的第三特征图的特征,进而基于第四特征图确定待识别视频中的动作信息时,可以提高识别的准确度。
一种可能的实施方式中,将各级第二特征图对应的第三特征图进行融合处理,得到融合后的第四特征图,包括:
按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图;
基于每一次融合后的中间特征图,得到第四特征图。
本公开实施例中,可以设定第三特征图的融合顺序,将各级第二特征图对应的第三特征图按照设定的融合顺序,依次进行融合处理,得到每一次融合后的中间特征图。
比如,若设定的融合顺序为:第一级第一卷积神经网络对应的第三特征图、第二级第一卷积神经网络对应的第三特征图、第三级第一卷积神经网络对应的第三特征图,则可以先将第一级第一卷积神经网络对应的第三特征图与第二级第一卷积神经网络对应的第三特征图进行融合,得到第一次融合后的中间特征图;在将得到的融合后的中间特征图与第三级第一卷积神经网络对应的第三特征图进行融合,得到第二次融合后的中间特征图。在可以基于每一次融合后的中间特征图,得到第四特征图。
示例性的,第一级第一卷积神经网络对应的第三特征图与第二级第一卷积神经网络对应的第三特征图进行融合时,可以先将第一级第一卷积神经网络对应的第三特征图进行图像插值处理,在将图像插值处理后的第一级第一卷积神经网络对应的第三特征图与第二级第一卷积神经网络对应的第三特征图进行融合,得到第一次融合后的中间特征图。每一次的融合过程,可以参考上述第一级第一卷积神经网络对应的第三特征图与第二级第一卷积神经网络对应的第三特征图进行融合的过程,本公开实施例对此不再进行赘述。
比如,若第一级第一卷积神经网络对应的第三特征图的参数信息为7×7×1×512,第二级第一卷积神经网络对应的第三特征图的参数信息为7×7×2×512,则可以先将第一级第一卷积神经网络对应的第三特征图进行图像插值处理,插值处理后的第一级第一卷积神经网络对应的第三特征图的参数信息为7×7×2×512;然后将插值处理后的第一级第一卷积神经网络对应的第三特征图中每一特征点的值,与第二级第一卷积神经网络对应的第三特征图中对应的特征点的值求和,得到第一次融合后的中间特征图,其中,该第一次融合后的中间特征图的参数信息为7×7×2×512。
一种可能的实施方式中,将各级第二特征图对应的第三特征图作为第一级第三特征图至第N级第三特征图,其中第N级第三特征图的时间维度值大于第N-1级第三特征图的时间维度值,N为大于1的正整数。则按照设定的融合顺序,将各级第二特征图对应的第三特征图依次进行融合处理,得到每一次融合后的中间特征图,包括下述几种方式:
方式一:按照从第一级第三特征图到第N级第三特征图的融合顺序,依次将各级第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及每一次融合后的特征图,作为得到的中间特征图。
方式二:按照从第N级第三特征图到第一级第三特征图的融合顺序,依次将各级第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以及每一次融合后的特征图,作为得到中间特征图。
方式三:按照从第一级第三特征图到第N级第三特征图的融合顺序,将各级第三特征图进行融合处理,分别得到从第一级第三特征图到第N级第三特征图进行融合处理时每一次融合后的特征图,分别对第一级第三特征图以及每一次融合后的特征图进行卷积处理,得到第一级融合特征图至第N级融合特征图,其中,每一级融合特征图的参数信息与卷积处理前对应的特征图的参数信息相同;按照从第N级融合特征图到第一级融合特征图的融合顺序,依次将各级融合特征图进行融合处理,分别得到从第N级融合特征图到第一级融 合特征图进行融合处理时每一次融合后的特征图,将每一次融合后的特征图以及第N级融合特征图,作为得到的中间特征图。
方式四:按照从第一级第三特征图到第N级第三特征图的融合顺序,将各级第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及从第一级第三特征图到第N级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第一中间特征图,并按照从第N级第三特征图到第一级第三特征图的融合顺序,将各级第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以及从第N级第三特征图到第一级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第二中间特征图;将第一中间特征图和第二中间特征图作为得到的中间特征图。
参见图5a所示,本公开实施例,对上述方式一进行说明,对各级第三特征图进行融合时,可以先将第一级第三特征图501与第二级第三特征图502进行融合,得到第一次融合后的特征图;再将第一次得到的融合后的特征图与第三级第三特征图503进行融合,得到第二次融合后的特征图,以此类推,直至第N-2次融合后的特征图与第N级第三特征图504进行融合,得到第N-1次融合后的特征图为止;将第一次融合后的特征图(第一级第三特征图与第二级第三特征图融合后得到的特征图)、第二次融合后的特征图、…、第N-1次融合后的特征图以及第一级第三特征图,作为得到的中间特征图。
参见图5b所示,本公开实施例,对上述方式二进行说明,对各级第三特征图进行融合时,可以先将第N级第三特征图504与第N-1级第三特征图进行融合,得到第一次融合后的特征图;再将第一次融合后得到的特征图与第N-2级第三特征图进行融合,得到第二次融合后的特征图,以此类推,直至将第N-2次融合后的特征图与第一级第三特征图501进行融合,得到第N-1次融合后的特征图为止;将第一次融合后的特征图(第N级第三特征图与第N-1级第三特征图融合后得到的特征图)、第二次融合后的特征图、…、第N-1次融合后的特征图以及第N级第三特征图,作为得到的中间特征图。
参见图5c所示,本公开实施例,对上述方式三进行说明,对各级第三特征图进行融合时,可以先将第一级第三特征图与第二级第三特征图进行融合,得到第一次融合后的特征图;再将第一次得到的融合后的特征图与第三级第三特征图进行融合,得到第二次融合后的特征图,以此类推,可以得到第N-1次融合后的特征图;分别将第一级第三特征图、第一次融合后的特征图、第二次融合后的特征图、…、第N-1次融合后的特征图输入至对应的中间卷积神经网络505中进行卷积处理,得到第一级第三特征图对应的第一级融合特征图、第一次融合后的特征图对应的第二级融合特征图、第二次融合后的特征图对应的第三级融合特征图、…、第N-1次融合后的特征图对应的第N级融合特征图。其中,每一级融合特征图的参数信息与卷积处理前对应的特征图的参数信息相同,比如,若第一级第三特征图的参数信息为7×7×1×512,则第一级第三特征图对应的中间卷积神经网络505对第一级第三特征图进行卷积处理后,得到的第一级融合特征图的参数信息也为7×7×1×512;若第一次融合后的特征图的参数信息为7×7×2×512,则第一次融合后的特征图对应的中间卷积神经网络对第一次融合后的特征图进行卷积处理后,得到的第二级融合特征图的参数信 息也为7×7×2×512。
继续对上述方式三进行说明,按照从第N级融合特征图到第一级融合特征图的融合顺序,依次将各级融合特征图进行融合处理,分别得到从第N级融合特征图到第一级融合特征图进行融合处理时每一次融合后的特征图,将每一次融合后的特征图以及第N级融合特征图,作为得到的中间特征图。
参见图5d所示,本公开实施例,对上述方式四进行说明,对各级第三特征图进行融合时,可以通过上述方式一将各级第三特征图进行融合处理,将第一级第三特征图以及从第一级第三特征图到第N级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第一中间特征图;同时,可以通过上述方式二将各级第三特征图进行融合处理,将第N级第三特征图以及从第N级第三特征图到第一级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第二中间特征图;其中,第一中间特征图以及第二中间特征图构成了通过方式四得到的中间特征图。
上述实施方式中,通过设定多种不同的融合顺序,依次将各级第三特征图进行融合处理,丰富了特征图的融合方式。
一种可能的实施方式中,参见图6所示,基于每一次融合后的中间特征图,得到第四特征图,包括:
S601,对每一次融合后的中间特征图进行卷积处理,得到该中间特征图对应的第五特征图;其中,每个中间特征图对应的第五特征图的时间维度值相同。
S602,将各个中间特征图对应的第五特征图进行合并,得到第四特征图。
示例性的,若每一次融合后的中间特征图包括参数信息为7×7×1×512的中间特征图、7×7×2×512的中间特征图、7×7×4×512的中间特征图,确定的融合后的时间维度值为1,其中,融合后的时间维度值可以根据实际需要进行设置,则可以确定每一中间特征图对应的第四卷积神经网络的网络参数信息,即可以确定参数信息为7×7×1×512的中间特征图对应的第四卷积神经网络A的网络参数信息、确定参数信息为7×7×2×512的中间特征图对应的第四卷积神经网络B的网络参数信息、确定参数信息为7×7×4×512的中间特征图对应的第四卷积神经网络C的网络参数信息;基于携带网络参数信息的第四卷积神经网络A对参数信息为7×7×1×512的中间特征图进行卷积处理,得到参数信息为7×7×1×512的中间特征图对应的第五特征图;进而可以得到参数信息为7×7×2×512的中间特征图对应的第五特征图、以及参数信息为7×7×4×512的中间特征图对应的第五特征图,其中,各个中间特征图对应的第五特征图的参数信息均为7×7×1×512。
进一步的,将各个中间特征图对应的第五特征图进行合并,得到第四特征图,即得到的第四特征图的参数信息为7×7×4×1536。其中,对各个中间特征图对应的第五特征图进行合并时,可以通过Concatenate操作将第五特征图进行串联,得到第四特征图。
上述实施方式中,通过对每一次融合后的中间特征图进行卷积处理,并将卷积处理后 得到的第五特征图进行合并,得到第四特征图,使得第四特征图中既包括语义特征较强的特征信息,也包括细节特征较强的特征信息,且得到的第四特征图中还包括不同时间维度值的特征信息,使得基于第四特征图对待识别视频中包括的动作信息进行识别时,可以提高识别的准确度。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于相同的构思,本公开实施例还提供了一种动作信息识别装置,参见图7所示,为本公开实施例提供的一种动作信息识别的架构示意图,包括特征提取模块701、参数调整模块702、时间维度调整模块703、确定模块704,具体的:
特征提取模块701,用于对待识别视频进行特征提取,得到多级第一特征图;
参数调整模块702,用于通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
时间维度调整模块703,用于分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
确定模块704,用于基于所述第三特征图,确定所述待识别视频中的动作信息。
一种可能的实施方式中,所述参数调整模块702,在通过对所述第一特征图进行参数信息调整,得到各级第一卷积神经网络对应的第二特征图的情况下,用于:
确定所述各级第一卷积神经网络输出的第一特征图对应的参数信息中尺寸最小的第一特征图,并将除所述尺寸最小的第一特征图外的其它第一特征图,调整为与该尺寸最小的第一特征图相同参数信息的特征图,将所述尺寸最小的第一特征图,以及调整后所述与该尺寸最小的第一特征图相同参数信息的特征图作为所述第二特征图;或者,
将所述各级第一卷积神经网络分别输出的所述第一特征图调整为预设参数信息下的特征图,将该预设参数信息下的特征图作为所述第二特征图。
一种可能的实施方式中,所述特征提取模块,在对待识别视频进行特征提取,得到多级第一特征图的情况下,用于:
通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图;
所述参数调整模块702,在通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图的情况下,用于:
基于确定的调整后的参数信息,以及每一级第一卷积神经网络输出的所述第一特征图 的参数信息,确定该级第一卷积神经网络对应的第二卷积神经网络的网络参数信息;
基于携带有确定的网络参数信息的所述每一级第二卷积神经网络,对该级第二卷积神经网络对应的第一卷积神经网络输出的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的所述第二特征图。
一种可能的实施方式中,所述特征提取模块,在对待识别视频进行特征提取,得到多级第一特征图的情况下,用于:
通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图;
所述时间维度调整模块703,在分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图的情况下,用于:
基于不同级第一卷积神经网络之间的时间维度值的比例,以及每一级第一卷积神经网络对应的所述第二特征图的时间维度值,确定各级第一卷积神经网络分别对应的第三特征图的时间维度值;
基于确定的各级第一卷积神经网络分别对应的第三特征图的时间维度值,以及每一级第一卷积神经网络对应的所述第二特征图的时间维度值,确定该级第一卷积神经网络对应的第三卷积神经网络的网络参数信息;
基于携带有确定的网络参数信息的所述每一级第三卷积神经网络,对该级第三卷积神经网络对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的所述第三特征图。
一种可能的实施方式中,所述确定模块704,在基于所述第三特征图,确定所述待识别视频中的动作信息的情况下,用于:
将各级第二特征图对应的所述第三特征图进行融合处理,得到融合后的第四特征图;
基于所述第四特征图,确定所述待识别视频中的动作信息。
一种可能的实施方式中,所述确定模块704,在将各级第二特征图对应的所述第三特征图进行融合处理,得到融合后的第四特征图的情况下,用于:
按照设定的融合顺序,将各级第二特征图对应的所述第三特征图依次进行融合处理,得到每一次融合后的中间特征图;
基于每一次融合后的中间特征图,得到所述第四特征图。
一种可能的实施方式中,将各级第二特征图对应的第三特征图作为第一级第三特征图至第N级第三特征图,其中第N级第三特征图的时间维度值大于第N-1级第三特征图的时间维度值,N为大于1的正整数,则所述确定模块704,在按照设定的融合顺序,将各级 第二特征图对应的所述第三特征图依次进行融合处理,得到每一次融合后的中间特征图的情况下,用于:
按照从第一级第三特征图到所述第N级第三特征图的融合顺序,依次将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及每一次融合后的特征图,作为得到的所述中间特征图;或者,
按照从第N级第三特征图到所述第一级第三特征图的融合顺序,依次将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以及每一次融合后的特征图,作为得到所述中间特征图;或者,
按照从第一级第三特征图到所述第N级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到从第一级第三特征图到所述第N级第三特征图进行融合处理时每一次融合后的特征图,分别对第一级第三特征图以及每一次融合后的特征图进行卷积处理,得到第一级融合特征图至第N级融合特征图,其中,每一级所述融合特征图的参数信息与卷积处理前对应的特征图的参数信息相同;按照从第N级融合特征图到所述第一级融合特征图的融合顺序,依次将各级所述融合特征图进行融合处理,分别得到从第N级融合特征图到所述第一级融合特征图进行融合处理时每一次融合后的特征图,将每一次融合后的特征图以及第N级融合特征图,作为得到的所述中间特征图;或者,
按照从第一级第三特征图到所述第N级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及从第一级第三特征图到所述第N级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第一中间特征图,并按照从第N级第三特征图到所述第一级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以及从第N级第三特征图到所述第一级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第二中间特征图;将所述第一中间特征图和所述第二中间特征图作为得到的所述中间特征图。
一种可能的实施方式中,所述确定模块704,在基于每一次融合后的中间特征图,得到所述第四特征图的情况下,用于:
对每一次融合后的中间特征图进行卷积处理,得到该中间特征图对应的第五特征图;其中,每个中间特征图对应的第五特征图的时间维度值相同;
将各个中间特征图对应的第五特征图进行合并,得到所述第四特征图。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模板可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。
基于同一技术构思,本公开实施例还提供了一种电子设备。参照图8所示,为本公开实施例提供的电子设备的结构示意图,包括处理器801、存储器802、和总线803。其中, 存储器802用于存储执行指令,包括内存8021和外部存储器8022;这里的内存8021也称内存储器,用于暂时存放处理器801中的运算数据,以及与硬盘等外部存储器8022交换的数据,处理器801通过内存8021与外部存储器8022进行数据交换,当电子设备800运行时,处理器801与存储器802之间通过总线803通信,使得处理器801在执行以下指令:
对待识别视频进行特征提取,得到多级第一特征图;
通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
基于所述第三特征图,确定所述待识别视频中的动作信息。
此外,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的动作信息识别方法的步骤。
本公开实施例所提供的动作信息识别方法的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可用于执行上述方法实施例中所述的动作信息识别方法的步骤,具体可参见上述方法实施例,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计 算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种动作信息识别方法,其特征在于,包括:
    对待识别视频进行特征提取,得到多级第一特征图;
    通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
    分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
    基于所述第三特征图,确定所述待识别视频中的动作信息。
  2. 根据权利要求1所述的方法,其特征在于,所述通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,包括:
    确定所述各级第一特征图对应的参数信息中尺寸最小的第一特征图,并将除所述尺寸最小的第一特征图外的其它第一特征图,调整为与该尺寸最小的第一特征图相同参数信息的特征图,将所述尺寸最小的第一特征图,以及调整后所述与该尺寸最小的第一特征图相同参数信息的特征图作为所述第二特征图;或者,
    将所述各级第一特征图调整为预设参数信息下的特征图,将该预设参数信息下的特征图作为所述第二特征图。
  3. 根据权利要求1所述的方法,其特征在于,所述对待识别视频进行特征提取,得到多级第一特征图,包括:
    通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图;
    所述通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图,包括:
    基于确定的调整后的参数信息,以及每一级第一卷积神经网络输出的所述第一特征图的参数信息,确定该级第一卷积神经网络对应的第二卷积神经网络的网络参数信息;
    基于携带有确定的网络参数信息的所述每一级第二卷积神经网络,对该级第二卷积神经网络对应的第一卷积神经网络输出的第一特征图进行卷积处理,得到该级第二卷积神经网络输出的所述第二特征图。
  4. 根据权利要求1~3任一所述的方法,其特征在于,所述对待识别视频进行特征提取,得到多级第一特征图,包括:
    通过多级第一卷积神经网络对待识别视频进行特征提取,得到每一级第一卷积神经网络输出的第一特征图;
    所述分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,包括:
    基于不同级第一卷积神经网络之间的时间维度值的比例,以及每一级第一卷积神经网络对应的所述第二特征图的时间维度值,确定各级第一卷积神经网络分别对应的第三特征图的时间维度值;
    基于确定的各级第一卷积神经网络分别对应的第三特征图的时间维度值,以及每一级 第一卷积神经网络对应的所述第二特征图的时间维度值,确定该级第一卷积神经网络对应的第三卷积神经网络的网络参数信息;
    基于携带有确定的网络参数信息的所述每一级第三卷积神经网络,对该级第三卷积神经网络对应的第二特征图进行卷积处理,得到该级第三卷积神经网络输出的所述第三特征图。
  5. 根据权利要求1所述的方法,其特征在于,所述基于所述第三特征图,确定所述待识别视频中的动作信息,包括:
    将各级第二特征图对应的所述第三特征图进行融合处理,得到融合后的第四特征图;
    基于所述第四特征图,确定所述待识别视频中的动作信息。
  6. 根据权利要求5所述的方法,其特征在于,将各级第二特征图对应的所述第三特征图进行融合处理,得到融合后的第四特征图,包括:
    按照设定的融合顺序,将各级第二特征图对应的所述第三特征图依次进行融合处理,得到每一次融合后的中间特征图;
    基于每一次融合后的中间特征图,得到所述第四特征图。
  7. 根据权利要求6所述的方法,其特征在于,将各级第二特征图对应的第三特征图作为第一级第三特征图至第N级第三特征图,其中第N级第三特征图的时间维度值大于第N-1级第三特征图的时间维度值,N为大于1的正整数,则按照设定的融合顺序,将各级第二特征图对应的所述第三特征图依次进行融合处理,得到每一次融合后的中间特征图,包括:
    按照从第一级第三特征图到所述第N级第三特征图的融合顺序,依次将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及每一次融合后的特征图,作为得到的所述中间特征图;或者,
    按照从第N级第三特征图到所述第一级第三特征图的融合顺序,依次将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以及每一次融合后的特征图,作为得到的所述中间特征图;或者,
    按照从第一级第三特征图到所述第N级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到从第一级第三特征图到所述第N级第三特征图进行融合处理时每一次融合后的特征图,分别对第一级第三特征图以及每一次融合后的特征图进行卷积处理,得到第一级融合特征图至第N级融合特征图,其中,每一级所述融合特征图的参数信息与卷积处理前对应的特征图的参数信息相同;按照从第N级融合特征图到所述第一级融合特征图的融合顺序,依次将各级所述融合特征图进行融合处理,分别得到从第N级融合特征图到所述第一级融合特征图进行融合处理时每一次融合后的特征图,将每一次融合后的特征图以及第N级融合特征图,作为得到的所述中间特征图;或者,
    按照从第一级第三特征图到所述第N级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第一级第三特征图以及从第一级第三特征图到所述第N级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第一中间特征图,并按照从第N级第三特征图到所述第一级第三特征图的融合顺序,将各级所述第三特征图进行融合处理,分别得到每一次融合后的特征图,将第N级第三特征图以 及从第N级第三特征图到所述第一级第三特征图进行融合处理时每一次融合后的特征图,作为得到的第二中间特征图;将所述第一中间特征图和所述第二中间特征图作为得到的所述中间特征图。
  8. 根据权利要求6或7所述的方法,其特征在于,所述基于每一次融合后的中间特征图,得到所述第四特征图,包括:
    对每一次融合后的中间特征图进行卷积处理,得到该中间特征图对应的第五特征图;其中,每个中间特征图对应的第五特征图的时间维度值相同;
    将各个中间特征图对应的第五特征图进行合并,得到所述第四特征图。
  9. 一种动作信息识别装置,其特征在于,包括:
    特征提取模块,用于对待识别视频进行特征提取,得到多级第一特征图;
    参数调整模块,用于通过对所述第一特征图进行参数信息调整,得到各级第一特征图对应的第二特征图;其中,不同级第一特征图对应的第二特征图的参数信息相同;
    时间维度调整模块,用于分别调整各级第二特征图的参数信息,得到各级第二特征图对应的第三特征图,其中,各级第三特征图的时间维度值的比例与预设比例相符;
    确定模块,用于基于所述第三特征图,确定所述待识别视频中的动作信息。
  10. 一种电子设备,其特征在于,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至8任一所述的动作信息识别方法的步骤。
  11. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至8任一所述的动作信息识别方法的步骤。
  12. 一种计算机程序产品,其特征在于,所述计算机程序产品包括程序指令,所述程序指令被处理器运行时该处理器执行如权利要求1至8任一所述的动作信息识别方法的步骤。
PCT/CN2020/142510 2020-02-28 2020-12-31 动作信息识别方法、装置、电子设备及存储介质 WO2021169604A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020227008074A KR20220042467A (ko) 2020-02-28 2020-12-31 동작 정보 인식 방법, 장치, 전자 디바이스 및 저장매체
JP2021545743A JP2022525723A (ja) 2020-02-28 2020-12-31 動作情報識別方法、装置、電子機器及び記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010128428.XA CN111353428B (zh) 2020-02-28 2020-02-28 动作信息识别方法、装置、电子设备及存储介质
CN202010128428.X 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021169604A1 true WO2021169604A1 (zh) 2021-09-02

Family

ID=71195824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/142510 WO2021169604A1 (zh) 2020-02-28 2020-12-31 动作信息识别方法、装置、电子设备及存储介质

Country Status (4)

Country Link
JP (1) JP2022525723A (zh)
KR (1) KR20220042467A (zh)
CN (1) CN111353428B (zh)
WO (1) WO2021169604A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353428B (zh) * 2020-02-28 2022-05-24 北京市商汤科技开发有限公司 动作信息识别方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110304541A1 (en) * 2010-06-11 2011-12-15 Navneet Dalal Method and system for detecting gestures
CN106897714A (zh) * 2017-03-23 2017-06-27 北京大学深圳研究生院 一种基于卷积神经网络的视频动作检测方法
US20170286774A1 (en) * 2016-04-04 2017-10-05 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN108875931A (zh) * 2017-12-06 2018-11-23 北京旷视科技有限公司 神经网络训练及图像处理方法、装置、系统
CN109165562A (zh) * 2018-07-27 2019-01-08 深圳市商汤科技有限公司 神经网络的训练方法、横向控制方法、装置、设备及介质
CN111353428A (zh) * 2020-02-28 2020-06-30 北京市商汤科技开发有限公司 动作信息识别方法、装置、电子设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710847B (zh) * 2018-05-15 2020-11-27 北京旷视科技有限公司 场景识别方法、装置及电子设备
CN109086690B (zh) * 2018-07-13 2021-06-22 北京旷视科技有限公司 图像特征提取方法、目标识别方法及对应装置
CN109697434B (zh) * 2019-01-07 2021-01-08 腾讯科技(深圳)有限公司 一种行为识别方法、装置和存储介质
CN110324664B (zh) * 2019-07-11 2021-06-04 南开大学 一种基于神经网络的视频补帧方法及其模型的训练方法
CN110533119B (zh) * 2019-09-04 2022-12-27 北京迈格威科技有限公司 标识识别方法及其模型的训练方法、装置及电子系统
CN110633700B (zh) * 2019-10-21 2022-03-25 深圳市商汤科技有限公司 视频处理方法及装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110304541A1 (en) * 2010-06-11 2011-12-15 Navneet Dalal Method and system for detecting gestures
US20170286774A1 (en) * 2016-04-04 2017-10-05 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN106897714A (zh) * 2017-03-23 2017-06-27 北京大学深圳研究生院 一种基于卷积神经网络的视频动作检测方法
CN108875931A (zh) * 2017-12-06 2018-11-23 北京旷视科技有限公司 神经网络训练及图像处理方法、装置、系统
CN109165562A (zh) * 2018-07-27 2019-01-08 深圳市商汤科技有限公司 神经网络的训练方法、横向控制方法、装置、设备及介质
CN111353428A (zh) * 2020-02-28 2020-06-30 北京市商汤科技开发有限公司 动作信息识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
JP2022525723A (ja) 2022-05-19
KR20220042467A (ko) 2022-04-05
CN111353428A (zh) 2020-06-30
CN111353428B (zh) 2022-05-24

Similar Documents

Publication Publication Date Title
CN108205655B (zh) 一种关键点预测方法、装置、电子设备及存储介质
US10726244B2 (en) Method and apparatus detecting a target
WO2017193906A1 (zh) 一种图像处理方法及处理系统
US11132575B2 (en) Combinatorial shape regression for face alignment in images
US20170150235A1 (en) Jointly Modeling Embedding and Translation to Bridge Video and Language
CN109522945B (zh) 一种群体情感识别方法、装置、智能设备及存储介质
WO2020253127A1 (zh) 脸部特征提取模型训练方法、脸部特征提取方法、装置、设备及存储介质
CN109522902B (zh) 空-时特征表示的提取
US11704563B2 (en) Classifying time series image data
KR102263017B1 (ko) 3d cnn을 이용한 고속 영상 인식 방법 및 장치
WO2021169641A1 (zh) 人脸识别方法和系统
WO2022206729A1 (zh) 视频封面选择方法、装置、计算机设备和存储介质
WO2021169604A1 (zh) 动作信息识别方法、装置、电子设备及存储介质
CN111382791A (zh) 深度学习任务处理方法、图像识别任务处理方法和装置
CN116630630B (zh) 语义分割方法、装置、计算机设备及计算机可读存储介质
US20230343137A1 (en) Method and apparatus for detecting key point of image, computer device and storage medium
CN112990176A (zh) 书写质量评价方法、装置和电子设备
CN115731442A (zh) 图像处理方法、装置、计算机设备和存储介质
CN111383245A (zh) 视频检测方法、视频检测装置和电子设备
CN117441195A (zh) 纹理补全
CN113128277A (zh) 一种人脸关键点检测模型的生成方法及相关设备
WO2022141092A1 (zh) 模型生成方法、图像处理方法、装置及可读存储介质
CN113191316A (zh) 图像处理方法、装置、电子设备及存储介质
CN113569684A (zh) 短视频场景分类方法、系统、电子设备及存储介质
CN111666908A (zh) 视频用户的兴趣画像生成方法、装置、设备和存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021545743

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921230

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20227008074

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921230

Country of ref document: EP

Kind code of ref document: A1