WO2021185317A1 - 动作识别方法及装置、存储介质 - Google Patents
动作识别方法及装置、存储介质 Download PDFInfo
- Publication number
- WO2021185317A1 WO2021185317A1 PCT/CN2021/081556 CN2021081556W WO2021185317A1 WO 2021185317 A1 WO2021185317 A1 WO 2021185317A1 CN 2021081556 W CN2021081556 W CN 2021081556W WO 2021185317 A1 WO2021185317 A1 WO 2021185317A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- human body
- face
- scene image
- person
- target
- Prior art date
Links
- 230000009471 action Effects 0.000 title claims abstract description 216
- 238000000034 method Methods 0.000 title claims abstract description 56
- 239000013598 vector Substances 0.000 claims description 77
- 238000001514 detection method Methods 0.000 claims description 61
- 230000015654 memory Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 230000001186 cumulative effect Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 230000000875 corresponding effect Effects 0.000 description 71
- 230000006870 function Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/469—Contour-based spatial representations, e.g. vector-coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
Definitions
- the present disclosure relates to the field of computer vision, and in particular to an action recognition method and device, and storage medium.
- the present disclosure provides an action recognition method, device, and storage medium.
- an action recognition method comprising: acquiring a scene image; performing detection of different parts of an object on the scene image, association of different parts in the same object, and motion recognition of the object , Determining at least one object included in the scene image and a target action type of each object in the at least one object.
- the object includes a person, and different parts of the object include the face and the human body of the person; the scene image is detected for different parts of the object, the association of different parts in the same object, and Object action recognition, determining at least one object included in the scene image and the target action type of each object in the at least one object includes: extracting features of the scene image to obtain a feature map; determining the feature map At least one human face position and at least one human body position; determine at least one person included in the scene image according to the at least one human face position and/or the at least one human body position; Associating with the position of the human body; and determining the target action type of each character in the at least one character in the scene image according to the associated face position and the human body position.
- the associating the position of the face and the position of the human body belonging to the same person includes: for each of the at least one person, determining the position corresponding to the position of the person's face Reference human body position; according to the reference human body position and the at least one human body position, the face position and the human body position belonging to the same person are associated.
- the determining the reference human body position corresponding to each face position includes: determining the first coordinate value of the face position of the person on the feature map; according to a preset vector and The first coordinate value determines the second coordinate value; wherein the preset vector is a vector that points from the position of the human face to the position of the human body; and the second coordinate value is used as the reference human body position.
- the associating the face position and the human body position belonging to the same person according to the reference human body position and the at least one human body position includes: linking with the The human body position with the smallest reference human body position distance is associated with the face position corresponding to the reference human body position.
- the at least one character included in the scene image and the target action type of each character in the at least one character are determined according to the associated face position and the human body position , Including: for each character in at least one character, determining a plurality of feature vectors according to the face position and the human body position associated with the character; Describe the target action type.
- the determining a plurality of feature vectors according to the face position and the human body position associated with the person includes: determining that they correspond to at least one preset action type and are determined by the person.
- the face position points to multiple feature vectors of the associated human body position.
- the determining the target action type of each character in the at least one character based on the plurality of feature vectors includes: normalizing the plurality of feature vectors corresponding to the character respectively The normalized value of each feature vector is obtained; the feature vector corresponding to the maximum normalized value is used as the target feature vector of the person; the action type corresponding to the target feature vector is used as the person’s Target action type.
- the detection of different parts of the object, the association of different parts of the same object, and the motion recognition of the object are performed on the scene image to determine at least one object and the at least one object included in the scene image.
- the target action type of each object in the object includes: after determining the target position of each part of each object on the scene image through the object detection model, associating the target positions of different parts belonging to the same object And then determine the at least one object included in the scene image and the target action type of each object in the at least one object according to the target positions of the associated different parts through the object detection model.
- the object detection model is trained by the following steps: determining the label type in the sample image set; wherein the label type includes face position label, human body position label, face position and human body At least one of the association relationship label between the positions, the action identification label between the human body position and the action type; using the sample image set, the branches corresponding to the label type in the preset model are respectively trained to obtain The object detection model.
- the object detection model includes at least a positioning branch, an association branch, and an action recognition branch.
- the positioning branch is used to determine the position of each person's face and the position of each person's body;
- the association branch is used to associate the face position and the human body position that belong to the same person;
- the action recognition branch is used to determine what the scene image includes according to the associated face position and the human body position At least one character and the target action type corresponding to each character in the at least one character.
- the method further includes: determining a cumulative detection result of an action matching the target action type made by each object within a set time period.
- the scene image includes a scene image collected in a classroom
- the object includes a teaching object
- the target action type includes at least one action type in a teaching task.
- an action recognition device the device includes: an image acquisition module for acquiring a scene image; an action recognition module for detecting different parts of an object on the scene image, The association of different parts in the same object and the action recognition of the object determine at least one object included in the scene image and the target action type of each object in the at least one object.
- the object includes a person, and different parts of the object include the person's face and human body;
- the action recognition module includes: a feature extraction module for feature extraction on the scene image, Obtain a feature map; a first determining sub-module for determining at least one face position and at least one human body position in the feature map; a second determining sub-module for determining according to the at least one face position and/or the At least one human body position is used to determine at least one person included in the scene image; an association sub-module is used to associate the face position and the human body position belonging to the same person; the third determination sub-module is used to determine according to the associated The face position and the human body position determine the target action type of each character in the at least one character in the scene image.
- the association submodule includes: a first determining unit, configured to determine, for each of at least one character, a reference human body position corresponding to the position of the person's face; the association unit uses According to the reference human body position and the at least one human body position, the face position and the human body position belonging to the same person are associated.
- the first determining unit includes: determining, on the scene image, the first coordinate value of the person's face position on the feature map; and according to a preset vector and the first coordinate value A coordinate value to determine a second coordinate value respectively; wherein the preset vector is a vector that points from the position of the human face to the position of the human body; and the second coordinate value is used as the reference human body position.
- the associating unit includes: associating the human body position with the smallest distance from the reference human body position and the face position corresponding to the reference human body position.
- the second determining sub-module includes: a second determining unit, configured to, for each of the at least one character, determine according to the position of the face and the human body associated with the character Position, determining multiple feature vectors; a third determining unit, configured to determine the target action type of each of the at least one person based on the multiple feature vectors.
- the second determining unit includes: determining multiple feature vectors respectively corresponding to at least one preset action type and pointing from the face position to the associated human body position.
- the third determining unit includes: normalizing a plurality of feature vectors corresponding to the person to obtain a normalized value of each feature vector; and calculating the maximum normalized value
- the corresponding feature vector is used as the target feature vector of the person; the action type corresponding to the target feature vector is used as the target action type of the person.
- the action recognition module includes: a second association sub-module, which is used to determine the target position of each part of each object on the scene image through the object detection model, and then the pairs belong to the same The target positions of different parts of the object are associated; a third determining sub-module is used to determine at least one object included in the scene image and the target position of the associated different parts through the object detection model The target action type of each object in at least one object.
- the device further includes: a tag type determination module, configured to determine the tag type in the sample image set; wherein the tag type includes a face position tag, a human body position tag, a face position and At least one of the association relationship label between the human body position and the action identification label between the human body position and the action type; the training module is used to use the sample image set to compare the preset model corresponding to the label type. The branches are trained separately to obtain the object detection model.
- the device further includes: a matching determination module, configured to determine the cumulative detection result of the action matching the target action type made by each object within a set time period.
- the scene image includes a scene image collected in a classroom
- the object includes a teaching object
- the target action type includes at least one action type in a teaching task.
- a computer-readable storage medium stores a computer program, and the computer program is used to execute the action recognition method of any one of the first aspects.
- an action recognition device including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the storage in the memory The executable instructions of to implement the action recognition method described in any one of the first aspect.
- the scene image can be detected for different parts of the object, the association of different parts of the same object, and the action recognition of the object, so as to determine at least one object included in the scene image and the target of each object in the at least one object.
- Action type the above action recognition duration has nothing to do with the number of objects included in the scene image, and the increase in the number of objects will not increase the calculation time, which greatly saves computing resources, shortens the duration of action recognition, and effectively improves detection efficiency .
- Fig. 1 is a flow chart of an action recognition method according to an exemplary embodiment of the present disclosure
- Fig. 2 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure
- Fig. 3 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure
- Fig. 4 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure.
- Fig. 5 is a schematic diagram showing preset vectors according to an exemplary embodiment of the present disclosure.
- Fig. 6 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure.
- Fig. 7 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure.
- Fig. 8 is a schematic structural diagram of an object detection model according to an exemplary embodiment of the present disclosure.
- Fig. 9 is a flowchart of another action recognition method according to an exemplary embodiment of the present disclosure.
- Fig. 10 is a schematic diagram showing a training scene of an object detection model according to an exemplary embodiment of the present disclosure
- Fig. 11 is a flowchart of another method for action recognition according to an exemplary embodiment of the present disclosure.
- Fig. 12 is a block diagram showing an action recognition device according to an exemplary embodiment of the present disclosure.
- Fig. 13 is a schematic structural diagram of a device for action recognition according to an exemplary embodiment of the present disclosure.
- first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
- first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
- word “if” as used herein can be interpreted as "when” or “when” or “in response to a certainty”.
- the embodiments of the present disclosure provide an action recognition solution, which is exemplary and can be applied to terminal devices in different scenarios. Different scenarios include but are not limited to classrooms, places where advertisements are played, or other indoor or outdoor scenes that require action recognition of at least one object.
- the terminal device can be any terminal device with a camera, or the terminal device can also be an external camera device .
- the terminal device detects the different parts of the object, the association of different parts in the same object, and the action recognition of the object on the acquired scene image, thereby determining at least one object included in the scene image and the target action type of each object in the at least one object .
- the terminal equipment can be a teaching multimedia device with a camera deployed in the classroom, including but not limited to teaching projectors, monitoring equipment in the classroom, etc.
- the terminal device obtains the scene image in the classroom, so as to detect different parts of the object, the association of different parts in the same object, and the motion recognition of the object in the classroom, and quickly obtain the detection result, which may include the scene image
- At least one object and the target action type of each object are included.
- the target action type may include raising a hand, standing, or performing other interactive actions.
- the terminal device can obtain a scene image in an elevator, and the elevator is playing an advertisement.
- the target action type corresponding to the object in the elevator can be determined when the elevator is playing the advertisement.
- the target action type can be Including but not limited to turning heads, paying attention to ad placement, turning sideways, etc.
- the action recognition solution provided by the embodiments of the present disclosure can also be applied to cloud servers in different scenarios.
- the cloud server can be equipped with an external camera.
- the external camera collects scene images and sends them to the cloud server through devices such as routers or gateways.
- the cloud server performs detection of different parts of the object, association of different parts of the same object, and motion recognition of the object on the scene image, and determines at least one object included in the scene image and the target of each object in the at least one object Action type.
- an external camera is set in the classroom. After the external camera collects the scene image in the classroom, it is sent to the cloud server through a router or gateway.
- the cloud server detects different parts of the object and detects different parts of the same object on the scene image. Associate and recognize the action of the object, and determine the at least one object included in the scene image and the target action type of each object in the at least one object. Further, the cloud server can feed back the above results to the corresponding teaching task analysis server as required, so as to remind the teacher to adjust the teaching content so as to better carry out the teaching activities.
- the place is an elevator
- an external camera is set in the elevator.
- the external camera collects scene images in the elevator.
- the scene images can be sent to the cloud server through routers or gateways, and the cloud server determines the scene.
- the image includes at least one object and the target action type of each object in the at least one object.
- the statistical results of the target actions of the objects in the elevator can be fed back to the corresponding advertiser server as needed, and the advertiser can adjust the advertising content.
- the terminal device or the cloud server can also perform further processing according to the above detection result, for example, output a target image, and identify at least one object included in the scene image and each of the at least one object on the target image.
- the target action type of the object in order to better understand the object in the current scene and the action type of each object.
- the terminal device or the cloud server can also determine the cumulative detection result of each object included in the scene image within a set time period that matches the target action type.
- the target action type may include at least one action type in the teaching task.
- the teacher is teaching
- the target action types include but are not limited to raising hands, standing up to answer questions, interacting with the teacher, paying attention to the blackboard, and writing with your head down.
- the terminal device can display the accumulated test results after obtaining the accumulated test results, so that the teacher can better perform teaching tasks, or the cloud server can send the accumulated test results to the designated terminal device for display after obtaining the accumulated test results, which can also make the teacher better Carry out teaching tasks.
- Fig. 1 shows an action recognition method according to an exemplary embodiment, which includes the following steps:
- step 101 a scene image is acquired.
- scene images in the current scene can be collected.
- the scenes of the present disclosure include, but are not limited to, any scene that requires action recognition of objects in the scene, such as classrooms and places where advertisements are played.
- step 102 the scene image is subjected to detection of different parts of the object, the association of different parts of the same object, and the motion recognition of the object to determine at least one object included in the scene image and each object in the at least one object The target action type.
- the object may include but is not limited to a person, and different parts may include, but are not limited to, a human face and a human body.
- the detection of different parts of the object on the scene image may include the position of the person’s face and the human body on the scene image. Detection.
- the association of different parts in the same object may require associating the position of the face and the position of the human body belonging to the same person.
- the action recognition of the object may be to determine the target action type of each person included in the scene image from at least one preset action type.
- the preset action type can be set according to the needs of the scene, including but not limited to raising hands, bending over, jumping, turning around, etc., or the preset action types can also include the types that have not performed any actions, such as the character holding the previous The action type remains unchanged.
- the scene image is subjected to detection of different parts of the object, the association of different parts of the same object, and the action recognition of the object, so as to determine at least one object included in the scene image and the target action type of each object in the at least one object ,
- the above action recognition time has nothing to do with the number of objects included in the scene image, and it will not be due to the increase in the number of objects This leads to an increase in calculation time, which greatly saves calculation resources, shortens the time for action recognition, and improves detection efficiency.
- step 102 may include:
- Feature extraction is performed on the scene image, and after the feature map is obtained, detection of different parts of the object, association of different parts in the same object, and motion recognition of the object are performed on the feature map.
- the neural network backbone model (backbone) trained in advance can be used to extract the image features in the scene image to obtain the feature map.
- the backbone model of the neural network may adopt, but is not limited to, models such as Visual Geometry Group Network (VGG Net).
- the dimension of the feature map obtained by extracting the image features through the neural network backbone model is smaller than the dimension of the scene image. For example, by inputting a scene image with a dimension of 640 ⁇ 480 into the neural network backbone model, a feature map with a dimension of 80 ⁇ 60 can be obtained.
- the extracted image features may include, but are not limited to, color features, texture features, shape features, and so on.
- Color feature is a kind of global feature, which describes the surface color attribute of the object corresponding to the image
- texture feature is also a kind of global feature, which describes the surface texture attribute of the object corresponding to the image.
- shape features There are two types of representation methods for shape features, one is It is the contour feature, and the other is the regional feature.
- the contour feature of the image is mainly for the outer boundary of the object, and the regional feature of the image is related to the shape of the image area.
- the subsequent detection of different parts of the object, the association of different parts of the same object, and the motion recognition of the object are performed on the feature map, so as to quickly determine the location of the scene image based on the image characteristics.
- the included at least one object and the target action type of each object in the at least one object are easy to implement and have high availability.
- the object includes a person, and different parts of the object include the person's face and human body.
- step 102 may include:
- step 102-0 at least one face position and at least one human body position in the feature map are determined.
- the human face area belonging to the human face and the human body area belonging to the human body on the feature map corresponding to the scene image can be detected through the area prediction network.
- the face area can be identified by the face recognition frame
- the human body area can be identified by the human body recognition frame.
- the size of the face recognition frame can be determined by the center position of the face recognition frame, and the length and width of the face recognition frame.
- the face position can be determined by the center position of the face recognition frame.
- the size of the human body recognition frame can be determined by the center position of the human body recognition frame, the length and width of the human body recognition frame, and the position of the human body can be represented by the center position of the human body recognition frame.
- the above-mentioned position description information of the human face and the human body can be respectively represented through different channels.
- the dimension of the feature map is 80 ⁇ 60.
- the first feature map of 80 ⁇ 60 ⁇ 6 can be obtained.
- the 6 channels of the first feature map output face recognition respectively The center position of the frame, the length of the face recognition frame, the width of the face recognition frame, the center position of the human body recognition frame, the length of the human body recognition frame, and the width of the human body recognition frame.
- the first feature maps corresponding to the two channels of the center position of the face recognition frame and the center position of the human body recognition frame may be obtained, so as to determine the position of the face and the position of the human body respectively.
- step 102-1 at least one person included in the scene image is determined according to the at least one human face position and/or the at least one human body position.
- each character can be represented by the face and/or human body corresponding to the character, so that at least one character included in the scene image can be determined.
- the position of the person can be determined by the position of the face.
- the position of the face can be the center position of the face recognition frame. Assuming that the position of the face includes A1, A2, and A3, it can be determined that the feature map includes 3 persons, and each The locations of the personal belongings are A1, A2, and A3.
- step 102-2 the position of the face and the position of the human body belonging to the same person are associated.
- the center positions of the two face recognition frames are determined on the feature map, A1 and A2, and the center positions of the two face recognition frames are determined, B1 and B2, respectively.
- the center positions of the face recognition frames can be determined Associate with the center position of the human body recognition frame, and finally get the center position A1 of the associated face recognition frame and the center position B2 of the human body recognition frame, as well as the center position A2 of the associated face recognition frame and the center position B1 of the human body recognition frame .
- the position of the face and the position of the human body associated with the position of the face can be respectively represented through two channels.
- the dimension of the feature map is 80 ⁇ 60.
- a first feature map with a dimension of 80 ⁇ 60 ⁇ 6 is obtained.
- the position of the human body is correlated, and a second feature map with a dimension of 80 ⁇ 60 ⁇ 2 is obtained.
- the second feature map includes two channels, one channel corresponds to the face position of each person, and the other channel corresponds to the human body position associated with the face position.
- step 102-3 the target action type of each character in the at least one character in the scene image is determined according to the associated position of the face and the position of the human body.
- multiple feature vectors can be determined according to the associated face position and human body position. These feature vectors are respectively obtained according to preset action types, and then can be based on these feature vectors.
- the target action type may be at least one of the preset action types. Assuming that the number of preset action types is n, n channels are needed to respectively correspond to different preset action types.
- the preset action types include various types of actions that the character may perform, as well as the types of actions that the character does not perform any actions.
- the dimension of the feature map is 80 ⁇ 60.
- a first feature map with a dimension of 80 ⁇ 60 ⁇ 6 is obtained, and further, the face with an association relationship is determined
- a second feature map with a dimension of 80 ⁇ 60 ⁇ 2 is obtained.
- a third feature map with a dimension of 80 ⁇ 60 ⁇ n needs to be determined. Determine the final target action type according to the third feature map.
- the position of the face and the position of the human body can be determined on the feature map first, and further, the position of the face belonging to the same person is associated with the position of the human body, so as to be based on the associated face position and the position of the human body.
- the at least one character included in the scene image and the target action type corresponding to each character in the at least one character are determined.
- the target action type corresponding to each character can be quickly determined, which reduces the requirement on the computing power of the device, reduces the time for action recognition, and improves the competitiveness of the device.
- step 102-2 may include:
- step 102-21 a reference human body position corresponding to the face position of the same person is determined.
- the center position of the most likely human body recognition frame corresponding to the center position of the face recognition frame can be predicted based on the center position of the face recognition frame of a person, and this position is used as the reference human body Location.
- step 102-22 determine the associated face position and human body position according to the reference human body position and the human body position.
- each reference human body position can be associated with a human body position, so that the face position and the human body position of the same reference human body position are associated.
- the reference human body position corresponding to each human face position may be determined first according to the human face position of each person, so as to associate the human face position with the human body position, which is simple to implement and has high usability.
- steps 102-21 may include:
- step 201 the first coordinate value corresponding to the position of the face of the same person on the scene image is determined.
- the face position of each person has been determined on the feature map corresponding to the scene image before, and the face position here can be represented by the center position of the face recognition frame. Then, the coordinate value corresponding to the center position of each face recognition frame can be determined in the image coordinate system corresponding to the feature map, and the coordinate value is the first coordinate value.
- a second coordinate value is determined according to a preset vector and the first coordinate value.
- the preset vector is a preset vector pointing from the position of the face to the position of the human body. For example, as shown in FIG. The estimated center position of the human body recognition frame. Then, according to the first coordinate value of the face position and the preset vector, a second coordinate value can be determined.
- step 203 the second coordinate value is used as the reference human body position.
- the second coordinate value is directly used as the reference human body position.
- the reference human body position corresponding to each human face position can be determined according to the human face position and the preset vector of each person, so that the face position and the human body position can be subsequently associated with high usability.
- steps 102-22 may include:
- the human body position with the smallest distance from the reference human body position and the human face position corresponding to the reference human body position are used as the human face position and the human body position having an association relationship.
- the human body position closest to the reference human body position and the face position corresponding to the reference human body position are the face positions of the same person and Human position.
- the associated face position and human body position are obtained.
- the reference body position includes C1 and C2, where C1 is determined according to the face position A1, and C2 is determined according to the face position A2.
- the human body positions include B1 and B2. Among the human body positions, the closest to C1 is B2, and the closest to C2 is B1. It can be determined that A1 and B2 have an association relationship, and A2 and B1 have an association relationship.
- the reference face position corresponding to each human body position can also be determined according to the human body position of each person and another preset vector.
- the face position with the smallest position distance and the human body position corresponding to the reference face position are regarded as the face position and the human body position having an association relationship.
- the other preset vector may be a preset vector pointing from the position of the human body to the position of the human face.
- the method of determining the position of the reference human face is the same as the method of determining the position of the reference human body described above, and will not be repeated here.
- step 102-3 includes:
- steps 102-31 at least one of the associated face position and the human body position is used as the position of each person included in the scene image, and the scene image includes the at least one A character.
- the position of each character can be represented by the position of the face and/or the position of the human body corresponding to the character, so that the character included in the scene image can be determined.
- a plurality of feature vectors are determined according to the associated position of the face and the position of the human body.
- the feature vectors corresponding to at least one preset action type and corresponding to the position of the human body are respectively determined from the face position to obtain the character vector corresponding to the same person. Multiple feature vectors.
- the target action type corresponding to each person is determined based on the multiple feature vectors.
- the most likely action type of the character can be determined based on the multiple feature vectors, and this action type is used as the target action type.
- steps 102-33 may include:
- step 301 the multiple feature vectors corresponding to each person are normalized to obtain a normalized value corresponding to each feature vector.
- a normalization function such as a softmax function, may be used to normalize multiple feature vectors corresponding to each person, so as to obtain a normalized value corresponding to each feature vector.
- step 302 the feature vector corresponding to the maximum normalized value of each person is used as the target feature vector of each person.
- the feature vector corresponding to the maximum normalized value is used as the target feature vector of each person.
- step 303 the action type corresponding to the target feature vector is used as the target action type corresponding to each person.
- the action type corresponding to the target feature vector is the most likely action type of the character, and accordingly, it can be used as the target action type of the character.
- the most likely action type of each character is determined, and the action type is used as the target action type to realize the action recognition of the object. Purpose.
- the scene image may be input to a pre-trained object detection model, and the object detection model determines each object of each object on the scene image.
- the target positions of different parts belonging to the same object are then associated with the target positions of different parts belonging to the same object, and at least one object included in the scene image and each of the at least one object are determined according to the target positions of the associated different parts.
- the target action type of each object is determined according to the target positions of the associated different parts.
- the structure of the object detection model is shown in Figure 8. After acquiring the scene image, the scene image is input into the object detection model.
- the object detection model first uses the pre-trained neural network backbone model to extract the features of the scene image to obtain the feature map.
- the object detection model includes at least a positioning branch, an association branch, and an action recognition branch.
- the object detection model determines the position of each person's face and the position of each person's body on the feature map through positioning branches.
- the object detection model associates the position of the face and the position of the human body that belong to the same person through an association branch.
- the action recognition branch is then used to determine the at least one character included in the scene image and the target action type corresponding to each character in the at least one character according to the associated face position and the human body position.
- the final object detection model may output the above-mentioned action detection result, which includes at least one person included in the scene image and the target action type corresponding to each person in the at least one person.
- the object detection model can also directly output the target image. At least one object included in the scene image and the target action type of each object in the at least one object can be identified on the target image at the same time. Intuitively reflect the object detection results.
- the scene image can be detected for different parts of the object, the association of different parts of the same object, and the action recognition of the object can be performed to determine at least one object included in the scene image and the target action of each object in the at least one object.
- the aforementioned action recognition duration has nothing to do with the number of objects included in the scene image, and the calculation duration will not increase due to the increase in the number of objects, which greatly saves computing resources, shortens the duration of action recognition, and effectively improves detection efficiency.
- the face position label, the human body position label, the association relationship label between the face position and the human body position, and the relationship between the human body position and the action type are marked at the same time.
- the optimal sample image set of the action identification label is relatively small, and for the sample image set with only part of the labels, it will take more time to label other labels.
- the method may further include:
- step 100-1 the label type in the sample image set is determined.
- an existing sample image set is used, and the label types included in the sample images in the sample image set include face position labels, human body position labels, and correlation between face positions and human body positions At least one of the relationship label, the position of the human body, and the action identification label between the action type.
- step 100-2 the sample image set is used to separately train branches in the preset model corresponding to each of the tag types to obtain the object detection model.
- the structure of the preset model may also be as shown in FIG. 8, including positioning branch, association branch and action recognition branch.
- the sample image set is used to separately train branches in the preset model corresponding to the label type, and in the case where the loss function of the corresponding branch is the smallest, a trained object detection model is obtained.
- the positioning branch may also include a face positioning branch and a human body positioning branch (not shown in FIG. 9).
- the sample image set is used to train the face positioning branch in the positioning branch of the preset model. In each training iteration, no processing is done on other branches. That is, the loss function determined each time is the same as the first loss function, and the second loss function, the third loss function, and the fourth loss function can be set to 0, for example.
- the sample image set is used to train the human body positioning branch in the positioning branch of the preset model. If the label types in the sample image set include both the face position label and the human body position label, the sample image set can be used to directly train the positioning branch.
- the sample image set can be used to train the association branch of the preset model, and the loss functions corresponding to other branches are 0.
- the sample image set can be used to train the action recognition branch of the preset model, and the loss functions corresponding to other branches can be 0, for example.
- the sample image set can be used to train the corresponding branch of the preset model, and the loss functions corresponding to other branches can be 0, for example.
- the sample image set is used to train the branches corresponding to the label type of the sample image set in the preset model to obtain the object detection model, which improves the detection performance and generalization performance of the object detection model .
- the method may further include:
- step 103 the cumulative detection result of the action matching the target action type made by each object within a set time period is determined.
- the scene image includes a scene image collected in a classroom
- the object includes a teaching object
- the target action type includes at least one action type in the teaching task, and the action type matching the teaching task Including but not limited to raising hands, interacting with teachers, standing up to answer questions, paying attention to the blackboard, writing with your head down, etc.
- teaching multimedia equipment with cameras deployed in the classroom can be used to obtain scene images collected in the classroom.
- each teaching object for example, the cumulative detection result of an action matching the target action type made by each student, can be determined. For example, determine how many times each student raised his hand in a class, the length of time he paid attention to the blackboard, the length of time he lowered his head to write, the number of times he stood up to answer questions, the number of interactions with the teacher, and so on.
- the above results can be displayed through teaching multimedia equipment, so that teachers can better carry out teaching tasks.
- the present disclosure also provides an embodiment of the device.
- FIG. 12 is a block diagram of an action recognition device according to an exemplary embodiment of the present disclosure.
- the device includes: an image acquisition module 410, which is used to acquire scene images; and an action recognition module 420, which is used to The scene image performs detection of different parts of an object, association of different parts in the same object, and motion recognition of the object, and determines at least one object included in the scene image and a target action type of each object in the at least one object.
- the object includes a person, and different parts of the object include the person's face and human body;
- the action recognition module includes: a feature extraction module for feature extraction on the scene image, Obtain a feature map; a first determining sub-module for determining at least one face position and at least one human body position in the feature map; a second determining sub-module for determining according to the at least one face position and/or the At least one human body position is used to determine at least one person included in the scene image; an association sub-module is used to associate the face position and the human body position belonging to the same person; the third determination sub-module is used to determine according to the associated The face position and the human body position determine the target action type of each character in the at least one character in the scene image.
- the association submodule includes: a first determining unit, configured to determine, for each of at least one character, a reference human body position corresponding to the position of the person's face; the association unit uses According to the reference human body position and the at least one human body position, the face position and the human body position belonging to the same person are associated.
- the first determining unit includes: determining, on the scene image, the first coordinate value of the person's face position on the feature map; and according to a preset vector and the first coordinate value A coordinate value to determine a second coordinate value respectively; wherein the preset vector is a vector that points from the position of the human face to the position of the human body; and the second coordinate value is used as the reference human body position.
- the associating unit includes: associating the human body position with the smallest distance from the reference human body position and the face position corresponding to the reference human body position.
- the second determining sub-module includes: a second determining unit, configured to, for each of the at least one character, determine according to the position of the face and the human body associated with the character Position, determining multiple feature vectors; a third determining unit, configured to determine the target action type of each of the at least one person based on the multiple feature vectors.
- the second determining unit includes: determining multiple feature vectors respectively corresponding to at least one preset action type and pointing from the face position to the associated human body position.
- the third determining unit includes: normalizing a plurality of feature vectors corresponding to the person to obtain a normalized value of each feature vector; and calculating the maximum normalized value
- the corresponding feature vector is used as the target feature vector of the person; the action type corresponding to the target feature vector is used as the target action type of the person.
- the action recognition module includes: a second association sub-module, which is used to determine the target position of each part of each object on the scene image through the object detection model, and then the pairs belong to the same The target positions of different parts of the object are associated; a third determining sub-module is used to determine at least one object included in the scene image and the target position of the associated different parts through the object detection model The target action type of each object in at least one object.
- the device further includes: a tag type determination module, configured to determine the tag type in the sample image set; wherein the tag type includes a face position tag, a human body position tag, a face position and At least one of the association relationship label between the human body position and the action identification label between the human body position and the action type; the training module is used to use the sample image set to compare the preset model corresponding to the label type. The branches are trained separately to obtain the object detection model.
- the device further includes: a matching determination module, configured to determine the cumulative detection result of the action matching the target action type made by each object within a set time period.
- the scene image includes a scene image collected in a classroom
- the object includes a teaching object
- the target action type includes at least one action type in a teaching task.
- the relevant part can refer to the part of the description of the method embodiment.
- the device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place. , Or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the present disclosure. Those of ordinary skill in the art can understand and implement without creative work.
- the embodiment of the present disclosure also provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to execute any of the above-mentioned action recognition methods.
- the embodiments of the present disclosure provide a computer program product, including computer-readable code.
- the processor in the device executes any of the above implementations.
- the example provides instructions for the action recognition method.
- the embodiments of the present disclosure also provide another computer program product for storing computer-readable instructions, which when executed, cause the computer to perform the operations of the action recognition method provided in any of the foregoing embodiments.
- the computer program product can be specifically implemented by hardware, software, or a combination thereof.
- the computer program product is specifically embodied as a computer storage medium.
- the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc. Wait.
- SDK software development kit
- An embodiment of the present disclosure also provides an action recognition device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the executable instructions stored in the memory to implement any of the foregoing.
- FIG. 13 is a schematic diagram of the hardware structure of an action recognition device provided by an embodiment of the disclosure.
- the action recognition device 510 includes a processor 511, and may also include an input device 512, an output device 513, and a memory 514.
- the input device 512, the output device 513, the memory 514, and the processor 511 are connected to each other through a bus.
- Memory includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or portable Read-only memory (compact disc read-only memory, CD-ROM), which is used for related instructions and data.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- CD-ROM compact disc read-only memory
- the input device is used to input data and/or signals
- the output device is used to output data and/or signals.
- the output device and the input device can be independent devices or a whole device.
- the processor may include one or more processors, such as one or more central processing units (CPU).
- processors such as one or more central processing units (CPU).
- CPU central processing units
- the CPU may be a single-core CPU or Multi-core CPU.
- the memory is used to store the program code and data of the network device.
- the processor is used to call the program code and data in the memory to execute the steps in the foregoing method embodiment.
- the processor is used to call the program code and data in the memory to execute the steps in the foregoing method embodiment.
- the description in the method embodiment please refer to the description in the method embodiment, which will not be repeated here.
- FIG. 13 only shows a simplified design of a motion recognition device.
- the motion recognition device may also include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all the motion recognition devices that can implement the embodiments of the present disclosure All are within the protection scope of the present disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Geometry (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
- Signal Processing For Digital Recording And Reproducing (AREA)
Abstract
Description
Claims (17)
- 一种动作识别方法,其特征在于,包括:获取场景图像;对所述场景图像进行对象的不同部位检测、同一对象中不同部位的关联以及对象的动作识别,确定所述场景图像包括的至少一个对象和所述至少一个对象中每个对象的目标动作类型。
- 根据权利要求1所述的方法,其特征在于,所述对象包括人物,所述对象的不同部位包括人物的人脸和人体;所述对所述场景图像进行对象的不同部位检测、同一对象中不同部位的关联以及对象的动作识别,确定所述场景图像包括的至少一个对象和所述至少一个对象中每个对象的目标动作类型,包括:对所述场景图像进行特征提取,得到特征图;确定所述特征图中至少一个人脸位置和至少一个人体位置;根据所述至少一个人脸位置和/或所述至少一个人体位置,确定场景图像中包括的至少一个人物;对属于同一人物的所述人脸位置和所述人体位置进行关联;根据关联的所述人脸位置和所述人体位置,确定所述场景图像的所述至少一个人物中每个人物的所述目标动作类型。
- 根据权利要求2所述的方法,其特征在于,所述对属于同一人物的所述人脸位置和所述人体位置进行关联,包括:对于至少一个人物中的每个人物,确定与该人物的人脸位置对应的参考人体位置;根据所述参考人体位置和所述至少一个人体位置,对属于所述同一人物的所述人脸位置和所述人体位置进行关联。
- 根据权利要求3所述的方法,其特征在于,所述确定与该人物的人脸位置对应的参考人体位置,包括:确定该人物的人脸位置在所述特征图上的第一坐标值;根据预设向量和所述第一坐标值,确定第二坐标值;其中,所述预设向量是由人脸所在位置指向人体所在位置的向量;将所述第二坐标值作为所述参考人体位置。
- 根据权利要求3或4所述的方法,其特征在于,所述根据所述参考人体位置和所述至少一个人体位置,对属于所述同一人物的所述人脸位置和所述人体位置进行关联,包括:将与所述参考人体位置距离最小的所述人体位置和该参考人体位置对应的所述人脸位置进行关联。
- 根据权利要求2-5任一项所述的方法,其特征在于,所述根据关联的所述人脸位置和所述人体位置,确定所述场景图像的所述至少一个人物中每个人物的所述目标动作类型,包括:对于至少一个人物中的每个人物,根据与该人物关联的所述人脸位置和所述人体位置,确定多个特征向量;基于所述多个特征向量,确定该人物的所述目标动作类型。
- 根据权利要求6所述的方法,其特征在于,所述根据与该人物关联的所述人脸位置和所述人体位置,确定多个特征向量,包括:确定与至少一个预设动作类型分别对应且由所述人脸位置指向关联的所述人体位置的多个特征向量。
- 根据权利要求6或7所述的方法,其特征在于,所述基于所述多个特征向量, 确定该人物的所述目标动作类型,包括:对该人物对应的多个特征向量分别进行归一化,得到每个特征向量的归一化值;将最大归一化值所对应的特征向量,作为该人物的目标特征向量;将所述目标特征向量所对应的动作类型作为该人物的所述目标动作类型。
- 根据权利要求1-8任一项所述的方法,其特征在于,所述对所述场景图像进行对象的不同部位检测、同一对象中不同部位的关联以及对象的动作识别,确定所述场景图像包括的至少一个对象和所述至少一个对象中每个对象的目标动作类型,包括:通过对象检测模型在所述场景图像上,确定每个对象的每个部位的目标位置后,对属于同一对象的不同部位的所述目标位置进行关联;再通过所述对象检测模型根据关联的不同部位的所述目标位置,确定所述场景图像包括的至少一个对象和所述至少一个对象中每个对象的目标动作类型。
- 根据权利要求9所述的方法,其特征在于,所述对象检测模型是通过以下步骤训练的:确定样本图像集合中的标签类型;其中,所述标签类型包括人脸位置标签、人体位置标签、人脸位置和人体位置之间的关联关系标签、人体位置和动作类型之间的动作标识标签中的至少一种;采用所述样本图像集合,对预设模型中与所述标签类型对应的分支分别进行训练,得到所述对象检测模型。
- 根据权利要求10所述的方法,其特征在于,所述对象检测模型至少包括定位分支、关联分支以及动作识别分支,所述定位分支用于确定所述每个人物的人脸位置和所述每个人物的人体位置;所述关联分支用于对属于同一人物的所述人脸位置和所述人体位置进行关联;所述动作识别分支用于根据关联的所述人脸位置和所述人体位置,确定所述场景图像包括的至少一个人物和至少一个人物中每个人物对应的目标动作类型。
- 根据权利要求1-11任一项所述的方法,其特征在于,所述方法还包括:确定所述每个对象在设定时间段内做出的与所述目标动作类型匹配的动作的累计检测结果。
- 根据权利要求12所述的方法,其特征在于,所述场景图像包括在教室中采集到的场景图像,所述对象包括教学对象,所述目标动作类型包括教学任务中的至少一个动作类型。
- 一种动作识别装置,其特征在于,所述装置包括:图像获取模块,用于获取场景图像;动作识别模块,用于对所述场景图像进行对象的不同部位检测、同一对象中不同部位的关联以及对象的动作识别,确定所述场景图像包括的至少一个对象和所述至少一个对象中每个对象的目标动作类型。
- 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-13任一所述的动作识别方法。
- 一种动作识别装置,其特征在于,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器中存储的可执行指令,实现权利要求1-13中任一项所述的动作识别方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机可读代码,当计算机可读代码在设备上运行时,设备中的处理器执行用于实现权利要求1-13中任一项所述的动作识别方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020227003914A KR20220027241A (ko) | 2020-03-19 | 2021-03-18 | 동작 인식 방법, 장치 및 저장 매체 |
JP2022506372A JP2022543032A (ja) | 2020-03-19 | 2021-03-18 | 動作認識方法、動作認識装置、コンピュータ可読記憶媒体、電子機器及びコンピュータプログラム製品 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010196461.6A CN113496143B (zh) | 2020-03-19 | 2020-03-19 | 动作识别方法及装置、存储介质 |
CN202010196461.6 | 2020-03-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021185317A1 true WO2021185317A1 (zh) | 2021-09-23 |
Family
ID=77770162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/081556 WO2021185317A1 (zh) | 2020-03-19 | 2021-03-18 | 动作识别方法及装置、存储介质 |
Country Status (5)
Country | Link |
---|---|
JP (1) | JP2022543032A (zh) |
KR (1) | KR20220027241A (zh) |
CN (1) | CN113496143B (zh) |
TW (1) | TWI776429B (zh) |
WO (1) | WO2021185317A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463850B (zh) * | 2022-02-08 | 2022-12-20 | 南京科源视觉技术有限公司 | 一种适用于多种应用场景的人体动作识别系统 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050084141A1 (en) * | 2003-08-29 | 2005-04-21 | Fuji Xerox Co., Ltd. | Action recognition apparatus and apparatus for recognizing attitude of object |
US7110569B2 (en) * | 2001-09-27 | 2006-09-19 | Koninklijke Philips Electronics N.V. | Video based detection of fall-down and other events |
CN102179048A (zh) * | 2011-02-28 | 2011-09-14 | 武汉市高德电气有限公司 | 基于动作分解和行为分析实现实景游戏的方法 |
CN108229324A (zh) * | 2017-11-30 | 2018-06-29 | 北京市商汤科技开发有限公司 | 手势追踪方法和装置、电子设备、计算机存储介质 |
US10037458B1 (en) * | 2017-05-02 | 2018-07-31 | King Fahd University Of Petroleum And Minerals | Automated sign language recognition |
CN109829435A (zh) * | 2019-01-31 | 2019-05-31 | 深圳市商汤科技有限公司 | 一种视频图像处理方法、装置及计算机可读介质 |
CN110347246A (zh) * | 2019-06-19 | 2019-10-18 | 深圳前海达闼云端智能科技有限公司 | 人机交互方法、装置、存储介质及电子设备 |
CN110647807A (zh) * | 2019-08-14 | 2020-01-03 | 中国平安人寿保险股份有限公司 | 异常行为确定方法、装置、计算机设备和存储介质 |
CN110781843A (zh) * | 2019-10-29 | 2020-02-11 | 首都师范大学 | 课堂行为检测方法及电子设备 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190213792A1 (en) * | 2018-01-11 | 2019-07-11 | Microsoft Technology Licensing, Llc | Providing Body-Anchored Mixed-Reality Experiences |
CN110659544A (zh) * | 2018-06-28 | 2020-01-07 | 华南农业大学 | 一种基于非参数时空上下文轨迹模型的奶牛行为识别方法 |
CN108960209B (zh) * | 2018-08-09 | 2023-07-21 | 腾讯科技(深圳)有限公司 | 身份识别方法、装置及计算机可读存储介质 |
CN110135246B (zh) * | 2019-04-03 | 2023-10-20 | 平安科技(深圳)有限公司 | 一种人体动作的识别方法及设备 |
CN110096964B (zh) * | 2019-04-08 | 2021-05-04 | 厦门美图之家科技有限公司 | 一种生成图像识别模型的方法 |
-
2020
- 2020-03-19 CN CN202010196461.6A patent/CN113496143B/zh active Active
-
2021
- 2021-03-18 KR KR1020227003914A patent/KR20220027241A/ko not_active Application Discontinuation
- 2021-03-18 JP JP2022506372A patent/JP2022543032A/ja not_active Withdrawn
- 2021-03-18 TW TW110109832A patent/TWI776429B/zh active
- 2021-03-18 WO PCT/CN2021/081556 patent/WO2021185317A1/zh active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7110569B2 (en) * | 2001-09-27 | 2006-09-19 | Koninklijke Philips Electronics N.V. | Video based detection of fall-down and other events |
US20050084141A1 (en) * | 2003-08-29 | 2005-04-21 | Fuji Xerox Co., Ltd. | Action recognition apparatus and apparatus for recognizing attitude of object |
CN102179048A (zh) * | 2011-02-28 | 2011-09-14 | 武汉市高德电气有限公司 | 基于动作分解和行为分析实现实景游戏的方法 |
US10037458B1 (en) * | 2017-05-02 | 2018-07-31 | King Fahd University Of Petroleum And Minerals | Automated sign language recognition |
CN108229324A (zh) * | 2017-11-30 | 2018-06-29 | 北京市商汤科技开发有限公司 | 手势追踪方法和装置、电子设备、计算机存储介质 |
CN109829435A (zh) * | 2019-01-31 | 2019-05-31 | 深圳市商汤科技有限公司 | 一种视频图像处理方法、装置及计算机可读介质 |
CN110347246A (zh) * | 2019-06-19 | 2019-10-18 | 深圳前海达闼云端智能科技有限公司 | 人机交互方法、装置、存储介质及电子设备 |
CN110647807A (zh) * | 2019-08-14 | 2020-01-03 | 中国平安人寿保险股份有限公司 | 异常行为确定方法、装置、计算机设备和存储介质 |
CN110781843A (zh) * | 2019-10-29 | 2020-02-11 | 首都师范大学 | 课堂行为检测方法及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN113496143B (zh) | 2024-07-16 |
KR20220027241A (ko) | 2022-03-07 |
CN113496143A (zh) | 2021-10-12 |
TWI776429B (zh) | 2022-09-01 |
JP2022543032A (ja) | 2022-10-07 |
TW202139061A (zh) | 2021-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10776970B2 (en) | Method and apparatus for processing video image and computer readable medium | |
US11443557B2 (en) | Monitoring and analyzing body language with machine learning, using artificial intelligence systems for improving interaction between humans, and humans and robots | |
US9349076B1 (en) | Template-based target object detection in an image | |
JP2022505762A (ja) | 画像セマンティックセグメンテーションネットワークのトレーニング方法、装置、機器及びコンピュータプログラム | |
CN113807276B (zh) | 基于优化的YOLOv4模型的吸烟行为识别方法 | |
CN110851641B (zh) | 跨模态检索方法、装置和可读存储介质 | |
WO2021218671A1 (zh) | 目标跟踪方法及装置、存储介质及计算机程序 | |
CN109063587B (zh) | 数据处理方法、存储介质和电子设备 | |
US20140198954A1 (en) | Systems and methods of detecting body movements using globally generated multi-dimensional gesture data | |
CN106874826A (zh) | 人脸关键点跟踪方法和装置 | |
CN110942011B (zh) | 一种视频事件识别方法、系统、电子设备及介质 | |
CN110175528B (zh) | 人体跟踪方法及装置、计算机设备及可读介质 | |
CN109522883A (zh) | 一种人脸检测方法、系统、装置及存储介质 | |
CN110287848A (zh) | 视频的生成方法及装置 | |
CN111160134A (zh) | 一种以人为主体的视频景别分析方法和装置 | |
Balasuriya et al. | Learning platform for visually impaired children through artificial intelligence and computer vision | |
US20230274145A1 (en) | Method and system for symmetric recognition of handed activities | |
CN111767831A (zh) | 用于处理图像的方法、装置、设备及存储介质 | |
CN114782901A (zh) | 基于视觉变动分析的沙盘投影方法、装置、设备及介质 | |
CN109063790A (zh) | 对象识别模型优化方法、装置和电子设备 | |
WO2021185317A1 (zh) | 动作识别方法及装置、存储介质 | |
CN112087590A (zh) | 图像处理方法、装置、系统及计算机存储介质 | |
CN117218703A (zh) | 智能学情分析方法及系统 | |
CN111652045B (zh) | 课堂教学质量评估方法和系统 | |
CN112446360A (zh) | 目标行为检测方法、装置及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21772235 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022506372 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20227003914 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21772235 Country of ref document: EP Kind code of ref document: A1 |