CN113553959A

CN113553959A - Action recognition method and device, computer readable medium and electronic equipment

Info

Publication number: CN113553959A
Application number: CN202110852012.7A
Authority: CN
Inventors: 车宏伟
Original assignee: Hangzhou Douku Software Technology Co Ltd
Current assignee: Hangzhou Douku Software Technology Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-26

Abstract

The disclosure provides a motion recognition method, a motion recognition device, a computer readable medium and an electronic device, and relates to the technical field of image processing. The method comprises the following steps: identifying key points corresponding to a target object in a target image, and connecting the key points to obtain a limb line; determining the limb line type of the limb line according to the limb type of the target object, and determining the position identifier corresponding to each limb line; and performing action recognition according to the limb line type of each limb line and the position identification corresponding to each limb line to obtain the action category of the target object. The method and the device can reduce the data required during action recognition, and further avoid the problem of large calculation amount in the recognition process caused by excessive redundant data input; meanwhile, the problem that the identification precision is reduced due to the influence of redundant data on the identification result can be avoided.

Description

Action recognition method and device, computer readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a motion recognition method, a motion recognition apparatus, a computer-readable medium, and an electronic device.

Background

With the continuous development of computer vision technology, human motion recognition is also widely applied as a development direction of computer vision. For example, the method has a great application prospect in a plurality of fields such as motion sensing games, virtual reality, security protection, monitoring and the like.

In the related art, in order to perform motion recognition, recognition is usually performed by some special equipment. For example, in a virtual reality scenario, a large number of sensors are usually required to be disposed in a wearable device to monitor the motion of various parts of the human body, so as to recognize the motion of the human body through a specific algorithm. However, in the method by the dedicated device, the dedicated device tends to be expensive, and thus is not suitable for all scenes.

Disclosure of Invention

The present disclosure is directed to a motion recognition method, a motion recognition apparatus, a computer-readable medium, and an electronic device, which avoid redundant data at least to some extent and improve recognition accuracy.

According to a first aspect of the present disclosure, there is provided an action recognition method including: identifying key points corresponding to a target object in a target image, and connecting the key points to obtain a limb line; determining the limb line type of the limb line according to the limb type of the target object, and determining the position identification corresponding to each limb line; and performing motion recognition according to the limb line type of each limb line and the position identification corresponding to each limb line to obtain the motion category of the target object. According to a second aspect of the present disclosure, there is provided a motion recognition apparatus including: the limb line acquisition module is used for identifying key points corresponding to the target object in the target image and connecting the key points to obtain a limb line; the limb line classification module is used for determining the limb line type of the limb line according to the limb type of the target object and determining the position identification corresponding to each limb line; and the action recognition module is used for carrying out action recognition according to the limb line type of each limb line and the position identification corresponding to each limb line to obtain the action category of the target object.

According to a third aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising: a processor; and memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to the action recognition method provided by the embodiment of the disclosure, the key points corresponding to the target object in the target image are recognized, and the key points are connected to obtain the limb line; then classifying the limb lines according to the limb types of the target object to obtain the limb line types, and simultaneously determining the position identification corresponding to each limb line; and then, performing motion recognition through the limb line type and the position identification corresponding to the limb line to output the motion type of the target object. According to the technical scheme, the target object included in the target image is subjected to limb splitting, so that data required during motion recognition can be reduced, and the problem of large calculation amount in the recognition process caused by excessive redundant data input is avoided; meanwhile, redundant data is not referred in the identification process, so that the problem that the identification precision is reduced due to the influence of the redundant data on the identification result can be avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 3 schematically illustrates a flow chart of a method of motion recognition in an exemplary embodiment of the disclosure;

FIG. 4 is a schematic diagram illustrating a distribution of key points of a human body in an exemplary embodiment of the disclosure;

FIG. 5 schematically illustrates a schematic diagram of determining a location indicator based on two different reference fiducials in an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram schematically illustrating a method for determining a position indicator corresponding to a limb line of a human body in an exemplary embodiment of the disclosure;

FIG. 7 is a schematic diagram illustrating the structure of a motion recognition model in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of a method for motion recognition based on a motion recognition model in an exemplary embodiment of the disclosure;

FIG. 9 is a schematic diagram illustrating a principle of motion recognition based on a motion recognition model in an exemplary embodiment of the disclosure;

FIG. 10 schematically illustrates a schematic diagram of training a motion recognition model in an exemplary embodiment of the disclosure;

fig. 11 schematically illustrates a composition diagram of a motion recognition apparatus in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a motion recognition method and apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having an image processing function, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The motion recognition method provided by the embodiment of the present disclosure is generally executed by the

terminal devices

101, 102, and 103, and accordingly, the motion recognition apparatus is generally disposed in the

terminal devices

101, 102, and 103. However, it is easily understood by those skilled in the art that the motion recognition method provided in the embodiment of the present disclosure may also be executed by the server 105, and accordingly, the motion recognition device may also be disposed in the server 105, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, the

terminal devices

101, 102, and 103 may perform motion recognition by using the acquired images as target images, or the

terminal devices

101, 102, and 103 may perform motion recognition on the acquired target images after acquiring the target images from other terminal devices through a network.

An exemplary embodiment of the present disclosure provides an electronic device for implementing a motion recognition method, which may be the

terminal device

101, 102, 103 or the server 105 in fig. 1. The electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the action recognition method via execution of the executable instructions.

The following takes the mobile terminal 200 in fig. 2 as an example, and exemplifies the configuration of the electronic device. It will be appreciated by those skilled in the art that the configuration of figure 2 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, mobile terminal 200 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is only schematically illustrated and does not constitute a structural limitation of the mobile terminal 200. In other embodiments, the mobile terminal 200 may also interface differently than shown in fig. 2, or a combination of multiple interfaces.

As shown in fig. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor module 280, a display 290, a camera module 291, an indicator 292, a motor 293, a button 294, and a Subscriber Identity Module (SIM) card interface 295. Wherein the sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyroscope sensor 2803, and the like.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is a Neural-Network (NN) computing processor, which processes input information quickly by using a biological Neural Network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the mobile terminal 200, for example: image recognition, face recognition, speech recognition, text understanding, and the like. In some embodiments, the processes of keypoint identification and motion identification may be performed by the NPU.

A memory is provided in the processor 210. The memory may store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and execution is controlled by processor 210.

The charge management module 240 is configured to receive a charging input from a charger. The power management module 241 is used for connecting the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives the input of the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the display screen 290, the camera module 291, the wireless communication module 260, and the like.

The wireless communication function of the mobile terminal 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, a modem processor, a baseband processor, and the like. Wherein, the antenna 1 and the antenna 2 are used for transmitting and receiving electromagnetic wave signals; the mobile communication module 250 may provide a solution including wireless communication of 2G/3G/4G/5G, etc. applied to the mobile terminal 200; the modem processor may include a modulator and a demodulator; the Wireless communication module 260 may provide a solution for Wireless communication including a Wireless Local Area Network (WLAN) (e.g., a Wireless Fidelity (Wi-Fi) network), Bluetooth (BT), and the like, applied to the mobile terminal 200. In some embodiments, antenna 1 of the mobile terminal 200 is coupled to the mobile communication module 250 and antenna 2 is coupled to the wireless communication module 260, such that the mobile terminal 200 may communicate with networks and other devices via wireless communication techniques.

The mobile terminal 200 implements a display function through the GPU, the display screen 290, the application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 290 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information. In some embodiments, the target image may be displayed and rendered by the GPU.

The mobile terminal 200 may implement a photographing function through the ISP, the camera module 291, the video codec, the GPU, the display screen 290, the application processor, and the like. The ISP is used for processing data fed back by the camera module 291; the camera module 291 is used for capturing still images or videos; the digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals; the video codec is used to compress or decompress digital video, and the mobile terminal 200 may also support one or more video codecs. In some embodiments, the target image to be identified may be acquired by the camera module 291, or the video may be acquired by the camera module 291, and each frame of image in the video may be used as the target image for motion identification.

Internal memory 221 may be used to store computer-executable program code, which includes instructions. The internal memory 221 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (e.g., audio data, a phonebook, etc.) created during use of the mobile terminal 200, and the like. In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk Storage device, a Flash memory device, a Universal Flash Storage (UFS), and the like. The processor 210 executes various functional applications of the mobile terminal 200 and data processing by executing instructions stored in the internal memory 221 and/or instructions stored in a memory provided in the processor.

The mobile terminal 200 may implement an audio function through the audio module 270, the speaker 271, the receiver 272, the microphone 273, the earphone interface 274, the application processor, and the like. Such as music playing, recording, etc.

The depth sensor 2801 is used to acquire depth information of a scene. The pressure sensor 2802 is used to sense a pressure signal and convert the pressure signal into an electrical signal. The gyro sensor 2803 may be used to determine a motion gesture of the mobile terminal 200. In addition, other functional sensors, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc., may be provided in the sensor module 280 according to actual needs.

Other devices for providing auxiliary functions may also be included in mobile terminal 200. For example, the keys 294 include a power-on key, a volume key, and the like, and a user can generate key signal inputs related to user settings and function control of the mobile terminal 200 through key inputs. Further examples include indicator 292, motor 293, SIM card interface 295, etc.

In the related art, in addition to motion recognition by means of a dedicated device, motion recognition may be directly performed on an image including an object through a network of machine learning or deep learning to recognize an object motion. For example, in the related art, a graph structure of motion data may be constructed by node data and edge data connecting edges, and then the graph structure is input into a motion recognition model, so as to obtain a motion recognition classification result.

However, the form of the image or the image structure is directly used as the input of the machine learning or deep learning network, and redundant data provided for the network is more, so that a longer time may be consumed when the network is trained; meanwhile, due to the existence of redundant data, the identification process may be affected, so that the accuracy of the identification result is reduced.

Based on one or more of the problems described above, the present example embodiment provides an action recognition method. The motion recognition method may be applied to the server 105, and may also be applied to one or more of the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to fig. 3, the motion recognition method may include the following steps S310 to S330:

in step S310, key points corresponding to the target object in the target image are identified, and the key points are connected to obtain a limb line.

The target image may be a video frame extracted from a video, or an image directly acquired by a camera.

In an exemplary embodiment, after the target image is acquired, object recognition may be performed on the target image to obtain a target area of the target object in the target image, and then, key point detection is performed on the target area to further identify a key point corresponding to the target object.

In an exemplary embodiment, when the target object is a human body, human image detection may be performed on the target image through an algorithm or a model for human image detection to obtain a human body frame (target area). For example, the frame of the human body can be obtained by performing frame selection on the target image through a Single Shot multi box Detector (SSD). When the frame detection is carried out through the SSD algorithm, the human body frame can be obtained through operations of frame selection, frame normalization, CNN training feature extraction, frame regression, classifier classification, data post-processing and the like on the target image.

The frame selection mainly comprises the steps of manually marking samples, screening the samples, selecting the samples in a proper scene for training and the like; the CNN training network adopts an SSD algorithm model, and the model can be optimized by removing smaller frame prediction branches from the model and carrying out processing methods such as quantification and the like on the network model. The optimized network can automatically carry out regression and classification on the frame, and finally all human body frame bodies contained in the target image are determined.

Further, after a target area (human body frame) corresponding to the target object is obtained, the target area corresponding to each target object may be directly subjected to the checkpoint detection, so as to obtain a keypoint corresponding to the target object. Specifically, a top-down approach may be used, such as the simple baseline approach proposed by MSRA.

It should be noted that, in methods that require the detection of the key points based on machine learning or deep learning, such as a simple baseline method, the key point detection mainly includes the procedures of CNN training and feature extraction, key point regression, key point classification, data post-processing, and the like performed on the target region, and the key points corresponding to the target object can be obtained.

It should be noted that, in order to ensure that accurate limb lines can be obtained by subsequently connecting the key points, the key points to be identified in the target object may be set. For example, when the target object is a human body, as shown with reference to fig. 4, the joint position of the human body may be set as a key point so that an accurate limb line may be determined based on key point connection. Wherein, the joint positions corresponding to the key points 0-16 shown in fig. 4 are respectively: hip, right knee, right ankle, left hip, left knee, left ankle, spine, chest, neck, head, left shoulder, left elbow, left wrist, right shoulder, right elbow, right wrist.

In an exemplary embodiment, after obtaining the key points, the key points may be connected to each other according to the positions of the key points in the target object, so as to obtain corresponding limb lines. For example, as shown in FIG. 4 for keypoints 0-16, based on the locations of the keypoints on the target object, the following limb lines can be derived: 1-2, 2-3, 4-5, 5-6, 11-12, 12-13, 14-15, 15-16, 0-8 and 8-10.

In step S320, the limb line type of the limb line is determined according to the limb type of the target object, and the position identifier corresponding to each limb line is determined.

In an exemplary embodiment, after obtaining the limb line, the limb line may be classified according to the limb type of the target object, obtaining the limb line type. Specifically, when classifying, the limb lines including different key points may be set in advance, so as to classify the limb lines. For example, with respect to the limb lines included in fig. 4, the limb lines 1-2, 2-3, 4-5, and 5-6 may be classified by the limb, trunk, or head of the human body to which the limb lines belong, and the limb lines 1-2, 2-3, 4-5, and 5-6 may be divided into lower limb portions, 11-12, 12-13, 14-15, and 15-16 may be divided into upper limb portions, the limb lines 0-8 may be divided into trunk portions, and the limb lines 8-10 may be divided into head portions.

In an exemplary embodiment, when the type of the limb line is determined, the position identifier corresponding to the limb line also needs to be determined. Specifically, when the position identifier corresponding to the limb line is determined, the reference datum corresponding to the limb line may be determined according to the position of each limb line on the target object, and then the position identifier corresponding to the limb line may be determined based on the relative position between the limb line and the reference datum corresponding to the straight line.

The reference datum corresponding to the limb line may be set differently according to different target objects and different limb lines, and may be a reference point, a reference line or a reference plane, and the like, which is not particularly limited in the present disclosure.

In an exemplary embodiment, when determining the reference, there is typically a center of rotation as the limb of the target object is moving. For example, the upper arm of the human body is usually moved around a joint such as a shoulder as a rotation center. Therefore, the rotation center of the limb line can be determined according to the position of the limb line on the target object, and then the reference datum corresponding to the limb line can be determined based on the rotation center.

In an exemplary embodiment, when determining the reference datum, the reference datum for different limb lines may be set differently based on different target objects and the positions of the limb lines on the target objects. For example, for the human body, the large arms and thighs on the limbs of the human body usually have the rotation centers of the shoulders and buttocks, while the small arms and calves usually need to rely not only on the elbow joints or knee joints, but also move differently with the large arms or thighs. Therefore, for the upper arm or the thigh, a key point corresponding to the shoulder or the hip can be used as a rotation center, and the rotation center is directly set as a reference datum; for the lower arm or the lower leg, an extension line may be directly made along the upper arm or the upper leg toward the elbow joint or the knee joint (rotation center), and the extension line may be set as a reference.

In an exemplary embodiment, after the reference datum is determined, the position identifier corresponding to the limb line may be determined based on the relative position between the limb line and the reference datum corresponding to the limb line. It should be noted that, for different reference benchmarks, the position identifier corresponding to the relative position between the limb line and the reference benchmarks may be set in advance. The location identifier may be a number, a symbol, or the like, for identifying a unique relative location.

In an exemplary embodiment, after the reference datum is determined, the position where the limb line can move may be divided into a plurality of areas based on the reference datum, and according to one of the divided areas of the relative position of the limb line with respect to the reference datum, the position identifier corresponding to the limb line is further determined.

For example, referring to fig. 5A, assuming a reference datum as a reference point, a vertical line and a horizontal line can be made with the reference point as a center, and a plane where the target image is located is divided into four regions. At this time, the four regions may be respectively identified as 1, 2, 3, 4. If a certain limb line is in the area 1 relative to the reference point, the position identifier corresponding to the limb line can be determined to be 1; as another example, referring to fig. 5B, assuming that a reference line is a reference line, the plane on which the target image is located may be divided into three regions, i.e., a left region on the line and a right region on the line, based on the straight line on which the reference line is located. At this time, the three regions may be respectively identified as a, b, c. If a certain limb line is in the area a relative to the reference line, the position mark corresponding to the limb line can be determined as a.

For example, referring to fig. 6, taking limb lines 0 to 8 of the trunk of the human body as an example, and taking the key point 8 as a reference point, a plane where the target image is located is divided into four regions. At this time, the four regions may be respectively identified as 1, 2, 3, and 4, and if the limb lines 0 to 8 in the target image are located in the lower right region, the corresponding position identification thereof may be determined as 2.

It should be noted that, for different reference bases, the same identifier may be selected for corresponding location identifiers. At this time, although the marks are the same, the areas represented by the position marks corresponding to the limb lines are still different because the reference bases are different; the corresponding location identity may select a different identity for different reference bases, such as 1, 2, 3, 4, or A, B, C in the example above.

In step S330, motion recognition is performed according to the limb line type of each limb line and the position identifier corresponding to each limb line, so as to output the motion category of the target object.

In an exemplary embodiment, after the limb line type and the position identifier corresponding to the limb line are obtained, motion recognition may be performed on each limb line according to the limb line type or the position identifier, so as to output a motion category corresponding to the entire target object.

In an exemplary embodiment, when performing motion recognition, the limb line type and the position identifier corresponding to each limb line may be input into a motion recognition model for motion recognition, and then the motion category of the target object is output.

In an exemplary embodiment, since the target object is generally a whole, the action type of the target object can be obtained by adopting a scheme from local recognition to global recognition, and the accuracy of action recognition is effectively improved by refining the relationship between local and global. Specifically, the identification of the motion category of the target object according to the limb line may be implemented by a motion identification model, and the motion identification type may include a first identification network and a second identification network connected in series, where the first identification network includes one or more sub-networks connected in parallel and is used to perform local motion identification on different local limb lines in the target object, and the second identification network is used to perform global motion identification by combining a local motion identification result. For example, when the first recognition network includes two sub-networks, the structure of the motion recognition model may be as shown in fig. 7.

In an exemplary embodiment, when the motion recognition model includes a first recognition network and a second recognition network connected in series, where the first recognition network includes one sub-network or a plurality of sub-networks connected in parallel, the motion recognition is performed on the type of each limb line and the position identifier corresponding to each limb line based on the motion recognition model to output the motion category of the target object, as shown in fig. 8, the method may include the following steps

Step S810, determining a first limb line corresponding to the target type and a second limb line corresponding to a reference type except the target type in each limb line according to the limb line type of each limb line;

step S820, aiming at the second limb line corresponding to the reference type, inputting the position identification corresponding to the second limb line of each limb line type in the reference type into a sub-network to obtain a first action category corresponding to each limb line type included in the reference type;

step S830, inputting the first action type and the position identification corresponding to the first limb line into a second identification network to obtain a second action type;

in step S840, the motion category of the target object is output based on the first motion category and the second motion category.

In an exemplary embodiment, in order to effectively refine the local and global relationships, a type of limb line having a larger influence on the motion of the target object may be determined as a first limb line corresponding to the target type from the target object, the remaining limb lines are determined as second limb lines corresponding to the reference type, and then position identifiers corresponding to the second limb lines and the second limb lines corresponding to the reference type are input into the first recognition network to perform local motion recognition to obtain a first motion category, specifically, the second limb line corresponding to each limb line type in the reference type and the position identifier thereof may be input into a sub-network of the first recognition network to obtain a first motion category (local motion) corresponding to each reference type; and then inputting the first limb line and the position identification corresponding to the first limb line, which are included in the first action category and the target type, into a second recognition network to obtain a second action category. The action category of the target object is then output based on the first action category and the second action category.

The following describes the motion recognition process in detail, taking the torso portion and the head of the human body as target types, the upper limbs and the lower limbs as reference types, and taking the motion recognition model shown in fig. 7 as an example, as shown in fig. 9:

referring to fig. 9, based on the type of the limb line, inputting upper limb data (the type is the limb line of the upper limb and the position mark corresponding to the limb line) and lower limb data (the type is the limb line of the lower limb and the position mark corresponding to the limb line) into the

sub-networks

1 and 2 connected in parallel in the first recognition network, respectively, to obtain a first action type 1 and a first action type 2; then inputting the first action type 1, the first action type 2, body data (the type is the body line of the body and the position mark corresponding to the body line) and head data (the type is the body line of the head and the position mark corresponding to the body line) into a second recognition network to obtain a second action type; the action category is then output based on the first action category 1, the first action category 2, and the second action category.

The confidence level of the motion category obtained based on the motion recognition model may be determined based on the first motion category 1, the first motion category 2, and the second motion category, and the preset weight parameter corresponding to each motion category. For example, the action class confidence is first action class 1 × 0.2+ first action class 2 × 0.2+ second action class 0.6.

In an exemplary embodiment, before performing the motion recognition, the preset recognition model needs to be trained based on the sample data to obtain the motion recognition model. The sample data comprises a limb line type of the limb line, a position identifier corresponding to the limb line and an action type corresponding to the limb line.

Further, when the motion recognition model includes a first recognition network and a second recognition network connected in series, where the first recognition network includes one sub-network or a plurality of sub-networks connected in parallel, the training process for the motion recognition model may include: specifying a target type and a reference type among the types of the limb lines; inputting the position identification corresponding to the second limb line of each limb line type in the reference type into a sub-network aiming at the reference type so as to obtain a first action category corresponding to each limb line type in the reference type; inputting the position identification corresponding to the first limb line of the first action type and the target type into a second identification network to obtain a second action type; outputting the action category of the target object based on the first action category and the second action category.

It should be noted that the training process of the model is similar to the application process, and the difference is that, when the model is trained, a target type having a large influence on the action type of the target object, that is, a target type input into the second recognition network needs to be specified.

The above training process is explained in detail below with reference to fig. 10, taking as examples the key points 0-16 as shown in fig. 4, and the limb lines 1-2, 2-3, 4-5, 5-6, 11-12, 12-13, 14-15, 15-16, 0-8, 8-10 determined based on the key points 0-16:

after the sample image is obtained, the key point identification can be carried out on the sample image, and finally the human body key point information corresponding to the sample image is obtained. Based on the information of the key points, the split is performed by the key point action, and the split part is labeled as table 1.

Referring to fig. 10, during training, data such as the limb line type and the corresponding position mark of each limb line in table 1 are input into the network. The type data for identifying the limb line type can be input through the following format: and (3) partial classification-limb line key points, wherein the partial classification and the limb line key points can be labeled by a simple method. For example, for the partial classification, the labels corresponding to the upper limb, the lower limb, the trunk, and the head may be set to 0, 1, 2, and 3 in advance; for the key points of the limb line, one key point of a certain limb line may be set in advance as a key point representing the limb line, for example, the limb line 1-2 may be identified by the key point 1, and the limb line 2-3 may be identified by the key point 2. At this time, the type data corresponding to the limb line 1-2 may be 0-1 as shown in table 1.

TABLE 1 Classification results for Limb lines in sample data

During training, the body and the head, namely the body lines with the

partial classification identifications

2 and 3 are designated as the body lines of the target type, data with the partial classification identification 0 (upper limb) is input into a first sub-network, data with the partial classification identification 1 (lower limb) is input into a second sub-network, and then data with the

partial classification identification

2 and 3 is input into a second recognition network.

The network structures of the first sub-network and the second sub-network may be a resnet18 structure, and the network structure of the second recognition network may be a resnet50, a Loss1, a Loss2, and a Loss 3 may all adopt softmax Loss.

In summary, in the exemplary embodiment, different types are automatically divided for key points of a human body, so that a method for automatically labeling a data set is implemented, and then a network is trained through data sets of different types, so that compared with training of network models of other schemes, manual participation is effectively reduced, generation time of model data samples is reduced, and a labeled data set can be automatically generated. The method for dividing the area reduces the input data of the network and improves the convergence and generalization capability of the network. In addition, the design of the network structure effectively classifies the upper limbs and the lower limbs, global classification is obtained, and instability of network prediction results can be effectively reduced.

It should be noted that when different types of key points are classified, the key points may be classified based on the above classification method, and may be classified in more types according to different requirements. In addition, the network structures of the sub-network included in the first recognition network and the second recognition network may be network structures such as resnet and efficientnet, and may also be trained by other types of loss functions besides softmax loss, which is not limited in this disclosure.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 11, the embodiment of the present example also provides a motion recognition apparatus 1100, which includes a limb line obtaining module 1110, a limb line classifying module 1120, and a motion recognition module 1130. Wherein:

the limb line obtaining module 1110 may be configured to identify key points corresponding to a target object in a target image, and connect the key points to obtain a limb line.

The limb line classification module 1120 may be configured to determine a limb line type of a limb line according to a limb type of the target object, and determine a position identifier corresponding to each limb line.

The motion recognition module 1130 may be configured to perform motion recognition according to the limb line type of each limb line and the position identifier corresponding to each limb line, so as to obtain a motion category of the target object.

In an exemplary embodiment, the limb line classification module 1120 may be configured to determine a reference datum corresponding to a limb line according to a position of each limb line on the target object; and determining the position identification corresponding to the limb line based on the relative position between the limb line and the reference datum corresponding to the limb line.

In an exemplary embodiment, the limb line classification module 1120 may be configured to determine a center of rotation on the limb line according to a position of each limb line on the target object; and determining a reference datum corresponding to the limb line based on the rotation center on the limb line.

In an exemplary embodiment, the motion recognition module 1130 may be configured to perform motion recognition on the limb line type of each limb line and the position identifier corresponding to each limb line based on the motion recognition model to output the motion category of the target object.

In an exemplary embodiment, when the motion recognition model includes a first recognition network and a second recognition network connected in series, and the first recognition network includes one sub-network or a plurality of sub-networks connected in parallel, the motion recognition module 1130 may be configured to determine, in each limb line, a first limb line corresponding to the target type and a second limb line corresponding to a reference type other than the target type according to the limb line type of each limb line; inputting the position identification corresponding to the second limb line of each limb line type in the reference type into a sub-network aiming at the second limb line corresponding to the reference type so as to obtain a first action category corresponding to each limb line type in the reference type; inputting the first action type and the position identification corresponding to the first limb line into a second identification network to obtain a second action type; outputting the action category of the target object based on the first action category and the second action category.

In an exemplary embodiment, the motion recognition apparatus may further include a model training module, and the model training module may be configured to train the preset recognition model based on the sample data to obtain the motion recognition model.

The sample data comprises a limb line type of a limb line, a position identification corresponding to the limb line and an action type corresponding to the limb line; training a preset recognition model based on sample data to acquire an action recognition model, comprising: specifying a target type and a reference type among limb line types of a limb line; inputting the position identification corresponding to the second limb line of each limb line type in the reference type into a sub-network aiming at the reference type so as to obtain a first action category corresponding to each limb line type in the reference type; inputting the position identification corresponding to the first limb line of the first action type and the target type into a second identification network to obtain a second action type; outputting the action category of the target object based on the first action category and the second action category. .

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3 and 8 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A motion recognition method, comprising:

identifying key points corresponding to a target object in a target image, and connecting the key points to obtain a limb line;

determining the limb line type of the limb line according to the limb type of the target object, and determining the position identifier corresponding to each limb line;

and performing action recognition according to the limb line type of each limb line and the position identification corresponding to each limb line to obtain the action category of the target object.

2. The method of claim 1, wherein the determining the location identifier corresponding to each limb line comprises:

determining a reference datum corresponding to each limb line according to the position of each limb line on the target object;

and determining the position identification corresponding to the limb line based on the relative position between the limb line and the reference datum corresponding to the limb line.

3. The method of claim 2, wherein the determining the reference datum corresponding to each of the limb lines according to the position of the limb line on the target object comprises:

determining a rotation center on each limb line according to the position of the limb line on the target object;

and determining a reference datum corresponding to the limb line based on the rotation center on the limb line.

4. The method according to claim 1, wherein the performing motion recognition according to the limb line type of each limb line and the position identifier corresponding to each limb line to obtain the motion category of the target object comprises:

and performing motion recognition on the limb line type of each limb line and the position identification corresponding to each limb line based on a motion recognition model to output the motion category of the target object.

5. The method according to claim 4, characterized in that the action recognition model comprises a first recognition network and a second recognition network connected in series, the first recognition network comprising one sub-network or a plurality of sub-networks connected in parallel with each other;

the motion recognition of the limb line type of each limb line and the position identification corresponding to each limb line based on the motion recognition model to output the motion category of the target object includes:

according to the limb line type of each limb line, determining a first limb line corresponding to a target type and a second limb line corresponding to a reference type except the target type in each limb line;

for second limb lines corresponding to the reference types, inputting the position identification corresponding to the second limb line of each limb line type in the reference types into the sub-network to obtain a first action category corresponding to each limb line type included in the reference types;

inputting the first action type and the position identification corresponding to the first limb line into a second identification network to obtain a second action type;

outputting an action category of the target object based on the first action category and the second action category.

6. The method of claim 5, further comprising:

training a preset recognition model based on sample data to obtain the action recognition model; wherein the sample data comprises a limb line type of the limb line, a position identifier corresponding to the limb line and an action category corresponding to the limb line;

the training of the preset recognition model based on the sample data to obtain the action recognition model comprises the following steps:

specifying the target type and the reference type in a limb line type of the limb line;

for the reference type, inputting a position identifier corresponding to a second limb line of each limb line type in the reference type into the sub-network to obtain a first action category corresponding to each limb line type included in the reference type;

inputting the first action category and the position identification corresponding to the first limb line of the target type into a second recognition network to obtain a second action category;

7. The method of claim 1, wherein identifying key points corresponding to target objects in the target image comprises:

performing object recognition on the target image to recognize a target area of the target object in the target image;

and carrying out key point detection on the target area so as to identify key points corresponding to the target object.

8. An action recognition device, comprising:

the limb line acquisition module is used for identifying key points corresponding to the target object in the target image and connecting the key points to obtain a limb line;

the body line classification module is used for determining the body line type of the body line according to the body type of the target object and determining the position identification corresponding to each body line;

and the action recognition module is used for carrying out action recognition according to the limb line type of each limb line and the position identification corresponding to each limb line to obtain the action category of the target object.

9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-7 via execution of the executable instructions.