WO2020088433A1 - 多人姿态识别方法、装置、电子设备及存储介质 - Google Patents

多人姿态识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020088433A1
WO2020088433A1 PCT/CN2019/113899 CN2019113899W WO2020088433A1 WO 2020088433 A1 WO2020088433 A1 WO 2020088433A1 CN 2019113899 W CN2019113899 W CN 2019113899W WO 2020088433 A1 WO2020088433 A1 WO 2020088433A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
layer
feature
image
recognized
Prior art date
Application number
PCT/CN2019/113899
Other languages
English (en)
French (fr)
Inventor
黄浩智
龚新宇
罗镜民
朱晓龙
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19878618.8A priority Critical patent/EP3876140B1/en
Publication of WO2020088433A1 publication Critical patent/WO2020088433A1/zh
Priority to US17/073,441 priority patent/US11501574B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4641Virtual LANs, VLANs, e.g. virtual private networks [VPN]
    • H04L12/4675Dynamic sharing of VLAN information amongst network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/09Mapping addresses
    • H04L61/10Mapping addresses of different types
    • H04L61/103Mapping addresses of different types across network layers, e.g. resolution of network layer into physical layer addresses or address resolution protocol [ARP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • H04L67/125Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks involving control of end-device applications over a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/30Definitions, standards or architectural aspects of layered protocol stacks
    • H04L69/32Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level
    • H04L69/322Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions
    • H04L69/325Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions in the network layer [OSI layer 3], e.g. X.25
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • H04N19/126Details of normalisation or weighting functions, e.g. normalisation matrices or variable uniform quantisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/543Motion estimation other than block-based using regions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/55Motion estimation with spatial constraints, e.g. at image or region borders
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/87Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving scene cut or scene change detection in combination with video compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/33Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the spatial domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/59Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression

Definitions

  • This application relates to the field of computer technology, and in particular, to a gesture recognition method, device, electronic device, and storage medium.
  • the multi-person gesture recognition technology includes two schemes: a top-down scheme and a bottom-up scheme.
  • the top-down solution needs to first detect each person in the image to be recognized in the form of a bounding box, and then perform human key point detection on each person in the bounding box.
  • the bottom-up solution is to detect the human key points of everyone in the image to be recognized at once, and at the same time determine the person to which each human key point belongs. It can be seen from this that the bottom-up approach is more efficient than the top-down approach, but the accuracy is insufficient.
  • a multi-person gesture recognition method based on a stacked hourglass network is proposed to make up for the lack of accuracy in the bottom-up scheme.
  • feature propagation depends on the convolution operation, which forms a bottleneck for feature propagation.
  • various embodiments of the present application provide a method, device, electronic device, and storage medium for multi-person gesture recognition.
  • a multi-person gesture recognition method includes:
  • the roundabout pyramid network includes several stages in parallel, each stage includes each layer of the downsampling network, each layer of the upsampling network, and the first residual connection between each layer of the upsampling network Layer, connected by a second residual connection layer between different stages;
  • the first residual connection layer performs feature propagation between the layers of the down-sampling network and the layers of the up-sampling network in the current stage to obtain the output feature map of the current stage;
  • the output feature map of the last stage is used as the feature map corresponding to the image to be recognized;
  • Multi-person gesture recognition is performed according to the feature map corresponding to the image to be recognized to obtain a gesture recognition result of the image to be recognized.
  • a multi-person gesture recognition device includes:
  • the image acquisition module is used to acquire the image to be recognized
  • the traversal module is used to construct a roundabout pyramid network, the roundabout pyramid network includes several stages in parallel, each stage includes each layer of the downsampling network, each layer of the upsampling network, and the layers connected between the layers of the upsampling network
  • the first residual connection layer is connected between different stages through the second residual connection layer;
  • the traversal module is also used to traverse each stage of the roundabout pyramid network, including performing the following processes: in the feature map extraction performed in the current stage, the first residual connection layer is used to downsample the network in the current stage Perform feature propagation between each layer and each layer of the up-sampling network to obtain the output feature map of the current stage; via the second residual connection layer, up-sampling each layer of the network in the current stage and down-sampling in the subsequent stage Perform feature propagation between the various layers of the network to extract corresponding feature maps in the latter stage;
  • the traversal module is also used to complete the traversal of each stage in the circuitous pyramid network, and use the output feature map of the last stage as the feature map corresponding to the image to be recognized;
  • the gesture recognition module is configured to perform multi-person gesture recognition according to the feature map corresponding to the image to be recognized, and obtain a gesture recognition result of the image to be recognized.
  • an electronic device includes a processor and a memory, and the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the multi-person gesture recognition method described above is implemented.
  • a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the multi-person gesture recognition method as described above.
  • FIG. 1 is a schematic diagram of an implementation environment involved in this application.
  • Fig. 2 is a block diagram of a hardware structure of an electronic device according to an exemplary embodiment.
  • Fig. 3 is a flow chart showing a method for multi-person gesture recognition according to an exemplary embodiment.
  • FIG. 4 is a schematic structural diagram of a roundabout pyramid network involved in the embodiment corresponding to FIG. 3.
  • FIG. 5 is a flowchart of step 340 in the embodiment corresponding to FIG. 3 in one embodiment.
  • FIG. 6 is a schematic diagram of the current stage of the roundabout pyramid network involved in the corresponding embodiment of FIG. 5.
  • Fig. 7 is a flowchart of another method for multi-person gesture recognition according to an exemplary embodiment.
  • FIG. 8 is a schematic structural diagram of a propagation path constructed for a detour pyramid network according to the embodiment corresponding to FIG. 7.
  • FIG. 9 is a flowchart of step 360 in an embodiment corresponding to the embodiment in FIG. 3.
  • FIG. 10 is a schematic diagram of a thermal map for identifying the position of a key point of the nose according to the embodiment corresponding to FIG. 9.
  • FIG. 11 is a schematic diagram of a thermal diagram for identifying the position of a key point of the wrist according to the embodiment corresponding to FIG. 9.
  • FIG. 12 is a schematic diagram of a grouping diagram for identifying key points of noses according to the embodiment corresponding to FIG. 9.
  • FIG. 13 is a schematic diagram of a grouping diagram for identifying key points of a wrist according to the embodiment corresponding to FIG. 9.
  • FIG. 14 is a schematic diagram of the gesture recognition result of the image to be recognized according to the embodiment corresponding to FIG. 9.
  • Fig. 15 is a flowchart of another method for multi-person gesture recognition according to an exemplary embodiment.
  • Fig. 16 is a block diagram of a multi-person gesture recognition device according to an exemplary embodiment.
  • Fig. 17 is a block diagram of an electronic device according to an exemplary embodiment.
  • the embodiment of the present application proposes a multi-person gesture recognition method, which solves the problem of feature propagation bottleneck, and thus effectively improves the gesture recognition accuracy. Accordingly, this multi-person gesture recognition method is applicable to a multi-person gesture recognition device.
  • the multi-person gesture recognition device is deployed in an electronic device with a von Neumann architecture.
  • the electronic device may be a user terminal, a server, and so on.
  • FIG. 1 is a schematic diagram of an implementation environment involved in a gesture recognition method.
  • the implementation environment includes an identification terminal 110 and an interaction terminal 130.
  • the recognition terminal 110 may be a desktop computer, a notebook computer, a tablet computer, a smart phone, a palmtop computer, a personal digital assistant, or other electronic equipment that can deploy a gesture recognition device, for example, a server that provides a gesture recognition service to a user, and does not Be limited.
  • the interaction terminal 130 refers to somatosensory devices, smart home devices, and other electronic devices that can achieve somatosensory interaction with users.
  • the interactive terminal 130 is deployed in the same gateway as the recognition terminal 110 through communication methods such as 2G / 3G / 4G / 5G and Wi-Fi, so as to facilitate the somatosensory interaction between the user and the interactive terminal 130.
  • a gesture recognition can be performed on the image to be recognized by using a roundabout pyramid network to obtain a gesture recognition result of the image to be recognized.
  • the recognition terminal 110 is a server
  • the image to be recognized obtained by the server may come from a camera device disposed in the environment where the user performs the action, and the camera device may collect images or videos of the user when performing the action in real time , And then upload to the server.
  • the actions in the image to be recognized are recognized to generate a corresponding interactive instruction, and then the execution of the specified event is controlled through the interactive instruction.
  • the interaction terminal 130 is a smart speaker, then, with the interaction between the recognition terminal 110 and the smart speaker, the smart speaker can receive the interaction instruction, and then execute the specified event according to the interaction instruction. For example, if the specified event is a startup event, when the action performed by the user conforms to the specified gesture, the smart speaker is activated for the user.
  • the specified event is a startup event, when the action performed by the user conforms to the specified gesture, the smart speaker is activated for the user.
  • the gesture recognition device can also be directly deployed on the interactive terminal 130, that is, the interactive terminal 130 also serves as the recognition terminal.
  • the interactive terminal 130 After acquiring the image to be recognized, the interactive terminal 130 performs gesture recognition on the image to be recognized, and then executes a designated event according to the gesture recognition result of the image to be recognized. For example, if the interactive terminal 130 is a dancing machine, by identifying whether a series of actions performed by the user matches the specified dance actions, it is recognized whether the user has executed a series of specified dance actions in sequence, and then generates an interactive instruction, and generates interactive commands Instruction execution scoring events, that is, scoring actions performed by the user.
  • Fig. 2 is a block diagram of a hardware structure of an electronic device according to an exemplary embodiment.
  • This type of electronic device is suitable for the identification terminal 110 in the implementation environment shown in FIG. 1, and may be a user terminal such as a desktop computer, a notebook computer, a tablet computer, a palmtop computer, a personal digital assistant, a smart phone, a wearable device, or a server, etc. Server.
  • this type of electronic device is only an example adapted to this application, and cannot be considered as providing any limitation on the scope of use of this application.
  • This type of electronic device cannot also be interpreted as being dependent on or having to have one or more components in the exemplary electronic device 200 shown in FIG. 2.
  • the hardware structure of the electronic device 200 may vary greatly depending on the configuration or performance. As shown in FIG. 2, the electronic device 200 includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU) , Central Processing Units) 270, and camera components 290.
  • the power supply 210 is used to provide an operating voltage for each component on the electronic device 200.
  • the interface 230 includes at least one wired or wireless network interface 231, at least one serial-parallel conversion interface 233, at least one input-output interface 235, at least one USB interface 237, etc., for communicating with external devices. For example, interact with the interactive terminal 130 in the implementation environment shown in FIG. 1.
  • the memory 250 may be a read-only memory, a random access memory, a magnetic disk, or an optical disk.
  • the resources stored on the memory 250 include an operating system 251, application programs 253, and data 255.
  • the storage method may be temporary storage or permanent storage. .
  • the operating system 251 is used to manage and control the components and application programs 253 on the electronic device 200 to realize the calculation and processing of the massive data 255 by the central processor 270, which may be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM , FreeBSDTM, etc.
  • the application program 253 is a computer program that completes at least one specific job based on the operating system 251, and may include at least one module (not shown in FIG. 2), and each module may separately include a series of electronic devices 200.
  • Computer readable instructions For example, the multi-person gesture recognition apparatus can be regarded as an application program 253 deployed in the electronic device 200 to implement the multi-person gesture recognition method.
  • the data 255 may be a photo, a picture, or an image to be recognized, which is stored in the memory 250.
  • the central processor 270 may include one or more processors, and is configured to communicate with the memory 250 through a communication bus to read computer-readable instructions stored in the memory 250, thereby implementing operations on the massive data 255 in the memory 250 With processing. For example, the multi-person gesture recognition method is completed by the central processor 270 reading a series of computer-readable instructions stored in the memory 250.
  • the camera component 290 such as a camera, is used to capture images or videos.
  • the captured images or videos can be stored in the memory 250, and can also communicate with external devices through the interface 230. For example, real-time collection of images or videos when users perform actions.
  • present application can also be implemented through a hardware circuit or a combination of hardware circuits and software. Therefore, implementing the present application is not limited to any specific hardware circuit, software, or a combination of both.
  • a multi-person gesture recognition method is applied to an electronic device, for example, for the recognition terminal of the implementation environment shown in FIG. 1, the structure of the recognition terminal may be as shown in FIG.
  • This multi-person gesture recognition method may be executed by the recognition terminal, or may be understood as being executed by a multi-person gesture recognition device deployed in the recognition terminal.
  • the execution body of each step is described as a multi-person gesture recognition device, but this does not constitute a specific limitation.
  • This multi-person gesture recognition method may include the following steps:
  • Step 310 Acquire an image to be recognized.
  • the image to be recognized is generated by photographing multiple people, so as to facilitate subsequent multi-person gesture recognition for the image to be recognized that contains multiple people.
  • the image to be recognized can be derived from the image collected by the recognition terminal in real time.
  • the recognition terminal is a smartphone, and the smartphone is equipped with a camera, or it can be an image pre-stored by the recognition terminal, for example, the recognition terminal is a server, which is then read locally. It can be obtained by way of network transmission.
  • the image to be recognized collected in real time can be acquired, so that the multi-person gesture recognition of the image to be recognized can be performed in real time, and the pending image collected within a historical time period can also be obtained Recognize images to facilitate multi-person gesture recognition of the image to be recognized when there are fewer processing tasks, or multi-person gesture recognition of the image to be recognized under the instruction of the operator, which is not specifically limited in this embodiment.
  • the camera component configured at the recognition end if the camera component can be used as an independent device, such as a camera, a video recorder, etc., it can be placed around the environment where multiple people are located, so that the multiple people can be photographed from different angles. This obtains images to be recognized that reflect multiple people at different angles, thereby helping to ensure the accuracy of subsequent gesture recognition.
  • an independent device such as a camera, a video recorder, etc.
  • the shooting can be a single shot or a continuous shot.
  • the obtained image is the image to be recognized, that is, a picture, and for continuous shooting, In other words, a video containing several images to be recognized is obtained. Therefore, in the embodiments of the present application, the image to be recognized for multi-person gesture recognition may be a picture taken at a time, or may be a certain image to be recognized in a continuously shot video. No specific restrictions were made.
  • Step 320 construct a roundabout pyramid network.
  • the roundabout pyramid network includes several stages connected in parallel, and each stage includes each layer of the downsampling network, each layer of the upsampling network, and the first residual connected between each layer of the upsampling network
  • the connection layer is connected by the second residual connection layer at different stages.
  • the roundabout pyramid network includes several stages connected in parallel in the form of a "roundabout” for extracting the feature maps corresponding to each stage.
  • each stage includes each layer of the down-sampling network and each layer of the up-sampling network.
  • each layer of the down-sampling network is used for down-sampling processing to obtain features with lower resolution, while reducing the computational complexity of gesture recognition.
  • Each layer of the up-sampling network is used to perform up-sampling to gradually improve the resolution of features, which in turn helps to ensure the accuracy of gesture recognition.
  • a first residual connection layer is established between each layer of the down-sampling network and each layer of the up-sampling network, so that features can be performed between each layer of the down-sampling network and each layer of the up-sampling network in each stage Propagation, that is, the features extracted from each layer of the down-sampling network will be transmitted to each layer of the up-sampling network through the first residual connection layer, and further feature fusion will be performed to obtain the feature map corresponding to each stage.
  • the roundabout pyramid network includes phase 0, phase 1, phase 2, ....
  • each layer includes the down-sampling network 401 and the up-sampling network 402, and the network layers 4051, 4052, 4053, and 4054 in the order of network hierarchy from low to high, respectively.
  • a second residual connection layer is established between each layer of the up-sampling network in the current stage and each layer of the down-sampling network in the subsequent stage, so as to facilitate feature propagation between different stages. Then, in the latter stage, corresponding feature maps can be extracted based on feature propagation.
  • Step 330 traversing the various stages in the roundabout pyramid network, including performing the processing described in steps 340 and 350 as follows:
  • Step 340 In the feature map extraction performed at the current stage, the first residual connection layer performs feature propagation between the layers of the down-sampling network and the layers of the up-sampling network in the current stage to obtain the output feature map of the current stage .
  • each layer of the down-sampling network 401 and each layer of the up-sampling network 402 corresponding feature propagation is performed through multiple layers in the first residual connection layer 403, respectively.
  • the output feature map 406 of stage 0 can be output via stage 0 of the roundabout pyramid network.
  • Step 350 through the second residual connection layer, perform feature propagation between each layer of the up-sampling network in the current stage and each layer of the down-sampling network in the latter stage, so as to extract corresponding feature maps in the latter stage.
  • stage 1 feature extraction by each layer of the downsampling network 409, feature propagation by the first residual connection layer 411, and feature fusion by each layer of the upsampling network 410 can be obtained to obtain the output feature map 412 of stage 1 .
  • the output feature map of the last stage is used as the feature map corresponding to the image to be recognized.
  • feature propagation depends on the first residual connection layer and the second residual connection layer, avoiding the use of convolution operations, so as to avoid the problem of feature propagation bottlenecks.
  • the feature fusion through the upsampling network, and between different stages, the output feature map of the current stage is used as the input of the latter stage, which means that the features of different resolutions and different scales in the roundabout pyramid network It is related to each other, not isolated, which can effectively improve the accuracy of gesture recognition.
  • Step 360 Perform multi-person gesture recognition according to the feature map corresponding to the image to be recognized, to obtain a gesture recognition result of the image to be recognized.
  • the gesture recognition based on the roundabout pyramid network not only meets the accuracy requirements of gesture recognition, but also features propagation through the formation of convenient jump shortcuts, effectively solves the bottleneck problem of feature propagation, and then has It is conducive to improving the effectiveness of features in the roundabout pyramid network.
  • step 340 may include the following steps:
  • Step 331 Perform feature extraction on the input feature map of the current stage through the down-sampling network layers.
  • the down-sampling network includes several network high layers and several network low layers.
  • the input feature map is input to the current stage, and the output feature map is obtained after the current stage is processed.
  • the down-sampling network 501 includes a network lower layer 5011, 5012 and a network upper layer 5013, 5014.
  • local features are also understood as semantic features, which are the key parts of the human body such as eyes, nose, ears, mouth, shoulders, elbows, wrists, hip joints, knees, and ankles Accurate description, while global features are accurate descriptions of human body contours.
  • Step 333 Transmit the extracted features from each layer of the downsampling network to each layer of the upsampling network through the first residual connection layer, and perform feature fusion on each layer of the upsampling network to obtain an output feature map.
  • each layer of the up-sampling network After all layers of the downsampling network have completed feature extraction, they need to complete feature fusion through each layer of the upsampling network. Specifically, in each layer of the up-sampling network, according to the order of the network hierarchy from high to low, the following processing is performed for each layer: the feature received from the first residual connection layer and the previous layer The transferred features are fused, up-sampling is performed on the fused features, and the processed fused features are transferred to the next layer; the processed fused features obtained from the last layer are used as the output feature map.
  • each layer of the up-sampling network 501 ' the feature corresponding to the highest layer 5014 of the network is up-sampled to obtain the feature 5022 to be fused.
  • the feature corresponding to the next highest layer 5013 of the network is transmitted to the up-sampling network layer 5031 via the first residual connection layer 5023, and is fused with the feature 5022 to be fused, and the obtained fused feature is up-sampled to complete the fused feature.
  • the feature 5026 is obtained after upsampling, and the feature corresponding to the network layer 5011 is transmitted to the upsampling network layer 5033 via the first residual connection layer 5027, and then fused, and then up Sampling processing.
  • the processed fusion feature obtained by the last layer 5034 is used as the output feature map of the current stage.
  • the resolution of the output feature map of the current stage obtained through the above processing is only 1/2 of the resolution of the image to be recognized. According to the actual needs of different application scenarios, in order to facilitate subsequent gesture recognition, it is also necessary to interpolate the feature map corresponding to the current stage, so that the resolution of the output feature map of the current stage can be consistent with the resolution of the image to be recognized.
  • step 340 the method as described above may further include the following steps:
  • Step 510 Perform posture pre-recognition on the feature map corresponding to the current stage to obtain an intermediate recognition result.
  • Step 530 Perform fusion processing on the intermediate recognition result and the output feature map, and transmit the processed feature map to the subsequent stage via the second residual connection layer.
  • the feature map corresponding to each stage will be subjected to intermediate supervision to correct the deviation in the intermediate stage of the gesture recognition process.
  • Intermediate supervision is essentially to pre-recognize the posture of the feature map corresponding to the current stage, so that the obtained intermediate recognition result is close to the set intermediate supervision signal.
  • the intermediate supervision signal is set during the network training process of the roundabout pyramid network.
  • the intermediate supervision signal may refer to the loss value of the loss function.
  • an intermediate recognition result 4072 is further obtained.
  • the intermediate recognition result 4072 is constrained to be close to the given intermediate supervision signal 4074, and then the intermediate recognition result 4072 is fused with the feature map 4071 corresponding to stage 0. That is, as shown in 4073 in FIG. 4, the output feature map 406 of stage 0 is finally formed, and used as the input feature map of stage 1.
  • the roundabout pyramid network can learn higher-level semantic features as early as possible, and as the stages traverse, the intermediate recognition results are continuously fused into the roundabout pyramid network
  • the pyramid network is repeatedly optimized to make up for the shortcomings of the intermediate recognition results, and then make up for the deviation of the middle stage of the gesture recognition process, which further fully guarantees the accuracy of gesture recognition.
  • the feature propagation process via the first residual connection layer and the second residual connection layer is the same, the only difference is that the processing layers on both sides connected by the two are different .
  • the first residual connection layer and the second residual connection layer will be defined and explained as follows, so as to better describe the commonality in the feature propagation process below.
  • the method as described above may also include the following steps: constructing a propagation path for the detour pyramid network.
  • the propagation path includes a path corresponding to each layer when performing feature propagation through the layers in the first residual connection layer and / or the second residual connection layer.
  • the feature to be propagated is dimensionally compressed, that is, the dimension of the input feature map is compressed from H ⁇ W ⁇ C_in to H ⁇ W ⁇ C_out / e to reduce the calculation on the propagation path Complexity reduces the amount of calculations during feature propagation.
  • the feature compression unit 601 includes: a normalization layer (BN), an activation layer (ReLU) and a convolution layer (Conv 1 ⁇ 1) connected in sequence.
  • Each hole convolution pyramid unit 602 includes: a normalization layer (BN), an activation layer (ReLU), and a convolution layer (Conv 1 ⁇ 1) or a hole convolution layer (Atrous 3 ⁇ 3).
  • the feature expansion unit 604 performs dimension expansion on the spliced features, and restores the dimension H ⁇ W ⁇ C_out from the dimension H ⁇ W ⁇ C_out / e to the feature dimension before compression.
  • the feature expansion unit 604 includes: a normalization layer (BN), an activation layer (ReLU), and a convolution layer (Conv 1 ⁇ 1) connected in sequence.
  • both the first residual connection layer and the second residual connection layer have introduced pre-activation technology, which is further conducive to improving the accuracy of gesture recognition.
  • the rapid propagation of features in the same stage and between different stages in the detour pyramid network through the propagation path is conducive to the extraction of the corresponding feature maps of each stage, which reduces the time for gesture recognition in the bottom-up scheme.
  • the difficulty of learning the features of the same scale also effectively improves the accuracy of gesture recognition, so that the accuracy of gesture recognition in the embodiments of the present application reaches more than 70.2%, which is better than the 65.6% that can be achieved by the stacked hourglass network proposed in the prior art Accuracy.
  • the propagation path also includes an inter-stage jump path 605.
  • an inter-stage jump path is established between each stage in the circuitous pyramid network, and the inter-stage jump path is added to the propagation path.
  • the image to be recognized may not undergo any operation, or only undergo the original-scale convolution operation, and be merged into the circuitous pyramid network Phase 0.
  • the jump path between stages can be regarded as an identity mapping path, so as to ensure that the roundabout pyramid network achieves the purpose of easy training during the network training process, and reduces the difficulty of the network training process.
  • step 360 may include the following steps:
  • Step 371 Locate the key points of the human body according to the feature map corresponding to the image to be recognized, and obtain several thermal maps that identify the positions of the key points of the human body. Each thermal map corresponds to a category of human key points.
  • the key points of the human body refer to the key positions of the human body, including the nose, shoulders, wrists, elbows, hip joints, knees, ankles and other key positions of the human body.
  • the category refers to the types of key points of the human body, for example, the wrist key points.
  • the key points of the nose are considered to belong to different categories. Then, for different categories, several key points of the human body and their positions existing in the image to be recognized are different.
  • the heat map corresponding to the category is used to identify the position of the key points of the human body of the category in the image to be recognized, which is obtained by positioning the key points of the human body on the feature map corresponding to the image to be recognized.
  • a heat map 701 corresponding to the nose key point category is used to identify the positions of the nose key points 7011 of two different persons in the image to be recognized.
  • the positions of the wrist key points 7021 of two different persons in the image to be recognized are identified.
  • human key point positioning is implemented based on a classifier implemented by a roundabout pyramid network, that is, the classifier is used to calculate the probability of human key points appearing at different positions in the image to be recognized.
  • the probability of occurrence of key points of the human body of the category at different positions in the image to be recognized is calculated.
  • a thermal map corresponding to the category is generated.
  • Step 373 Group human key points according to the feature map corresponding to the image to be recognized, to obtain several grouping maps that identify the group of human key points, and each grouping map corresponds to a category of human key points.
  • the grouping diagram corresponding to the category is used to identify the grouping of the key points of the human body of the category.
  • the grouping of human key points is also implemented by a classifier implemented by a roundabout pyramid network, that is, the classifier is used to calculate the probability that human key points belong to different groupings.
  • the probability that the key points of the human body of the category belong to different groups is calculated.
  • the group to which the key points of the human body of the category belong is determined according to the calculated probability. That is, the greater the calculated probability, the greater the probability that the key points of this category of human body belong to the group.
  • the probability of human key points of category A belonging to group B1 is P1
  • the probability of human key points of category A belonging to group B2 is P2, if P1> P2, it means that human key points of category A belong to group B1, otherwise, if P1 ⁇ P2, indicating that the human key points of category A belong to group B2.
  • Marking in the image to be recognized according to the determined grouping to generate a grouping diagram corresponding to the category That is to say, in the grouping diagram corresponding to the category, different marks indicate that the key points of the human body belong to different groups, that is, different marks indicate that the key points of the human body belong to different people in the grouping chart.
  • the mark may refer to a color, a form of a line (for example, a dotted line, a solid line), etc., and this embodiment is not specifically limited herein.
  • Still taking the image to be recognized contains two persons (that is, there are two groups: girl and boy) as an example, as shown in FIG. 12, the nose key point 7011 belongs to the girl in the grouping diagram 701, which is marked by gray, and the nose key point 7011 is The group diagram 701 belongs to the boy and is marked by black.
  • wrist key points 7021 belong to girls in the grouping diagram 702 and are marked by gray, and wrist key points 7021 belong to boys in the grouping diagram 702 and are marked by black.
  • step 371 and step 373 have no execution order.
  • the heat map and the grouping map are output simultaneously.
  • Step 375 according to the positions and groups of the key points of the human body identified by the heat maps and the grouping charts, establish a connection between the positions of the key points of the human body of the same group and different categories in the image to be identified, to obtain the Recognize the gesture recognition result of the image.
  • the key points of different types of human bodies can be set according to the connection relationship between the corresponding positions of the key points of the human body in the image to be recognized. The connection is established, and thus the gesture recognition result of the image to be recognized is obtained.
  • human key points such as nose key points, shoulder key points, wrist key points, elbow key points, hip joint key points, knee key points, and ankle key points are to be identified.
  • a connection is established between the positions in the image, thereby obtaining the gesture recognition result of the image to be recognized.
  • the posture recognition result reflects the connection relationship between the key points of the human body included in each person in the image to be recognized, and the posture corresponding to the human body is represented by the connection relationship.
  • the multi-person gesture recognition based on the roundabout pyramid network can not only determine the position of the key points of the human body of different individuals in the image to be recognized, but also determine the key points of the human body of different individuals.
  • the different groups that belong to the image to be recognized greatly improve the processing efficiency of gesture recognition, especially the processing efficiency of multi-person gesture recognition.
  • the roundabout pyramid network during the network training process, the human key point positioning information and the human key point grouping information are used as supervision signals to participate in the network training, so as to ensure that after the network training is completed, the roundabout pyramid network It is possible to locate and group key points of the human body at the same time, so as to ensure the processing efficiency of gesture recognition.
  • the positioning information of human key points is related to the image samples to be identified with different types of human key points;
  • the grouping information of human key points is related to the image samples to be identified with groups of different types of human key points.
  • step 360 the method described above may further include the following steps:
  • Step 810 by matching the gesture recognition result of the image to be recognized with the specified gesture, the motion in the image to be recognized is recognized.
  • Step 830 Generate a corresponding interactive instruction according to the recognized action, and control the execution of the specified event through the interactive instruction.
  • the recognition end is a smart TV
  • the interaction end is a sense of unity device.
  • Interactive applications such as a two-person tennis somatosensory game client, that is, a gesture recognition device, are deployed on the smart TV.
  • the interactive application runs on the smart TV, the tennis game scene is displayed to the user through the display screen configured on the smart TV.
  • multi-person gesture recognition will be performed on the collected image to be recognized. If the gesture recognition result indicates that the user ’s gesture is If the specified swing gesture matches, it is recognized that the user has performed a swing motion.
  • the interactive application can generate an interactive instruction indicating that the user has performed a swing motion through the above recognition, thereby controlling the smart TV to execute the display event.
  • the virtual user character in the tennis game scene is controlled to perform the corresponding swing motion according to the interaction instruction, thereby realizing the somatosensory interaction between the user and the somatosensory device.
  • the gesture recognition service provides the basis for interactive applications based on human gestures, which greatly enriches the user's entertainment experience.
  • the following is an embodiment of the device of the present application, which can be used to execute the multi-person gesture recognition method involved in the present application.
  • the device embodiments of the present application please refer to the method embodiments of the multi-person gesture recognition method involved in the present application.
  • a multi-person gesture recognition device 900 includes but is not limited to: an image acquisition module 910, a traversal module 930, and a gesture recognition module 950.
  • the image acquisition module 910 is used to acquire the image to be recognized.
  • the traversal module 930 is used to construct a circuitous pyramid network, the circuitous pyramid network includes several stages in parallel, and each stage includes layers of the down-sampling network, layers of the up-sampling network, and connections between layers of the up-sampling network
  • the first residual connection layer is connected between different stages through the second residual connection layer.
  • the traversal module 930 is also used to traverse the various stages of the roundabout pyramid network, including performing the following processes: in the feature map extraction performed in the current stage, downsampling in the current stage through the first residual connection layer Perform feature propagation between each layer of the network and each layer of the up-sampling network to obtain the output feature map of the current stage; via the second residual connection layer, up-sample the layers of the network in the current stage and down in the subsequent stage Feature propagation is performed between the layers of the sampling network to extract corresponding feature maps in the latter stage.
  • the traversal module 930 is also used to complete the traversal of each stage in the circuitous pyramid network, and use the output feature map of the last stage as the feature map corresponding to the image to be recognized.
  • the gesture recognition module 950 is configured to perform gesture recognition according to the feature map corresponding to the image to be recognized, and obtain a gesture recognition result of the image to be recognized.
  • the traversal module 930 includes but is not limited to: a feature extraction unit and a feature fusion unit.
  • the feature extraction unit is configured to perform feature extraction on the input feature map of the current stage through each layer of the down-sampling network.
  • a feature fusion unit for transmitting the extracted features from each layer of the down-sampling network to each layer of the up-sampling network through the first residual connection layer, and performing features on each layer of the up-sampling network Fusion to obtain the output feature map.
  • the down-sampling network includes several network upper layers and several network lower layers.
  • the feature extraction unit includes but is not limited to: a local feature extraction subunit and a global feature extraction subunit.
  • the local feature extraction subunit is used to extract several local features of the input feature map through the several network lower layers, and each local feature corresponds to a network lower layer.
  • the global feature extraction subunit is used to extract a plurality of global features of the input feature map through the plurality of network upper layers, and each global feature corresponds to a network upper layer.
  • the feature fusion unit includes but is not limited to: a fusion subunit and a feature map acquisition subunit.
  • the fusion sub-unit is used to perform the following processing for each layer in the order of network hierarchy from high to low in each layer of the up-sampling network: features received from the first residual connection layer It is fused with the features passed from the previous layer, up-sampling the fused features, and passing the processed fused features to the next layer.
  • the feature map acquisition subunit is configured to use the processed fusion feature obtained in the last layer as the output feature map.
  • the apparatus 900 further includes, but is not limited to: a pre-identification module and a result fusion module.
  • the pre-recognition module is used to pre-recognize the output feature map to obtain an intermediate recognition result.
  • the result fusion module is used to fuse the intermediate recognition result with the output feature map, and transmit the processed feature map to the latter stage via the second residual connection layer.
  • the apparatus 900 further includes, but is not limited to: a propagation path construction module for constructing a propagation path for the detour pyramid network, the propagation path including the connection layer via the first residual And / or the path corresponding to each layer when each layer in the second residual connection layer performs feature propagation.
  • a propagation path construction module for constructing a propagation path for the detour pyramid network, the propagation path including the connection layer via the first residual And / or the path corresponding to each layer when each layer in the second residual connection layer performs feature propagation.
  • the propagation path construction module includes but is not limited to: a feature compression unit, a hole convolution unit, and a feature expansion unit.
  • the feature compression unit is used for dimensional compression of the feature to be propagated through the feature compression unit.
  • the hollow convolution unit is used to input the compressed features to multiple parallel parallel convolution pyramid units, and the feature splicing is performed by the splicing unit.
  • the feature expansion unit is used to perform dimension expansion on the spliced features via the feature expansion unit and restore the feature dimension when propagating.
  • both the feature compression unit and the feature expansion unit include: a normalization layer, an activation layer, and a convolution layer connected in sequence.
  • the propagation path construction module further includes, but is not limited to: a jump path establishment unit.
  • the jump path establishment unit is used to establish an inter-stage jump path between each stage in the detour pyramid network, and add the inter-stage jump path to the propagation path.
  • the gesture recognition module 950 includes, but is not limited to: a heat map acquisition unit, a group map acquisition unit, and a key point position connection unit.
  • the thermal map acquisition unit is used to locate the key points of the human body according to the feature map corresponding to the image to be recognized, and obtain several thermal maps that identify the positions of the key points of the human body.
  • the grouping diagram obtaining unit is used for grouping human key points according to the feature map corresponding to the image to be recognized, to obtain several grouping diagrams that identify the grouping of human body key points, and each grouping map corresponds to a category of human body key points.
  • the key point position connecting unit is used to establish the position and group of the key points of the human body according to the heat maps and the grouping figures, and establish the connection between the key points of the human body of the same group and different categories in the image to be recognized, The gesture recognition result of the image to be recognized is obtained.
  • the thermal map acquisition unit includes, but is not limited to: a position probability calculation subunit and a thermal map generation subunit.
  • the position probability calculation subunit is used to calculate the probability that the key points of the human body of the category appear at different positions in the image to be recognized for a category according to the feature map corresponding to the image to be recognized.
  • the heat map generation subunit is used to generate a heat map corresponding to the category using the calculated probability as a heat value.
  • the grouping graph acquisition unit includes but is not limited to: a grouping probability calculation subunit, a grouping determination subunit, and a grouping graph generation subunit.
  • the grouping probability calculation subunit is used to calculate the probability that the key points of the human body of the category belong to different groups according to the feature map corresponding to the image to be recognized for a category.
  • the group determination subunit is used to determine the group to which the key points of the human body of the category belong according to the calculated probability.
  • the grouping graph generating subunit is configured to mark the determined grouping in the image to be recognized and generate a grouping graph corresponding to the category.
  • the device 900 further includes, but is not limited to: an action recognition module and a control interaction module.
  • the action recognition module is used to recognize the action in the image to be recognized by matching the posture recognition result of the image to be recognized with the specified posture.
  • the control interaction module is configured to generate a corresponding interaction instruction according to the recognized action, and control the execution of a specified event through the interaction instruction.
  • the multi-person gesture recognition device provided in the above embodiment performs the multi-person gesture recognition process, only the above-mentioned division of each functional module is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated as needed Different functional modules are completed, that is, the internal structure of the multi-person gesture recognition device will be divided into different functional modules to complete all or part of the functions described above.
  • multi-person gesture recognition device and the multi-person gesture recognition method embodiments provided in the above embodiments belong to the same concept, and the specific manner in which each module performs operations has been described in detail in the method embodiments, and will not be repeated here. .
  • an electronic device 1000 includes at least one processor 1001, at least one memory 1002, and at least one communication bus 1003.
  • the memory 1002 stores computer-readable instructions, and the processor 1001 reads the computer-readable instructions stored in the memory 1002 through the communication bus 1003.
  • a computer-readable storage medium has stored thereon a computer program, which when executed by a processor implements the multi-person gesture recognition method in the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种多人姿态识别方法、装置、电子设备及存储介质,多人姿态识别方法包括:获取待识别图像;构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接;遍历所述迂回式金字塔网络的各阶段,包括执行如下处理:在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图;经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取;直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图;根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。

Description

多人姿态识别方法、装置、电子设备及存储介质
本申请要求于2018年10月30日提交中国专利局、申请号为201811275350.3、申请名称为“多人姿态识别方法、装置及电子设备”的中国专利申请的优先权。
技术领域
本申请涉及计算机技术领域,尤其涉及一种姿态识别方法、装置、电子设备及存储介质。
发明背景
目前,多人姿态识别技术包括两种方案:自上而下方案和自下而上方案。其中,自上而下方案需要先以包围盒的形式检测出待识别图像中的每一个人,然后再对每一个包围盒中的人进行人体关键点检测。自下而上方案则是一次性检测出待识别图像中所有人的人体关键点,并同时判断每个人体关键点所属的人。由此可知,自下而上方案相较于自上而下方案,虽然处理效率较高,但是精度不足。
为此,提出了一种基于堆叠式沙漏网络的多人姿态识别方法,来弥补自下而上方案所存在的精度不足的问题。然而,该种堆叠式沙漏网络中,特征传播依赖于卷积操作,形成了特征传播的瓶颈。
发明内容
为了解决相关技术中存在的特征传播瓶颈的问题,本申请各实施例提供一种多人姿态识别方法、装置、电子设备及存储介质。
其中,本申请所采用的技术方案为:
第一方面,一种多人姿态识别方法,包括:
获取待识别图像;
构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接;
遍历所述迂回式金字塔网络的各阶段,包括执行如下处理:
在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图;
经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取;
直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图;
根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。
第二方面,一种多人姿态识别装置,包括:
图像获取模块,用于获取待识别图像;
遍历模块,用于构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接;
所述遍历模块,还用于遍历所述迂回式金字塔网络的各阶段,包括执行如下处理:在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传 播,得到所述当前阶段的输出特征图;经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取;
所述遍历模块,还用于直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图;
姿态识别模块,用于根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。
第三方面,一种电子设备,包括处理器及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的多人姿态识别方法。
第四方面,一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的多人姿态识别方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图简要说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1是根据本申请所涉及的实施环境的示意图。
图2是根据一示例性实施例示出的一种电子设备的硬件结构框图。
图3是根据一示例性实施例示出的一种多人姿态识别方法的流程图。
图4是图3对应实施例所涉及的迂回式金字塔网络的结构示意图。
图5是图3对应实施例中步骤340在一个实施例的流程图。
图6为图5对应实施例所涉及的迂回式金字塔网络中当前阶段的结构示意图。
图7是根据一示例性实施例示出的另一种多人姿态识别方法的流程图。
图8为图7对应实施例所涉及的为迂回式金字塔网络构建的传播路径的结构示意图。
图9是图3对应实施例中步骤360在一个实施例的流程图。
图10为图9对应实施例所涉及的标识鼻子关键点位置的热力图的示意图。
图11为图9对应实施例所涉及的标识手腕关键点位置的热力图的示意图。
图12为图9对应实施例所涉及的标识鼻子关键点分组的分组图的示意图。
图13为图9对应实施例所涉及的标识手腕关键点分组的分组图的示意图。
图14为图9对应实施例所涉及的待识别图像的姿态识别结果的示意图。
图15是根据一示例性实施例示出的另一种多人姿态识别方法的流程图。
图16是根据一示例性实施例示出的一种多人姿态识别装置的框图。
图17是根据一示例性实施例示出的一种电子设备的框图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
本申请实施例提出了一种多人姿态识别方法,解决了特征传播瓶颈的问题,进而有效地提升了姿态识别精度,相应地,该种多人姿态识别方法适用于多人姿态识别装置,此多人姿态识别装置部署于具备冯诺依曼体系结构的电子设备中,例如,电子设备可以是用户终端、服务器端等等。
图1为一种姿态识别方法所涉及的实施环境的示意图。该实施环境包括识别端110和交互端130。
其中,识别端110可以是台式电脑、笔记本电脑、平板电脑、智能手机、掌上电脑、个人数字助理或者其他可部署姿态识别装置的电子设备,例如,为用户提供姿态识别服务的服务器,在此不进行限定。
交互端130则是指体感设备、智能家居设备等可与用户实现体感交互的电子设备。此交互端130通过2G/3G/4G/5G、Wi-Fi等通信方式,与识别端110部署于同一网关,以便于实现用户与交互端130之间的体感交互。
对于识别端110而言,在获取到待识别图像之后,便可借助迂回式金字塔网络对待识别图像进行姿态识别,得到待识别图像的姿态识别结果。
值得一提的是,对于识别端110为服务器时,服务器所获取到的待识别图像可以来源于布设在用户执行动作所在环境的摄像设备,该摄像设备可实时采集用户执行动作时的图像或者视频,进而上传至服务器。
进一步地,通过待识别图像的姿态识别结果,识别待识别图像中的动作,以生成对应的交互指令,进而通过交互指令控制指定事件的执行。
例如,交互端130为一智能音箱,那么,随着识别端110与智能音箱之间的交互,智能音箱便可接收到交互指令,进而根据交互指令执行指定事件。比如指定事件为启动事件,则当用户所执行的动作符合指定姿态,便为用户启动智能音箱。
当然,根据应用场景的实际需要,在另一实施环境中,姿态识别装置还可以直接部署于交互端130,也即是,交互端130同时作为识别端。
具体而言,交互端130在获取到待识别图像之后,对待识别图像进行姿态识别,进而通过待识别图像的姿态识别结果执行指定事件。例如,交互端130为跳舞机,则通过识别用户所执行的一系列动作是否与指定舞蹈动作匹配,由此识别出用户是否按照顺序执行了一系列指定舞蹈动作,进而生成交互指令,并根据交互指令执行评分事件,即对用户所执行的动作评分。
图2是根据一示例性实施例示出的一种电子设备的硬件结构框图。该种电子设备适用于图1所示出实施环境中的识别端110,可以是台式电脑、笔记本电脑、平板电脑、掌上电脑、个人数字助理、智能手机、可穿戴设备等用户终端,或者服务器等服务端。
需要说明的是,该种电子设备只是一个适配于本申请的示例,不能认为是提供了对本申请的使用范围的任何限制。该种电子设备也不能解释为需要依赖于或者必须具有图2中示出的示例性的电子设备200中的 一个或者多个组件。
电子设备200的硬件结构可因配置或者性能的不同而产生较大的差异,如图2所示,电子设备200包括:电源210、接口230、至少一存储器250、、至少一中央处理器(CPU,Central Processing Units)270、以及摄像组件290。
具体地,电源210用于为电子设备200上的各组件提供工作电压。
接口230包括至少一有线或无线网络接口231、至少一串并转换接口233、至少一输入输出接口235以及至少一USB接口237等,用于与外部设备通信。例如,与图1所示出实施环境中的交互端130交互。
存储器250作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统251、应用程序253及数据255等,存储方式可以是短暂存储或者永久存储。
其中,操作系统251用于管理与控制电子设备200上的各组件以及应用程序253,以实现中央处理器270对海量数据255的计算与处理,其可以是Windows ServerTM、Mac OS XTM、UnixTM、LinuxTM、FreeBSDTM等。
应用程序253是基于操作系统251之上完成至少一项特定工作的计算机程序,其可以包括至少一模块(图2中未示出),每个模块都可以分别包含有对电子设备200的一系列计算机可读指令。例如,多人姿态识别装置可视为部署于电子设备200的应用程序253,以实现多人姿态识别方法。
数据255可以是照片、图片,还可以是待识别图像,存储于存储器250中。
中央处理器270可以包括一个或多个以上的处理器,并设置为通过通信总线与存储器250通信,以读取存储器250中存储的计算机可读指 令,进而实现对存储器250中海量数据255的运算与处理。例如,通过中央处理器270读取存储器250中存储的一系列计算机可读指令的形式来完成多人姿态识别方法。
摄像组件290,例如摄像头,用于拍摄图像或者视频。拍摄的图像或者视频可以存储至存储器250内,还可以通过接口230与外部设备通信。例如,对用户执行动作时的图像或者视频进行实时采集。
此外,通过硬件电路或者硬件电路结合软件也能同样实现本申请,因此,实现本申请并不限于任何特定硬件电路、软件以及两者的组合。
请参阅图3,在一示例性实施例中,一种多人姿态识别方法应用于电子设备,例如,用于图1所示实施环境的识别端,该识别端的结构可以如图2所示。
该种多人姿态识别方法可以由识别端执行,也可以理解为是由识别端中部署的多人姿态识别装置执行。在下述方法实施例中,为了便于描述,以各步骤的执行主体为多人姿态识别装置加以说明,但是并不对此不构成具体限定。
该种多人姿态识别方法可以包括以下步骤:
步骤310,获取待识别图像。
其中,待识别图像是对多个人进行拍摄生成的,以便于后续针对包含多个人的待识别图像进行多人姿态识别。
待识别图像,可以来源于识别端实时采集的图像,例如,识别端为智能手机,该智能手机配置了摄像头,也可以是识别端预先存储的图像,例如,识别端为服务器,进而通过本地读取或者网络传输的方式获取。
换句话说,对于部署于识别端的多人姿态识别装置而言,可以获取实时采集的待识别图像,以便于实时地对待识别图像进行多人姿态识 别,还可以获取一历史时间段内采集的待识别图像,以便于在处理任务较少的时候对待识别图像进行多人姿态识别,或者,在操作人员的指示下对待识别图像进行多人姿态识别,本实施例并未对此作出具体限定。
进一步地,针对识别端所配置的摄像组件,如果该摄像组件可作为独立设备,例如,摄像机、录像机等,则可以布设于多个人所在环境的四周,以便于从不同角度拍摄该多个人,由此获得反映不同角度的多个人的待识别图像,从而有利于保障后续姿态识别的精度。
应当说明的是,拍摄可以是单次拍摄,还可以是连续性拍摄,相应地,就单次拍摄来说,得到的即为待识别图像,也即是一张图片,而对于连续性拍摄而言,得到的则是包含若干待识别图像的一段视频。由此,本申请各实施例中,进行多人姿态识别的待识别图像可以是单次拍摄的一张图片,还可以是连续性拍摄的一段视频中的某个待识别图像,本申请对此并未作出具体限定。
步骤320,构建迂回式金字塔网络,迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接。
迂回式金字塔网络包括以“迂回”形式并联的若干阶段,用于提取各阶段对应的特征图。
具体而言,每一个阶段包括下采样网络各层和上采样网络各层。其中,下采样网络各层用于进行下采样处理,以获取分辨率较低的特征,同时降低姿态识别的计算复杂度。上采样网络各层用于进行上采样处理,以逐步提高特征的分辨率,进而有利于保证姿态识别的精度。
在每一个阶段中,下采样网络各层与上采样网络各层之间建立了第一残差连接层,以便于每一个阶段中下采样网络各层与上采样网络各层 之间能够进行特征传播,也即是,下采样网络各层提取得到的特征,将通过第一残差连接层传输至上采样网络各层,进一步进行特征融合,以得到各阶段对应的特征图。
如图4所示,迂回式金字塔网络包括阶段0、阶段1、阶段2、……。
以阶段0进行说明,在阶段0中,包括下采样网络401各层和上采样网络402各层,按照网络层次从低往高的顺序,分别是网络层4051、4052、4053、4054。
对于不同阶段而言,在当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间建立了第二残差连接层,以便于在不同阶段之间进行特征传播。那么,后一个阶段便可基于特征传播,进行对应特征图的提取。
如图4所示,阶段0中上采样网络402各层与阶段1中下采样网络409各层之间,分别通过多个第二残差连接层404进行特征传播。
步骤330,遍历迂回式金字塔网络中的各阶段,包括执行如下步骤340和步骤350所述的处理:
步骤340,在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到当前阶段的输出特征图。
如图4所示,在下采样网络401各层与上采样网络402各层之间,分别通过第一残差连接层403中的多层进行相应的特征传播。
由此,经由迂回式金字塔网络的阶段0,即可输出阶段0的输出特征图406。
步骤350,经由第二残差连接层,在当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行后一个阶段对应特征图的提取。
在阶段1中,便可通过下采样网络409各层进行的特征提取、第一残差连接层411进行特征传播、以及上采样网络410各层进行的特征融合,得到阶段1的输出特征图412。
对迂回式金字塔网络中各阶段进行遍历,将相应地获得各阶段对应的特征图。
直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图。
由上可知,基于迂回式金字塔网络,特征传播依赖于第一残差连接层和第二残差连接层,避免使用卷积操作,以此避免存在特征传播瓶颈的问题。
此外,同一阶段中,通过上采样网络所进行的特征融合,以及不同阶段之间,当前阶段的输出特征图作为后一个阶段的输入,意味着在迂回式金字塔网络中不同分辨率不同尺度的特征被相互关联,而并非孤立的,从而能够有效地提升姿态识别的精度。
步骤360,根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。
通过如上所述的过程,基于迂回式金字塔网络实现的姿态识别,不仅满足姿态识别的精度要求,而且特征传播通过形成的便捷的跳转捷径进行,有效地解决了特征传播的瓶颈问题,进而有利于提升特征在迂回式金字塔网络中的传播有效性。
请参阅图5,在一示例性实施例中,对于遍历到的阶段作为当前阶段,步骤340可以包括以下步骤:
步骤331,通过下采样网络各层对当前阶段的输入特征图进行特征提取。
其中,所述下采样网络包括若干网络高层和若干网络低层。
结合图6,对当前阶段中进行的特征提取过程加以说明。输入到当前阶段的是输入特征图,当前阶段处理完毕后后获得输出特征图。
如图6所示,在所述当前阶段中,包括下采样网络501和上采样网络501’。其中,下采样网络501包括网络低层5011、5012和网络高层5013、5014。
通过所述下采样网络501中的网络低层5011、5012,提取得到输入特征图的若干局部特征,每一个局部特征对应一网络低层。
通过所述下采样网络501中的若干网络高层5013、5014,提取得到输入特征图的全局特征,每一个全局特征对应一网络高层。
也就是说,在当前阶段中,随着网络层次的加深,对待识别图像进行的特征提取中,逐渐由局部特征描述抽象为全局特征描述,进而更加准确地对待识别图像进行描述,以利于提升姿态识别的精度。
以待识别图像包含多个人为例进行说明,局部特征,也理解为语义特征,是对眼睛、鼻子、耳朵、嘴巴、肩膀、肘部、手腕、胯关节、膝盖、脚腕等人体关键部位的准确描述,而全局特征则是对人体轮廓的准确描述。
步骤333,通过第一残差连接层,将提取得到的特征从下采样网络各层传输至上采样网络各层,并在上采样网络各层进行特征融合,得到输出特征图。
在下采样网络各层完成特征提取之后,便需要通过上采样网络各层完成特征融合。具体而言,在所述上采样网络各层中,按照网络层次从高至低的顺序,对于每一层执行如下处理:将从所述第一残差连接层接收到的特征与上一层传递下来的特征进行融合,对融合后的特征进行上采样处理,将处理后的融合特征传递给下一层;将最后一层得到的处理后的融合特征作为所述输出特征图。
结合图6所示,对当前阶段中特征融合过程进行如下说明。
分别经由第一残差连接层5021、5023、5025、5027,在所述当前阶段中下采样网络501各层与上采样网络501’各层之间进行特征传播。
在所述上采样网络501’各层中,对网络最高一层5014对应的特征进行上采样处理,得到待融合特征5022。
网络次高一层5013对应的特征经由第一残差连接层5023传输到上采样网络层5031后,与待融合特征5022相融合,得到的融合后特征进行上采样处理,完成对融合后特征的更新处理,然后将融合处理后的特征5024传递给下一层5032。
同理,针对更新的融合特征5032,进行上采样处理后得到特征5026,将其与网络层5011对应的特征经由第一残差连接层5027传输到上采样网络层5033后进行融合,然后进行上采样处理。
直至完成网络其余层对应特征的遍历,以最后一层5034得到的处理后的融合特征作为所述当前阶段的输出特征图。
值得一提的是,经由上述处理获得的当前阶段的输出特征图的分辨率仅为待识别图像的分辨率的1/2。根据不同应用场景的实际需要,为了方便于后续的姿态识别,还需要对当前阶段对应的特征图进行插值,以使当前阶段的输出特征图的分辨率能够达到与待识别图像的分辨率一致。
通过上述实施例,通过反复进行的上采样处理和下采样处理,在降低姿态识别的计算复杂度的前提下,还扩大了网络感受野,充分地保障了姿态识别的精度。
请参阅图7,在一示例性实施例中,步骤340之后,如上所述的方法还可以包括以下步骤:
步骤510,对当前阶段对应的特征图进行姿态预识别,得到中间识 别结果。
步骤530,将中间识别结果与输出特征图进行融合处理,将处理后的特征图经由第二残差连接层传输给后一个阶段。
为了促使迂回式金字塔网络能够尽早地学习更高层次的语义特征,本实施例中,将针对每一个阶段对应的特征图进行中间监督,以此修正姿态识别过程中间阶段的偏差。
中间监督,实质是对当前阶段对应的特征图进行姿态预识别,以使所获得的中间识别结果贴近设定的中间监督信号。其中,中间监督信号,是在迂回式金字塔网络进行网络训练过程中设定的,例如,中间监督信号可以是指损失函数的损失值。
回请参阅图4,以当前阶段为阶段0举例说明中间预测过程。
如图4所示,假设上采样网络402各层经过特征融合之后,初步得到阶段0对应的特征图4071。
通过姿态预识别,进一步获得中间识别结果4072,通过与中间监督信号4074比较,约束中间识别结果4072贴近给定的中间监督信号4074,进而将中间识别结果4072与阶段0对应的特征图4071融合,即如图4中4073所示,最终形成阶段0的输出特征图406,并以此作为阶段1的输入特征图。
在上述实施例的作用下,配合中间监督,使得迂回式金字塔网络能够尽早地学习更高层次的语义特征,并随着阶段的遍历,中间识别结果不断融合至迂回式金字塔网络,以对迂回式金字塔网络进行反复地优化,以此弥补中间识别结果的不足,进而弥补姿态识别过程中间阶段的偏差,进一步地充分保障了姿态识别的精度。
在一示例性实施例中,应当理解,经由第一残差连接层、第二残差连接层所进行的特征传播过程,方式是相同的,区别仅在于二者所连接 的两侧处理层不同。为此,将针对第一残差连接层、第二残差连接层进行如下定义说明,以便于下文更好地描述特征传播过程中的共性。
相应地,如上所述的方法还可以包括以下步骤:为迂回式金字塔网络构建传播路径。其中,传播路径包括经由第一残差连接层和/或第二残差连接层中各层进行特征传播时每层所对应的路径。
结合图8所示,对经由传播路径进行特征传播过程加以说明。
具体地,通过特征压缩单元601,对待传播的特征进行维度压缩,也即是,将输入特征图的维度由H×W×C_in压缩成H×W×C_out/e,以降低传播路径上的计算复杂度,减少特征传播过程中的计算量。其中,所述特征压缩单元601包括:依次连接的归一化层(BN)、激活层(ReLU)和卷积层(Conv 1×1)。
将压缩的特征输入至多路(例如4路)并行的空洞卷积金字塔单元602,并通过拼接单元603进行特征拼接,使得在网络感受野得以扩大的同时,能够避免特征传播过程中的特征损失,有效地保障了特征在特征传播过程中的传播有效性,避免存在特征传播瓶颈的问题。其中,每一路空洞卷积金字塔单元602包括:归一化层(BN)、激活层(ReLU)、以及卷积层(Conv 1×1)或者空洞卷积层(Atrous 3×3)。
经由特征扩张单元604对拼接的特征进行维度扩张,由维度H×W×C_out/e恢复至压缩前的特征维度H×W×C_out。其中,所述特征扩张单元604包括:依次连接的归一化层(BN)、激活层(ReLU)和卷积层(Conv 1×1)。
值得一提的是,迂回式金字塔网络中,无论是第一残差连接层,还是第二残差连接层,均引入了预激活技术,进一步有利于提升姿态识别的精度。
通过上述过程,经由传播路径实现了特征在迂回式金字塔网络中同 一阶段内、不同阶段间的快速传播,有利于各阶段所对应特征图的提取,既降低了自下而上方案中姿态识别时学习相同尺度特征的难度,还有效提高了姿态识别的精度,使得本申请各实施例中姿态识别的精度达到70.2%以上,优于现有技术所提出的堆叠式沙漏网络所能够达到的65.6%的精度。
进一步地,如图8所示,传播路径还包括阶段间跳转路径605。
具体地,在所述迂回式金字塔网络中的各阶段之间建立阶段间跳转路径,并将所述阶段间跳转路径添加至所述传播路径。
回请参阅图4,在阶段0中,通过阶段间跳转路径408,待识别图像也可以不经过任何的操作,或者,仅经过原尺度的卷积操作,而融合至迂回式金字塔网络中的阶段0。
换句话说,阶段间跳转路径,可视为恒等映射路径,从而保证迂回式金字塔网络在网络训练过程中达到易于训练的目的,降低了网络训练过程的难度。
请参阅图9,在一示例性实施例中,步骤360可以包括以下步骤:
步骤371,根据所述待识别图像对应的特征图进行人体关键点定位,得到标识人体关键点位置的若干热力图,每一热力图对应一种类别的人体关键点。
人体关键点,指的是人体关键位置,包括鼻子、肩膀、手腕、肘部、胯关节、膝盖、脚腕等人体关键位置,相应地,类别是指人体关键点的种类,例如,手腕关键点、鼻子关键点视为属于不同类别。那么,对于不同类别而言,存在于待识别图像中的若干人体关键点及其位置有所区别。
由此,对应于类别的热力图,用于标识该类别的人体关键点在待识别图像中的位置,通过对待识别图像对应的特征图进行人体关键点定位 得到。
以待识别图像包含两个人为例,如图10所示,对应于鼻子关键点类别的热力图701,用于标识不同两个人的鼻子关键点7011在待识别图像中的位置。
如图11所示,对于手腕关键点类别的热力图702,则标识了不同两个人的手腕关键点7021在待识别图像中的位置。
在一实施例中,人体关键点定位,基于迂回金字塔网络所实现的分类器实现,即采用分类器计算人体关键点在待识别图像中不同位置出现的概率。
具体地,针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点在所述待识别图像中不同位置出现的概率。以计算得到的概率作为热力值,生成所述类别对应的热力图。
也就是说,某个位置在热力图中的热力值越大,表示在待识别图像中对应该位置出现该类别的人体关键点的概率越大。
步骤373,根据所述待识别图像对应的特征图进行人体关键点分组,得到标识人体关键点分组的若干分组图,每一分组图对应一种类别的人体关键点。
其中,对应于类别的分组图,用于标识该类别的人体关键点所属分组。
在一实施例中,人体关键点分组,也是由迂回金字塔网络所实现的分类器实现的,即采用分类器计算人体关键点属于不同分组的概率。
具体地,针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点属于不同分组的概率。
根据计算得到的概率确定所述类别的人体关键点所属分组。也即是,计算得到的概率越大,表示该类别的人体关键点属于该分组的可能 性越大。例如,类别A的人体关键点属于分组B1的概率为P1,类别A的人体关键点属于分组B2的概率为P2,如果P1>P2,表示类别A的人体关键点属于分组B1,反之,如果P1<P2,表示类别A的人体关键点属于分组B2。
在所述待识别图像中按照所确定分组进行标记,生成所述类别对应的分组图。也就是说,在对应于类别的分组图中,通过不同标记表示人体关键点所属不同分组,也即是,不同标记表示人体关键点在分组图中属于不同的人。其中,标记可以是指颜色、线的形式(例如虚线、实线)等,本实施例在此并未作出具体限定。
仍以待识别图像包含两个人(即存在两个分组:女孩和男孩)为例,如图12所示,鼻子关键点7011在分组图701中属于女孩,由灰色标记,鼻子关键点7011在分组图701中则属于男孩,由黑色标记。
如图13所示,手腕关键点7021在分组图702中属于女孩,由灰色标记,手腕关键点7021在分组图702中则属于男孩,由黑色标记。
应当说明的是,步骤371和步骤373并无执行先后顺序,对于迂回式金字塔网络而言,热力图和分组图是同时输出的。
步骤375,根据若干热力图和若干分组图分别标识的人体关键点的位置和分组,在所述待识别图像中将同一分组、不同类别的人体关键点位置之间建立连接,得到所述待识别图像的姿态识别结果。
在获得热力图和分组图之后,属于同一分组,也即是属于同一个人,不同类别的人体关键点便可按照设定的连接关系,在待识别图像中将对应的人体关键点位置之间建立连接,由此即得到待识别图像的姿态识别结果。
例如,如图14所示,针对每一个人,鼻子关键点、肩膀关键点、手腕关键点、肘部关键点、胯关节关键点、膝盖关键点、脚腕关键点等 人体关键点在待识别图像中的位置之间建立了连接,由此即得到待识别图像的姿态识别结果。
也可以理解为,姿态识别结果,反映了待识别图像中每一个人所包含各人体关键点之间的连接关系,通过该连接关系来表示对应人体的姿态。
通过上述过程,结合热力图和分组图,基于迂回式金字塔网络的多人姿态识别,不仅能够确定不同个人的人体关键点在待识别图像中的位置,还同时确定了不同个人的人体关键点在待识别图像中所属的不同分组,极大地提高了姿态识别的处理效率,尤其是多人姿态识别时的处理效率。
在此补充说明的是,迂回式金字塔网络,在网络训练过程中,以人体关键点定位信息和人体关键点分组信息作为监督信号,参与网络训练,以此保证完成网络训练之后,迂回式金字塔网络便可实现同时对人体关键点进行定位和分组,以此确保姿态识别的处理效率。
其中,人体关键点定位信息,与标注了不同类别的人体关键点位置的待识别图像样本有关;人体关键点分组信息,则与标注了不同类别的人体关键点所属分组的待识别图像样本有关。
请参阅图15,在一示例性实施例中,步骤360之后,如上所述的方法还可以包括以下步骤:
步骤810,通过对待识别图像的姿态识别结果与指定姿态之间进行匹配,对待识别图像中的动作进行识别。
步骤830,根据识别得到的动作生成对应的交互指令,通过所述交互指令控制指定事件执行。
在一应用场景中,识别端为智能电视,交互端为一体感设备。
交互式应用,例如双人网球体感游戏客户端,也即是姿态识别装置, 部署于智能电视,随着交互式应用在智能电视上运行,通过智能电视所配置的显示屏幕向用户展示网球游戏场景。
假设用户借由网球拍体感设备执行了挥拍动作,对于运行于智能电视的交互式应用而言,将针对采集到的待识别图像进行多人姿态识别,如果姿态识别结果所表示的用户姿态与指定挥拍姿态相匹配,则识别到用户执行了挥拍动作。
进一步地,交互式应用便可通过上述识别生成指示用户已执行挥拍动作的交互指令,从而控制智能电视执行显示事件。
具体而言,在显示屏幕所展示的网球游戏场景中,根据交互指令控制网球游戏场景中的虚拟用户角色执行相应的挥拍动作,从而实现了用户与体感设备之间的体感交互。
上述应用场景中,姿态识别服务为基于人体姿态的交互式应用提供了基础,极大地丰富了用户的娱乐体验。
下述为本申请装置实施例,可以用于执行本申请所涉及的多人姿态识别方法。对于本申请装置实施例中未披露的细节,请参照本申请所涉及的多人姿态识别方法的方法实施例。
请参阅图16,在一示例性实施例中,一种多人姿态识别装置900包括但不限于:图像获取模块910、遍历模块930和姿态识别模块950。
其中,图像获取模块910,用于获取待识别图像。
遍历模块930,用于构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接。
所述遍历模块930,还用于遍历所述迂回式金字塔网络的各阶段, 包括执行如下处理:在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图;经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取。
所述遍历模块930,还用于直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图。
姿态识别模块950,用于根据所述待识别图像对应的特征图进行姿态识别,得到所述待识别图像的姿态识别结果。
在一示例性实施例中,所述遍历模块930包括但不限于:特征提取单元和特征融合单元。
其中,特征提取单元,用于通过所述下采样网络各层对所述当前阶段的输入特征图进行特征提取。
特征融合单元,用于通过所述第一残差连接层,将提取得到的特征从所述下采样网络各层传输至所述上采样网络各层,并在所述上采样网络各层进行特征融合,得到所述输出特征图。
在一示例性实施例中,所述下采样网络包括若干网络高层和若干网络低层。
所述特征提取单元包括但不限于:局部特征提取子单元和全局特征提取子单元。
其中,局部特征提取子单元,用于通过所述若干网络低层,提取得到所述输入特征图的若干局部特征,每一个局部特征对应一网络低层。
全局特征提取子单元,用于通过所述若干网络高层,提取得到所述输入特征图的若干全局特征,每一个全局特征对应一网络高层。
在一示例性实施例中,所述特征融合单元包括但不限于:融合子单元和特征图获取子单元。
其中,融合子单元,用于在所述上采样网络各层中,按照网络层次从高至低的顺序,对于每一层执行如下处理:将从所述第一残差连接层接收到的特征与上一层传递下来的特征进行融合,对融合后的特征进行上采样处理,将处理后的融合特征传递给下一层。
特征图获取子单元,用于将最后一层得到的处理后的融合特征作为所述输出特征图。
在一示例性实施例中,所述装置900还包括但不限于:预识别模块和结果融合模块。
其中,预识别模块,用于对所述输出特征图进行姿态预识别,得到中间识别结果。
结果融合模块,用于将所述中间识别结果与所述输出特征图进行融合处理,将处理后的特征图经由所述第二残差连接层传输给所述后一个阶段。
在一示例性实施例中,所述装置900还包括但不限于:传播路径构建模块,用于为所述迂回式金字塔网络构建传播路径,所述传播路径包括经由所述第一残差连接层和/或所述第二残差连接层中各层进行特征传播时每层所对应的路径。
具体地,所述传播路径构建模块包括但不限于:特征压缩单元、空洞卷积单元和特征扩张单元。
其中,特征压缩单元,用于通过特征压缩单元,对待传播的特征进行维度压缩。
空洞卷积单元,用于将压缩的特征输入至多路并行的空洞卷积金字塔单元,并通过拼接单元进行特征拼接。
特征扩张单元,用于经由特征扩张单元,对拼接的特征进行维度扩张,恢复至进行传播时的特征维度。
进一步地,所述特征压缩单元和所述特征扩张单元均包括:依次连接的归一化层、激活层和卷积层。
在一示例性实施例中,所述传播路径构建模块还包括但不限于:跳转路径建立单元。
其中,跳转路径建立单元,用于在所述迂回式金字塔网络中的各阶段之间建立阶段间跳转路径,并将所述阶段间跳转路径添加至所述传播路径。
在一示例性实施例中,所述姿态识别模块950包括但不限于:热力图获取单元、分组图获取单元和关键点位置连接单元。
其中,热力图获取单元,用于根据所述待识别图像对应的特征图进行人体关键点定位,得到标识人体关键点位置的若干热力图,每一热力图对应一种类别的人体关键点。
分组图获取单元,用于根据所述待识别图像对应的特征图进行人体关键点分组,得到标识人体关键点分组的若干分组图,每一分组图对应一种类别的人体关键点。
关键点位置连接单元,用于根据若干热力图和若干分组图分别标识的人体关键点位置和分组,在所述待识别图像中将同一分组、不同类别的人体关键点位置之间建立连接,得到所述待识别图像的姿态识别结果。
在一示例性实施例中,所述热力图获取单元包括但不限于:位置概率计算子单元和热力图生成子单元。
其中,位置概率计算子单元,用于针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点在所述待识别图像中不 同位置出现的概率。
热力图生成子单元,用于以计算得到的概率作为热力值,生成所述类别对应的热力图。
在一示例性实施例中,所述分组图获取单元包括但不限于:分组概率计算子单元、分组确定子单元和分组图生成子单元。
其中,分组概率计算子单元,用于针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点属于不同分组的概率。
分组确定子单元,用于根据计算得到的概率确定所述类别的人体关键点所属分组。
分组图生成子单元,用于在所述待识别图像中按照所确定分组进行标记,生成所述类别对应的分组图。
在一示例性实施例中,所述装置900还包括但不限于:动作识别模块和控制交互模块。
其中,动作识别模块,用于通过对所述待识别图像的姿态识别结果与指定姿态之间进行匹配,对所述待识别图像中的动作进行识别。
控制交互模块,用于根据识别得到的动作生成对应的交互指令,通过所述交互指令控制指定事件执行。
需要说明的是,上述实施例所提供的多人姿态识别装置在进行多人姿态识别处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即多人姿态识别装置的内部结构将划分为不同的功能模块,以完成以上描述的全部或者部分功能。
另外,上述实施例所提供的多人姿态识别装置与多人姿态识别方法的实施例属于同一构思,其中各个模块执行操作的具体方式已经在方法实施例中进行了详细描述,此处不再赘述。
请参阅图17,在一示例性实施例中,一种电子设备1000,包括至少一处理器1001、至少一存储器1002、以及至少一通信总线1003。
其中,存储器1002上存储有计算机可读指令,处理器1001通过通信总线1003读取存储器1002中存储的计算机可读指令。
该计算机可读指令被处理器1001执行时实现上述各实施例中的多人姿态识别方法。
在一示例性实施例中,一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各实施例中的多人姿态识别方法。
上述内容,仅为本申请的较佳示例性实施例,并非用于限制本申请的实施方案,本领域普通技术人员根据本申请的主要构思和精神,可以十分方便地进行相应的变通或修改,故本申请的保护范围应以权利要求书所要求的保护范围为准。

Claims (16)

  1. 一种多人姿态识别方法,其特征在于,由电子设备执行,包括:
    获取待识别图像;
    构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接;
    遍历所述迂回式金字塔网络的各阶段,包括执行如下处理:
    在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图;
    经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取;
    直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图;
    根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。
  2. 如权利要求1所述的方法,其中,所述在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图,包括:
    通过所述下采样网络各层对所述当前阶段的输入特征图进行特征提取;
    通过所述第一残差连接层,将提取得到的特征从所述下采样网络各 层传输至所述上采样网络各层,并在所述上采样网络各层进行特征融合,得到所述输出特征图。
  3. 如权利要求2所述的方法,其中,所述下采样网络包括若干网络高层和若干网络低层;
    所述通过所述下采样网络各层对所述当前阶段的输入特征图进行特征提取,包括:
    通过所述若干网络低层,提取得到所述输入特征图的若干局部特征,每一个局部特征对应一网络低层;
    通过所述若干网络高层,提取得到所述输入特征图的若干全局特征,每一个全局特征对应一网络高层。
  4. 如权利要求2所述的方法,其中,所述在所述上采样网络各层进行特征融合,得到所述输出特征图,包括:
    在所述上采样网络各层中,按照网络层次从高至低的顺序,对于每一层执行如下处理:将从所述第一残差连接层接收到的特征与上一层传递下来的特征进行融合,对融合后的特征进行上采样处理,将处理后的融合特征传递给下一层;
    将最后一层得到的处理后的融合特征作为所述输出特征图。
  5. 如权利要求1所述的方法,其中,所述在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图之后,所述方法还包括:
    对所述输出特征图进行姿态预识别,得到中间识别结果;
    将所述中间识别结果与所述输出特征图进行融合处理,将处理后的特征图经由所述第二残差连接层传输给所述后一个阶段。
  6. 如权利要求1至5中任一项所述的方法,还包括:
    为所述迂回式金字塔网络构建传播路径,所述传播路径包括经由所述第一残差连接层和/或所述第二残差连接层中各层进行特征传播时每层所对应的路径。
  7. 如权利要求6所述的方法,其中,所述为所述迂回式金字塔网络构建传播路径,包括:
    通过特征压缩单元,对待传播的特征进行维度压缩;
    将压缩的特征输入至多路并行的空洞卷积金字塔单元,并通过拼接单元进行特征拼接;
    经由特征扩张单元,对拼接的特征进行维度扩张,恢复至压缩前的特征维度。
  8. 如权利要求7所述的方法,其中,所述特征压缩单元和所述特征扩张单元均包括依次连接的归一化层、激活层和卷积层。
  9. 如权利要求6所述的方法,还包括:
    在所述迂回式金字塔网络中的各阶段之间建立阶段间跳转路径,并将所述阶段间跳转路径添加至所述传播路径。
  10. 如权利要求1所述的方法,其中,所述根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果,包括:
    根据所述待识别图像对应的特征图进行人体关键点定位,得到标识人体关键点位置的若干热力图,每一热力图对应一种类别的人体关键点;
    根据所述待识别图像对应的特征图进行人体关键点分组,得到标识人体关键点分组的若干分组图,每一分组图对应一种类别的人体关键点;
    根据若干热力图和若干分组图分别标识的人体关键点位置和分组, 在所述待识别图像中将同一分组、不同类别的人体关键点位置之间建立连接,得到所述待识别图像的姿态识别结果。
  11. 如权利要求10所述的方法,其中,所述根据所述待识别图像对应的特征图进行人体关键点定位,得到标识人体关键点位置的若干热力图,包括:
    针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点在所述待识别图像中不同位置出现的概率;
    以计算得到的概率作为热力值,生成所述类别对应的热力图。
  12. 如权利要求10所述的方法,其中,所述根据所述待识别图像对应的特征图进行人体关键点分组,得到标识人体关键点分组的若干分组图,包括:
    针对一种类别,根据所述待识别图像对应的特征图,计算所述类别的人体关键点属于不同分组的概率;
    根据计算得到的概率确定所述类别的人体关键点所属分组;
    在所述待识别图像中按照所确定分组进行标记,生成所述类别对应的分组图。
  13. 如权利要求1、10、11或12任一项所述的方法,其中,所述根据所述待识别图像对应的特征图进行姿态识别,得到所述待识别图像的姿态识别结果之后,所述方法还包括:
    通过对所述待识别图像的姿态识别结果与指定姿态之间进行匹配,对所述待识别图像中的动作进行识别;
    根据识别得到的动作生成对应的交互指令,通过所述交互指令控制指定事件执行。
  14. 一种多人姿态识别装置,其特征在于,包括:
    图像获取模块,用于获取待识别图像;
    遍历模块,用于构建迂回式金字塔网络,所述迂回式金字塔网络包括并联的若干阶段,每一个阶段包括下采样网络各层、上采样网络各层、以及连接于上下采样网络各层之间的第一残差连接层,不同阶段之间通过第二残差连接层连接;
    所述遍历模块,还用于遍历所述迂回式金字塔网络的各阶段,包括执行如下处理:在当前阶段进行的特征图提取中,通过第一残差连接层在所述当前阶段中下采样网络各层与上采样网络各层之间进行特征传播,得到所述当前阶段的输出特征图;经由第二残差连接层,在所述当前阶段中上采样网络各层与后一个阶段中下采样网络各层之间进行特征传播,以进行所述后一个阶段对应特征图的提取;
    所述遍历模块,还用于直至完成所述迂回式金字塔网络中各阶段的遍历,以最后一个阶段的输出特征图作为所述待识别图像对应的特征图;
    姿态识别模块,用于根据所述待识别图像对应的特征图进行多人姿态识别,得到所述待识别图像的姿态识别结果。
  15. 一种电子设备,其特征在于,包括:
    处理器;及
    存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如权利要求1至13中任一项所述的多人姿态识别方法。
  16. 一种计算机可读存储介质,其特征在于,存储有计算机可读指令,可以使至少一个处理器执行如权利要求1至13中任一项所述的多 人姿态识别方法。
PCT/CN2019/113899 2018-10-30 2019-10-29 多人姿态识别方法、装置、电子设备及存储介质 WO2020088433A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19878618.8A EP3876140B1 (en) 2018-10-30 2019-10-29 Method and apparatus for recognizing postures of multiple persons, electronic device, and storage medium
US17/073,441 US11501574B2 (en) 2018-10-30 2020-10-19 Multi-person pose recognition method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811275350.3A CN110163059B (zh) 2018-10-30 2018-10-30 多人姿态识别方法、装置及电子设备
CN201811275350.3 2018-10-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/073,441 Continuation US11501574B2 (en) 2018-10-30 2020-10-19 Multi-person pose recognition method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020088433A1 true WO2020088433A1 (zh) 2020-05-07

Family

ID=67645190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/113899 WO2020088433A1 (zh) 2018-10-30 2019-10-29 多人姿态识别方法、装置、电子设备及存储介质

Country Status (4)

Country Link
US (1) US11501574B2 (zh)
EP (1) EP3876140B1 (zh)
CN (1) CN110163059B (zh)
WO (1) WO2020088433A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084958A (zh) * 2020-09-11 2020-12-15 中南大学 一种移动端的多人人体骨架识别方法及装置
CN112861678A (zh) * 2021-01-29 2021-05-28 上海依图网络科技有限公司 一种图像识别方法及装置
CN114266771A (zh) * 2022-03-02 2022-04-01 深圳市智源空间创新科技有限公司 基于改进扩展特征金字塔模型的管道缺陷检测方法及装置

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163059B (zh) 2018-10-30 2022-08-23 腾讯科技(深圳)有限公司 多人姿态识别方法、装置及电子设备
CN110728209B (zh) * 2019-09-24 2023-08-08 腾讯科技(深圳)有限公司 一种姿态识别方法、装置、电子设备及存储介质
CN111079749B (zh) * 2019-12-12 2023-12-22 创新奇智(重庆)科技有限公司 一种带姿态校正的端到端商品价签文字识别方法和系统
CN111368751A (zh) * 2020-03-06 2020-07-03 Oppo广东移动通信有限公司 图像处理方法、装置、存储介质及电子设备
CN111507185B (zh) * 2020-03-11 2020-11-24 杭州电子科技大学 一种基于堆叠空洞卷积网络的摔倒检测方法
CN113553877B (zh) * 2020-04-07 2023-05-30 舜宇光学(浙江)研究院有限公司 深度手势识别方法及其系统和电子设备
CN111652129A (zh) * 2020-06-02 2020-09-11 北京联合大学 一种基于语义分割和多特征融合的车辆前障碍物检测方法
CN111898570A (zh) * 2020-08-05 2020-11-06 盐城工学院 基于双向特征金字塔网络的图像中文本识别方法
CN112436987A (zh) * 2020-11-12 2021-03-02 中国联合网络通信集团有限公司 一种控制终端设备开关的方法以及系统
CN112973110A (zh) * 2021-03-19 2021-06-18 深圳创维-Rgb电子有限公司 云游戏控制方法、装置、网络电视及计算机可读存储介质
CN113516012B (zh) * 2021-04-09 2022-04-15 湖北工业大学 一种基于多层级特征融合的行人重识别方法及系统
CN113378691B (zh) * 2021-06-08 2024-05-17 衡阳览众科技有限公司 基于实时用户行为分析的智能家居管理系统及方法
CN113887468B (zh) * 2021-10-14 2023-06-16 西安交通大学 一种三阶段网络框架的单视角人-物交互的识别方法
CN114529944B (zh) * 2022-02-15 2022-11-15 中国科学院软件研究所 一种结合人体关键点热图特征的人像景别识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104641643A (zh) * 2012-05-14 2015-05-20 卢卡·罗萨托 在分层层级中信号编码、解码和重构期间残差数据的分解
CN107731011A (zh) * 2017-10-27 2018-02-23 中国科学院深圳先进技术研究院 一种港口泊船监测方法、系统及电子设备
CN108062543A (zh) * 2018-01-16 2018-05-22 中车工业研究院有限公司 一种面部识别方法及装置
CN108229445A (zh) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 一种基于级联金字塔网络的多人姿态估计方法
CN110163059A (zh) * 2018-10-30 2019-08-23 腾讯科技(深圳)有限公司 多人姿态识别方法、装置及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2395478A1 (en) * 2010-06-12 2011-12-14 Toyota Motor Europe NV/SA Monocular 3D pose estimation and tracking by detection
US9881234B2 (en) * 2015-11-25 2018-01-30 Baidu Usa Llc. Systems and methods for end-to-end object detection
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104641643A (zh) * 2012-05-14 2015-05-20 卢卡·罗萨托 在分层层级中信号编码、解码和重构期间残差数据的分解
CN107731011A (zh) * 2017-10-27 2018-02-23 中国科学院深圳先进技术研究院 一种港口泊船监测方法、系统及电子设备
CN108062543A (zh) * 2018-01-16 2018-05-22 中车工业研究院有限公司 一种面部识别方法及装置
CN108229445A (zh) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 一种基于级联金字塔网络的多人姿态估计方法
CN110163059A (zh) * 2018-10-30 2019-08-23 腾讯科技(深圳)有限公司 多人姿态识别方法、装置及电子设备

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084958A (zh) * 2020-09-11 2020-12-15 中南大学 一种移动端的多人人体骨架识别方法及装置
CN112084958B (zh) * 2020-09-11 2024-05-10 中南大学 一种移动端的多人人体骨架识别方法及装置
CN112861678A (zh) * 2021-01-29 2021-05-28 上海依图网络科技有限公司 一种图像识别方法及装置
CN112861678B (zh) * 2021-01-29 2024-04-19 上海依图网络科技有限公司 一种图像识别方法及装置
CN114266771A (zh) * 2022-03-02 2022-04-01 深圳市智源空间创新科技有限公司 基于改进扩展特征金字塔模型的管道缺陷检测方法及装置
CN114266771B (zh) * 2022-03-02 2022-06-03 深圳市智源空间创新科技有限公司 基于改进扩展特征金字塔模型的管道缺陷检测方法及装置

Also Published As

Publication number Publication date
EP3876140B1 (en) 2024-02-28
CN110163059A (zh) 2019-08-23
EP3876140A4 (en) 2021-12-22
CN110163059B (zh) 2022-08-23
US20210073527A1 (en) 2021-03-11
EP3876140A1 (en) 2021-09-08
US11501574B2 (en) 2022-11-15

Similar Documents

Publication Publication Date Title
WO2020088433A1 (zh) 多人姿态识别方法、装置、电子设备及存储介质
CN111738220B (zh) 三维人体姿态估计方法、装置、设备及介质
CN110135249B (zh) 基于时间注意力机制和lstm的人体行为识别方法
CN111553267B (zh) 图像处理方法、图像处理模型训练方法及设备
CN115699082A (zh) 缺陷检测方法及装置、存储介质及电子设备
CN112329525A (zh) 一种基于时空图卷积神经网络的手势识别方法和装置
CN111680550B (zh) 情感信息识别方法、装置、存储介质及计算机设备
US20220358662A1 (en) Image generation method and device
JP2023549240A (ja) 画像生成方法、画像生成装置、コンピュータ機器、及びコンピュータプログラム
WO2022227765A1 (zh) 生成图像修复模型的方法、设备、介质及程序产品
Punsara et al. IoT based sign language recognition system
CN113052025B (zh) 图像融合模型的训练方法、图像融合方法及电子设备
CN118247706A (zh) 基于微调标准模型的三维姿态估计方法、装置及存储介质
US20230401740A1 (en) Data processing method and apparatus, and device and medium
CN111447379B (zh) 生成信息的方法和装置
CN116977547A (zh) 一种三维人脸重建方法、装置、电子设备和存储介质
Aravindan et al. A Smart Assistive System for Visually Impaired to Inform Acquaintance Using Image Processing (ML) Supported by IoT
CN115830640B (zh) 一种人体姿态识别和模型训练方法、装置、设备和介质
US12131580B2 (en) Face detection method, apparatus, and device, and training method, apparatus, and device for image detection neural network
CN118585066B (zh) 应用于沉浸展演的便携式空间定位遥感交互控制系统
CN113553959B (zh) 动作识别方法及装置、计算机可读介质和电子设备
CN117173731B (zh) 一种模型训练的方法、图像处理的方法以及相关装置
CN114445878B (zh) 基于ar眼镜的信息提示方法、装置、设备及存储介质
US20220262162A1 (en) Face detection method, apparatus, and device, and training method, apparatus, and device for image detection neural network
WO2024169825A1 (zh) 动作识别方法、装置、电子设备、介质及计算机程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19878618

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019878618

Country of ref document: EP

Effective date: 20210531