CN112183314A - Expression information acquisition device and expression identification method and system - Google Patents

Expression information acquisition device and expression identification method and system Download PDF

Info

Publication number
CN112183314A
CN112183314A CN202011030333.0A CN202011030333A CN112183314A CN 112183314 A CN112183314 A CN 112183314A CN 202011030333 A CN202011030333 A CN 202011030333A CN 112183314 A CN112183314 A CN 112183314A
Authority
CN
China
Prior art keywords
edge
expression
node
data
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011030333.0A
Other languages
Chinese (zh)
Other versions
CN112183314B (en
Inventor
王勃然
姜京池
刘劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202011030333.0A priority Critical patent/CN112183314B/en
Publication of CN112183314A publication Critical patent/CN112183314A/en
Application granted granted Critical
Publication of CN112183314B publication Critical patent/CN112183314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01BMEASURING LENGTH, THICKNESS OR SIMILAR LINEAR DIMENSIONS; MEASURING ANGLES; MEASURING AREAS; MEASURING IRREGULARITIES OF SURFACES OR CONTOURS
    • G01B7/00Measuring arrangements characterised by the use of electric or magnetic techniques
    • G01B7/16Measuring arrangements characterised by the use of electric or magnetic techniques for measuring the deformation in a solid, e.g. by resistance strain gauge
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • G06V10/14Optical characteristics of the device performing the acquisition or on the illumination arrangements
    • G06V10/147Details of sensors, e.g. sensor lenses

Abstract

The invention provides an emotion information acquisition device, an expression identification method and an expression identification system. Expression information acquisition device includes: the facial mask comprises a flexible facial mask substrate used for being attached to the face, and a plurality of piezoelectric film sensors arranged on the facial mask substrate and used for detecting facial expression actions of the face. The expression recognition method comprises the steps of obtaining node data of all nodes in a preset facial node set, wherein the node data are collected by an expression information collection device and comprise the spatial positions of the nodes and the time sequence of the expression data of the nodes; and according to the node data, performing facial expression recognition by using a pre-trained graph convolution neural network expression recognition model. The node data directly collects muscle and skin action information of the human face by using a sensor, information loss and distortion caused by dimension reduction in an image form are avoided, and more accurate information quantity is larger. The map topological distribution of the sensor is in internal fit with the data processing mode of the map neural network on the data structure, so that the GCN can obtain a better expression recognition result.

Description

Expression information acquisition device and expression identification method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an emotion information acquisition device, an expression identification method and an expression identification system.
Background
At present, expression recognition is mainly based on a time sequence of face images in an acquired picture or video, and expression classification is realized through a deep neural network, particularly a deep convolutional neural network. However, to identify expressions, especially micro-expressions and micro-motions, based on images and video sequences, the convolutional neural network algorithm is limited by the accuracy of expression capture, and is difficult to have high accuracy. Among them, micro-emotive classification is more difficult than macro-expression (emotion classification) mainly because of sparse data, need to locate local regions of the face, and short duration, not easy to capture.
Disclosure of Invention
In order to solve at least one aspect of the above technical problems and obtain a better expression recognition result, the present invention provides an expression information collecting device, an expression recognition method and system, and a non-transitory computer readable storage medium.
According to a first aspect of the present invention, there is provided an expression information collecting apparatus, comprising:
the mask base is used for being attached to the face of a person and is made of a flexible material; and
the piezoelectric film sensors are arranged on the mask substrate and used for detecting facial expressions and actions.
In some embodiments, the mask substrate is a plurality, each mask substrate having a different size and/or shape, the different mask substrates being adapted to match different user facial morphologies.
In some embodiments, the expression information collecting apparatus further includes:
and the control module is used for receiving the signals output by the piezoelectric film sensors, preprocessing the signals to generate expression data and sending the expression data to the expression recognition device.
In some embodiments, the control module comprises:
the signal amplification unit is used for receiving the signals output by the piezoelectric film sensors and amplifying the signals;
the central processing unit is used for preprocessing the amplified signals output by the signal amplification unit, generating expression data and controlling the operation of the signal amplification unit, the wireless communication unit and the power supply module;
the wireless communication unit is used for communicating with an expression recognition device and sending the expression data to the expression recognition device;
the power supply module is used for supplying electric energy to the signal amplification unit, the signal processing unit and the wireless communication unit; and
a control box for accommodating the signal amplification unit, the signal processing unit, the wireless communication unit, and the power supply module.
In some embodiments, the expression information collecting apparatus further includes:
the support part is connected with the mask substrate and the control module and used for being worn on the head of a user to form a support, so that the mask substrate and the control module are worn on the head of the user.
By using the expression information acquisition device based on the invention, because the film sensor is matched with the flexible mask substrate, the device can be better attached to the face compared with the traditional facial electric signal acquisition device, and direct capture and fine capture of facial muscle and epidermal actions can be realized through the film piezoelectric sensors arranged at a plurality of specific preset positions, so that the device has a better capture effect on various facial actions, particularly micro-expression actions.
An embodiment of a second aspect of the present invention provides an expression recognition method, including:
acquiring node data of all nodes in a preset face node set, wherein the node data comprises the spatial positions of the nodes and the time sequence of node expression data; and
according to the node data, facial expression recognition is carried out by using a pre-trained graph convolution neural network expression recognition model;
the node expression data is acquired by using the expression information acquisition device according to the first aspect of the present invention, and the setting positions of the plurality of piezoelectric thin film sensors correspond to the node positions in the preset facial node set.
In some embodiments, the performing facial expression recognition using a pre-trained convolution neural network expression recognition model according to the node data includes:
according to the node data, calculating a connection edge set of all nodes in the node set and edge data of each connection edge, wherein the edge data represent node positions of node expression data which change relative to a reference value and changes of the node expression data;
constructing a graph structure of expression data according to the node data and the edge data of the connecting edges; and
and inputting the graph structure of the expression data into a pre-trained graph convolution neural network expression recognition model to obtain an expression recognition classification result output by the model.
In some embodiments, the calculating, according to the node data, the set of connection edges of all nodes in the node set and the edge data of each connection edge includes:
determining a connection edge set according to the spatial positions of all nodes in the node set and the time sequence of the expression data of the nodes, wherein the connection edge set specifically comprises the following steps:
for each time point in the time sequence of the node expression data, acquiring a node in the node set, wherein the node expression data change of the time point is larger than a preset threshold value, and taking the node as an active node, wherein any two active nodes i and j are connected to form a connecting edge eij
For t1Each connecting edge e of a momentt1 ijAcquiring a space adjacent edge in a preset space adjacent region and a time domain adjacent edge in a preset time adjacent region to form adjacent edgesEdge connecting set N (e)t1 ij) Wherein, the spatial adjacent edges in the preset spatial adjacent domain refer to: two adjacent edges are connected through a node not larger than a preset number d, d is a natural number, and the time adjacent edges in the preset time neighborhood refer to: for the connecting edge et1 ijConsider the sum of t1At any time t in the time neighborhood of which the time interval does not exceed the predetermined time range2Connecting edge e oft2 ijConnecting edge et2 ijIs also considered to be a connecting edge et1 ij(ii) a space adjoining edge; and
and calculating a connection edge set and edge data of each connection edge according to the time domain adjacent edge and the space adjacent edge of each connection edge.
In some embodiments, the calculating, according to the node data, a set of connection edges of all nodes in the node set and edge data of each connection edge further includes:
calculating a marking function L of the adjacent edge set of each connecting edge, and distributing weight to each adjacent edge in the adjacent edge set according to the marking function L;
wherein the labeling function L is used for characterizing the association degree of each adjacent edge in the adjacent edge set of the connecting edge with the connecting edge.
In some embodiments, the value of the labeling function L is a predetermined number of discrete values, and the value of the labeling function L is determined according to the relative position relationship between each adjacent edge and the connecting edge; and
assigning a weight to each adjacent edge in the set of adjacent edges according to the labeling function L comprises: and determining a weight coefficient according to the position relation between each adjacent edge and the connecting edge and the value of the marking function L, so that the edges with the same marking function value have the same weight.
In some embodiments, calculating the set of connection edges of all nodes in the set of nodes and the edge data of each connection edge according to the node data includes:
calculating the central coordinate and the direction vector of each connecting edge, wherein the central coordinate and the direction vector are obtained according to the three-dimensional position information of two nodes connected by the edges; and
and recording the central coordinate and the direction vector of the connecting side into the side data of the connecting side.
In some embodiments, the preset convolution neural network comprises, connected in sequence: a data input layer, a graph volume layer, a full connection layer and an output layer,
wherein, the structure of the graph volume layer comprises: the system comprises a first sublayer and a second sublayer, wherein the first sublayer and the second sublayer are in parallel cascade connection, the first sublayer comprises a first batch of regularization layers, an edge convolution layer and a first global pooling layer which are sequentially connected, and the second sublayer comprises a second batch of regularization layers, a node convolution layer and a second global pooling layer which are sequentially connected; or
The structure of the graph convolution layer includes: the image structure convolution sublayer, the sharing convolution sublayer and the global pooling sublayer are connected in sequence, wherein the image structure convolution sublayer comprises a third sublayer and a fourth sublayer which are connected in parallel in a cascade mode, the third sublayer comprises a third batch of regularization layers and an edge convolution layer which are connected in sequence, and the third sublayer comprises a fourth batch of regularization layers and a node convolution layer which are connected in sequence.
The expression data acquisition of the flexible film piezoelectric sensors is carried out, and the expression recognition is carried out by using the graph neural network, so that the structural information of the arrangement positions of the flexible film piezoelectric sensors on the mask substrate can be fully utilized, and the time sequence characteristics of facial muscle movement are combined, and a better expression recognition and/or emotion recognition result is obtained.
Due to the fact that the node data are collected, the muscle and skin action information of the human face is directly collected by the aid of the sensor, and the obtained original data are not converted in the modes of photographing, video recording and the like, information loss and distortion caused by three-dimensional information to two-dimensional information are avoided, and accurate original data are obtained more easily. And the map topological distribution of the sensor is internally matched with the data processing mode of the map neural network on the data structure, so that better expression recognition results can be obtained after the data acquisition is processed by GCN. Especially, the method has wide application prospect for scenes that the prior art cannot obtain good results, such as micro-expression recognition and the like.
An embodiment according to the third aspect of the present invention provides a non-transitory computer readable storage medium, in which computer instructions are stored, wherein the computer instructions, when executed, implement the expression recognition method according to the second aspect of the present invention.
An embodiment according to a third aspect of the present invention provides an expression recognition system, comprising:
the facial expression information acquisition device is used for acquiring facial expression data, and the facial expression information acquisition device is the facial expression information acquisition device according to the first aspect of the invention;
the expression recognition device is used for recognizing expressions according to the facial expression data collected by the expression information collection device, and comprises:
the data acquisition module is used for acquiring node data of all nodes in a preset facial node set according to the facial expression data acquired by the expression information acquisition device, wherein the node data comprises the spatial positions of the nodes and the time sequence of the node expression data;
the system comprises a volume expression recognition module, a facial expression recognition module and a facial expression recognition module, wherein the volume expression recognition module is used for recognizing facial expressions by using a pre-trained volume neural network expression recognition model;
the setting positions of a plurality of piezoelectric film sensors in the expression information acquisition device correspond to the node positions in the preset facial node set.
The storage medium according to the third aspect of the present invention and the expression recognition system according to the fourth aspect of the present invention have similar advantageous effects to the method according to the second aspect, and are not described herein again.
Drawings
Fig. 1 is a schematic structural diagram of an expression recognition apparatus according to an embodiment of the present invention;
fig. 2 is a schematic view of an expression information acquisition device according to another embodiment of the present invention;
FIG. 3 is a diagram of a face morphology classification model;
FIG. 4 is a schematic diagram of facial landmark positions according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating an expression recognition method according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an expression recognition model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an expression recognition model according to another embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below.
In the related art, when performing expression recognition based on a graph neural network, a video or a picture is usually used as a source of expression data. Because the human face is three-dimensional and the muscle movement is hidden under the epidermis when the expression is made, it is difficult to obtain the muscle movement information through direct observation. In addition, information loss is inevitable in the process of the two-dimensional formation of the three-dimensional stereo form of the human face, so that the expression recognition by using the image data has high requirements on the acquisition precision of the image, illumination conditions and the like, and poor recognition results are easily caused. In the related art, there is no description in which sensor data provided on a face is used in a neural network.
Some devices for collecting facial information are currently available, but most of them are used for medical purposes such as epidermal reconstruction after collecting facial morphology, muscle reaction, and the like. Its realization is that the sensor is direct to be pasted on face through the paster, is subject to the size of face, and sampling point quantity is less, and the connecting wire is complicated, and unsettled around face, and the person of wearing feels not good, uses also extremely inconveniently.
In order to obtain a better facial expression recognition result, the invention provides a facial information acquisition device for directly and physically acquiring facial expression actions of a plurality of nodes of a face, and provides a method, a system and a storage medium for realizing expression recognition by combining a graph neural network to process and recognize data based on data acquired by the facial information acquisition device.
First, an expression information acquisition apparatus according to a first aspect of the present invention will be described with reference to fig. 1 to 4. It should be noted that, for the convenience of description, the invention uniformly uses the appearance classification of basic expressions such as aversion, anger, fear, happiness, sadness, surprise and the like by expression recognition; or may refer to an inherent emotional change classification embodied by changes in facial motion characteristics (i.e., expressions). Since both use the same expression data, similar classification methods are used, and will not be distinguished in the description of the present disclosure.
Fig. 1 is a schematic structural diagram of an expression recognition apparatus according to an embodiment of the present invention. The expression information acquisition device provided by the invention comprises a mask substrate 102 and a plurality of piezoelectric thin film sensors 101, wherein the piezoelectric thin film sensors are arranged on the mask substrate. The facial mask base is used for being attached to the face of a person, is made of flexible materials, and forms an expression recognition facial mask 1 together with the piezoelectric film sensors for detecting expression actions of the face.
The mask substrate can adopt a film substrate compatible with a human body, such as silicon rubber or hydrogel. When the facial mask is used, the facial mask is attached to the face of a person. According to different requirements, the piezoelectric film sensors can be distributed on different parts of the human face, for example, the sensors can be distributed in places with rich facial expressions, particularly large facial muscle action amplitude according to the guidance of neurophysiology. For example, the feature point positions of 68, 106, or 240 points existing in the AI may be used as the setting positions of the sensors in the conventional Artificial Intelligence (AI) technology. For example, fig. 4 is a schematic diagram of positions of facial feature points according to an embodiment of the present invention, where the points set by a face are arranged with reference to commonly-used feature points based on image recognition, and position adjustment can be performed according to three-dimensional features of a human face.
Due to the size and shape of different people, the faces can be greatly different. In order to achieve better detection results for different human faces, in some embodiments, the mask substrate is multiple, each mask substrate having a different size and/or shape, and the different mask substrates are used to match different user facial morphologies.
With regard to adaptation to different face sizes and shapes, in 2003, the National Institute for Occupational Safety and Health (NIOSH) has conducted anthropometric surveys of 3997 subjects nationwide. Using conventional measurement methods and three-dimensional (3D) scanning systems, researchers have developed a database of anthropometric measurements using the obtained head and face measurements, detailing the facial size distribution of the respirator user. This database is used to build appropriate test panels to incorporate NIOSH respirator certification and international standards. One of the panels is called Principal Component Analysis (PCA) panel, using the first two principal components obtained from a set of 10 face sizes (age and ethnicity adjustments). As shown in the following figure, the user population can be divided into 5 face size categories according to physiological characteristics of different ethnicities. Five persons most representative of a particular size category were scanned in three dimensions and averaged to construct a representative head shape for each category (small, medium, large, long/narrow, short/wide). Called NIOSH digital head model, five common NIOSH digital head models are shown in figure 3.
The NIOSH digital head shape is symmetrical and represents the size and shape distribution of the face of the current user. Furthermore, the ears have been placed in the average position of the head form match given the size of the selected head. In the actual wearing process of the mask, a 3D scanner can be used for carrying out on-site registration on the face of a user, and flexible 3D printing is adopted to finish customized mask printing; the standard size can be adapted according to the 5 standard facial forms obtained by the big data analysis, and the standard size can be selected according to the head form of the wearer when in use.
When the facial expression of the user changes, the piezoelectric film sensor outputs voltage signals which change. Because the piezoelectric film sensor is a passive sensor, power supply is not needed at the expression recognition mask part, and only sensor data needs to be transmitted to a receiving device. The receiving device can be a part of the expression information acquisition device or an external device.
The sensor data can be transmitted in a wired mode, the electric connecting wires are embedded into the mask substrate, and the wires can be arranged as close to the edge of the mask as possible for wearing.
In some embodiments, wireless communication may also be used to communicate sensor data to a corresponding receiving device. For example, a miniature RFID device may be connected to the piezoelectric film sensor, and an RFID reader may be provided outside the mask to realize device deactivation of the mask and wireless data transmission of the sensor, thereby achieving the dual benefits of power saving and convenience in wearing.
Fig. 2 is a schematic diagram of an expression information acquisition device according to another embodiment of the present invention. The expression information acquisition device further comprises a control module 2 and a support part 3.
The control module 2 is used for receiving the signals output by the piezoelectric film sensors, preprocessing the signals to generate expression data, and sending the expression data to the expression recognition device.
The support part is connected with the mask substrate and the control module and used for being worn on the head of a user to form a support, so that the mask substrate and the control module are worn on the head of the user. For example, the support portion may be in various forms such as a band, a ring, a hat band, and a hat.
In this embodiment, for active connections, each node voltage may be output to the control circuitry at the back of the brain by routing wires around the face through the edge of the face. For passive access, the RFUD reader may be located in the control module.
Optionally, the control module may include a signal amplification unit, a central processing unit, a wireless communication unit and a power supply module, and may further include a control box for accommodating the above units.
The signal amplification unit is used for receiving the signals output by the piezoelectric film sensors and amplifying the signals. Specifically, the signal amplification unit may include two functions of AD conversion signal sampling and signal amplification, and a specific circuit implementation thereof may use a common sampling circuit and an amplification circuit.
For expression recognition, since the piezoelectric film is very sensitive and is easy to respond to facial expressions to generate pressure signals, under the current technical conditions, only the piezoelectric signals can be qualitatively analyzed, but not quantitatively analyzed, so that the requirement on the sampling rate is not high. The best result can be obtained by selecting 1/10-1/100 of the total duration of the action according to the purpose of expression recognition. For example, the duration of a typical expression or micro-expression is roughly concentrated on the order of 0.1ms, and the sampling frequency may be chosen to be 1-10 ms.
Of course, if the performance of the piezoelectric film sensor is better as the technology advances, the computational power of the processing unit chip and the computational power of the expression recognition device are further improved, and higher sampling frequency and sensor accuracy are obviously applicable to the invention. The data presented at present are only suitable choices obtained by considering the trade-off between the operation time and the cost under the existing related art conditions, and do not limit the protection scope of the present invention.
And the central processing unit is used for preprocessing the amplified signal output by the signal amplification unit, generating expression data and controlling the operation of the signal amplification unit, the wireless communication unit and the power supply module. The preprocessing can be simple operations such as denoising and filtering, and the recognition work requiring large calculation force is delivered to an external expression recognition device. In some embodiments, after the neural network pre-training is completed by the external processor, the trained parameters and algorithms may also be stored in the central processing unit, and the local expression recognition result output is realized through the chip computation of the central processing unit.
And the wireless communication unit is used for communicating with the expression recognition device and sending the expression data to the expression recognition device. The wireless communication can adopt various common wireless communication modes such as WIFI, Bluetooth and the like, and the invention has no limitation to the wireless communication mode. Moreover, the wireless communication is more convenient for wearing, and can completely adopt a wired communication mode to communicate with the expression recognition device, which is not limited by the invention.
And the power supply module is used for supplying electric energy to the signal amplification unit, the central processing unit and the wireless communication unit. Because the piezoelectric sensor is a passive device, the power supply module only needs to supply power to the signal processing and central processing unit when in work, and can be realized by using common commercially available lithium batteries.
A control box for accommodating the signal amplification unit, the signal processing unit, the wireless communication unit, and the power supply module. In order to meet the requirements of wearable equipment, the control box can be made of light materials, such as plastics.
By using the expression information acquisition device based on the invention, because the film sensor is matched with the flexible mask substrate, the device can be better attached to the face compared with the traditional facial electric signal acquisition device, and direct capture and fine capture of facial muscle and epidermal actions can be realized through the film piezoelectric sensors arranged at a plurality of specific preset positions, so that the device has a better capture effect on various facial actions, particularly micro-expression actions.
And further expression recognition can be carried out according to the data acquired by the expression information acquisition device. The distribution of the sensors on the mask substrate has a topological structure, so the method is particularly suitable for performing data processing by using a graph convolution neural network (GCN) mode and performing expression recognition. This is also the core technical idea of the present invention to use the face sensor data for GCN to achieve expression recognition. This was the first initiative in the GCN application area.
An embodiment of a second aspect of the present invention provides an expression recognition method, including: acquiring node data of all nodes in a preset face node set, wherein the node data comprises the spatial positions of the nodes and the time sequence of node expression data; the node expression data is acquired by using the expression information acquisition device according to the first aspect of the present invention, and the setting positions of the plurality of piezoelectric thin film sensors correspond to the node positions in the preset facial node set. And then according to the node data, carrying out facial expression recognition by using a pre-trained graph convolution neural network expression recognition model.
Various conventional convolution modes based on nodes can be used for the method disclosed by the invention to achieve the purpose of expression recognition. The present invention is not described in detail, and those skilled in the art can implement the method of the present invention according to various publications.
When convolution calculation is performed according to nodes, wherein an adjacent matrix for representing the connection relation between the nodes usually adopts a 0,1 matrix, and a label 1 with a connection edge exists; there are also some directed graph representations with-1 introduced to represent direction, but overall, the adjacency matrix is static during the convolution computation. In addition, in the process of updating the node data, the adjacency matrix participates in convolution calculation as a part of the convolution weight, and is also unchanged. Thus, the amount of information in the graph structure is not fully utilized.
The inventor of the present application has noted that, in the graph structure, besides representing the connection relationship between nodes, if given proper values, the edge itself can also represent many features of the graph structure. And compared with the node, the node can only represent one-dimensional information in the graph structure, and the edge is used as a two-dimensional characteristic, so that more information can be carried than the node. In particular, when the connection relation of the graph structure is time-varying, the graph can be more accurately expressed by using the side information to represent the dynamic change of the graph structure. Therefore, the convolution calculation is carried out by using the side information, so that the identification precision of the graph neural network is further improved, and the requirement on the precision of sample data can be correspondingly reduced. Based on the inventive concept, the invention also provides an expression recognition model based on edge convolution, a training method and an expression recognition method thereof.
First, a case recognition model training method is introduced. It should be noted that, in the model training and the expression recognition, the expression data and the processing method thereof are similar, and the descriptions of the related steps can be referred to each other.
The model training method includes steps S110 to S140.
In step S110, node data of all nodes in the preset action node set is obtained, where the node data includes spatial positions of the nodes and time sequences of expression data of the nodes.
The node expression data is acquired by using the expression information acquisition device, and the setting positions 11 of the piezoelectric film sensors correspond to the node positions in the preset facial node set. Acquiring voltage data acquired by a piezoelectric sensor arranged at a preset face node on the skin of a human face, and preprocessing the voltage data acquired by the piezoelectric sensor to obtain node data of all nodes in a preset face node set. Every connection between two nodes constitutes an edge 12.
In step S120, according to the node data, a connection edge set of all nodes in the node set and edge data of each connection edge are calculated, where the edge data represents a node position where the node action data changes with respect to a reference value and a change of the node action data.
When considering motion recognition, important influencing factors include the sequence and correlation of the motion between the nodes where the motion occurs. Therefore, when the connection edge sets of all nodes in the node set and the edge data of each connection edge are calculated according to the node data, the values and calculation modes of the connection edges can be designed according to the principle. The set of connected edges E is constructed from two dimensions, time and space. And performs data processing by temporal convolution and spatial convolution.
Specifically, for each time point in the time sequence of the node expression data, a node in the node set, whose node expression data change at the time point is greater than a preset threshold, is obtained and used as an active node, and any two active nodes i, j are connected to form a connecting edge eij. The change of the node expression data is larger than a preset threshold, and for the condition that the target object no-action standard posture data can be obtained, the node data can be compared with the standard data in the absence of action. When the standard expression of the target object cannot be obtained as a ground channel (basic fact), judging whether the standard expression is greater than a preset threshold value or not requires selecting the threshold value according to the expression motion to be judged and the relative motion amplitude and comparing the threshold value, for exampleSuch as based on a comparison of the voltage value output by the sensor with a preset value.
The value of the connecting edge can be calculated according to the value of the node, for example: and taking a value according to algebraic weighted average or geometric average and the like of node values of nodes at two ends of the connecting edge. Taking algebraic weighted average as an example, calculating a central coordinate and a direction vector of each connecting edge, wherein the central coordinate and the direction vector are obtained according to three-dimensional position information of two nodes connected with the edges; and recording the central coordinate and the direction vector of the connecting side into the side data of the connecting side.
When convolution operation is performed on node expression data of each node and edge data of a connecting edge to achieve feature extraction, influences of time dimension and space dimension need to be considered. For this purpose, for each time t1Each corresponding connecting edge et1 ijThe adjacent edges in the preset space neighborhood and the time domain in the preset time neighborhood can be obtained to form an adjacent edge set N (e)t1 ij). Wherein, the spatial adjacent edges in the preset spatial adjacent domain refer to: two adjacent edges are connected through a node not larger than a preset number d, d is a natural number, and the time adjacent edges in the preset time neighborhood refer to: for the connecting edge et1 ijConsider the sum of t1At any time t in the time neighborhood of which the time interval does not exceed the predetermined time range2Connecting edge e oft2 ijConnecting edge et2 ijIs also considered to be a connecting edge et1 ijThe preset time range is greater than or equal to zero; and calculating a connection edge set and edge data of each connection edge according to the time domain adjacent edge and the space adjacent edge of each connection edge. The calculated edge data will be used to perform a time convolution and a space convolution.
The time convolution mainly considers the action characterization characteristics of the time sequence and aims at the connection edge eijAt t1The value of the time is denoted as et1 ij,et1 ijThe adjoining side of (2) is defined in two ways: 1) at t1At a time, connect to e only through no more than d nodest1 ijThe edge of (d) is defined as a space adjacent edge of the node, wherein d can be taken as a value according to the node density, the detection precision requirement of the action amplitude, the calculation force and other factors, a good effect can be obtained when a natural number of 1-4 is generally taken, and when the number of the nodes is relatively small, for example, the order of magnitude of 10, d can be taken as 1-2; 2) at t2Time of day, if t2And t1Within a predetermined time interval range of the time interval of et2 ijIs also considered to be et1 ijThe adjacent edge of (1) is called et1 ijAdjacent edge of time domain. t is t1And t2The time node data may be different, and the preset time interval range may be defined by defining a time core KtTo represent KtInteger, used to represent t1And t2The number of acquisition intervals between the time instants. Limitation et1 ijMust not exceed KtBy constraint D (e)t2 kn,et2 ij) Defining the number of connecting edge layers of the space neighborhood, thereby obtaining
N(et1 ij)={et2 kn|et2 kn∈E and|t2–t1|≤Kt and D(et2 kn,et2 ij)≤d}
Wherein, N (e)t1 ij) Denotes et1 ijE represents the set of all connected edges.
For expression recognition, considering the calculation amount required by neural network data processing and the size of a sensor, the number of nodes is in the order of tens to hundreds, which is a feasible scheme. At this time, the maximum extent of the time neighborhood should not exceed 4-7 sampling intervals.
In some embodiments, using the set of adjacent edges as the parameters of the convolution calculation of the connected edges, a more compact algorithm is to perform the subsequent convolution calculation directly in a manner of weighted average of the adjacent edges, and mark the adjacent edges according to the "influence" of the actions of the adjacent edges (the term influence here does not refer to the effect on the actual action, but refers to the degree of association when characterizing the action).
It is noted that because expressions are actually represented by a series of movements of muscles, skin, etc., and thus expression data is essentially a representation of motion, expression data is sometimes referred to as motion data or expression motion data in this disclosure to match the emphasis on its motion attributes in the context of the context, for the convenience of the reader.
Calculating a marking function L of the adjacent edge set of each connecting edge, and distributing weight to each adjacent edge in the adjacent edge set according to the marking function L; wherein the labeling function L is used for characterizing the association degree of each adjacent edge in the adjacent edge set of the connecting edge with the connecting edge. The marking function L can calculate its respective value for each adjacent edge individually, but this results in a large amount of computation, and the performance improvement effect is not proportional to the amount of computation. Thus, a simplified way of calculation can be introduced.
In order to simplify the calculation, in some embodiments, the value of the labeling function L is a predetermined number of discrete values, and the value of the labeling function L is determined according to a relative position relationship between each adjacent edge and the connecting edge. Assigning a weight to each adjacent edge in the set of adjacent edges according to the labeling function L comprises: grouping the marking functions into values according to the position relation between each adjacent edge and the connecting edge; and determining a weight coefficient according to the value of the marking function L, so that the edges with the same marking function value have the same weight.
For example, the subset of groups may be divided according to the relative positional relationship of the adjoining and connecting edges with respect to the "center" of the action. For facial expression recognition, the geometric center of the face can be selected as the center of the action, and for the action of limbs, the gravity center or the geometric center of the human body can be selected as the center of the action.
When the temporal neighborhood is not considered, L is a labeling function of a single adjacent edge of the spatial neighborhood, also as a spatial configuration label, KtIs time kernel is largeSmall, K is the number of subsets into which the labeling function L is divided. For example, the three types of movement can be roughly classified into concentric, eccentric, and eccentric according to the distance relationship between different positions and the center Gc of the movement. Specifically, for a connection edge eijThe marker function divides its adjacent edges into three subsets: 1) ratio eijEdges closer to the center; 2) and eijEdges equidistant from the center; 3) ratio eijThe edges further from the center. Thus, the tag function can be expressed as:
Figure BDA0002703464880000131
2if d(ekn,Gc)>d(eij,Gc)
wherein G isCThe geometric mean reference point can be taken as the coordinate of each part of the human body, and when the dynamic characteristics of facial expression of the human face are researched, the geometric center or the physical gravity center can be selected or the position of the reference point can be changed according to the requirements. d (e)ij,GC) To connect edge eijTo GCD (e) ofkn,Gc) Is the distance from the adjacent edge to the reference center.
It is noted that 0,1, and 2 are only values for one embodiment, and it is obvious to those skilled in the art that L (e) may bekn) Given other values, K can also be used to assign L (e) according to other principleskn) Other numbers of groups are divided.
Considering the temporal neighborhood, the marker function can be further modified as:
L’(et2 kn)=L(et2 kn)+(t2-t1+Kt) xK (formula 2)
Wherein, K istIs added to t1-t2Is to ensure (t)1-t2+Kt) Non-negative, the final multiplication by K is to ensure that the tag values of the temporal neighborhood are different from those of the spatial neighborhood.
In the invention, in the model of the subsequent neural network calculation, the spatial convolution of the model can be divided into two modes of edge convolution and node convolution.
Node convolution is a mode frequently used in the neural network of the current graph, and various common methods can be used for the node convolution part. For example, the expression data of each node may be used as node data, the linear distance between two adjacent nodes is used as edge data, and then weights are assigned to different expression classifications according to the relationship between the edges and the nodes, so as to complete the expression recognition of the target. The node convolution method used in various related techniques in GCN (graph convolution neural network) can also be used in the node convolution calculation of the present invention.
In edge convolution, the position data of each edge can be calculated according to the space coordinates of two end nodes of the edge, and for each edge, the coordinates of the center of the edge are obtained by averaging the coordinates of the two nodes, and then the coordinates of one node are subtracted from the other end to obtain a vector, wherein the length and the direction of the vector represent the length and the direction between the two nodes. For example, for edge eijTwo nodes at two ends are n respectivelyiAnd njE is to beijThe data of (2) have its center coordinate as a space coordinate and its direction vector as an edge vector. Edge eijCan be calculated according to the following equation:
xc(eij)=1/2×(x(ni)+x(nj))
yc(eij)=1/2×(y(ni)+y(nj))
zc(eij)=1/2×(z(ni)+z(nj))
Direction(eij)=(x(nj)–x(ni),y(nj)–y(ni),z(nj)–z(ni))
wherein, x (n)i)、y(ni)、z(ni) Is niThree-axis coordinate of (2), x (n)j)、y(nj)、z(ni) Is njThree-axis coordinates of (a); x is the number ofc(eij)、yc(eij)、zc(eij) Is eijThe coordinate values of the space coordinate in three directions of the x-axis, the y-axis and the z-axis;Direction(eij) Representing the direction of the edge vector. Thus, the target expression can be abstracted into a spatial vector of edges, each edge being represented by a set of coordinates of the center and a vector representing the length and direction. The edge value of each edge may then be calculated from the node values of the nodes at its two ends.
In spatial convolution, the problem of assigning weights to adjacent edges is also involved. In this embodiment, the definition index function l is a function that specifies the order on adjacent edges. For the connecting edge eijEach adjoining edge e in the neighborhoodknThe marking function l will have assigned thereto a marking value l (e)kn) Indicates the order of the edges and is assigned to eknIs dependent on the label value l (e)kn). Because the number of adjacent edges is time-varying, the number of adjacent edges at different times and different connecting edges may be different, if a fixed number of weights are set according to the dimension of the full-connection matrix to be allocated to the edges, firstly, the problem of calculation waste is not caused because most connecting edges do not exist, and secondly, the operation of a sparse matrix in the data fitting process is inconvenient, therefore, the marking function l of the implementation is not to give a unique marking value to each adjacent edge, but to map the adjacent edges into subsets with a fixed number, and the edges in the same subset have the same marking value.
Recording: l (e)kn)∶N(eij) → 1, …, K, each edge in the neighborhood will be labeled as an integer from 1 to K, which is the order in which the edges that decide which weight value to assign to this edge. Thus, even if the number of adjacent edges is not fixed, they can always be assigned K weights, since the edges are always divided into K subsets.
The label function L and the label function L may adopt similar definition manners, which are not described herein again, and in the entire description, the two may be used in place of each other in some cases when they adopt similar definitions. In particular, when the temporal convolution and the spatial convolution are combined and implemented by using a uniform convolution calculation, the label function L and the label function L are also combined into one function for uniformly grouping all adjacent edges, and the present disclosure will also use L or L for this case.
And step S130, constructing a graph structure of expression data according to the node data and the edge data of the connecting edge.
Recording the data collected and calculated in the above steps S120 and S110 into the graph structure constitutes the main data structure of the graph structure of the present invention.
Of course, according to the requirement, the data structure of the nodes and edges of the graph structure may further record the time information of the expression, the data of more dimensions of the nodes, and the like.
In step S140, a graph structure of the expression data is used as a model input, an expression recognition classification result is used as a model output, and a preset graph convolution neural network expression recognition model is subjected to supervised training; wherein the using the graph structure of the expression data as the model input comprises: and taking the side data of the connecting side in the graph structure of the expression data as model input.
The convolutional network extracts a group of high-level features from the expression sequence, and the node convolutional network and the edge convolutional network extract features from different angles, so that the two groups of features (output of the convolutional layer) represent the same action sequence from different angles. Both edge convolution and node convolution have their own advantages. Edge convolutional networks utilize the dynamics of edges, while nodal convolutional networks utilize the dynamics of nodes. Because the dynamics of the nodes and the edges supplement each other, a model can be designed to simultaneously utilize the two groups of characteristics, so that the human muscle dynamics can be utilized from the angles of the nodes and the edges, and the performance of the action recognition task is further improved. Because the edge convolution simultaneously embodies the characteristics of a time neighborhood and a space neighborhood, the method has better capturing and identifying capabilities for time sequence actions such as facial expressions and the like, particularly for identifying micro expressions, and has obvious improvement on the identifying capability compared with the existing node-based convolution mode; and the requirement on the acquisition precision of the face data is greatly reduced. Good recognition results can be obtained at low accuracy data sets.
Therefore, the invention further designs two different mixed models, and combines the edge convolution model and the node convolution model according to the characteristics of different layers.
It is noted that the present invention can be implemented in the form of an embodiment in which edge convolution alone is performed without node convolution, except that the point convolution portion is removed in the hybrid model. The corresponding network structure can be derived by the person skilled in the art on his own from the teaching of the present invention. This technical solution also falls within the scope of the present invention.
From the perspective of a deep learning pipeline, the preset graph convolution neural network comprises the following components connected in sequence: input layers (601, 701), graph convolution layers, full connection layers (606, 707), and output layers (607, 708). The input layer can be divided into two forms of nodes and edges, and the position and vector input of the nodes and/or the edges is completed. The graph volume layer includes a regularization layer, a node/edge volume layer, and a global pooling layer. Regularization Layer (regularization Layer) has regularization to prevent overfitting, improve the generalization capability of the model, allow higher learning rates and thus accelerate convergence. The graph volume layer is mainly used for completing feature extraction; the global pooling layer is used for reducing the dimension and reducing the parameters of the network; and finally, classifying through a full connection layer, and importing a classification result into an output layer.
From the perspective of the deep learning architecture, in a single edge convolution or node convolution network, there is only one set of features he _ Seq or hn _ Seq, to which we apply the global pool to obtain a representation of the entire sequence, which is then input into a fully connected layer to output a final class score representing the probability that the sequence is classified into each class. Alternatively, the two sequences can be combined to provide two sets of features and two different representations of the same sequence. By concatenating these two representations, a tensor is formed to obtain the input of the last fully connected layer. By connecting the outputs of the edge and node convolution streams, the features extracted from the edge and node convolution networks contribute to the final classification result, i.e., the dynamics of the nodes and edges (muscles) are exploited in the classification.
This graph convolution neural network of mixed edges and nodes can be divided into two forms: the first is that after the outputs of the graph convolution neural network based on edges and the graph convolution neural network based on nodes are respectively finished, the outputs are respectively cascaded after passing through respective global pooling layers, and then are led into a full connection layer for classification; the second way is to cascade the edge-based graph convolution neural network and the node-based graph convolution neural network directly, introduce the graph convolution neural network into the global pooling layer, and finally classify the graph convolution neural network through the full-link layer.
Referring to fig. 6, fig. 7, two different implementations of the neural network of the present invention are shown, respectively.
In fig. 6, in the expression recognition model structure of the graph convolution neural network: the structure of the graph convolution layer includes: the multilayer structure comprises a first sublayer and a second sublayer, wherein the first sublayer and the second sublayer are in parallel cascade connection, the first sublayer comprises a first batch of regularization layers 602, an edge convolution layer 603 and a first global pooling layer 605 which are sequentially connected, and the second sublayer comprises a second batch of regularization layers 608, a node convolution layer 604 and a second global pooling layer 609 which are sequentially connected. And respectively pooling the edge convolution and the node convolution, then cascading output results of the edge convolution and the node convolution to form a total tensor, inputting the total tensor to the full-connection layer and the output layer, and performing classified output. The embodiment is the comprehensive application of the side convolution information and the node convolution information.
In the structure of fig. 7, the structure of the map convolutional layer includes: the image structure convolution sublayer, the shared convolution sublayer and the global pooling sublayer are connected in sequence, wherein the image structure convolution sublayer comprises a third sublayer and a fourth sublayer which are connected in parallel and in cascade, the third sublayer comprises a third batch of regularization layers 702 and an edge convolution layer 703 which are connected in sequence, and the fourth sublayer comprises a fourth batch of regularization layers 709 and a node convolution layer 704 which are connected in sequence. After that, the convolution and pooling are performed by the shared convolution layer 705 and the global pooling layer 706, and then output to the full-link layer 707 and the output layer 708. In the method of fig. 7, after the calculation results of the edge convolution and the node convolution are cascaded and combined, the calculation results are input to the shared convolution layer and the global pooling layer, the comprehensive convolution operation is performed to perform further feature extraction, and then the classification output is performed through the full connection layer and the output layer. The idea of extracting comprehensive information is embodied after the operations of side convolution and node convolution are carried out.
Wherein the convolution calculation of the edge convolution layer comprises: acquiring edge data of all connecting edges, wherein the edge data of the connecting edges comprises edge values of node expression data of two nodes for representing two ends of the connecting edges; and calculating the output of the edge convolution layer according to the weighted summation of the edge values of the connecting edges.
The temporal convolution and the spatial convolution can be simultaneously embodied in the edge convolution layer. For example, the edge convolution layer may be calculated according to the following formula:
Figure BDA0002703464880000181
wherein the content of the first and second substances,
Figure BDA0002703464880000182
represents an edge eijCorresponding convolution output, vknIndicating edge connecting edge eknFor example, an arithmetic average, a geometric average, a weighted average, or the like of node values of two end points of the edge may be taken. Omega (l (e)kn) ) represents an edge eknThe corresponding weight value is classified and weighted according to the label function l in formula 3, and of course, ω (e) may be used if necessarykn) As the weight of the edge, a weight is respectively assigned to each adjacent edge.
Optionally, for the case of simplified computation, the edge e is connected using K — 3 subset pairsknThe subset grouping of the adjacent edges is divided into three different subsets of centrifugation, eccentricity and concentricity, and according to the embodiment of setting corresponding weight coefficients according to the groups, the convolutional layer calculation formula can be further written as:
Figure BDA0002703464880000183
Figure BDA0002703464880000184
wherein, ω (l (e)kn) Is a weight function, l (e)kn) Is based on a marking letterCounting the calculated label values, and distributing the weight values for the edges by the weight function according to the label values of the edges. N (e)ij) Represents an edge eijThe adjacent edges comprise spatial adjacent edges, or comprise spatial adjacent edges and temporal adjacent edges. The set N (e)ij) When divided into K subsets, NP(eij) Represents the pth subset thereof, P ∈ (1,2, … …, K). Zij(ekn,P) Representing a set of contiguous edges N (e)ij) P-th subset NP(eij) The number of adjoining sides contained therein. Coefficient of performance
Figure BDA0002703464880000185
Is introduced to balance the contribution of the abutting edges at different mark values.
Optionally, for convenience of processing, the weight value ranges from [0,1 ]]. I.e. the tag value l (e)kn) The value of (1) is divided into 3 subsets, each subset is distributed with a weight, and the value range of each weight is [0,1 ]]For example, the three weights are 0.2, 0.3, and 0.5, respectively.
For node convolution, a similar calculation to edge convolution can be used:
Figure BDA0002703464880000191
wherein, among others,
Figure BDA0002703464880000192
representing a node niCorresponding convolution output, vnRepresenting an edge node niThe value of (a). Omega (l (e)n) Represents node n)iCorresponding weight, N (x)i) Is a node niOf the neighboring nodes.
The adjacent node set may be defined in a manner similar to the adjacent edge set, and is not described in detail. The case of classifying and weighting according to the label function l is considered in equation 6, and of course, ω (x) may be used if necessaryn) As the weight of the node, a weight is respectively assigned to each node.
Likewise, optionally, for the case of simplified computation, connecting edge e with K — 3 subset pairsknThe adjacent edges are grouped into three different subsets of centrifugation, eccentricity and concentricity, and the nodes are grouped according to the group of the connecting edge of the nodes according to the embodiment of setting the corresponding weight coefficient according to the group; or, according to each node and node niThe distance or positional relationship with respect to the center point of the action directly groups the nodes, then the convolutional layer calculation formula can be further written as:
Figure BDA0002703464880000193
Figure BDA0002703464880000194
wherein, ω (l (x)n) Is a weight function, l (x)n) And the weight function distributes weight values to the edges according to the label values of the edges. N (x)i) Representing a node niOf the neighboring nodes. The set N (x)i) When divided into K subsets, NP(xi) Represents the pth subset thereof, P ∈ (1,2, … …, K). Zi(xn,P) Represents a set of neighbor nodes N (x)i) P-th subset NP(xi) The number of neighbor nodes contained therein. Coefficient of performance
Figure BDA0002703464880000195
Is introduced to balance the contributions of the neighbor nodes at different label values.
Optionally, for convenience of processing, the weight value ranges from [0,1 ]]. I.e. the tag value l (x)n) The value of (1) is divided into 3 subsets, each subset is distributed with a weight, and the value range of each weight is [0,1 ]]For example, the three weights are 0.2, 0.3, and 0.5, respectively.
Each convolutional layer may be a single convolutional layer, or may be a multilayer convolutional layer, that is, convolutional layer 1, convolutional layer 2, or the like, or a full-link layer may be a multilayer, and different activation functions such as Relu, tanh, and the like may be used for each layer, and finally, a softmax function is used for classification, or means such as dropout may be used to prevent overfitting as necessary. Under the guidance of the spirit of the present invention, a person skilled in the art can make modifications and changes to the present invention according to various existing network structures and optimization means of convolutional neural networks to adapt to the use requirements of various scenarios and data volumes, and such modifications and changes will fall into the protection scope of the present invention.
According to the model training method, expression data are recorded and calculated based on the graph structure, deep learning is performed by using the space and time data of the edge of the graph structure, effective data volume participating in the deep learning is further improved, better identification precision can be obtained, and dependence on sample data precision is reduced. In addition, the processing of the node convolution layer and the edge convolution layer can be combined in the deep learning model to further improve the expression recognition, particularly the facial expression recognition performance.
The expression recognition method based on edge convolution according to the present invention is described in detail below, and referring to fig. 5, the method includes the following steps S210 to S240.
In step S210, node data of all nodes in the preset node set is obtained, where the node data includes spatial positions of the nodes and time sequences of expression data of the nodes.
The node expression data are acquired by using the expression information acquisition device, and the setting positions of the piezoelectric film sensors correspond to the node positions in the preset facial node set. Acquiring voltage data acquired by a piezoelectric sensor arranged at a preset face node on the skin of a human face, and preprocessing the voltage data acquired by the piezoelectric sensor to obtain node data of all nodes in a preset face node set.
In step S220, according to the node data, a connection edge set of all nodes in the node set and edge data of each connection edge are calculated, where the edge data represents a node position where node expression data changes with respect to a reference value and a change of the node expression data.
Determining a connection edge set according to the spatial positions of all nodes in the node set and the time sequence of the expression data of the nodes, wherein the connection edge set specifically comprises the following steps: for each time point in the time sequence of the node expression data, acquiring a node in the node set, wherein the node expression data change of the time point is larger than a preset threshold value, and taking the node as an active node, wherein any two active nodes i and j are connected to form a connecting edge eij(ii) a For t1Each connecting edge e of a momentt1 ijAcquiring a space adjacent edge in a preset space adjacent domain and a time domain adjacent edge in a preset time adjacent domain to form an adjacent edge set N (e)t1 ij) Wherein, the spatial adjacent edges in the preset spatial adjacent domain refer to: two adjacent edges are connected through a node not larger than a preset number d, d is a natural number, and the time adjacent edges in the preset time neighborhood refer to: for the connecting edge et1 ijConsider the sum of t1At any time t in the time neighborhood of which the time interval does not exceed the predetermined time range2Connecting edge e oft2 ijConnecting edge et2 ijIs also considered to be a connecting edge et1 ij(ii) a space adjoining edge; and calculating a connection edge set and edge data of each connection edge according to the time domain adjacent edge and the space adjacent edge of each connection edge.
Calculating a marking function L of the adjacent edge set of each connecting edge, and distributing weight to each adjacent edge in the adjacent edge set according to the marking function L; wherein the labeling function L is used for characterizing the association degree of each adjacent edge in the adjacent edge set of the connecting edge with the connecting edge.
The value of the marking function L is a discrete value with a preset number, and is determined according to the relative position relationship between each adjacent edge and the connecting edge; assigning a weight to each adjacent edge in the set of adjacent edges according to the labeling function L comprises: and determining a weight coefficient according to the position relation between each adjacent edge and the connecting edge and the value of the marking function L, so that the edges with the same marking function value have the same weight.
Calculating the central coordinate and the direction vector of each connecting edge, wherein the central coordinate and the direction vector are obtained according to the three-dimensional position information of two nodes connected by the edges; and recording the central coordinate and the direction vector of the connecting side into the side data of the connecting side.
In step S230, a graph structure of expression data is constructed according to the node data and the edge data of the connection edge;
in step S240, inputting the graph structure of the expression data into a pre-trained graph convolution neural network expression recognition model to obtain an expression recognition classification result output by the model; the preset image convolution neural network expression recognition model is obtained according to the model training method.
The inputting of the graph structure of the expression data into the pre-trained graph convolution neural network expression recognition model may be inputting the edge data in the graph structure of the expression data as a model. In order to obtain a better recognition effect, the edge data and the node data in the graph structure of the expression data can be used as model input together.
For a specific implementation manner of steps S210 to S240, reference may be made to the description of relevant portions of steps S110 to S140, which is not described herein again.
According to the expression recognition method, expression data are recorded and calculated based on the graph structure, and deep learning is performed by using the space and time data of the edges of the graph structure, so that the effective data volume participating in the deep learning is further improved, better recognition accuracy can be obtained, and the dependence on the accuracy of sample data is reduced. In addition, the processing of the node convolution layer and the edge convolution layer can be combined in the deep learning model so as to further improve the facial expression recognition performance. Due to the fact that the node data are collected, the muscle and skin action information of the human face is directly collected by the aid of the sensor, and the obtained original data are not converted in the modes of photographing, video recording and the like, information loss and distortion caused by three-dimensional information to two-dimensional information are avoided, and accurate original data are obtained more easily. And the map topological distribution of the sensor is internally matched with the data processing mode of the map neural network on the data structure, so that better expression recognition results can be obtained after the data acquisition is processed by GCN. Especially, the method has wide application prospect for scenes that the prior art cannot obtain good results, such as micro-expression recognition and the like.
The facial mask is worn on the face of a user and can be used for observing the expression response of the user under the condition of a given paradigm so as to judge the emotional change of the user and the potential risks of potential mental diseases such as depression, bi-directional diseases, schizophrenia and the like. The piezoelectric film sensors are distributed on the face of a person in a multi-point array mode, dynamic pressure/tension changes of facial muscles are captured through the sensors, facial expressions of a user are captured, the sensors are mainly distributed in areas with protruding/sunken parts of the face and large dynamic change range of the muscles, and data of the multi-point sensors are used for feeding back emotion changes of the user. The method is different from the traditional image method for recognizing the facial expressions of the human faces, adopts the piezoelectric film to record the muscle changes of the human faces, so that the expressions and the emotion changes are deduced, the facial expressions can be accurately classified on the basis of more accurate facial expression data, and the recognition accuracy rate of the micro-expressions can be improved.
Embodiments according to the third aspect of the present invention also provide a non-transitory computer readable storage medium, in which computer instructions are stored, and when executed, the computer instructions implement the method according to the second aspect of the present invention.
There is also provided, in accordance with an embodiment of the fourth aspect of the present invention, an expression recognition system, including: expression information acquisition device and expression recognition device.
The expression information acquisition device is used for acquiring facial expression data, wherein the expression information acquisition device is the expression information acquisition device according to the first aspect of the invention. See the above related description.
The expression recognition device is used for recognizing expressions according to the facial expression data collected by the expression information collection device. The system can be any computing equipment which can meet the communication and computing power requirements, such as a personal computer, a mobile terminal, a cloud server and the like.
The expression recognition device comprises a data acquisition module and a graph volume expression recognition module.
The data acquisition module is used for acquiring node data of all nodes in a preset facial node set according to the facial expression data acquired by the expression information acquisition device, wherein the node data comprises the spatial positions of the nodes and the time sequence of the node expression data.
The system comprises a volume expression recognition module, a facial expression recognition module and a facial expression recognition module, wherein the volume expression recognition module is used for recognizing facial expressions by using a pre-trained volume neural network expression recognition model; the setting positions of a plurality of piezoelectric film sensors in the expression information acquisition device correspond to the node positions in the preset facial node set.
The details of the specific implementation of each component of the expression recognition system may refer to the description of the expression recognition model training method in combination with steps S110 to S140 and the expression recognition method in combination with steps S210 to 240, and are not described herein again.
The storage medium according to the third aspect of the present invention and the expression recognition system according to the fourth aspect of the present invention have similar advantageous effects to the method according to the second aspect, and are not described herein again.
Although the present disclosure has been described above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present disclosure, and these changes and modifications are intended to be within the scope of the present disclosure.

Claims (14)

1. An expression information acquisition device, comprising:
the mask base is used for being attached to the face of a person and is made of a flexible material; and
the piezoelectric film sensors are arranged on the mask substrate and used for detecting facial expressions and actions.
2. The information acquisition apparatus according to claim 1,
the mask substrate is in plurality, each mask substrate having a different size and/or shape, the different mask substrates being adapted to match different user facial morphologies.
3. The facial expression information collection device according to claim 1, further comprising:
and the control module is used for receiving the signals output by the piezoelectric film sensors, preprocessing the signals to generate expression data and sending the expression data to the expression recognition device.
4. The expression information acquisition device according to claim 3, wherein the control module includes:
the signal amplification unit is used for receiving the signals output by the piezoelectric film sensors and amplifying the signals;
the central processing unit is used for preprocessing the amplified signals output by the signal amplification unit, generating expression data and controlling the operation of the signal amplification unit, the wireless communication unit and the power supply module;
the wireless communication unit is used for communicating with an expression recognition device and sending the expression data to the expression recognition device;
the power supply module is used for supplying electric energy to the signal amplification unit, the signal processing unit and the wireless communication unit; and
a control box for accommodating the signal amplification unit, the signal processing unit, the wireless communication unit, and the power supply module.
5. The facial expression information collection device according to claim 3, further comprising:
the support part is connected with the mask substrate and the control module and used for being worn on the head of a user to form a support, so that the mask substrate and the control module are worn on the head of the user.
6. An expression recognition method, comprising:
acquiring node data of all nodes in a preset face node set, wherein the node data comprises the spatial positions of the nodes and the time sequence of node expression data; and
according to the node data, facial expression recognition is carried out by using a pre-trained graph convolution neural network expression recognition model;
the node expression data is acquired by using the expression information acquisition device according to any one of claims 1 to 5, and the arrangement positions of the piezoelectric thin film sensors correspond to the node positions in the preset facial node set.
7. The method of claim 6, wherein the facial expression recognition using the pre-trained convolutional neural network expression recognition model according to the node data comprises:
according to the node data, calculating a connection edge set of all nodes in the node set and edge data of each connection edge, wherein the edge data represent node positions of node expression data which change relative to a reference value and changes of the node expression data;
constructing a graph structure of expression data according to the node data and the edge data of the connecting edges; and
and inputting the graph structure of the expression data into a pre-trained graph convolution neural network expression recognition model to obtain an expression recognition classification result output by the model.
8. The expression recognition method according to claim 6, wherein the calculating, according to the node data, the set of connection edges of all nodes in the set of nodes and the edge data of each connection edge comprises:
determining a connection edge set according to the spatial positions of all nodes in the node set and the time sequence of the expression data of the nodes, wherein the connection edge set specifically comprises the following steps:
for each time point in the time sequence of the node expression data, acquiring a node in the node set, wherein the node expression data change of the time point is larger than a preset threshold value, and taking the node as an active node, wherein any two active nodes i and j are connected to form a connecting edge eij
For t1Each connecting edge e of a momentt1 ijAcquiring a space adjacent edge in a preset space adjacent domain and a time domain adjacent edge in a preset time adjacent domain to form an adjacent edge set N (e)t1 ij) Wherein, the spatial adjacent edges in the preset spatial adjacent domain refer to: two adjacent edges are connected through a node not larger than a preset number d, d is a natural number, and the time adjacent edges in the preset time neighborhood refer to: for the connecting edge et1 ijConsider the sum of t1At any time t in the time neighborhood of which the time interval does not exceed the predetermined time range2Connecting edge e oft2 ijConnecting edge et2 ijIs also considered to be a connecting edge et1 ij(ii) a space adjoining edge; and
and calculating a connection edge set and edge data of each connection edge according to the time domain adjacent edge and the space adjacent edge of each connection edge.
9. The expression recognition method according to claim 8, wherein the calculating, based on the node data, the set of connection edges of all nodes in the set of nodes and the edge data of each connection edge further comprises:
calculating a marking function L of the adjacent edge set of each connecting edge, and distributing weight to each adjacent edge in the adjacent edge set according to the marking function L;
wherein the labeling function L is used for characterizing the association degree of each adjacent edge in the adjacent edge set of the connecting edge with the connecting edge.
10. The expression recognition method according to claim 9,
the value of the marking function L is a discrete value with a preset number, and is determined according to the relative position relationship between each adjacent edge and the connecting edge; and
assigning a weight to each adjacent edge in the set of adjacent edges according to the labeling function L comprises: and determining a weight coefficient according to the position relation between each adjacent edge and the connecting edge and the value of the marking function L, so that the edges with the same marking function value have the same weight.
11. The expression recognition method according to claim 6, wherein calculating the set of connection edges of all nodes in the node set and the edge data of each connection edge according to the node data comprises:
calculating the central coordinate and the direction vector of each connecting edge, wherein the central coordinate and the direction vector are obtained according to the three-dimensional position information of two nodes connected by the edges; and
and recording the central coordinate and the direction vector of the connecting side into the side data of the connecting side.
12. The expression recognition method of claim 6, wherein the preset atlas neural network comprises sequentially connected: a data input layer, a graph volume layer, a full connection layer and an output layer,
wherein, the structure of the graph volume layer comprises: the system comprises a first sublayer and a second sublayer, wherein the first sublayer and the second sublayer are in parallel cascade connection, the first sublayer comprises a first batch of regularization layers, an edge convolution layer and a first global pooling layer which are sequentially connected, and the second sublayer comprises a second batch of regularization layers, a node convolution layer and a second global pooling layer which are sequentially connected; or
The structure of the graph convolution layer includes: the image structure convolution sublayer, the sharing convolution sublayer and the global pooling sublayer are connected in sequence, wherein the image structure convolution sublayer comprises a third sublayer and a fourth sublayer which are connected in parallel in a cascade mode, the third sublayer comprises a third batch of regularization layers and an edge convolution layer which are connected in sequence, and the third sublayer comprises a fourth batch of regularization layers and a node convolution layer which are connected in sequence.
13. A non-transitory computer readable storage medium having stored therein computer instructions, wherein the computer instructions, when executed, implement the expression recognition method of any one of claims 6-12.
14. An expression recognition system, comprising:
the facial expression information acquisition device is used for acquiring facial expression data, wherein the facial expression information acquisition device is the facial expression information acquisition device according to any one of claims 1-5;
the expression recognition device is used for recognizing expressions according to the facial expression data collected by the expression information collection device, and comprises:
the data acquisition module is used for acquiring node data of all nodes in a preset facial node set according to the facial expression data acquired by the expression information acquisition device, wherein the node data comprises the spatial positions of the nodes and the time sequence of the node expression data;
the system comprises a volume expression recognition module, a facial expression recognition module and a facial expression recognition module, wherein the volume expression recognition module is used for recognizing facial expressions by using a pre-trained volume neural network expression recognition model;
the setting positions of a plurality of piezoelectric film sensors in the expression information acquisition device correspond to the node positions in the preset facial node set.
CN202011030333.0A 2020-09-27 2020-09-27 Expression information acquisition device, expression recognition method and system Active CN112183314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011030333.0A CN112183314B (en) 2020-09-27 2020-09-27 Expression information acquisition device, expression recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011030333.0A CN112183314B (en) 2020-09-27 2020-09-27 Expression information acquisition device, expression recognition method and system

Publications (2)

Publication Number Publication Date
CN112183314A true CN112183314A (en) 2021-01-05
CN112183314B CN112183314B (en) 2023-12-12

Family

ID=73945016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011030333.0A Active CN112183314B (en) 2020-09-27 2020-09-27 Expression information acquisition device, expression recognition method and system

Country Status (1)

Country Link
CN (1) CN112183314B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112606022A (en) * 2020-12-28 2021-04-06 苏州源睿尼科技有限公司 Use method of facial expression acquisition mask
CN113391706A (en) * 2021-06-25 2021-09-14 浙江工业大学 Sensor array-based motion capture device and attitude identification method thereof
WO2023128847A1 (en) * 2021-12-30 2023-07-06 Telefonaktiebolaget Lm Ericsson (Publ) Face mask for capturing speech produced by a wearer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593465A (en) * 2009-07-06 2009-12-02 北京派瑞根科技开发有限公司 Electronics with expression shape change is drawn
US20120229248A1 (en) * 2011-03-12 2012-09-13 Uday Parshionikar Multipurpose controller for electronic devices, facial expressions management and drowsiness detection
CN106390318A (en) * 2016-10-18 2017-02-15 京东方科技集团股份有限公司 Intelligent mask and control method thereof
CN206470693U (en) * 2017-01-24 2017-09-05 广州幻境科技有限公司 A kind of Emotion identification system based on wearable device
CN110008819A (en) * 2019-01-30 2019-07-12 武汉科技大学 A kind of facial expression recognizing method based on figure convolutional neural networks
CN110390305A (en) * 2019-07-25 2019-10-29 广东工业大学 The method and device of gesture identification based on figure convolutional neural networks
CN111339847A (en) * 2020-02-14 2020-06-26 福建帝视信息科技有限公司 Face emotion recognition method based on graph convolution neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593465A (en) * 2009-07-06 2009-12-02 北京派瑞根科技开发有限公司 Electronics with expression shape change is drawn
US20120229248A1 (en) * 2011-03-12 2012-09-13 Uday Parshionikar Multipurpose controller for electronic devices, facial expressions management and drowsiness detection
CN106390318A (en) * 2016-10-18 2017-02-15 京东方科技集团股份有限公司 Intelligent mask and control method thereof
CN206470693U (en) * 2017-01-24 2017-09-05 广州幻境科技有限公司 A kind of Emotion identification system based on wearable device
CN110008819A (en) * 2019-01-30 2019-07-12 武汉科技大学 A kind of facial expression recognizing method based on figure convolutional neural networks
CN110390305A (en) * 2019-07-25 2019-10-29 广东工业大学 The method and device of gesture identification based on figure convolutional neural networks
CN111339847A (en) * 2020-02-14 2020-06-26 福建帝视信息科技有限公司 Face emotion recognition method based on graph convolution neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李聪慧: "基于显著性特征与图卷积的面部表情分析研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 803 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112606022A (en) * 2020-12-28 2021-04-06 苏州源睿尼科技有限公司 Use method of facial expression acquisition mask
CN113391706A (en) * 2021-06-25 2021-09-14 浙江工业大学 Sensor array-based motion capture device and attitude identification method thereof
WO2023128847A1 (en) * 2021-12-30 2023-07-06 Telefonaktiebolaget Lm Ericsson (Publ) Face mask for capturing speech produced by a wearer

Also Published As

Publication number Publication date
CN112183314B (en) 2023-12-12

Similar Documents

Publication Publication Date Title
CN111127441B (en) Multi-modal brain image depression recognition method and system based on graph node embedding
Bevilacqua et al. Human activity recognition with convolutional neural networks
CN112183314B (en) Expression information acquisition device, expression recognition method and system
Kumar et al. Multimodal gait recognition with inertial sensor data and video using evolutionary algorithm
Goulart et al. Visual and thermal image processing for facial specific landmark detection to infer emotions in a child-robot interaction
CN105393252A (en) Physiologic data acquisition and analysis
Mekruksavanich et al. A multichannel cnn-lstm network for daily activity recognition using smartwatch sensor data
CN111728609A (en) Electroencephalogram signal classification method, classification model training method, device and medium
CN109919085A (en) Health For All Activity recognition method based on light-type convolutional neural networks
Wei et al. Real-time facial expression recognition for affective computing based on Kinect
Botzheim et al. Human gesture recognition for robot partners by spiking neural network and classification learning
Ehatisham-Ul-Haq et al. C2FHAR: Coarse-to-fine human activity recognition with behavioral context modeling using smart inertial sensors
CN109685148A (en) Multi-class human motion recognition method and identifying system
Kumar et al. Fusion of neuro-signals and dynamic signatures for person authentication
Al-Qaderi et al. A multi-modal person recognition system for social robots
CN112183315B (en) Action recognition model training method and action recognition method and device
Saeed et al. Automated facial expression recognition framework using deep learning
Zheng et al. Lightweight fall detection algorithm based on AlphaPose optimization model and ST-GCN
Nasir et al. Fuzzy triangulation signature for detection of change in human emotion from face video image sequence
CN111382807B (en) Image processing method, image processing device, computer equipment and storage medium
Kwaśniewska et al. Real-time facial features detection from low resolution thermal images with deep classification models
Jantawong et al. Time series classification using deep learning for har based on smart wearable sensors
Serrão et al. Human activity recognition from accelerometer with convolutional and recurrent neural networks
Adibuzzaman et al. In situ affect detection in mobile devices: a multimodal approach for advertisement using social network
Compagnon et al. Personalized posture and fall classification with shallow gated recurrent units

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant