CN116311423A - Cross-attention mechanism-based multi-mode emotion recognition method - Google Patents

Cross-attention mechanism-based multi-mode emotion recognition method Download PDF

Info

Publication number
CN116311423A
CN116311423A CN202310104197.2A CN202310104197A CN116311423A CN 116311423 A CN116311423 A CN 116311423A CN 202310104197 A CN202310104197 A CN 202310104197A CN 116311423 A CN116311423 A CN 116311423A
Authority
CN
China
Prior art keywords
emotion
facial
voice
recognition
facial expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310104197.2A
Other languages
Chinese (zh)
Inventor
刘婷婷
周蔚博
刘海
杨兵
赵莉
张昭理
陈胜勇
李友福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Central China Normal University
Original Assignee
Hubei University
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University, Central China Normal University filed Critical Hubei University
Priority to CN202310104197.2A priority Critical patent/CN116311423A/en
Publication of CN116311423A publication Critical patent/CN116311423A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/24Detecting, measuring or recording bioelectric or biomagnetic signals of the body or parts thereof
    • A61B5/316Modalities, i.e. specific diagnostic methods
    • A61B5/389Electromyography [EMG]
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Psychiatry (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Surgery (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • Mathematical Physics (AREA)
  • Hospice & Palliative Care (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Social Psychology (AREA)
  • Psychology (AREA)
  • Educational Technology (AREA)
  • Developmental Disabilities (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Physiology (AREA)
  • Oral & Maxillofacial Surgery (AREA)

Abstract

The invention relates to the technical field of computer vision, in particular to a multi-mode emotion recognition method based on a cross attention mechanism, which comprises the following steps: collecting facial expression images, voice signals and facial electromyographic signals of a user at the same moment; preprocessing each path of signal respectively; respectively inputting the preprocessed signals into a trained emotion recognition model to obtain a single-mode emotion recognition result; and carrying out weighted summation on the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition, and outputting the multimodal fusion emotion type after fusion of the features. According to the invention, the facial expression image, the voice signal and the facial electromyographic signal are simultaneously used in the multi-mode emotion recognition, and the explicit behavior and the implicit state are combined, so that the psychological emotion state of the consultant can be truly reflected, the limitations of single dimension and strong subjectivity of the traditional detection method are overcome, and the method has important significance for improving the online psychological consultation quality and assisting the online psychological consultation.

Description

Cross-attention mechanism-based multi-mode emotion recognition method
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-mode emotion recognition method based on a cross attention mechanism.
Background
"perfecting mental health and mental health services", there is an increasing demand for mental health by the people. However, under the condition that the on-line psychological consultation service is inconvenient, how to accurately acquire the emotion state of the consultant in a remote environment and provide high-quality on-line psychological consultation service for the consultant becomes a problem to be solved urgently. Therefore, the real-time online emotion recognition method is adopted to assist psychological consultation, so that the psychological consultation of the consultant is improved under the online environment, and the psychological dispersion is further carried out efficiently, and the method is significant.
In general, the emotion type of the consultant can be reflected in facial expressions, voice emotion, and the like. However, in an online psychological counseling scenario, there are the following problems: (1) the situation that the current facial expression and emotion state of the consultant cannot be accurately judged due to the fact that visual information is lost caused by various shielding frequently occurs; (2) the emotion state of the current consultant is difficult to accurately and objectively judge by only capturing unilateral information, and the explicit behaviors such as facial expression and the like have certain deception, so that the accuracy of the detection result is reduced; (3) the traditional multi-mode information fusion has poor interactivity, and frequently has the phenomenon of multi-aspect information conflict, so the judgment effect is poor. (4) For psychological consultants, a series of auxiliary functions such as real-time emotional state feedback of the state of the online psychological consultant are lacking.
Therefore, emotion of a consultant cannot be effectively identified in the current online psychological consultation scene, and objectivity and accuracy of identification cannot meet diagnosis and treatment requirements.
Disclosure of Invention
The invention provides a multi-mode emotion recognition method based on a cross attention mechanism, which is used for solving the defect that the current emotion state of a consultant cannot be accurately recognized in an online psychological consultation scene in the prior art, comprehensively and comprehensively improving the objectivity and the accuracy of emotion recognition in the online psychological consultation scene through multi-mode information complementation, so that the online psychological consultation is assisted to be performed with high efficiency and high quality.
The invention provides a multi-mode emotion recognition method based on a cross attention mechanism, which comprises the following steps:
collecting facial expression images, voice signals and facial electromyographic signals of a user at the same moment;
preprocessing the facial expression image, the voice signal and the facial electromyographic signal respectively;
respectively inputting the preprocessed facial expression image, the voice signal and the facial electromyographic signal into a trained emotion recognition model to obtain a single-mode emotion recognition result, and outputting a facial expression type, a voice emotion type and an electromyographic emotion type obtained by recognition;
and based on a preset weight, carrying out weighted summation on the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition, and outputting the multimodal fusion emotion type after fusion of the characteristics.
According to the multi-modal emotion recognition method based on the cross attention mechanism, each two of the facial expression type, the voice emotion type and the myoelectricity emotion type obtained through recognition are compared, a single-modal emotion recognition result is output to be compared with other two single-modal emotion recognition results, and if any comparison result is the same as a preset abnormal condition, the current user is judged to have emotion abnormality.
According to the multi-mode emotion recognition method based on the cross attention mechanism provided by the invention, the preprocessing of the facial expression image, the voice signal and the facial electromyographic signal respectively comprises the following steps:
sequentially carrying out shielding removal, image restoration reconstruction and image enhancement operation on the facial expression image, extracting effective facial information and eliminating useless information;
sequentially carrying out pre-emphasis, framing and windowing operation on the voice signals, extracting effective voice information and eliminating useless information;
and denoising the facial electromyographic signals, extracting effective myoelectricity information, and eliminating useless information.
According to the multi-mode emotion recognition method based on the cross attention mechanism, when the preprocessed facial expression images and the voice signals are recognized through the trained emotion recognition model, transmembrane interaction is carried out on the facial expression images and the voice signals in recognition based on the cross attention mechanism, so that vectors of original single-mode information corresponding to the facial expression images and the voice signals simultaneously contain image information and dialogue voice information.
According to the multi-modal emotion recognition method based on the cross attention mechanism, the preprocessed facial expression image, the preprocessed voice signal and the preprocessed facial electromyographic signal are respectively input into a trained emotion recognition model to obtain a single-modal emotion recognition result: the emotion recognition model comprises an expression recognition sub-network, a voice emotion recognition sub-network and a facial myoelectricity emotion recognition sub-network;
inputting the preprocessed facial expression image into the pre-trained expression recognition sub-network, dividing the facial expression image into a plurality of image blocks with preset dimensions, merging the image blocks until a feature map with preset dimensions is output, performing auxiliary classification by using label space topology information and label distribution learning, and outputting a classification result obtained by recognition;
inputting the preprocessed voice signals into the pre-trained voice emotion recognition sub-network, convolving the voice signals in two layers, extracting local features, and outputting a classification result obtained by recognition through an output layer;
inputting the preprocessed facial electromyographic signals into the pre-trained facial electromyographic emotion recognition sub-network, and outputting and obtaining a classification result by respectively corresponding to each feature of different sampling points of the face through a preset number of feature vectors.
According to the multi-mode emotion recognition method based on the cross attention mechanism, the emotion recognition sub-network is obtained by training at least one facial emotion image sample with various facial emotion labels, the voice processing sub-network is obtained by training at least one audio sample with various voice emotion labels, and the myoelectricity recognition sub-network is obtained by training at least one myoelectricity sample with myoelectricity emotion labels.
According to the multi-modal emotion recognition method based on the cross attention mechanism, which is provided by the invention, the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition are weighted and summed, and the multi-modal fusion emotion type after fusion characteristics are output through a multi-modal fusion weighted summation function comprises the following steps:
p=w f ·p f +w s ·p s +W e ·p e
wherein ,
Figure SMS_1
for the probability distribution of facial expressions->
Figure SMS_2
For the speech emotion probability distribution +.>
Figure SMS_3
Is myoelectricity emotion probability distribution; w (w) f 、w s 、w e Weights of facial expression mode, voice mode and facial myoelectric mode are respectively preset, and w f +w s +w e =1;
And taking the maximum value of the numerical value calculated by the multi-modal fusion weighted sum function as a multi-modal fusion emotion recognition result.
In another aspect, the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the steps of any one of the above-mentioned multi-modal emotion recognition methods are implemented when the processor executes the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multimodal emotion recognition method as described in any of the above.
The multi-mode emotion recognition method based on the cross attention mechanism provided by the invention has at least the following technical effects:
(1) According to the invention, the facial expression image, the voice signal and the facial electromyographic signal are simultaneously used in the multi-mode emotion recognition, the expression related signals from three sources are combined, and the explicit behavior and the implicit state are combined, so that the deception can be effectively reduced, the psychological emotion state of the consultant is truly reflected, the limitation of single detection dimension and strong subjectivity in the traditional detection method is overcome in the multi-mode emotion recognition, the problems of missed detection and false detection are reduced, and the method has important significance for improving the online psychological consultation quality and assisting the online psychological consultation.
(2) Through carrying out cross attention interaction of facial expression images and voice signals in the recognition process of the expression recognition sub-network and the voice emotion recognition sub-network, multi-modal information fusion can be carried out in the finer granularity field, the accuracy of multi-modal information fusion is improved, and the method has important significance for improving the multi-modal information interaction mode and improving the emotion recognition accuracy.
(3) The emotional state provided by the invention feeds back three single-mode emotional recognition results and one multi-mode fusion emotional result in real time, can effectively provide more comprehensive information for psychological consultants, can timely, accurately and objectively evaluate the current emotional state of the psychological consultant, and provides accurate and efficient diagnosis assistance for the psychological consultant.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a multi-modal emotion recognition method based on a cross-attention mechanism provided by the invention;
FIG. 2 is a schematic diagram of signal acquisition of a multi-modal emotion recognition method based on a cross-attention mechanism provided by the invention;
FIG. 3 is a schematic diagram of an emotion recognition model of a multi-modal emotion recognition method based on a cross-attention mechanism.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the term "first/second" related to the present invention is merely to distinguish similar objects, and does not represent a specific order for the objects, and it should be understood that "first/second" may interchange a specific order or precedence where allowed. It is to be understood that the "first\second" distinguishing aspects may be interchanged where appropriate to enable embodiments of the invention described herein to be implemented in sequences other than those described or illustrated herein.
In one embodiment, as shown in fig. 1, the present invention provides a multi-modal emotion recognition method based on a cross-attention mechanism, including:
collecting facial expression images, voice signals and facial electromyographic signals of a user at the same moment;
preprocessing the facial expression image, the voice signal and the facial electromyographic signal respectively;
respectively inputting the preprocessed facial expression image, the voice signal and the facial electromyographic signal into a trained emotion recognition model to obtain a single-mode emotion recognition result, and outputting a facial expression type, a voice emotion type and an electromyographic emotion type obtained by recognition;
and based on a preset weight, carrying out weighted summation on the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition, and outputting the multimodal fusion emotion type after fusion of the characteristics.
Further, after three single-mode emotion recognition results and multiple-mode emotion recognition results are obtained through recognition, each two of the facial expression type, the voice emotion type and the myoelectricity emotion type obtained through recognition are compared, a comparison result of the single-mode emotion recognition result and the other two single-mode emotion recognition results is output, and if any comparison result is identical to a preset abnormal condition, the current user is judged to have emotion abnormality.
In one embodiment, as shown in fig. 2, the online psychological consultant is a subject who is performing online psychological consultation by using an online psychological consultation professional recording device, and the professional recording device can be used to collect facial expression RGB image IMG, dialogue voice signal VOI and facial electromyographic signal EMG of the online psychological consultant respectively, for example, the online psychological consultant is used to record audio and video of the online psychological consultant, and the surface electrode is used to collect electromyographic data of the frowning muscle area and the cheekbone area of the online psychological consultant;
as an example, the online psychological consultation professional recording device includes, but is not limited to, an RGB camera, a voice recorder, an myoelectricity detection device, a computer, and a psychological consultation comprehensive analysis system packaging integrated device;
as an example, the frame rate of the online psychological consultation professional recording device is set to 30 frames/second, and one facial expression RGB image is extracted every two frames, namely 10 facial expression RGB images are extracted every second; for speech, sampling the input original audio signal at a sampling rate of 11025 Hz; for myoelectricity, the input raw myoelectrical signal is sampled at a sampling rate of 32 Hz.
According to the multi-mode emotion recognition method based on the cross attention mechanism provided by the invention, the preprocessing of the facial expression image, the voice signal and the facial electromyographic signal respectively comprises the following steps:
sequentially carrying out shielding removal, image restoration reconstruction and image enhancement operation on the facial expression image, extracting effective facial information and eliminating useless information;
sequentially carrying out pre-emphasis, framing and windowing operation on the voice signals, extracting effective voice information and eliminating useless information;
and denoising the facial electromyographic signals, extracting effective myoelectricity information, and eliminating useless information.
In one embodiment, preprocessing the facial expression RGB image IMG of the online psychological consultant includes the steps of:
firstly, removing shielding, repairing and reconstructing images and enhancing images of a facial expression RGB image IMG, thereby effectively extracting facial information and eliminating useless information;
as an example, using U-Net as an infrastructure, performing image processing on an facial expression RGB image IMG by using a laplace a priori network LLC and a symmetric matching module SYM; it should be noted that, the U-net is a convolutional network structure for rapidly and accurately dividing the image;
firstly, processing a facial expression RGB image IMG by using a Laplace operator delta, recovering facial expression information of an occlusion image, adopting a Laplace priori network LLC, wherein the subnet comprises three convolution layers, the first layer of convolution parameters are the Laplace operator delta, obtaining a Laplace edge graph after completion, normalizing an edge graph, and transmitting the edge graph to the two later layers of Alexnet for convolution so as to further extract the edge information;
after the processing is finished, the obtained first facial expression RGB image IMG 1 It is sent to the next symmetric matching module to further extract edge information; and adopting a symmetrical matching module to remove shielding, image restoration and reconstruction and image enhancement of the first facial expression RGB image IMG1, embedding the module into a continuous convolution and deconvolution layer of U-Net for image processing, and obtaining a processed second facial expression RGB image IMG2.
Further, the symmetry matching module utilizes the similarity of the symmetric regions of the left face and the right face, enhances the region with high similarity according to the similarity of the symmetric matching in the target image, and otherwise suppresses the region with low similarity to realize symmetric smoothing; specifically, the similarity is SSIM (structural similarity), the range is 0-1, the size is judged according to the numerical value, and the specific enhancement and inhibition areas take the preset similarity numerical value as a high-low demarcation point;
calculating pixel loss, where Lp is the actual output image
Figure SMS_4
And the normalized euclidean distance between the target image y, H, W, C represents the height, width, and feature quantity of the image, respectively. Lp is specifically defined as follows:
Figure SMS_5
calculating symmetry loss Lsym, and firstly, respectively calculating the difference between the left and right symmetric areas of the processed face image and the target face image; according to the similarity of symmetrical matching in the target image, the region with high similarity is enhanced, otherwise, the region with low similarity is restrained, and symmetrical smoothing is realized;
specifically, define
Figure SMS_6
Representing an average pooling operation, ψ represents a symmetrical inversion operation on the image. Difference I between bilateral symmetry regions of the target face image and difference +.>
Figure SMS_7
The expression is as follows:
Figure SMS_8
the symmetry loss Lsym is calculated as follows:
Figure SMS_9
the smoothing loss Ls obtains corresponding weight according to the neighborhood difference of the target image, the larger the difference is, the lower the weight is, and psi is defined as the weight value normalization lambda H and λW Respectively correspond to the return between adjacent pixels in the H and W directionsUnifying the differential weights; delta H and δW Is the difference between adjacent pixels in the H direction and the W direction;
the normalized difference weight between adjacent pixels and the adjacent pixel difference are specifically calculated as follows:
Figure SMS_10
the smoothing loss Ls is specifically calculated as follows:
Figure SMS_11
in summary, the three loss terms are defined as follows:
L all =α 1 L p2 L sym3 L s ,#
wherein the parameter alpha 1 ,α 2 ,α 3 Respectively representing three types of loss corresponding weights; alternatively, the parameter ranges are respectively set to alpha 1 ∈[0.9,1.1],α 2 ∈[0.1,0.2],α 3 ∈[0.1,0.2];
In one embodiment, the pre-emphasis and framing windowing operations are sequentially performed on the voice signal, effective voice information is extracted, and useless information is eliminated, and the method specifically comprises the following steps:
firstly, a digital filter is used for signal pre-emphasis to obtain a first voice signal VOI1, and the transfer function of the filter is defined as follows:
H(z)=1-μz -1 ,#
wherein μ is the pre-emphasis coefficient, and μ ε [0.9,1] is used to define the transfer function;
further, the first speech signal VOI1 is processed based on a Hamming window (Hamming), and is framed and windowed with a window function as follows:
Figure SMS_12
after zero correction is performed on the volume, endpoint detection is performed using 0.1 of the maximum volume as a threshold value.
As an example, for samples after end-point detection segmentation, each sample is sequentially truncated into 300 frames with 150 sampling points as one frame. Uniformly converting the one-dimensional sample signals into two-dimensional signals with the size of (300,150) by adopting a zero padding and cutting-off method; the present invention is presented herein by way of example only and should not be construed as limiting the invention;
in one embodiment, denoising the facial electromyographic signals, extracting effective electromyographic information, and eliminating useless information, which specifically includes:
first, the electromyographic signals are denoised using wavelet transformation to obtain a first facial electromyographic signal EMG1, the specific method of wavelet transformation being represented as follows:
Figure SMS_13
psi (t) is a basic wavelet function, psi * F (t) is an input signal EMG, a is a transformation scale, and b is a displacement parameter;
optionally, selecting a D5 wavelet for wavelet transformation;
extracting 7 characteristics of the processed first facial electromyographic signal EMG1, namely a time domain root mean square RMS, an average absolute value MAV, a variance VAR, a standard deviation STD, zero crossing points ZC, a wavelength WS and an integral electromyographic value IEMG;
according to the multi-mode emotion recognition method based on the cross attention mechanism, when the preprocessed facial expression images and the voice signals are recognized through the trained emotion recognition model, transmembrane interaction is carried out on the facial expression images and the voice signals in recognition based on the cross attention mechanism, so that vectors of original single-mode information corresponding to the facial expression images and the voice signals simultaneously contain image information and dialogue voice information.
According to the multi-modal emotion recognition method based on the cross attention mechanism, the preprocessed facial expression image, the preprocessed voice signal and the preprocessed facial electromyographic signal are respectively input into a trained emotion recognition model to obtain a single-modal emotion recognition result: the emotion recognition model comprises an expression recognition sub-network, a voice emotion recognition sub-network and a facial myoelectricity emotion recognition sub-network;
inputting the preprocessed facial expression image into the pre-trained expression recognition sub-network, dividing the facial expression image into a plurality of image blocks with preset dimensions, merging the image blocks until a feature map with preset dimensions is output, performing auxiliary classification by using label space topology information and label distribution learning, and outputting a classification result obtained by recognition;
inputting the preprocessed voice signals into the pre-trained voice emotion recognition sub-network, convolving the voice signals in two layers, extracting local features, and outputting a classification result obtained by recognition through an output layer;
inputting the preprocessed facial electromyographic signals into the pre-trained facial electromyographic emotion recognition sub-network, and outputting and obtaining a classification result by respectively corresponding to each feature of different sampling points of the face through a preset number of feature vectors;
in one embodiment, as shown in fig. 3, the FVEmo-Trans network structure schematic diagram of the emotion recognition model provided in this embodiment includes an expression recognition sub-network SubNet-I, a voice emotion recognition sub-network SubNet-V, and a facial myoelectricity emotion recognition sub-network SubNet-E in the FVEmo-Trans recognizer; the expression recognition sub-network SubNet-I is obtained by training at least one facial expression RGB image sample with a facial expression label, the voice processing sub-network SubNet-V is obtained by training at least one audio sample with an emotion label, and the myoelectricity recognition sub-network SubNet-E is obtained by training at least one myoelectricity sample with an emotion label;
in the FVEmo-Transformer, a Transformer module is used in an expression recognition sub-network SubNet-I and a voice emotion recognition sub-network SubNet-V, and after the Transformer coding is finished, cross-modal interaction is performed by using a Cross-attention mechanism, so that vectors which originally only have single-mode information in the two models simultaneously contain expression image information and voice information;
in the FVEmo-transducer, the expression recognition sub-network sub-Net-I, the voice emotion recognition sub-network sub-Net-V and the facial myoelectricity emotion recognition sub-network sub-Net-E can output single-mode emotion.
Specifically, the expression recognition sub-network SubNet-I extracts facial expression RGB image features FEA-I by using a Swin-Trans architecture, and performs auxiliary classification by using tag space topology information LST and tag distribution learning LDL.
Extracting facial expression RGB image features FEA-I of facial expression RGB image data by a Swin-Trans architecture of an expression recognition sub-network; four stages are constructed, and the specific flow is as follows:
as an example, a facial expression image IMG with a size of 224×224×3 after preprocessing is taken as an input, and is divided into a plurality of image block sets with a size of 4×4 after block partitioning;
the first stage, carrying out linear embedding processing on an image block set, converting the original characteristic dimension of the image block, and then applying a self-attention module-double-layer window Transformer block to process the image block on the basis, and obtaining an image with fixed output dimension after processing; the double-layer window converter block consists of two continuous window converter basic blocks, wherein the first window converter basic block comprises LN layer normalization, window-based multi-head self-attention module (W-MSA) and a double-layer multiple perceptron (MLP) framework with nonlinear activation function GELU, the LN layer normalization module is applied between each MSA and each MLP module, and residual connection is adopted after each module; the second window transform basic block includes LN layer normalization, moving window based multi-head self-attention module (SW-MSA), double layer multiple perceptron (MLP) architecture with nonlinear activation function GELU. Applying LN layer normalization modules between each MSA and MLP module, each module then adopting residual connection;
using moving window partitioning, for successive window convertors blocks, the calculation process is as follows:
Figure SMS_14
wherein ,
Figure SMS_15
and sk Respectively representing the output characteristics of the (S) W-MSA module and the MLP module;
a second stage, performing block merging processing again on the result processed in the first stage, performing feature transformation processing by using a window transform block,
in order to obtain the multi-size feature information, a hierarchical converter is constructed, the second stage operation is repeated in the third stage and the fourth stage, and the feature diagram dimension is changed, which specifically comprises:
and connecting adjacent block features of each group of 2×2 through a block merging layer, applying a linear layer on a 4×96-dimensional connecting layer to achieve the effect of double downsampling, and simultaneously adopting 1×1 convolution at a dimension channel to control the output dimension of an image to be 2×96.
In one embodiment, the voice processing sub-network subNet-V is formed by combining a transducer with a support vector machine SVM, called a Tran-SVM architecture, and the current voice emotion characteristics of the online psychological consultant are extracted through the voice processing sub-network, and the method comprises the following steps:
firstly, two layers of convolution of an audio signal are carried out, local features are extracted, and the input sizes of the two layers of convolution are (300,128,32) and (300,128,1) respectively; then entering a transducer module layer, wherein the basic module of the layer comprises LN layer normalization, a multi-head attention module MHA and a feedforward neural network module FNN, and each module is connected by adopting residual errors;
the classification result and the model parameter setting are obtained through a full connection layer (input dimension FC-dim=64), dropout (input dimension DP-dim=64) and a Softmax layer (input dimension SM-dim=4); finally, taking the model as a pre-training model, inputting an original audio signal to extract characteristics and training a Support Vector Machine (SVM) classifier to obtain a classification result of the Support Vector Machine (SVM);
the convolution layer adopts an Xavier to initialize and set a convolution kernel, an activation function is a ReLU, a loss function is cross entropy loss, and the loss function is defined as follows:
Figure SMS_16
encoding audio sequence position information using a sine and cosine function:
Figure SMS_17
where p represents the actual position of the object to be investigated in the sequence. In one specific example, the feature dimension d of each frame vector takes a value of 256.
And selecting a Gaussian kernel function as a Support Vector Machine (SVM) data map. The function is expressed as follows:
Figure SMS_18
wherein the parameter sigma is used to control the radial range of action of the function;
the above is merely exemplary of embodiments of the present invention and is not to be construed as a further limitation of the present invention.
In one embodiment, facial myoelectricity emotion recognition sub-network SubNet-E extracts the current myoelectricity emotion characteristics FEA-E of the on-line psychological consultant, and myoelectricity processing is mainly implemented by using BP neural network, for example:
setting the number of network input layer nodes n=28, the number of network output layer nodes k=7, the number of hidden layers l=8, and selecting an activation function as Softmax;
the 28 input feature vectors correspond to 7 features of 4 sampling points of the frown muscle region and the cheekbone region, and the 7 output nodes correspond to seven emotion classification results of happiness (Hap), sadness (Sad), gas generation (Ang), neutrality (Neu), surprise (Sur), fear (Fer) and aversion (Dis); the examples are merely for further explanation of the present invention and are not intended to limit the scope of the present invention;
according to the multi-mode emotion recognition method based on the cross attention mechanism, the expression recognition sub-network is obtained by training at least one facial expression image sample with various facial emotion labels, the voice processing sub-network is obtained by training at least one audio sample with various voice emotion labels, and the myoelectricity recognition sub-network is obtained by training at least one myoelectricity sample with myoelectricity emotion labels;
in one embodiment, the training process for the expression recognition sub-network SubNet-I includes: acquiring a facial expression RGB image of an online psychological consultant, extracting facial expression RGB image characteristics FEA-I of the online psychological consultant by using a moving window Transformer (Swin-Trans) architecture, and performing expression recognition by using label space topology information LST and label distribution learning LDL;
wherein the label space topology information LST and the label distribution learning LDL are trained using at least one facial expression RGB image sample with happy (Hap), sad (Sad), angry (Ang), neutral (Neu), surprise (Sur), fear (Fer), aversion (Dis), tension (Anx) facial expression labels, comprising:
and effectively extracting action units and facial feature points by using an Openface method, and constructing a K neighbor graph corresponding to the facial expression RGB image IMG by using a K neighbor algorithm. Each image in the training set, and two adjacent images thereof, are indexed and stored in an index similarity list together with the local similarity. The backbone network is trained using the list.
Model f (x|θ) with Softmax uses the tag l i And
Figure SMS_19
will input x i Mapping to tag distribution, l i Logic tag being the ith instance object, < +.>
Figure SMS_20
Is the vector x in the t-th auxiliary task i Is a label of (a). Assuming that the logical label is close enough to the original label to represent the true value, the loss function L is JS divergence between the true value label and the predicted label, and the loss function is:
Figure SMS_21
network prediction f (x) using ith image i |θ) and its neighboring image j in the auxiliary task's network prediction f (x) j And |θ) to guide the updating of network parameters, estimating local similarity to describe the relative importance of adjacent image network predictions, the local similarity being as follows:
Figure SMS_22
wherein Nt X in the tag space representing auxiliary task t i Is a set of K neighbors of (a);
it is assumed here that images close to each other in the auxiliary label space are more likely to have similar label distribution, so
Figure SMS_23
The larger f (x i |θ) and f (x) j ||θ) is closer.
Task instruction loss Ω t (f (x |θ)) is:
Figure SMS_24
minimizing classification loss and task direction loss Ω using output distribution of all input images t (f (x |θ)) to optimize the expression recognition sub-network SubNet-I, making the model more robust; in the actual application process, after training and calculation are performed based on a certain number of application samples, various parameters of the sub-network can be adjusted and identified according to the loss function adjustment, so that the model is more perfect;
according to the multi-modal emotion recognition method based on the cross attention mechanism, which is provided by the invention, the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition are weighted and summed, and the multi-modal fusion emotion type after fusion characteristics are output through a multi-modal fusion weighted summation function comprises the following steps:
p=w f ·p f +w s ·p s +w e ·p e
wherein ,
Figure SMS_25
for the probability distribution of facial expressions->
Figure SMS_26
For the speech emotion probability distribution +.>
Figure SMS_27
Is myoelectricity emotion probability distribution; w (w) f 、w s 、w e Weights of facial expression mode, voice mode and facial myoelectric mode are respectively preset, and w f +w s +W e =1;
And taking the maximum value of the numerical value calculated by the multi-modal fusion weighted sum function as a multi-modal fusion emotion recognition result.
Further, after the emotion result of the online psychological consultant is obtained, four results of facial expression recognition, voice emotion state recognition, facial myoelectricity emotion recognition and multi-mode fusion recognition of the consultant at the moment are provided for the psychological consultant in real time; as an example, the preset emotional state anomalies may be as shown in the following table entry;
TABLE 1 expression-myoelectric Modal combination emotional abnormalities
Figure SMS_28
TABLE 2 expression-Speech modality combination emotional anomalies
Figure SMS_29
TABLE 3 myoelectricity-Speech modality combination emotional abnormalities
Figure SMS_30
For example, in table 1, when the single-mode recognition result of the expression is happy, but the single-mode recognition result obtained by myoelectricity recognition is sad, it indicates that the two single-mode recognition results are in conflict with each other; the method comprises the steps of carrying out a first treatment on the surface of the When the results of the three single-mode emotion recognition modules are large in difference, the system can perform emotion state early warning reminding to prompt psychological consultants to notice that the current emotion state of the consultant is unstable or the consultant has specific emotion disease symptoms, so that guiding and treating measures such as emotion intervention and the like can be performed in time.
The present invention also provides an electronic device, which may include: a processor (processor), a communication interface (Communications Interface), a memory (memory) and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other via the communication bus. The processor may invoke logic instructions in the memory to perform the steps of the multimodal emotion recognition method provided by the methods described above.
Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the steps of the multimodal emotion recognition method provided by the methods described above.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multimodal emotion recognition method provided by the methods described above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for multi-modal emotion recognition based on a cross-attention mechanism, comprising:
collecting facial expression images, voice signals and facial electromyographic signals of a user at the same moment;
preprocessing the facial expression image, the voice signal and the facial electromyographic signal respectively;
respectively inputting the preprocessed facial expression image, the voice signal and the facial electromyographic signal into a trained emotion recognition model to obtain a single-mode emotion recognition result, and outputting a facial expression type, a voice emotion type and an electromyographic emotion type obtained by recognition;
and based on a preset weight, carrying out weighted summation on the facial expression type, the voice emotion type and the myoelectricity emotion type which are obtained through recognition, and outputting the multimodal fusion emotion type after fusion of the characteristics.
2. The method for identifying multi-modal emotion based on a cross-attention mechanism according to claim 1, wherein each two of the facial expression type, the voice emotion type and the myoelectricity emotion type obtained through identification are compared, a comparison result of a single-modal emotion identification result and other two single-modal emotion identification results is output, and if any comparison result is the same as a preset abnormal condition, the current user is judged to have emotion abnormality.
3. The method of claim 1, wherein preprocessing the facial expression image, the speech signal, and the facial electromyographic signal, respectively, comprises:
sequentially carrying out shielding removal, image restoration reconstruction and image enhancement operation on the facial expression image, extracting effective facial information and eliminating useless information;
sequentially carrying out pre-emphasis, framing and windowing operation on the voice signals, extracting effective voice information and eliminating useless information;
and denoising the facial electromyographic signals, extracting effective myoelectricity information, and eliminating useless information.
4. A multi-modal emotion recognition method based on a cross-attention mechanism as set forth in claim 1 or 3 wherein, when the preprocessed facial expression image and the speech signal are recognized by a trained emotion recognition model, transmembrane interaction is performed on the facial expression image and the speech signal in recognition based on the cross-attention mechanism, so that vectors of original single-modal information corresponding to the facial expression image and the speech signal simultaneously contain image information and dialogue speech information.
5. The method for multi-modal emotion recognition based on cross-attention mechanism of claim 3, wherein the preprocessed facial expression image, the speech signal and the facial electromyographic signal are respectively input into a trained emotion recognition model to obtain a single-modal emotion recognition result: the emotion recognition model comprises an expression recognition sub-network, a voice emotion recognition sub-network and a facial myoelectricity emotion recognition sub-network;
inputting the preprocessed facial expression image into the pre-trained expression recognition sub-network, dividing the facial expression image into a plurality of image blocks with preset dimensions, merging the image blocks until a feature map with preset dimensions is output, performing auxiliary classification by using label space topology information and label distribution learning, and outputting a classification result obtained by recognition;
inputting the preprocessed voice signals into the pre-trained voice emotion recognition sub-network, convolving the voice signals in two layers, extracting local features, and outputting a classification result obtained by recognition through an output layer;
inputting the preprocessed facial electromyographic signals into the pre-trained facial electromyographic emotion recognition sub-network, and outputting and obtaining a classification result by respectively corresponding to each feature of different sampling points of the face through a preset number of feature vectors.
6. The method of claim 5, wherein the expression recognition sub-network is trained using at least one facial expression image sample with various facial emotion tags, the voice processing sub-network is trained using at least one audio sample with various voice emotion tags, and the myoelectricity recognition sub-network is trained using at least one myoelectricity sample with myoelectricity emotion tags.
7. The method for identifying multimodal emotion based on a cross-attention mechanism according to claim 1, wherein the steps of performing weighted summation on the facial expression type, the voice emotion type and the myoelectricity emotion type obtained by the identification, and outputting the multimodal fusion emotion type after the fusion feature through a multimodal fusion weighted summation function comprise the following steps:
p=w f ·p cadaver +w s ·p s +w e ·p e
wherein ,
Figure FDA0004074225010000031
for the probability distribution of facial expressions->
Figure FDA0004074225010000032
For the probability distribution of speech emotion,
Figure FDA0004074225010000033
is myoelectricity emotion probability distribution; w (w) f 、w s 、w e Weights of facial expression mode, voice mode and facial myoelectric mode are respectively preset, and w f +w s +w e =1;
And taking the maximum value of the numerical value calculated by the multi-modal fusion weighted sum function as a multi-modal fusion emotion recognition result.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the cross-attention mechanism based multimodal emotion recognition method of any of claims 1 to 7 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the cross-attention mechanism based multimodal emotion recognition method of any of claims 1 to 7.
CN202310104197.2A 2023-02-07 2023-02-07 Cross-attention mechanism-based multi-mode emotion recognition method Pending CN116311423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310104197.2A CN116311423A (en) 2023-02-07 2023-02-07 Cross-attention mechanism-based multi-mode emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310104197.2A CN116311423A (en) 2023-02-07 2023-02-07 Cross-attention mechanism-based multi-mode emotion recognition method

Publications (1)

Publication Number Publication Date
CN116311423A true CN116311423A (en) 2023-06-23

Family

ID=86835150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310104197.2A Pending CN116311423A (en) 2023-02-07 2023-02-07 Cross-attention mechanism-based multi-mode emotion recognition method

Country Status (1)

Country Link
CN (1) CN116311423A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543445A (en) * 2023-06-29 2023-08-04 新励成教育科技股份有限公司 Method, system, equipment and storage medium for analyzing facial expression of speaker

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543445A (en) * 2023-06-29 2023-08-04 新励成教育科技股份有限公司 Method, system, equipment and storage medium for analyzing facial expression of speaker
CN116543445B (en) * 2023-06-29 2023-09-26 新励成教育科技股份有限公司 Method, system, equipment and storage medium for analyzing facial expression of speaker

Similar Documents

Publication Publication Date Title
Yeh et al. Multi-scale deep residual learning-based single image haze removal via image decomposition
CN110009013B (en) Encoder training and representation information extraction method and device
CN106919903B (en) robust continuous emotion tracking method based on deep learning
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN107862249B (en) Method and device for identifying split palm prints
US11216652B1 (en) Expression recognition method under natural scene
KR102132407B1 (en) Method and apparatus for estimating human emotion based on adaptive image recognition using incremental deep learning
CN111080591A (en) Medical image segmentation method based on combination of coding and decoding structure and residual error module
CN116311423A (en) Cross-attention mechanism-based multi-mode emotion recognition method
Majumder et al. A tale of a deep learning approach to image forgery detection
CN115631107A (en) Edge-guided single image noise removal
CN116934725A (en) Method for detecting sealing performance of aluminum foil seal based on unsupervised learning
Jameel et al. Gait recognition based on deep learning
CN112016592A (en) Domain adaptive semantic segmentation method and device based on cross domain category perception
CN114708353B (en) Image reconstruction method and device, electronic equipment and storage medium
CN113893517B (en) Rope skipping true and false judgment method and system based on difference frame method
CN115116117A (en) Learning input data acquisition method based on multi-mode fusion network
Ebanesar et al. Human Ear Recognition Using Convolutional Neural Network
Wyzykowski et al. A Universal Latent Fingerprint Enhancer Using Transformers
Chen et al. Single image de-raining using spinning detail perceptual generative adversarial networks
Mastan et al. DILIE: deep internal learning for image enhancement
Jalali et al. VGA‐Net: Vessel graph based attentional U‐Net for retinal vessel segmentation
Sharma et al. Image Fusion with Deep Leaning using Wavelet Transformation
Rajkumar et al. An efficient image-denoising method using a deep learning technique
Khan et al. Perceptual adversarial non-residual learning for blind image denoising

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination