WO2023098912A1 - Procédé et appareil de traitement d'image, support de stockage, et dispositif électronique - Google Patents

Procédé et appareil de traitement d'image, support de stockage, et dispositif électronique Download PDF

Info

Publication number
WO2023098912A1
WO2023098912A1 PCT/CN2022/136363 CN2022136363W WO2023098912A1 WO 2023098912 A1 WO2023098912 A1 WO 2023098912A1 CN 2022136363 W CN2022136363 W CN 2022136363W WO 2023098912 A1 WO2023098912 A1 WO 2023098912A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
feature
classification network
emotional
Prior art date
Application number
PCT/CN2022/136363
Other languages
English (en)
Chinese (zh)
Inventor
赵鹤
陈奕名
Original Assignee
新东方教育科技集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新东方教育科技集团有限公司 filed Critical 新东方教育科技集团有限公司
Publication of WO2023098912A1 publication Critical patent/WO2023098912A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of image processing, and in particular, to an image processing method, device, storage medium, and electronic equipment.
  • Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
  • the present disclosure provides an image processing method, device, storage medium and electronic equipment.
  • the first aspect of the present disclosure provides an image processing method, the method comprising:
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the obtaining the emotional information based on the feature image includes:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the training of the emotion classification network includes:
  • the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
  • the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
  • the feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
  • parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
  • the fully connected layer includes an attention factor
  • the input of the feature vector corresponding to the target training image into the fully connected layer to obtain the predicted label corresponding to the emotional information represented by the facial information in the target training image includes:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
  • the method also includes:
  • test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the target test image is input into the RedNet feature extractor in the emotion classification network after the training, to obtain the feature image of the target test image;
  • the feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
  • a second aspect of the present disclosure provides an image processing device, the device comprising:
  • An acquisition module configured to acquire a target image including facial information
  • the emotion determination module is used to input the target image into the pre-trained emotional classification network to obtain the emotional information represented by facial information in the target image;
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the emotion determination module is specifically used for:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the device includes:
  • the second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • Feature extraction module for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network, obtain the feature image of the target training image;
  • a feature vector determination module configured to input the feature image of the target training image into the Transformer encoder to obtain the corresponding feature vector of the target training image
  • a prediction module configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
  • the adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
  • a third aspect of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the methods described in the first aspect of the present disclosure are implemented.
  • a fourth aspect of the present disclosure provides an electronic device, including:
  • a processor configured to execute the computer program in the memory to implement the steps of any one of the methods in the first aspect of the present disclosure.
  • the image input to the emotion classification network is initially processed, the local details of the image are extracted, and the obtained feature image is input into the downstream of the emotion classification network module, which effectively improves the final accuracy of the emotional information output by the emotional classification network.
  • Fig. 1 is a flowchart of an image processing method shown according to an exemplary embodiment
  • Fig. 2 is a schematic diagram of an emotion classification network in a training phase according to an exemplary embodiment
  • Fig. 3 is a schematic diagram of an emotion classification network in a testing phase according to an exemplary embodiment
  • Fig. 4 is a block diagram of an image processing device according to an exemplary embodiment
  • Fig. 5 is a block diagram of an electronic device according to an exemplary embodiment
  • Fig. 6 is another block diagram of an electronic device according to an exemplary embodiment.
  • Emotion recognition is an inevitable part of any interpersonal communication. People observe the emotional changes of others to confirm whether their actions are reasonable and effective. With the continuous advancement of technology, emotion recognition can use different features to detect and recognize, such as face, voice, EEG, and even speech content. Among these features, facial expressions are usually easier to observe.
  • the facial expression recognition system mainly consists of three stages, namely face detection, feature extraction and expression recognition.
  • face detection stage multiple face detectors are used, like MTCNN network and RetinaFace network, they are used to locate the face position in complex scenes, and the detected faces can be further aligned.
  • feature extraction past studies have proposed various methods for capturing facial geometry and appearance features induced by facial expressions.
  • the feature type they can be divided into engineering features and learning-based features.
  • engineering features it can be further divided into texture-based features, geometry-based global features, etc.
  • Fig. 1 is a flow chart of an image processing method shown according to an exemplary embodiment.
  • the execution subject of the method may be a terminal such as a mobile phone, a computer, a notebook computer, or a server.
  • the method includes :
  • the face information in the target image may only include the face information of one person, or may be the face information of multiple people.
  • S102 Input the target image into a pre-trained emotion classification network to obtain emotional information represented by facial information in the target image.
  • the emotion information may represent probability values of emotions such as happiness, sadness, crying, laughing, etc. corresponding to the face information of the task in the target image.
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the involution operator has channel invariance and space specificity, and its design is opposite to the characteristics of convolution, that is, the kernel kernel is shared in the channel dimension, and the space-specific kernel is used in the space dimension for more flexible modeling.
  • the inner convolution kernel has different attention to different positions in the space, which can more effectively mine the target features with diversity, and without increasing the amount of parameter calculation,
  • the sharing and migration of feature weights in different spatial positions is exactly what the space-specific design principle pursues.
  • This design from convolution to involution re-allocates computing power, making the limited The computing power is adjusted to the position where the performance can be maximized, so we use the RedNet composed of involution operators as the feature extractor, and obtain better results than ResNet with a smaller amount of parameters.
  • the image input to the emotion classification network is initially processed, the local details of the picture are extracted, and the obtained feature image is input into the
  • the downstream module of the emotion classification network effectively improves the final accuracy of the emotional information output by the emotion classification network.
  • the obtaining the emotional information based on the feature image includes:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the feature image may include multiple feature sub-image patches
  • inputting the feature image into the Transformer encoder includes: stretching the multiple feature sub-image patches and inputting them into the Transformer encoder respectively.
  • the Multi-head self-attention module linearly connects multiple attention outputs to the desired dimension. Multiple attention heads can be used to learn local and global dependencies in images.
  • the Multi-Layer Perception contains two layers of Gaussian Error Linear Units (GELU) and Layer Normalization (LN), which can be used to improve training time and generalization performance. Residual connections are applied after each patch because they allow gradients to flow directly through the network without going through nonlinear layer implementations.
  • GELU Gaussian Error Linear Units
  • LN Layer Normalization
  • CNNs convolutional neural networks
  • a facial expression recognition system including key features can be extracted and learned through training of data sets.
  • many cues come from some parts of the face, such as the mouth and eyes, while other parts, such as the background and hair, play a small role in the output, which means that the ideal
  • the model framework should only focus on the important parts of the face, and pay less attention to the sensitivity to other facial areas, and have better generalization ability for special cases such as occlusion blur.
  • a Transformer-based framework for facial expression recognition which takes the above observations into account and utilizes an attention mechanism to focus on salient parts of the face. Using Transformer encoding instead of deep convolutional models can achieve very high accuracy.
  • the training of the emotion classification network includes:
  • the training set includes a plurality of training images, each training image in the plurality of training images includes facial information and an emotional label corresponding to the facial information;
  • the target training image is input into the RedNet feature extractor in the initial sentiment classification network to obtain the feature image of the target training image;
  • the feature image of the target training image is input into the Transformer encoder to obtain the corresponding feature vector of the target training image;
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained;
  • parameters in the emotion classification network are adjusted to obtain a trained emotion classification network.
  • the untrained initial emotion classification network is trained to obtain an accurate representation of the facial information in the image.
  • An emotion classification network for identifying and classifying emotions.
  • the fully connected layer includes an attention factor
  • the feature vector corresponding to the target training image is input into the fully connected layer to obtain the emotional information represented by the facial information in the target training image
  • the corresponding predicted labels include:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the true accuracy of the samples in the training set is determined by adding an attention factor to the fully connected layer.
  • a high value represents a good performance of the sample, high accuracy, and a large "role" during training, otherwise the sample performance is poor.
  • the accuracy is low, and the training time is not ideal.
  • the neural network will focus on the samples with better and more effective actual effects, which can effectively improve the accuracy of training.
  • the training of the emotion classification network also includes a method of inputting the training set into the SCN network (Self-Cure Network) to automatically repair the wrong label in the sample.
  • the SCN network includes a self-attention importance weighting module (Self-Attention Importance Weighting) and a relabeling module.
  • the self-attention importance weighting module is used to generate a weight ⁇ i for each sample xi in the training set as a measure of the importance of the sample xi in the training set.
  • the self-attention importance weighting module is trained using RR-loss (Rank Regularization loss, rank regularization loss).
  • the specific calculation steps of RR-loss include: sorting a batch of samples according to ⁇ i , and dividing the samples into two groups with high scores and low scores according to the ratio ⁇ .
  • L RR represents RR-loss
  • ⁇ H represents the average weight of high groups
  • ⁇ L represents the average weight of low groups
  • ⁇ H and ⁇ L satisfy the following formula:
  • ⁇ 1 is a fixed or learnable value used to separate the weight mean value of high group and low group.
  • dividing the sample into high-scoring and low-scoring groups according to the ratio ⁇ includes:
  • the best grouping method should satisfy: argmax M distance( ⁇ H , ⁇ L )
  • ⁇ H represents the set of high-group sample weights
  • ⁇ L represents the set of low-group samples.
  • the formula used for the distance can be argmax M (min i ⁇ [0,M) ⁇ i -max i ⁇ [M,N) ⁇ i ).
  • grouping is performed according to the actual weight of each batch of samples, which can avoid instability in training while realizing adaptive grouping.
  • an adaptive grouping method is proposed, grouping according to the actual weight of each batch of samples, which effectively improves the accuracy of the weight output by the model.
  • the method also includes:
  • test set includes a plurality of test images, each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the target test image is input into the RedNet feature extractor in the trained emotional classification network to obtain the feature image of the target test image;
  • the feature image of the target test image is input to the Transformer encoder to obtain the corresponding feature vector of the target test image;
  • the CNN model, the attention model, and the Transformer model are all mathematically maximum likelihood estimation models.
  • the maximum likelihood estimation model is unbiased and the weights are fixed.
  • any model weights in the real world should tend to be Gaussian rather than fixed. Therefore, the maximum likelihood estimation cannot effectively estimate the uncertainty of the data.
  • Human expressions are extremely complex, such as panic and surprise, and tears from laughter. These are mixed with different expressions, not a single expression. Therefore, using a model with fixed weights to estimate an uncertain task is a contradiction in itself.
  • MC-dropout is a way of understanding dropout based on Bayesian theory, which interprets dropout as a Bayesian approximation of a Gaussian process.
  • ordinary models have the ability to evaluate uncertainty like Bayesian neural networks.
  • the use of the MC-dropout layer only requires an input to be tested n times during testing to obtain a set of sampling points, thereby calculating the mean and variance, and using the variance to predict the uncertainty of the samples in the test set The larger the variance, the higher the uncertainty of the prediction.
  • the backbone when testing, the backbone outputs the features O b ⁇ R 1 ⁇ p .
  • W fc is sampled n times.
  • the weight obtained by sampling is denoted as Then the MC-dropout layer can be defined by the following formula:
  • the variance n () function is expressed in Computes the variance over the n-dimension of .
  • the function represents the sample variance corresponding to O mean .
  • the uncertainty of the prediction results can be measured based on the maximum value of O var . The larger the variance, the higher the uncertainty.
  • dropout can also be implemented in other layers, just ensure that the calculation before this layer is only run once, and then when it reaches the MC-dropout layer, it becomes a matrix operation.
  • Bayesian estimation can be used for uncertainty analysis by replacing the fully connected layer with the MC-dropout layer during the testing phase.
  • Described emotion classification network 20 comprises input module 21, RedNet feature extractor 22, Transformer coder 23, fully connected layer 24 and classifier 25 connected in series successively;
  • the training of the emotion classification network 20 includes: the training set is input into the RedNet feature extractor 22 in this emotion classification network 20 through the input module 21, to obtain the multiple of any training image in the training set a feature image pactch; the multiple feature images pactch are input into the Transformer encoder 23 to obtain the feature vector of any training image in the training set; the feature vector is input into the fully connected layer 24 to obtain the facial information representation in the target image
  • the probability value of each emotional category; the probability value of each emotional category is input into the classifier 25 to obtain the highest probability emotional category; according to the emotional category and the pre-marked label information in the training set, based on the cross entropy loss function and regularization
  • the loss adjusts the parameters in the emotion classification network 20 to obtain the trained emotion classification network.
  • the present disclosure also provides a schematic diagram of an emotion classification network in a test phase according to an exemplary embodiment as shown in FIG. 3 .
  • the emotion classification network 30 includes a trained input module 31, RedNet feature extractor 32, Transformer encoder 33, MC-dropout layer 34, and classifier 35.
  • the test of the emotion classification network 30 includes: the test set is input into the RedNet feature extractor 32 in this emotion classification network 30 by the input module 31, to obtain the multiple of any training image in the training set a feature image pactch; input the multiple feature images pactch into the Transformer encoder 33, to obtain the feature vector of any training image in the training set; input the feature vector into the MC-dropout layer 34 for multiple sampling, and obtain each sampling place
  • RedNet and Transformer are used as feature extractors for the first time. Combined use of RedNet and Bayesian-based MC-dropout. In addition, in order to deal with the blurred pictures and labels contained in the training set, the training method in SCN is utilized and further improved.
  • FIG. 4 is a block diagram of an image processing device 40 according to an exemplary embodiment.
  • the device 40 can be used as a part of a terminal such as a mobile phone, or as a part of a server.
  • the device 40 includes:
  • the first obtaining module 41 is used to obtain the target image comprising facial information
  • the emotion determination module 42 is used to input the target image into the emotional classification network that has been trained in advance to obtain the emotional information represented by facial information in the target image;
  • the emotion classification network includes a RedNet feature extractor composed of involution operators, and the RedNet feature extractor is used to obtain a feature image according to the target image, so as to obtain the emotion information based on the feature image.
  • the emotion determination module 42 is specifically used for:
  • the feature image is input into a Transformer encoder to obtain a feature vector corresponding to the target image, and the Transformer encoder includes a multi-head self-attention module, a multi-layer perceptron and a layer normalization module;
  • the feature vector is input into the fully connected layer to obtain the emotional information represented by the facial information in the target image.
  • the device 40 also includes:
  • the second acquisition module is used to acquire a training set, the training set includes a plurality of training images, and each training image in the plurality of training images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the first feature extraction module for any target training image in the training set, input the target training image into the RedNet feature extractor in the initial emotion classification network to obtain the feature image of the target training image;
  • the first feature vector determination module is used to input the feature image of the target training image into the Transformer encoder to obtain the feature vector corresponding to the target training image;
  • a prediction module configured to input a feature vector corresponding to the target training image into a fully connected layer, to obtain a prediction label corresponding to emotional information represented by facial information in the target training image;
  • the adjustment module is used to adjust the parameters in the emotion classification network according to the predicted label and the emotion label pre-marked in the target training image, so as to obtain the trained emotion classification network.
  • the fully connected layer includes an attention factor
  • the prediction module is specifically used for:
  • the feature vector corresponding to the target training image is input into the fully connected layer, and the prediction label corresponding to the emotional information represented by the facial information in the target training image is obtained, and the weight information of the target training image;
  • the adjustment module is specifically used for:
  • the parameters in the emotion classification network are adjusted based on a cross-entropy loss function and a regularization loss.
  • the device 40 also includes:
  • the third acquisition module is used to acquire a test set, the test set includes a plurality of test images, and each test image in the plurality of test images includes facial information and a pre-marked emotional label corresponding to the facial information;
  • the second feature extraction module is used to input the target test image into the RedNet feature extractor in the trained emotion classification network for any target test image in the test set to obtain the feature image of the target test image ;
  • the second eigenvector determination module inputs the eigenimage of the target test image into the Transformer encoder to obtain the corresponding eigenvector of the target test image;
  • the first determination module is used to input the feature vector corresponding to the target test image into the MC-dropout layer, and determine the uncertainty information of the target test image;
  • the second determination module is used to determine whether the uncertainty information of the plurality of test images satisfies a preset rule, and if the preset rule is satisfied, the trained emotion classification network is completed as the training emotion classification network.
  • Fig. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment.
  • the electronic device 500 may include: a processor 501 and a memory 502 .
  • the electronic device 500 may also include one or more of a multimedia component 503 , an input/output (I/O) interface 504 , and a communication component 505 .
  • I/O input/output
  • the processor 501 is used to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the above-mentioned image processing method.
  • the memory 502 is used to store various types of data to support the operation of the electronic device 500, for example, these data may include instructions for any application or method operating on the electronic device 500, and application-related data, For example, images in training set, test set, etc.
  • the memory 502 can be realized by any type of volatile or non-volatile storage device or their combination, such as Static Random Access Memory (Static Random Access Memory, referred to as SRAM), Electrically Erasable Programmable Read-Only Memory (EPROM) Electrically Erasable Programmable Read-Only Memory, referred to as EEPROM), Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, referred to as EPROM), Programmable Read-Only Memory (Programmable Read-Only Memory, referred to as PROM), read-only Memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • Multimedia components 503 may include screen and audio components.
  • the screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals.
  • an audio component may include a microphone for receiving external audio signals.
  • the received audio signal may be further stored in the memory 502 or sent through the communication component 505 .
  • the audio component also includes at least one speaker for outputting audio signals.
  • the I/O interface 504 provides an interface between the processor 501 and other interface modules, which may be a keyboard, a mouse, buttons, and the like. These buttons can be virtual buttons or physical buttons.
  • the communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices.
  • the communication component 505 may include: a Wi-Fi module, a Bluetooth module, an NFC module and the like.
  • the electronic device 500 may be implemented by one or more application-specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processors (Digital Signal Processor, DSP for short), digital signal processing equipment (Digital Signal Processing Device, referred to as DSPD), programmable logic device (Programmable Logic Device, referred to as PLD), field programmable gate array (Field Programmable Gate Array, referred to as FPGA), controller, microcontroller, microprocessor or other electronic components Implementation, used to execute the above-mentioned image processing method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • PLD programmable logic device
  • FPGA Field Programmable Gate Array
  • controller microcontroller
  • microprocessor or other electronic components Implementation used to execute the above-mentioned image processing method.
  • a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are realized.
  • the computer-readable storage medium may be the above-mentioned memory 502 including program instructions, and the above-mentioned program instructions can be executed by the processor 501 of the electronic device 500 to complete the above-mentioned image processing method.
  • Fig. 6 is a block diagram of an electronic device 600 according to an exemplary embodiment.
  • the electronic device 600 may be provided as a server.
  • the electronic device 600 includes a processor 622 , the number of which may be one or more, and a memory 632 for storing computer programs executable by the processor 622 .
  • the computer program stored in memory 632 may include one or more modules each corresponding to a set of instructions.
  • the processor 622 may be configured to execute the computer program to perform the above-mentioned image processing method.
  • the electronic device 600 may further include a power supply component 626 and a communication component 650, the power supply component 626 may be configured to perform power management of the electronic device 600, and the communication component 650 may be configured to implement communication of the electronic device 600, for example, wired or wireless communication.
  • the electronic device 600 may further include an input/output (I/O) interface 658 .
  • the electronic device 600 can operate based on an operating system stored in the memory 632, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM and so on.
  • a computer-readable storage medium including program instructions, and when the program instructions are executed by a processor, the steps of the above-mentioned image processing method are implemented.
  • the non-transitory computer-readable storage medium may be the above-mentioned memory 632 including program instructions, and the above-mentioned program instructions can be executed by the processor 622 of the electronic device 600 to implement the above-mentioned image processing method.
  • a computer program product comprising a computer program executable by a programmable device, the computer program having a function for performing the above-mentioned The code section of the image processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Procédé et appareil de traitement d'image, support de stockage et dispositif électronique, qui se rapportent au domaine du traitement d'image. Le procédé consiste : à acquérir une image cible comprenant des informations faciales (S101) ; et à entrer l'image cible dans un réseau de classification d'émotion pré-formé pour obtenir des informations d'émotion représentées par les informations faciales de l'image cible (S102), le réseau de classification d'émotion comprenant un extracteur de caractéristiques RedNet composé d'un opérateur de convolution interne, et l'extracteur de caractéristiques RedNet étant utilisé pour obtenir une image de caractéristique selon l'image cible de façon à obtenir les informations d'émotion sur la base de l'image de caractéristique. Une structure RedNet composée d'un opérateur de convolution interne est utilisée en tant qu'extracteur de caractéristiques, l'image entrée dans le réseau de classification d'émotion est préalablement traitée, des détails locaux de l'image sont extraits, et l'image de caractéristique obtenue est entrée dans un module aval du réseau de classification d'émotion, de telle sorte que la précision finale des informations d'émotion produites par le réseau de classification d'émotion est efficacement améliorée.
PCT/CN2022/136363 2021-12-02 2022-12-02 Procédé et appareil de traitement d'image, support de stockage, et dispositif électronique WO2023098912A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111473999.8A CN116229530A (zh) 2021-12-02 2021-12-02 图像处理方法、装置、存储介质及电子设备
CN202111473999.8 2021-12-02

Publications (1)

Publication Number Publication Date
WO2023098912A1 true WO2023098912A1 (fr) 2023-06-08

Family

ID=86579171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136363 WO2023098912A1 (fr) 2021-12-02 2022-12-02 Procédé et appareil de traitement d'image, support de stockage, et dispositif électronique

Country Status (2)

Country Link
CN (1) CN116229530A (fr)
WO (1) WO2023098912A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058405A (zh) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端
CN117079324A (zh) * 2023-08-17 2023-11-17 厚德明心(北京)科技有限公司 一种人脸情绪识别方法、装置、电子设备及存储介质
CN117611933A (zh) * 2024-01-24 2024-02-27 卡奥斯工业智能研究院(青岛)有限公司 基于分类网络模型的图像处理方法、装置、设备和介质
CN117689998A (zh) * 2024-01-31 2024-03-12 数据空间研究院 非参数自适应的情绪识别模型、方法、系统和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194347A (zh) * 2017-05-19 2017-09-22 深圳市唯特视科技有限公司 一种基于面部动作编码系统进行微表情检测的方法
CN107423707A (zh) * 2017-07-25 2017-12-01 深圳帕罗人工智能科技有限公司 一种基于复杂环境下的人脸情绪识别方法
CN113221639A (zh) * 2021-04-01 2021-08-06 山东大学 一种基于多任务学习的代表性au区域提取的微表情识别方法
CN113591718A (zh) * 2021-07-30 2021-11-02 北京百度网讯科技有限公司 目标对象识别方法、装置、电子设备和存储介质
CN113705541A (zh) * 2021-10-21 2021-11-26 中国科学院自动化研究所 基于Transformer的标记选择和合并的表情识别方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194347A (zh) * 2017-05-19 2017-09-22 深圳市唯特视科技有限公司 一种基于面部动作编码系统进行微表情检测的方法
CN107423707A (zh) * 2017-07-25 2017-12-01 深圳帕罗人工智能科技有限公司 一种基于复杂环境下的人脸情绪识别方法
CN113221639A (zh) * 2021-04-01 2021-08-06 山东大学 一种基于多任务学习的代表性au区域提取的微表情识别方法
CN113591718A (zh) * 2021-07-30 2021-11-02 北京百度网讯科技有限公司 目标对象识别方法、装置、电子设备和存储介质
CN113705541A (zh) * 2021-10-21 2021-11-26 中国科学院自动化研究所 基于Transformer的标记选择和合并的表情识别方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI DUO, HU JIE, WANG CHANGHU, LI XIANGTAI, SHE QI, ZHU LEI, ZHANG TONG, CHEN QIFENG: "Involution: Inverting the Inherence of Convolution for Visual Recognition", ARXIV.ORG, 10 March 2021 (2021-03-10), pages 1 - 12, XP093070355 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117058405A (zh) * 2023-07-04 2023-11-14 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端
CN117058405B (zh) * 2023-07-04 2024-05-17 首都医科大学附属北京朝阳医院 一种基于图像的情绪识别方法、系统、存储介质及终端
CN117079324A (zh) * 2023-08-17 2023-11-17 厚德明心(北京)科技有限公司 一种人脸情绪识别方法、装置、电子设备及存储介质
CN117079324B (zh) * 2023-08-17 2024-03-12 厚德明心(北京)科技有限公司 一种人脸情绪识别方法、装置、电子设备及存储介质
CN117611933A (zh) * 2024-01-24 2024-02-27 卡奥斯工业智能研究院(青岛)有限公司 基于分类网络模型的图像处理方法、装置、设备和介质
CN117689998A (zh) * 2024-01-31 2024-03-12 数据空间研究院 非参数自适应的情绪识别模型、方法、系统和存储介质
CN117689998B (zh) * 2024-01-31 2024-05-03 数据空间研究院 非参数自适应的情绪识别模型、方法、系统和存储介质

Also Published As

Publication number Publication date
CN116229530A (zh) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2023098912A1 (fr) Procédé et appareil de traitement d'image, support de stockage, et dispositif électronique
TWI773189B (zh) 基於人工智慧的物體檢測方法、裝置、設備及儲存媒體
KR20190081243A (ko) 정규화된 표현력에 기초한 표정 인식 방법, 표정 인식 장치 및 표정 인식을 위한 학습 방법
US20230119593A1 (en) Method and apparatus for training facial feature extraction model, method and apparatus for extracting facial features, device, and storage medium
CN111860362A (zh) 生成人脸图像校正模型及校正人脸图像的方法和装置
US11681923B2 (en) Multi-model structures for classification and intent determination
CN111133453A (zh) 人工神经网络
WO2020238353A1 (fr) Procédé et appareil de traitement de données, support de stockage et dispositif électronique
Liu et al. Real-time facial expression recognition based on cnn
CN110598638A (zh) 模型训练方法、人脸性别预测方法、设备及存储介质
CN112712068B (zh) 一种关键点检测方法、装置、电子设备及存储介质
US20230036338A1 (en) Method and apparatus for generating image restoration model, medium and program product
WO2021217937A1 (fr) Procédé et dispositif d'apprentissage de modèle de posture, procédé et dispositif de reconnaissance de posture
Krishnan et al. Detection of alphabets for machine translation of sign language using deep neural net
CN113221695B (zh) 训练肤色识别模型的方法、识别肤色的方法及相关装置
CN110717407A (zh) 基于唇语密码的人脸识别方法、装置及存储介质
WO2024071884A1 (fr) Appareil et procédé de génération d'image de personne à tête chauve, appareil d'expérience de coiffure virtuelle comprenant un appareil de génération d'image de personne à tête chauve, et procédé de coiffure virtuelle l'utilisant
Tewari et al. Real Time Sign Language Recognition Framework For Two Way Communication
RU2768797C1 (ru) Способ и система для определения синтетически измененных изображений лиц на видео
CN115116117A (zh) 一种基于多模态融合网络的学习投入度数据的获取方法
CN115457365A (zh) 一种模型的解释方法、装置、电子设备及存储介质
WO2022178833A1 (fr) Procédé d'entraînement de réseau de détection de cible, procédé de détection de cible et appareil
CN112101185A (zh) 一种训练皱纹检测模型的方法、电子设备及存储介质
CN117576279B (zh) 基于多模态数据的数字人驱动方法及系统
CN113610064B (zh) 笔迹识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900716

Country of ref document: EP

Kind code of ref document: A1