CN113312942A

CN113312942A - Data processing method and equipment and converged network architecture

Info

Publication number: CN113312942A
Application number: CN202010122774.7A
Authority: CN
Inventors: 杨攸奕; 刘力哲; 古鉴
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2021-08-27
Anticipated expiration: 2040-02-27
Also published as: CN113312942B

Abstract

The application discloses a data processing method and device and a converged network architecture, and the application realizes the flexible application to different application scenes through a converged network, and further improves the operation efficiency of the application.

Description

Data processing method and equipment and converged network architecture

Technical Field

The present application relates to, but not limited to, artificial intelligence technology, and in particular, to a data processing method and apparatus and a converged network architecture.

Background

ADAS systems are becoming more and more common to utilize various sensors for fatigue detection of drivers, for learning concentration assessment of mental states of students, and so on. However, the types, settings and specific detection methods of the two current application scenarios are highly scene-dependent and cannot be copied to other scenarios, that is, the detection method provided in the related art is limited to a certain application scenario and cannot be applied to different application scenarios.

Disclosure of Invention

The application provides a data processing method and device and a converged network architecture, which can be flexibly suitable for different application scenarios.

The embodiment of the invention provides a data processing method, which comprises the following steps:

acquiring image data to be recognized, wherein the image data comprises a human face;

acquiring a human face overall characteristic and at least one human face local characteristic corresponding to the image data;

selecting main features from the human face overall features and the human face local features, wherein the features except the main features are auxiliary features;

and calculating the physiological state corresponding to the face included in the image data by fusing a neural network based on the main feature and the auxiliary feature.

In an exemplary instance, the acquiring of the whole face features and the at least one local face feature corresponding to the image data includes:

acquiring a face state sequence, at least one face characteristic state sequence and a face deep learning characteristic sequence corresponding to a plurality of frames of images in the image data;

selecting main features from the whole human face features and the local human face features, wherein the main features comprise:

and in the human face state sequence and at least one facial feature state sequence, one of the human face state sequence and the at least one facial feature state sequence is taken as the main feature, and the rest are taken as the auxiliary features.

In an exemplary instance, the calculating the physiological state corresponding to the face included in the image data includes:

performing fusion processing through the fusion neural network based on the face state sequence and at least one face feature state sequence to obtain more than two optimized main feature sequences;

taking more than two optimized main feature sequences as auxiliary weights, and taking the face deep learning feature sequences as main features to perform fusion processing to obtain optimized face deep learning feature sequences;

and classifying the optimized human face deep learning feature sequence to obtain the detection result of the physiological state.

In an illustrative example, the face deep learning feature sequence includes a face Convolutional Neural Network (CNN) feature sequence.

In one illustrative example, the multi-frame image includes N consecutive frame images.

In one illustrative example, frame skipping is allowed in the consecutive N-frame images.

In one illustrative example, the at least one facial feature state sequence comprises: the first facial feature state sequence, the second facial feature state sequence …, Mth facial feature state, M being an integer greater than or equal to 1;

the acquiring more than two optimized main feature sequences comprises:

respectively inputting the first facial feature state sequence, the second facial feature state sequence …, the Mth facial feature state and the human face state sequence as auxiliary weights, and the first facial feature state sequence as a main feature into the fusion network; outputting (M +1) optimized first facial feature state sequences after fusion processing;

respectively inputting the first facial feature state sequence, the second facial feature state sequence …, the Mth facial feature state and the human face state sequence as auxiliary weights, and the second facial feature state sequence as a main feature into the fusion network; outputting (M +1) optimized second facial feature state sequences after fusion processing;

by analogy, the first facial feature state sequence, the second facial feature state sequence …, the Mth facial feature state and the human face state sequence are used as auxiliary weights, the Mth facial feature state sequence is used as a main feature, and the M face feature state sequences are respectively input into the fusion network; outputting (M +1) optimized Mth facial feature state sequences after fusion processing;

respectively inputting the first facial feature state sequence, the second facial feature state sequence …, the Mth facial feature state and the human face state sequence as auxiliary weights and the human face state sequence as main features into the fusion network; and (M +1) optimized human face state sequences are output after fusion processing.

In an exemplary embodiment, the optimized face deep learning feature sequence is processed by a neural network classifier.

In one illustrative example, the method further comprises:

detecting a single frame image, and extracting a face image and at least one face local image;

and detecting the face image and at least one face local image, and temporarily storing the obtained face state information, at least one face characteristic state information and face deep learning characteristic information.

In an exemplary instance, the face state sequence includes the face states respectively corresponding to the plurality of frames of images;

the face feature state sequence comprises the face feature states respectively corresponding to the multiple frames of images;

the face deep learning feature sequence comprises the face deep learning features respectively corresponding to the multiple frames of images.

In one illustrative example, the method further comprises:

detecting fatigue expression according to the calculated physiological state corresponding to the human face included in the image data; alternatively, the first and second electrodes may be,

and detecting drunk driving according to the calculated physiological state corresponding to the face included in the image data.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing any of the data processing methods described above.

The application also provides a data processing device, comprising a memory and a processor, wherein the memory stores the following instructions executable by the processor: for performing the steps of the data processing method described above.

The present application further provides a converged network architecture, comprising: a convolutional neural network CNN processing module and a fusion processing module; wherein the content of the first and second substances,

the CNN processing module is used for performing CNN operation on the input multiple auxiliary weights;

and the fusion processing module is used for carrying out element multiplication on the CNN operated result and the input main characteristic to output more than two optimized main characteristic sequences.

In one illustrative example, the plurality of auxiliary weights employ a concatenated input.

In an exemplary embodiment, the dimension of the output result after the plurality of auxiliary weights perform CNN operation is the same as the dimension of the main feature of the input.

In one illustrative example, the CNN structure is a backend converged network LAF-Net.

The present application further provides a data processing apparatus, comprising: the device comprises a first acquisition unit, a second acquisition unit, a first processing unit and a second processing unit; wherein the content of the first and second substances,

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring image data to be recognized, and the image data comprises a human face;

the second acquisition unit is used for acquiring the whole human face features and at least one local human face feature corresponding to the image data;

the first processing unit is used for selecting main features from the whole human face features and the local human face features, wherein the features except the main features are auxiliary features;

and the second processing unit is used for calculating the physiological state corresponding to the face included in the image data through a fusion neural network based on the main feature and the auxiliary feature.

In an exemplary embodiment, the second obtaining unit is specifically configured to: acquiring a face state sequence, at least one face characteristic state sequence and a face deep learning characteristic sequence corresponding to a plurality of frames of images in the image data;

the first processing unit is specifically configured to: and in the human face state sequence and at least one facial feature state sequence, one of the human face state sequence and the at least one facial feature state sequence is taken as the main feature, and the rest are taken as the auxiliary features.

In one illustrative example, the second processing unit includes a first fusion module, a second fusion module, and a classification module, wherein,

the first fusion unit is used for carrying out fusion processing through the fusion neural network based on the human face state sequence and at least one face characteristic state sequence to obtain more than two optimized main characteristic sequences;

the second fusion unit is used for taking more than two optimized main feature sequences as auxiliary weights and carrying out fusion processing by taking the human face deep learning feature sequence as a main feature to obtain an optimized human face deep learning feature sequence;

and the classification unit is used for processing the optimized human face deep learning feature sequence by using a neural network classifier to obtain the result of the physiological state detection.

In one illustrative example, further comprising:

the detection unit is used for detecting the single-frame image and extracting a face image and at least one face local image; and detecting the face image and at least one face local image, and temporarily storing the obtained face state information, at least one face characteristic state information and face deep learning characteristic information.

In an illustrative example, the first or second converged unit comprises the converged network of any one of the above.

The application also provides a data processing method, which comprises the following steps:

and calculating the emotional state corresponding to the face included in the image data by fusing a neural network based on the main feature and the auxiliary feature.

The present application further provides a data processing method, including:

selecting main features from the whole features and the local features of the human face, wherein the features except the main features are auxiliary features;

acquiring more than two optimized main feature information by fusing a neural network based on the main feature and the auxiliary feature;

and performing makeup recommendation according to the more than two optimized main characteristic information.

The embodiment of the application realizes flexible application to different application scenes by combining the design of the network, and is not limited by a cockpit, a classroom, an office and the like; moreover, the operation efficiency of the application is further improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a flow chart of a data processing method of the present application;

FIG. 2 is a schematic diagram illustrating the components of the converged network architecture of the present application;

FIG. 3 is a schematic process diagram illustrating an embodiment of drowsiness detection according to the present application;

fig. 4 is a schematic diagram of a configuration of a data processing apparatus according to the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a flowchart of a data processing method according to the present application, as shown in fig. 1, including:

step 100: acquiring image data to be recognized, wherein the image data comprises a human face.

Step 101: and acquiring the whole face features and at least one face local feature corresponding to the image data.

In one illustrative example, the step may include:

and acquiring a face state sequence, at least one face characteristic state sequence and a face deep learning characteristic sequence corresponding to a plurality of frames of images in the image data.

In an exemplary instance, the face deep learning feature sequence may include, but is not limited to, a face Convolutional Neural Network (CNN) feature sequence, for example. CNN is a kind of feed forward Neural Networks (fed Neural Networks) including convolution calculation and having a deep structure, and is one of the representative algorithms of deep learning (deep learning).

In one illustrative example, the facial feature state of each frame of image includes, but is not limited to:

a first facial feature state, such as an eye state, including a left eye state, a right eye state, such as open eye, closed eye, drowsiness, etc.;

and/or, a second facial feature state, such as a mouth state, e.g., normal, yawning, laughing, etc.;

…

and/or, an mth facial feature state, such as an eyebrow state, etc. M is an integer greater than or equal to 0.

In one illustrative example, the face states include, but are not limited to: such as normal, low head, etc.

In an exemplary example, the face state sequence is composed of face states corresponding to multiple frames of images respectively, the face feature state sequence is composed of face feature states corresponding to multiple frames of images respectively, and the face deep learning feature sequence is composed of face deep learning features corresponding to multiple frames of images respectively, such as face CNN features.

In one illustrative example, the multi-frame image may be a continuous N-frame image.

In one illustrative example, frame skipping is allowed in consecutive N-frame images. Here, the frame skipping may be implemented by changing a number of transmission Frames Per Second (FPS, Frames Per Second).

Wherein the value of N can be changed by changing the duration of the input image sequence.

Step 102: and selecting main features from the whole features and the local features of the human face, wherein the features except the main features are auxiliary features.

In one illustrative example, the step may include:

and in the human face state sequence and at least one facial feature state sequence, one of the human face state sequence and the at least one facial feature state sequence is used as a main feature, and the rest are used as auxiliary features.

Step 103: and calculating the physiological state corresponding to the face included in the image data by fusing the neural network based on the obtained main features and the auxiliary features.

In one illustrative example, the step may include:

performing fusion processing through a fusion neural network based on the face state sequence and at least one face characteristic state sequence to obtain more than two optimized main characteristic sequences;

taking more than two optimized main feature sequences as auxiliary weights, and taking the face deep learning feature sequence as a main feature to carry out fusion processing to obtain an optimized face deep learning feature sequence;

and classifying the optimized human face deep learning feature sequence to obtain a detection result of the physiological state.

In an exemplary example, the obtaining more than two optimized main feature sequences may include:

by analogy …

Respectively inputting the first facial feature state sequence, the second facial feature state sequence …, the Mth facial feature state and the human face state sequence as auxiliary weights, and the Mth facial feature state sequence as a main feature into the fusion network; outputting (M +1) optimized Mth facial feature state sequences after fusion processing;

In an illustrative example, the fusion network in the embodiments of the present application may include a plurality of auxiliary weight inputs, a main feature input, all of the input information having the same sequence length N, i.e., each input information is from the same number of frames of images. Converged network as shown in fig. 2, the converged network may include: a CNN processing module and a fusion processing module; wherein the content of the first and second substances,

In an exemplary embodiment, the plurality of auxiliary weights may be input in a form of a concatenation (successive) manner, that is, the plurality of auxiliary weights are input as an auxiliary weight after being concatenated, and three parts separated by black bold lines in the leftmost box of fig. 2 represent that different auxiliary weights are concatenated together and input to the convergence network.

In an exemplary embodiment, the dimension of the output result after the CNN operation is performed on the plurality of auxiliary weights is the same as the dimension of the input main feature.

In an exemplary embodiment, the output result after the CNN operation can be any type of CNN result, and can also be a sequence network RNN, such as a long-short memory network (LSTM) or the like. Among them, the Recurrent Neural Network (RNN) is one of timing networks, and has better performance on the input of timing sequence.

In an exemplary example, based on the embodiments of the present application, those skilled in the art will readily understand that the overall framework of the converged network shown in fig. 2 can perform face tracking for a situation of multiple persons, and perform the fusion processing on different individuals respectively.

In an illustrative example, the CNN structure may be, but is not limited to, such as a back-end fusion network (LAF-Net). The LAF-Net is a CNN network based on a back-end Fusion (Late Fusion), the Late Fusion is a structural design of the CNN for the video, the Late Fusion separately processes images of each frame of the video in a backbone CNN, and the content Fusion of each frame is considered when the video arrives at a main network. The general complete CNN structure is composed of backbone + main network, and backbone is also a CNN architecture and responsible for extracting features for use by the main network. It should be noted that the CNN in the present application may also be an existing network, and is not used to limit the scope of the present application.

The method optimizes the existing detection result by fusing the relationship between the network and the serialized detection result. Thus, the operating efficiency of 15FPS can be achieved for the low end equipment.

In an exemplary embodiment, the optimized human face deep learning feature sequence is classified to obtain a physiological state detection result.

In an exemplary embodiment, the optimized human face deep learning feature sequence can be processed by a neural network classifier to obtain a physiological state detection result.

In the present application, the form of the neural network classifier is not limited, and for example, Average particle +2FC may be adopted.

In one illustrative example, the method of the present application further comprises:

The present application also provides a computer-readable storage medium storing computer-executable instructions for performing any of the data processing methods described above.

The present application further provides a data processing apparatus comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the data processing method of any one of the above.

The ADAS system in the related art uses various sensors outside the vision and can only be used on the vehicle, but can be used outside the vehicle by using the technical scheme provided by the application;

the intelligent campus solution in the related art emphasizes motion detection, and cannot detect fine expressions of the face, such as eye closure and dense blinking, but by using the technical scheme provided by the application, the detection accuracy of the fine expressions (such as eye closure, continuous blinking, squinting and the like) is improved through mutual complementation between continuous frames, and the detection accuracy of large-amplitude expressions (such as yawning, napping, cheek support and the like) is improved;

the intelligent learning in the related art is focused on the fact that the force detection equipment must wear contact equipment, but the technical scheme provided by the application can be conveniently realized only by using an infrared camera or an RGB camera;

the intelligent driving scheme in the related technology only has single-frame detection and simple filtering, but by utilizing the technical scheme provided by the application, further fusion optimization is carried out on the basis of single-frame detection through the time sequence information and the fusion network, so that the identification effect is further improved;

the independent early warning instrument (ear-hung type) in the related technology can only detect the fatigue action related to the whole head, but the technical scheme provided by the application realizes the detection of the fatigue performance related to eyes, mouth and the like;

independent early warning appearance (infrared camera) equipment among the correlation technique is simpler, can only detect the tired action of eyes, but, the technical scheme who utilizes this application to provide has then realized the detection to more tired performances.

In an exemplary embodiment, according to the obtained physiological status, detection of drunk driving may also be implemented, that is, whether a person corresponding to the face in the image data is drunk driving or not is obtained.

In summary, the application realizes flexible application to different application scenes by combining the design of the network, and is not limited by a cockpit, a classroom, an office and the like; moreover, the operation efficiency of the application is further improved.

Fig. 3 is a schematic process diagram of an embodiment of drowsiness state detection according to the present application, in the embodiment, a facial feature state includes two facial feature states, namely, eyes (left eye and right eye) and mouth, as shown in fig. 3, the detection process includes:

firstly, detecting N frames of images to obtain an eye state sequence consisting of N eye states, a mouth state sequence consisting of N mouth states and a face state sequence consisting of N personal face states; and performing CNN processing on the N frames of images to obtain a face deep learning feature sequence consisting of N face deep learning features.

Then, the eye state sequence, the mouth state sequence and the face state sequence are used as auxiliary weights (as shown by a thin solid line arrow in fig. 3), the eye state sequence is used as a main feature (as shown by a thick solid line arrow in fig. 3), and the auxiliary weights are respectively input into the fusion network; outputting (2+1) ═ 3 optimized eye state sequences after fusion processing; respectively inputting the eye state sequence, the mouth state sequence and the face state sequence into the fusion network by taking the eye state sequence, the mouth state sequence and the face state sequence as auxiliary weights (as shown by a thin solid arrow in fig. 3) and taking the mouth state sequence as a main characteristic (as shown by a thick solid arrow in fig. 3); outputting (2+1) ═ 3 optimized mouth state sequences after fusion processing; respectively inputting the eye state sequence, the mouth state sequence and the face state sequence into the fusion network by taking the eye state sequence, the mouth state sequence and the face state sequence as auxiliary weights (as shown by a thin solid arrow in fig. 3) and by taking the face state sequence as a main characteristic (as shown by a thick solid arrow in fig. 3); and (2+1) is output as 3 optimized face state sequences after fusion processing.

And then. Respectively inputting the optimized eye state sequence, the optimized mouth state sequence and the optimized face state sequence into a fusion network by taking the optimized eye state sequence, the optimized mouth state sequence and the optimized face state sequence as auxiliary weights (as shown by thin solid line arrows in fig. 3) and by taking the face deep learning feature sequence as a main feature (as shown by thick solid line arrows in fig. 3); and (2+1) is output as 3 optimized human face deep learning feature sequences after fusion processing.

And finally, inputting the 3 optimized human face deep learning feature sequences into a neural network classifier, and processing the optimized human face deep learning feature sequences to obtain a physiological state detection result, namely a drowsiness state. The state of the pajamas in this embodiment is a yes state or a no state, i.e., sleepiness or no sleepiness.

In an illustrative example, the present application further provides a data processing method, comprising:

based on the main features and the auxiliary features, the emotional states, such as calmness, depression, excitement, anger and the like, corresponding to the human faces included in the image data are calculated by fusing the neural networks.

In an exemplary example, whether the person corresponding to the face is a person with a violent emotion can be evaluated by counting the obtained emotional state corresponding to the face, so that when the person purchases the car insurance, adjustment of an appropriate amount of insurance can be made according to the emotional state of the person, and the like.

In the embodiment, the feature information obtained by fusing different parts included by the face is optimized, and the reliability of makeup recommendation is further improved.

In an exemplary embodiment, the data processing method provided by the application can also be applied to a live scene, and the image data including a physiological state, an emotional state and the like corresponding to the human face are detected in the live scene.

Fig. 4 is a schematic structural diagram of a data processing apparatus according to the present application, as shown in fig. 4, at least including: the device comprises a first acquisition unit, a second acquisition unit, a first processing unit and a second processing unit; wherein the content of the first and second substances,

the first processing unit is used for selecting main features from the whole features and the local features of the human face, wherein the features except the main features are auxiliary features;

and the second processing unit is used for calculating the physiological state corresponding to the face included in the image data through fusing the neural network based on the main characteristic and the auxiliary characteristic.

In an exemplary embodiment, the second obtaining unit is specifically configured to:

In an illustrative example, the second processing unit includes a first fusion module, a second fusion module, and a classification module, wherein,

the first fusion module is used for carrying out fusion processing through a fusion neural network based on the human face state sequence and at least one facial feature state sequence to obtain more than two optimized main feature sequences;

the second fusion module is used for taking more than two optimized main feature sequences as auxiliary weights and taking the face deep learning feature sequence as a main feature to carry out fusion processing to obtain the optimized face deep learning feature sequence;

and the classification module is used for processing the optimized human face deep learning feature sequence by using a neural network classifier to obtain a detection result of the physiological state.

In an illustrative example, the data processing apparatus of the present application further includes:

In one illustrative example, the first or second fusion module comprises:

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method of data processing, comprising:

2. The data processing method according to claim 1, wherein the acquiring of the whole face features and the at least one local face feature corresponding to the image data comprises:

3. The data processing method according to claim 2, wherein the calculating the physiological state corresponding to the face included in the image data comprises:

4. The data processing method of claim 2, wherein the face deep learning feature sequence comprises a face Convolutional Neural Network (CNN) feature sequence.

5. The data processing method according to claim 2, wherein the multi-frame image includes N consecutive frame images.

6. The data processing method of claim 5, wherein frame skipping is allowed in the consecutive N-frame images.

7. The data processing method of claim 3, wherein the at least one facial feature state sequence comprises: the first facial feature state sequence, the second facial feature state sequence …, Mth facial feature state, M being an integer greater than or equal to 1;

the acquiring more than two optimized main feature sequences comprises:

8. The data processing method of claim 3, wherein the optimized face deep learning feature sequence is processed using a neural network classifier.

9. A data processing method according to any of claims 1 to 8, the method further comprising:

10. The data processing method of claim 9,

the face state sequence comprises the face states respectively corresponding to the multiple frames of images;

11. The data processing method of claim 1, the method further comprising:

12. A computer-readable storage medium storing computer-executable instructions for performing the data processing method of any one of claims 1 to 11.

13. A data processing apparatus comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the data processing method of any one of claims 1 to 11.

14. A converged network architecture, comprising: a convolutional neural network CNN processing module and a fusion processing module; wherein the content of the first and second substances,

15. The converged network architecture of claim 14, wherein the plurality of auxiliary weights employ a concatenated input.

16. The converged network architecture of claim 14, wherein the plurality of secondary weights perform CNN operations with output results having the same dimensions as the input primary features.

17. The converged network architecture of claim 14, wherein the CNN structure is a backend converged network LAF-Net.

18. A data processing apparatus comprising: the device comprises a first acquisition unit, a second acquisition unit, a first processing unit and a second processing unit; wherein the content of the first and second substances,

19. The data processing device according to claim 18, wherein the second obtaining unit is specifically configured to: acquiring a face state sequence, at least one face characteristic state sequence and a face deep learning characteristic sequence corresponding to a plurality of frames of images in the image data;

20. The data processing device of claim 19, wherein the second processing unit comprises a first fusion module, a second fusion module, and a classification module, wherein,

21. The data processing device of claim 18, further comprising:

22. The physiological state detection device of claim 18, the first or second fusion unit comprising the fusion network of any one of claims 13-16.

23. A method of data processing, comprising:

24. A method of data processing, comprising: