CN116994209A

CN116994209A - Image data processing system and method based on artificial intelligence

Info

Publication number: CN116994209A
Application number: CN202311092448.6A
Authority: CN
Inventors: 吴学凯; 吴佳峰; 吕珏; 徐雪梅
Original assignee: Haining Xinchen Network Technology Co ltd
Current assignee: Haining Xinchen Network Technology Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-03

Abstract

The application relates to the field of image data processing, and particularly discloses an image data processing system and method based on artificial intelligence.

Description

Image data processing system and method based on artificial intelligence

Technical Field

The present application relates to the field of image data processing, and more particularly, to an artificial intelligence-based image data processing system and method thereof.

Background

The video monitoring system is the core of the community technical protection system, can intuitively display images of various key areas of the community on a large screen, and can store and play back pictures for a long time. Through the system, security personnel can know the running conditions of personnel, equipment, traffic and the like in a community without going out, and can discover and stop crime in time, so that accidents are prevented, and the system has great significance for guaranteeing the personal and property safety of residents in the community.

The existing community has large area, long perimeter, scattered buildings, frequent entrance and exit of personnel and vehicles and larger monitoring range, the existing community usually adopts a guard to record the entrance and exit of the passengers in a registration mode, when abnormal people enter and exit the community in other modes, the community can not collect information for the people, when abnormal personnel invade the security of the community and the owners and property of the community, the corresponding personnel can only investigate the beginning and the end of the matters by returning to check the video record and combining the visit inquiry, so that the functionality of the video monitoring system is passive.

Therefore, an image data processing system based on artificial intelligence and a method thereof are expected to prompt abnormal personnel to enter a cell, and the initiative of a video monitoring system is improved.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides an image data processing system and a method based on artificial intelligence, which are characterized in that facial features and target features are extracted by comprehensively utilizing the facial image of each user in a cell and the data of a monitoring video, and the facial features and the target features are calculated and compared by combining image processing, feature extraction and classification technologies, so that whether abnormal personnel exist can be judged, measures can be taken in time for processing, and the initiative of the monitoring video system is improved.

According to one aspect of the present application, there is provided an artificial intelligence based image data processing system and method thereof, including:

the data acquisition module is used for acquiring facial images of all users and monitoring videos acquired by the cameras;

the image processing module is used for acquiring a plurality of facial image feature images of the acquired facial image of the user through a first convolution neural network model using a spatial attention mechanism;

the pooling module is used for carrying out global mean pooling on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and arranging the plurality of facial image feature vectors into a facial feature matrix according to time sequence;

the target monitoring module is used for extracting a plurality of key frames from the monitoring video, and then obtaining a plurality of target object region of interest graphs through a target detection network based on an anchor-free window respectively;

the target extraction module is used for enabling the region-of-interest graphs of the plurality of target objects to pass through a two-dimensional convolutional neural network model to obtain a tracking feature graph, and carrying out global mean pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector;

The query module is used for calculating the matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector;

the optimizing module performs affine subspace probability on the classifying feature vector and the tracking feature vector to obtain an optimized classifying feature vector;

the abnormality detection module is used for enabling the optimized classification feature vector to pass through the classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to the staff or the owners.

In the above image data processing system based on artificial intelligence, the image processing module includes: a convolution encoding unit, configured to pass face images of each user in the face images of the plurality of users through a convolution encoding portion of the first convolution neural network to obtain a plurality of high-dimensional feature maps; a spatial attention unit, configured to input each of the plurality of high-dimensional feature maps into a spatial attention portion of the first convolutional neural network to obtain a plurality of spatial attention patterns; and the attention applying unit is used for respectively calculating the spatial attention patterns and the corresponding spatial attention patterns of each group in the plurality of high-dimensional feature patterns and multiplying the spatial attention patterns and the high-dimensional feature patterns by position points to obtain the plurality of facial image features.

In the above image data processing system based on artificial intelligence, the feature extraction module includes: and carrying out global averaging on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and arranging the plurality of facial image feature vectors into a facial feature matrix according to time sequence.

In the above image data processing system based on artificial intelligence, the object monitoring module includes: the multi-layer convolution layer unit is used for enabling each key frame in the plurality of key frames to pass through the multi-layer convolution layers respectively so as to obtain a plurality of shallow feature images; and the anchor-free window detection unit is used for respectively passing the plurality of shallow feature maps through the target detection network based on the anchor-free window to obtain a plurality of target object region-of-interest maps.

In the above image data processing system based on artificial intelligence, in the multi-layer convolution layer unit, the method is characterized in that: the multi-layer convolution layer comprises N layers of convolution layers, wherein N is more than or equal to 1 and less than 6; the multiple layers of convolution layers respectively perform convolution processing, pooling processing and nonlinear activation processing based on a two-dimensional convolution kernel on input data in forward transmission of the layers so as to output the shallow feature map by the last convolution layer of the multiple layers of convolution layers; the nonlinear activation function used by each of the multiple convolutional layers is a Mish activation function.

In the image data processing system based on artificial intelligence, the target extracting module is characterized in that: arranging the multiple target object region of interest maps into three-dimensional tensors, obtaining a tracking feature map through a two-dimensional convolutional neural network model, and carrying out global mean pooling on each feature matrix of the tracking feature map along a channel dimension to obtain a tracking feature vector, wherein the method comprises the following steps: and respectively carrying out convolution processing, pooling processing and nonlinear activation processing based on a two-dimensional convolution kernel on input data in forward transfer of layers of the two-dimensional convolution neural network so as to output the tracking feature map by the last layer of the two-dimensional convolution kernel.

In the above image data processing system based on artificial intelligence, the query module is configured to: calculating a matrix product between the tracking feature vector and the facial feature matrix to obtain the classification feature vector as follows; wherein, the formula is:

wherein V is ₁ Representing the classification feature vector, V ₂ Representing the tracking feature vector, M representing the facial feature matrix,representing a matrix multiplication.

In the above image data processing system based on artificial intelligence, the optimizing module includes: the normalization subunit performs normalization processing on the classification feature vector and the tracking feature vector to obtain a normalized classification feature vector and a normalized tracking feature vector; a covariance calculation subunit for calculating a covariance matrix between the normalized classification feature vector and the normalized tracking feature vector; a eigenvalue decomposition subunit, configured to perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and a plurality of eigenvalue vectors corresponding to the plurality of eigenvalues; the extraction feature value subunit extracts feature value vectors corresponding to the first two largest feature values from the plurality of feature vectors as a first-dimension feature vector and a second-dimension feature vector, wherein the first-dimension feature vector and the second-dimension feature vector are used as the basis of an affine subspace; an arrangement vector subunit, configured to arrange the first dimension feature vector and the second dimension feature vector according to a column vector to obtain an affine subspace matrix; the association subunit performs matrix multiplication on the classification feature vector and the tracking feature vector and the affine subspace matrix respectively, and maps the classification feature vector and the tracking feature vector to the affine subspace matrix respectively to obtain an affine transformation classification feature vector and an affine transformation tracking feature vector; and the sub-unit is added according to the position points, and the optimized classification feature vector is obtained by calculating the position points between the affine transformation classification feature vector and the transpose vector of the affine transformation tracking feature vector.

In the above image data processing system based on artificial intelligence, the abnormality detection module includes: the full-connection coding unit is used for carrying out full-connection coding on the optimized classification feature vector by using a full-connection layer of the classifier so as to obtain a full-connection coding feature vector; the probability obtaining unit is used for passing the full-connection coding feature vector through a Softmax classification function of the classifier to obtain a first probability of risk of abnormal personnel and a second probability of no risk of abnormal personnel; and a classification result determining unit configured to determine the classification result based on a comparison between the first probability and the second probability.

According to another aspect of the present application, there is also provided an image data processing method based on artificial intelligence, characterized in that:

acquiring face images of all users, and acquiring a monitoring video through a camera;

acquiring a plurality of facial image feature images of the acquired facial image of the user through a first convolution neural network model using a spatial attention mechanism;

global averaging is carried out on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and the plurality of facial image feature vectors are arranged into facial feature matrices according to time sequence;

Extracting a plurality of key frames from the monitoring video, and then respectively obtaining a plurality of target object region-of-interest graphs from the plurality of key frames through a target detection network based on an anchor-free window;

arranging the multiple target object region-of-interest graphs into three-dimensional tensors, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global mean pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector;

calculating the matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector;

carrying out affine subspace probability on the classification characteristic vector and the tracking characteristic vector to obtain an optimized classification characteristic vector;

and the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to a staff or an owner.

In the above image data processing method based on artificial intelligence, the acquiring the facial image of the user to acquire a plurality of facial image feature maps through a first convolutional neural network of a spatial attention mechanism includes: the face images of all users in the face images of the users are respectively passed through a convolution coding part of the first convolution neural network to obtain a plurality of high-dimensional feature images; inputting each high-dimensional feature map in the plurality of high-dimensional feature maps into a space attention part of the first convolutional neural network respectively to obtain a plurality of space attention maps; the plurality of facial image features are obtained by respectively calculating the spatial attention map corresponding to each group of the plurality of spatial attention maps and the plurality of high-dimensional feature maps and multiplying the spatial attention map and the high-dimensional feature maps by position points.

In the above image data processing method based on artificial intelligence, affine subspace probability is performed on the classification feature vector and the tracking feature vector to obtain an optimized classification feature vector, including: carrying out standardization processing on the classification characteristic vector and the tracking characteristic vector to obtain a standardized classification characteristic vector and a standardized tracking characteristic vector; calculating a covariance matrix between the standardized classification feature vector and the standardized tracking feature vector; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and a plurality of eigenvalue vectors corresponding to the eigenvalues; extracting eigenvalue vectors corresponding to the first two largest eigenvalues from the eigenvectors to be used as a first-dimension eigenvector and a second-dimension eigenvector, wherein the first-dimension eigenvector and the second-dimension eigenvector are used as the basis of an affine subspace; arranging the first dimension feature vector and the second dimension feature vector according to column vectors to obtain an affine subspace matrix; matrix multiplication is carried out on the classification feature vector and the tracking feature vector and the affine subspace matrix respectively, and the classification feature vector and the tracking feature vector are mapped to the affine subspace matrix respectively so as to obtain a class affine transformation classification feature vector and a class affine transformation tracking feature vector; and calculating the position-based points between the class affine transformation classification feature vector and the transpose vector of the class affine transformation tracking feature vector to obtain the optimized classification feature vector.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the artificial intelligence based image data processing method as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the artificial intelligence based image data processing method as described above.

Compared with the prior art, the embodiment of the application provides an image data processing system and method based on artificial intelligence, which extracts facial features and target features by comprehensively utilizing the facial image of each user in a cell and the data of a monitoring video, and combines the image processing, feature extraction and classification technologies to calculate and compare the facial features and the target features, so that whether abnormal personnel exist can be judged, and measures can be taken in time to process the abnormal personnel, thereby improving the initiative of the monitoring video system.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. Like reference numerals generally refer to like components or steps.

FIG. 1 illustrates a block diagram schematic of an artificial intelligence based image data processing system in accordance with an embodiment of the application.

FIG. 2 illustrates a block diagram of an image processing module in an artificial intelligence based image data processing system according to an embodiment of the application.

FIG. 3 illustrates a block diagram of an anomaly detection module in an artificial intelligence based image data processing system in accordance with an embodiment of the present application.

FIG. 4 illustrates a flow chart of an artificial intelligence based image data processing method according to an embodiment of the application.

FIG. 5 illustrates an architectural diagram of an artificial intelligence based image data processing method according to an embodiment of the present application.

Fig. 6 illustrates a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

As described above, the present cell usually adopts a guard to register the incoming and outgoing persons, and when abnormal persons enter the cell in other ways, the cell cannot collect information for such persons. Therefore, an image data processing system based on artificial intelligence and a method thereof are expected to prompt abnormal personnel to enter a cell, and the initiative of a monitoring video system is improved.

At present, deep learning and neural networks have been widely used in the fields of computer vision, natural language processing, speech signal processing, and the like. In addition, deep learning and neural networks have also shown levels approaching and even exceeding humans in the fields of image classification, object detection, semantic segmentation, text translation, and the like.

The development of deep learning and neural networks provides new solutions and schemes for image data processing.

Accordingly, in the technical scheme of the application, the inventor tries to comprehensively utilize the data of the face image and the monitoring video of each user of the cell, and combines the image processing, feature extraction and classification technologies, so that an abnormality detection function can be realized, whether abnormal personnel exist can be judged by extracting the face features and the target features and calculating and comparing the face features and the target features, and once the system monitors abnormality, an alarm signal can be sent to public personnel or owners so as to take measures and process in time.

Specifically, in the technical scheme of the application, facial images of each user are firstly obtained, and a plurality of facial image feature images are obtained through a first convolution neural network model using a spatial attention mechanism, and it is understood that the spatial attention mechanism can enable a network to automatically learn which areas are more critical to facial feature extraction when processing the images, so that better input is provided for subsequent feature extraction and classification steps. Then, global averaging is carried out on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and the plurality of facial image feature vectors are arranged into facial feature matrices according to time sequence. Specifically, by performing mean pooling and time series arrangement on the plurality of facial image feature maps, the plurality of features of the facial image may be reduced and integrated to obtain a more compact, more representative facial feature representation.

Further, a plurality of key frames are extracted from the surveillance video, it being understood that surveillance video is typically a continuous video stream including a large amount of redundant information and extraneous background, and key frames may be selected from the video for analysis in order to reduce the amount of processing data and improve processing efficiency. Specifically, a plurality of key frames are extracted, the monitoring video can be sampled according to a certain time interval, and one frame in each time period is selected as a sampling frame.

Still further, it should be understood that the deep learning-based target detection method classifies the network into two categories, namely an Anchor-based (Anchor-based) and an Anchor-free (Anchor-free) based (Anchor-free) according to whether or not the Anchor window is used in the network. Anchor window based methods such as Fast R-CNN, fast R-CNN, retinaNet, etc., anchor window based methods such as CenterNet, extremeNet, repPoints, etc. The anchor window-free method solves the defects that targets with large scale change are difficult to identify, positive and negative samples are unbalanced in the training process, memory is occupied in high amount and the like caused by the anchor window, and is a current mainstream development direction.

In addition, the method of the anchor-free window is subdivided into two main categories, namely a center point-based method and a key point-based method. Center point-based methods such as YOLOv1, FCOS, centerNet, etc., directly detect the center point of the object and then regress the boundary information of the object. The bounding box is obtained by predicting the keypoints of the object based on the keypoint method, such as CornerNet, extremeNet, repPoints. The method based on the key point is generally slightly higher in detection accuracy than the method based on the center point, but has a large consumption in calculation amount.

Therefore, in the technical scheme of the application, when video target detection is carried out, in order to focus on the abnormality of the video target and to highlight the difference between the abnormal video target and a normal user, the plurality of key frames are further respectively passed through a target detection network based on an anchor-free window to obtain a plurality of target object region-of-interest graphs. Specifically, in the embodiment of the present application, first, each key frame in the plurality of key frames is respectively passed through a plurality of convolution layers to obtain a plurality of shallow feature maps, where the plurality of convolution layers include N convolution layers, where N is greater than or equal to 1 and less than or equal to 6, and a nonlinear activation function used by each layer in the plurality of convolution layers is a dash activation function; and then, respectively passing the plurality of shallow feature maps through the anchor-window-based target detection network to obtain a plurality of target object region-of-interest maps. It should be appreciated that inputting the key frames into the anchor-window-based object detection network results in multiple target object region of interest maps, which can locate and identify key target objects and extract their important information.

And then, arranging the multiple target object region-of-interest graphs into a three-dimensional tensor, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global average pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector. It should be appreciated that by arranging the target object region of interest map into a three-dimensional tensor and inputting the three-dimensional tensor into a two-dimensional convolutional neural network model, local features in the image, including texture, shape, and structural information of the target, can be extracted. The individual feature matrices along the channel dimensions of the trace feature map are globally averaged, the input feature map typically has spatial dimensions of height and width, and feature dimensions of multiple channels. By global averaging, the feature matrix of each channel can be converted into a single feature value, thereby reducing the feature dimension to the channel level.

Thus, the classification feature vector is obtained by calculating the matrix product of the facial feature matrix and the tracking feature vector. The facial feature matrix and the tracking feature vector each represent different feature information. By performing a matrix product, the two feature information can be combined to form a new feature vector. This helps to combine the feature information from different sources to provide a more comprehensive and rich representation of the features.

Further, considering that tracking feature vectors are extracted from the surveillance video, the tracking feature vectors contain feature information of the target object on different time frames, and can capture the motion and track of the target object. That is, the tracking feature vector can provide motion and track information of the target object, which is important for the classification task of whether abnormal people exist, and meanwhile, the classification feature vector provides associated features of facial feature information and target image feature information, so that different users can be identified and distinguished. By fusing the two, the information of the monitoring video and the facial image of the user can be fully utilized, so that more optimized and comprehensive classification feature vectors are obtained, and the performance and accuracy of the classification model are improved.

In particular, it is considered that the classification feature vector is obtained by calculating a matrix product of the facial feature matrix and the tracking feature vector, and thus the classification feature vector and the tracking feature vector may contain some similar information, resulting in the presence of data redundancy in the fusion process. This may increase the dimension of the feature vector and introduce unnecessary redundancy information, complicating the feature vector. Meanwhile, the classification feature vector and the tracking feature vector adopt different feature representation methods or feature spaces, so that feature representation is inconsistent between the classification feature vector and the tracking feature vector. In the fusion process, the problem of feature inconsistency needs to be solved, so as to ensure that the fused feature vector can accurately represent the features of the target object. Thus, affine subspace probabilization is performed on the classification feature vector and the tracking feature vector to obtain an optimized classification feature vector.

In the technical scheme of the application, the classification feature vector and the tracking feature vector are subjected to standardization processing to map the classification feature vector and the tracking feature vector into a probability space, then the basis vector of an affine subspace is obtained based on principal component analysis thought and feature value decomposition, the classification feature vector and the tracking feature vector are respectively mapped into a common affine subspace after the basis vector of the affine subspace is obtained, and then the classification feature vector and the tracking feature vector after affine transformation are densely connected by utilizing a position-by-position association response to obtain the optimized classification feature vector. Therefore, the main information and the structure of the data in the classification feature vector and the tracking feature vector can be effectively extracted, the complexity and the redundancy of the data are reduced, and meanwhile, the dimension and the transformation mode of the data can be flexibly adjusted to adapt to different data types and scenes.

And then the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to a staff or an owner.

Based on the above, the present application proposes an image data processing system and method based on artificial intelligence, which includes: the data acquisition module is used for acquiring facial images of all users and monitoring videos acquired by the cameras; the image processing module is used for acquiring a plurality of facial image feature images of the acquired facial image of the user through a first convolution neural network model using a spatial attention mechanism; the pooling module is used for carrying out global mean pooling on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and arranging the plurality of facial image feature vectors into a facial feature matrix according to time sequence; the target monitoring module is used for extracting a plurality of key frames from the monitoring video, and then obtaining a plurality of target object region of interest graphs through a target detection network based on an anchor-free window respectively; the target extraction module is used for arranging the multiple target object region-of-interest graphs into a three-dimensional tensor, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global mean pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector; the query module is used for calculating the matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector; the optimizing module performs affine subspace probability on the classifying feature vector and the tracking feature vector to obtain an optimized classifying feature vector; the abnormality detection module is used for enabling the optimized classification feature vector to pass through the classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to the staff or the owners.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary System

FIG. 1 illustrates a block diagram schematic of an artificial intelligence based image data processing system in accordance with an embodiment of the application. As shown in fig. 1, an artificial intelligence based image data processing system 100 according to an embodiment of the present application includes: the data acquisition module 110 is used for acquiring facial images of each user and monitoring videos acquired by the camera; an image processing module 120, configured to acquire a plurality of facial image feature maps from the acquired facial image of the user through a first convolutional neural network model using a spatial attention mechanism; a pooling module 130, configured to globally average and pool feature matrices of the plurality of facial image feature maps along a channel dimension to obtain a plurality of facial image feature vectors, and arrange the plurality of facial image feature vectors into a facial feature matrix according to a time sequence; the target monitoring module 140 is configured to extract a plurality of key frames from the surveillance video, and then obtain a plurality of target object interested area diagrams from the plurality of key frames through a target detection network based on an anchor-free window; the target extraction module 150 is configured to arrange the multiple target object region of interest maps into a three-dimensional tensor, obtain a tracking feature map through a two-dimensional convolutional neural network model, and perform global mean pooling on each feature matrix of the tracking feature map along a channel dimension to obtain a tracking feature vector; a query module 160, configured to calculate a matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector; the optimizing module 170 performs affine subspace probability on the classification feature vector and the tracking feature vector to obtain an optimized classification feature vector; the anomaly detection module 180 is configured to pass the optimized classification feature vector through a classifier to obtain a classification result, where the classification result is used to indicate whether an abnormal person exists, and send an alarm signal to a staff or a homeowner.

In the embodiment of the present application, the data acquisition module 110 is configured to acquire facial images of each user and a monitoring video acquired by a camera. It should be understood that the data of the face image and the monitoring video of each user in the cell are comprehensively utilized, and the abnormality detection function can be realized by combining the image processing, feature extraction and classification technology, and by extracting the face features and the target features and calculating and comparing them, whether abnormal personnel exist can be judged, and once the system monitors the abnormality, an alarm signal can be sent to public personnel or owners so as to take measures and process in time.

In an embodiment of the present application, the image processing module 120 is configured to acquire a plurality of facial image feature maps from the acquired facial image of the user through a first convolutional neural network model using a spatial attention mechanism. It will be appreciated that the spatial attention mechanism may enable the network to automatically learn which regions are more critical to facial feature extraction when processing images, providing better input for subsequent feature extraction and classification steps.

In one embodiment of the application, FIG. 2 illustrates a block diagram of an image processing module in an artificial intelligence based image data processing system according to an embodiment of the application. As shown in fig. 2, in the image data processing system 100 based on artificial intelligence, the image processing module 120 includes: a convolutional encoding unit 121, configured to pass face images of respective users among the face images of the plurality of users through a convolutional encoding portion of the first convolutional neural network, respectively, to obtain a plurality of high-dimensional feature graphs; a spatial attention unit 122, configured to input each of the plurality of high-dimensional feature maps into a spatial attention portion of the first convolutional neural network to obtain a plurality of spatial attention patterns; an attention applying unit 123 for calculating the spatial attention map and the high-dimensional feature map corresponding to each group of the plurality of spatial attention maps and the high-dimensional feature map, respectively, and multiplying the spatial attention map and the high-dimensional feature map by position points to obtain the plurality of facial image features.

In the embodiment of the present application, the pooling module 130 is configured to perform global averaging pooling on each feature matrix of the plurality of facial image feature maps along the channel dimension to obtain a plurality of facial image feature vectors, and arrange the plurality of facial image feature vectors into a facial feature matrix according to a time sequence. It should be appreciated that by averaging and time-series arrangement of multiple facial image feature maps, multiple features of a facial image may be reduced and integrated resulting in a more compact, more representative facial feature representation.

In the embodiment of the present application, the target monitoring module 140 is configured to extract a plurality of key frames from the surveillance video, and then obtain a plurality of target object interested area diagrams from the plurality of key frames through a target detection network based on an anchor-free window. It should be appreciated that surveillance video is typically a continuous video stream that includes a large amount of redundant information and extraneous background, and that key frames may be selected from the video for analysis in order to reduce the amount of processing data and improve processing efficiency. Furthermore, the target detection method based on deep learning classifies the network into two categories, namely an Anchor-based (Anchor-based) and an Anchor-free (Anchor-free) based on whether or not an Anchor window is used in the network. Anchor window based methods such as Fast R-CNN, fast R-CNN, retinaNet, etc., anchor window based methods such as CenterNet, extremeNet, repPoints, etc. The anchor window-free method solves the defects that targets with large scale change are difficult to identify, positive and negative samples are unbalanced in the training process, memory is occupied in high amount and the like caused by the anchor window, and is a current mainstream development direction.

In addition, the method of the anchor-free window is subdivided into two main categories, namely a center point-based method and a key point-based method. Center point-based methods such as YOLOv1, FCOS, centerNet, etc., directly detect the center point of the object and then regress the boundary information of the object. The bounding box is obtained by predicting the keypoints of the object based on the keypoint method, such as CornerNet, extremeNet, repPoints. The method based on the key point is generally slightly higher in detection accuracy than the method based on the center point, but has a large consumption in calculation amount. Therefore, in the technical scheme of the application, when video target detection is carried out, in order to focus on the abnormality of the video target and to highlight the difference between the abnormal video target and a normal user, the plurality of key frames are further respectively passed through a target detection network based on an anchor-free window to obtain a plurality of target object region-of-interest graphs. Specifically, in the embodiment of the present application, first, each key frame in the plurality of key frames is respectively passed through a plurality of convolution layers to obtain a plurality of shallow feature maps, where the plurality of convolution layers include N convolution layers, where N is greater than or equal to 1 and less than or equal to 6, and a nonlinear activation function used by each layer in the plurality of convolution layers is a dash activation function; and then, respectively passing the plurality of shallow feature maps through the anchor-window-based target detection network to obtain a plurality of target object region-of-interest maps. It should be appreciated that inputting the key frames into the anchor-window-based object detection network results in multiple target object region of interest maps, which can locate and identify key target objects and extract their important information. Further used for: respectively passing each key frame in the plurality of key frames through a plurality of convolution layers to obtain a plurality of shallow feature images; and respectively passing the plurality of shallow feature maps through the target detection network based on the anchor-free window to obtain a plurality of target object interested region maps.

In the embodiment of the present application, the target extraction module 150 is configured to arrange the multiple target object region of interest maps into a three-dimensional tensor, obtain a tracking feature map through a two-dimensional convolutional neural network model, and pool global average values of feature matrices of the tracking feature map along a channel dimension to obtain a tracking feature vector. It should be appreciated that by arranging the region of interest map of the target object into a three-dimensional tensor and inputting the three-dimensional tensor into a two-dimensional convolutional neural network model, local features in the image, including texture, shape, and structural information of the target, may be extracted. The global averaging of each feature matrix along the channel dimension of the trace feature map is to further reduce the feature dimension and extract more representative feature vectors. For the purpose of: and respectively carrying out convolution processing, pooling processing and nonlinear activation processing based on a two-dimensional convolution kernel on input data in forward transfer of layers of the two-dimensional convolution neural network so as to output the tracking feature map by the last layer of the two-dimensional convolution kernel.

In the embodiment of the present application, the query module 160 is configured to calculate a matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector. Specifically, a matrix product between the tracking feature vector and the facial feature matrix is calculated in the following formula to obtain the classification feature vector;

Wherein, the formula is:

In the embodiment of the present application, the optimizing module 170 is configured to affine subspace probabilistically normalize the classification feature vector and the tracking feature vector to obtain an optimized classification feature vector. It should be appreciated that, considering that tracking feature vectors are extracted from surveillance video, they contain feature information of the target object at different time frames, so that the motion and trajectory of the target object can be captured. That is, the tracking feature vector can provide motion and track information of the target object, which is important for the classification task of whether abnormal people exist, and meanwhile, the classification feature vector provides associated features of facial feature information and target image feature information, so that different users can be identified and distinguished. By fusing the two, the information of the monitoring video and the facial image of the user can be fully utilized, so that more optimized and comprehensive classification feature vectors are obtained, and the performance and accuracy of the classification model are improved.

In the technical scheme of the application, the classification feature vector and the tracking feature vector are subjected to standardization processing to map the classification feature vector and the tracking feature vector into a probability space, then the basis vector of an affine subspace is obtained based on principal component analysis thought and feature value decomposition, the classification feature vector and the tracking feature vector are respectively mapped into a common affine subspace after the basis vector of the affine subspace is obtained, and then the classification feature vector and the tracking feature vector after affine transformation are densely connected by utilizing a position-by-position association response to obtain the optimized classification feature vector. Therefore, the main information and the structure of the data in the classification feature vector and the tracking feature vector can be effectively extracted, the complexity and the redundancy of the data are reduced, and meanwhile, the dimension and the transformation mode of the data can be flexibly adjusted to adapt to different data types and scenes. Specifically, the method comprises the following steps: the normalization subunit performs normalization processing on the classification feature vector and the tracking feature vector to obtain a normalized classification feature vector and a normalized tracking feature vector; a covariance calculation subunit for calculating a covariance matrix between the normalized classification feature vector and the normalized tracking feature vector; a eigenvalue decomposition subunit, configured to perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and a plurality of eigenvalue vectors corresponding to the plurality of eigenvalues; the extraction feature value subunit extracts feature value vectors corresponding to the first two largest feature values from the plurality of feature vectors as a first-dimension feature vector and a second-dimension feature vector, wherein the first-dimension feature vector and the second-dimension feature vector are used as the basis of an affine subspace; an arrangement vector subunit, configured to arrange the first dimension feature vector and the second dimension feature vector according to a column vector to obtain an affine subspace matrix; the association subunit performs matrix multiplication on the classification feature vector and the tracking feature vector and the affine subspace matrix respectively, and maps the classification feature vector and the tracking feature vector to the affine subspace matrix respectively to obtain an affine transformation classification feature vector and an affine transformation tracking feature vector; and the sub-unit is added according to the position points, and the optimized classification feature vector is obtained by calculating the position points between the affine transformation classification feature vector and the transpose vector of the affine transformation tracking feature vector.

In this embodiment of the present application, the anomaly detection module 180 is configured to pass the optimized classification feature vector through a classifier to obtain a classification result, where the classification result is used to indicate whether an abnormal person exists, and send an alarm signal to a staff or an owner.

In one embodiment of the application, FIG. 3 illustrates a block diagram of an anomaly detection module in an artificial intelligence based image data processing system in accordance with an embodiment of the present application. As shown in fig. 2, in the image data processing system 100 based on artificial intelligence, the anomaly detection module 180 includes: a full-connection encoding unit 181, configured to perform full-connection encoding on the classification feature vector by using a full-connection layer of the classifier to obtain a full-connection encoded feature vector; a probability obtaining unit 182, configured to pass the fully-connected encoding feature vector through a Softmax classification function of the classifier to obtain a first probability that there is a risk of an abnormal person and a second probability that there is no risk of an abnormal person; a classification result determination unit 183 for determining the classification result based on a comparison between the first probability and the second probability.

In summary, according to the image data processing system and the method based on artificial intelligence provided by the embodiment of the application, the facial features and the target features are extracted by comprehensively utilizing the facial image of each user in a cell and the data of the monitoring video, and the facial features and the target features are calculated and compared by combining the image processing, feature extraction and classification technologies, so that whether abnormal personnel exist can be judged, measures can be taken in time to process, and the initiative of the monitoring video system is improved.

As described above, the image data processing system 100 based on artificial intelligence according to an embodiment of the present application may be implemented in various terminal devices, for example, a server of the image data processing system based on artificial intelligence, or the like. In one example, the image data processing system 100 based on artificial intelligence may be integrated into the terminal device as a software module and/or hardware module. For example, the artificial intelligence based image data processing system 100 may be a software module in the operating system of the terminal device or may be an application developed for the terminal device; of course, the artificial intelligence based image data processing system 100 may equally be one of a number of hardware modules of the terminal device.

Alternatively, in another example, the artificial intelligence based image data processing system 100 and the terminal device may be separate devices, and the artificial intelligence based image data processing system 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information in a agreed data format.

Exemplary method

FIG. 4 illustrates a flow chart of an artificial intelligence based image data processing method according to an embodiment of the application. As shown in fig. 4, an artificial intelligence-based image data processing method according to an embodiment of the present application includes: s110, collecting facial images of all users, and acquiring a monitoring video through a camera; s120, acquiring a plurality of facial image feature images of the acquired facial image of the user through a first convolutional neural network model using a spatial attention mechanism; s130, carrying out global averaging pooling on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and arranging the plurality of facial image feature vectors into a facial feature matrix according to time sequence; s140, extracting a plurality of key frames from the monitoring video, and then respectively obtaining a plurality of target object interested region diagrams through a target detection network based on an anchor-free window by the plurality of key frames; s150, arranging the multiple target object region-of-interest graphs into a three-dimensional tensor, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global averaging pooling on each feature matrix of the tracking feature graph along a channel dimension to obtain a tracking feature vector; s160, calculating a matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector; s170, carrying out affine subspace probability on the classification characteristic vector and the tracking characteristic vector to obtain an optimized classification characteristic vector; and S180, the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to a staff or a business owner.

FIG. 5 illustrates an architectural diagram of an artificial intelligence based image data processing method according to an embodiment of the present application. As shown in fig. 5, in the embodiment of the present application, first, face images of each user are collected, and a monitoring video is obtained through a camera; then, acquiring a plurality of facial image feature images of the acquired facial image of the user through a first convolutional neural network model using a spatial attention mechanism; then, carrying out global mean pooling on each feature matrix of the plurality of facial image feature images along the channel dimension to obtain a plurality of facial image feature vectors, and arranging the plurality of facial image feature vectors into a facial feature matrix according to time sequence; secondly, extracting a plurality of key frames from the monitoring video, and then respectively obtaining a plurality of target object interested region diagrams through a target detection network based on an anchor-free window by the plurality of key frames; then, arranging the multiple target object region-of-interest graphs into a three-dimensional tensor, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global average pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector; then, calculating the matrix product of the facial feature matrix and the tracking feature vector to obtain a classification feature vector; then, carrying out affine subspace probability on the classification characteristic vector and the tracking characteristic vector to obtain an optimized classification characteristic vector; and finally, the optimized classification feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether abnormal personnel exist or not and sending an alarm signal to a staff or an owner.

In one embodiment of the present application, acquiring a plurality of facial image feature maps from the acquired facial images of the user through a first convolutional neural network of a spatial attention mechanism, including: the face images of all users in the face images of the users are respectively passed through a convolution coding part of the first convolution neural network to obtain a plurality of high-dimensional feature images; inputting each high-dimensional feature map in the plurality of high-dimensional feature maps into a space attention part of the first convolutional neural network respectively to obtain a plurality of space attention maps; the plurality of facial image features are obtained by respectively calculating the spatial attention map corresponding to each group of the plurality of spatial attention maps and the plurality of high-dimensional feature maps and multiplying the spatial attention map and the high-dimensional feature maps by position points.

In one embodiment of the present application, affine subspace probability of the classification feature vector and the tracking feature vector to obtain an optimized classification feature vector comprises: carrying out standardization processing on the classification characteristic vector and the tracking characteristic vector to obtain a standardized classification characteristic vector and a standardized tracking characteristic vector; calculating a covariance matrix between the standardized classification feature vector and the standardized tracking feature vector; performing eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and a plurality of eigenvalue vectors corresponding to the eigenvalues; extracting eigenvalue vectors corresponding to the first two largest eigenvalues from the eigenvectors to be used as a first-dimension eigenvector and a second-dimension eigenvector, wherein the first-dimension eigenvector and the second-dimension eigenvector are used as the basis of an affine subspace; arranging the first dimension feature vector and the second dimension feature vector according to column vectors to obtain an affine subspace matrix; matrix multiplication is carried out on the classification feature vector and the tracking feature vector and the affine subspace matrix respectively, and the classification feature vector and the tracking feature vector are mapped to the affine subspace matrix respectively so as to obtain a class affine transformation classification feature vector and a class affine transformation tracking feature vector; and calculating the position-based points between the class affine transformation classification feature vector and the transpose vector of the class affine transformation tracking feature vector to obtain the optimized classification feature vector.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described artificial intelligence-based image data processing method have been described in detail in the above description of the artificial intelligence-based image data processing system with reference to fig. 1 to 3, and thus, repetitive descriptions thereof will be omitted.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present application is described with reference to fig. 6.

As shown in fig. 6, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 11 to perform the artificial intelligence based image data processing and/or other desired functions of the various embodiments of the application described above. Various contents such as data of each user face image and monitoring video can also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

The input means 13 may comprise, for example, a keyboard, a mouse, etc.

The output device 14 may output various information including the classification result and the like to the outside. The output means 14 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of the artificial intelligence based image data processing method according to various embodiments of the application described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps of the artificial intelligence based image data processing method according to various embodiments of the present application described in the above "exemplary method" section of the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be considered as essential to the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. An artificial intelligence based image data processing system comprising:

The target extraction module is used for arranging the multiple target object region-of-interest graphs into a three-dimensional tensor, obtaining a tracking feature graph through a two-dimensional convolutional neural network model, and carrying out global mean pooling on each feature matrix of the tracking feature graph along the channel dimension to obtain a tracking feature vector;

2. The artificial intelligence based image data processing system of claim 1 wherein the image processing module comprises:

a convolution encoding unit, configured to pass face images of each user in the face images of the plurality of users through a convolution encoding portion of the first convolution neural network to obtain a plurality of high-dimensional feature maps;

A spatial attention unit, configured to input each of the plurality of high-dimensional feature maps into a spatial attention portion of the first convolutional neural network to obtain a plurality of spatial attention patterns;

and the attention applying unit is used for respectively calculating the spatial attention patterns and the corresponding spatial attention patterns of each group in the plurality of high-dimensional feature patterns and multiplying the spatial attention patterns and the high-dimensional feature patterns by position points to obtain the plurality of facial image features.

3. The artificial intelligence based image data processing system of claim 2 wherein the object monitoring module comprises:

the multi-layer convolution layer unit is used for enabling each key frame in the plurality of key frames to pass through the multi-layer convolution layers respectively to obtain a plurality of shallow feature images;

and the anchor-free window detection unit is used for respectively passing the plurality of shallow feature maps through the anchor-free window-based target detection network to obtain a plurality of target object region-of-interest maps.

4. The artificial intelligence based image data processing system of claim 3 wherein the multi-layer convolution layer unit is characterized by:

the multi-layer convolution layer comprises N layers of convolution layers, wherein N is more than or equal to 1 and less than 6;

The multiple layers of convolution layers respectively perform convolution processing, pooling processing and nonlinear activation processing based on a two-dimensional convolution kernel on input data in forward transmission of the layers so as to output the shallow feature map by the last convolution layer of the multiple layers of convolution layers;

the nonlinear activation function used by each of the multiple convolutional layers is a Mish activation function.

5. The artificial intelligence based image data processing system according to claim 4, wherein the object extraction module is characterized in that: arranging the multiple target object region of interest maps into three-dimensional tensors, obtaining a tracking feature map through a two-dimensional convolutional neural network model, and carrying out global mean pooling on each feature matrix of the tracking feature map along a channel dimension to obtain a tracking feature vector, wherein the method comprises the following steps:

and respectively carrying out convolution processing, pooling processing and nonlinear activation processing based on a two-dimensional convolution kernel on input data in forward transfer of layers of the two-dimensional convolution neural network so as to output the tracking feature map by the last layer of the two-dimensional convolution kernel.

6. The artificial intelligence based image data processing system according to claim 5, wherein the query module is configured to: calculating a matrix product between the tracking feature vector and the facial feature matrix to obtain the classification feature vector as follows; wherein, the formula is:

7. The artificial intelligence based image data processing system according to claim 6, wherein the optimization module comprises:

the normalization subunit performs normalization processing on the classification feature vector and the tracking feature vector to obtain a normalized classification feature vector and a normalized tracking feature vector;

a covariance calculation subunit for calculating a covariance matrix between the normalized classification feature vector and the normalized tracking feature vector;

a eigenvalue decomposition subunit, configured to perform eigenvalue decomposition on the covariance matrix to obtain a plurality of eigenvalues and a plurality of eigenvalue vectors corresponding to the plurality of eigenvalues;

the extraction feature value subunit extracts feature value vectors corresponding to the first two largest feature values from the plurality of feature vectors as a first-dimension feature vector and a second-dimension feature vector, wherein the first-dimension feature vector and the second-dimension feature vector are used as the basis of an affine subspace;

an arrangement vector subunit, configured to arrange the first dimension feature vector and the second dimension feature vector according to a column vector to obtain an affine subspace matrix;

The association subunit performs matrix multiplication on the classification feature vector and the tracking feature vector and the affine subspace matrix respectively, and maps the classification feature vector and the tracking feature vector to the affine subspace matrix respectively to obtain an affine transformation classification feature vector and an affine transformation tracking feature vector;

and the sub-unit is added according to the position points, and the optimized classification feature vector is obtained by calculating the position points between the affine transformation classification feature vector and the transpose vector of the affine transformation tracking feature vector.

8. The artificial intelligence based image data processing system of claim 7 wherein the anomaly detection module comprises:

the full-connection coding unit is used for carrying out full-connection coding on the optimized classification feature vector by using a full-connection layer of the classifier so as to obtain a full-connection coding feature vector;

the probability obtaining unit is used for passing the full-connection coding feature vector through a Softmax classification function of the classifier to obtain a first probability of risk of abnormal personnel and a second probability of no risk of abnormal personnel;

and a classification result determining unit configured to determine the classification result based on a comparison between the first probability and the second probability.

9. An artificial intelligence based image data processing method, comprising:

10. The artificial intelligence based image data processing method of claim 9 wherein capturing the captured facial image of the user through a first convolutional neural network of spatial attention mechanism obtains a plurality of facial image feature maps, comprising:

the face images of all users in the face images of the users are respectively passed through a convolution coding part of the first convolution neural network to obtain a plurality of high-dimensional feature images;

inputting each high-dimensional feature map in the plurality of high-dimensional feature maps into a space attention part of the first convolutional neural network respectively to obtain a plurality of space attention maps;

the plurality of facial image features are obtained by respectively calculating the spatial attention map corresponding to each group of the plurality of spatial attention maps and the plurality of high-dimensional feature maps and multiplying the spatial attention map and the high-dimensional feature maps by position points.