CN117576766A

CN117576766A - Cross-space-time compatibility unsupervised self-learning face recognition method and system

Info

Publication number: CN117576766A
Application number: CN202410059018.2A
Authority: CN
Inventors: 王东; 熊超; 何昊驰
Original assignee: Hangzhou Moredian Technology Co ltd
Current assignee: Hangzhou Moredian Technology Co ltd
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-02-20
Anticipated expiration: 2044-01-16
Also published as: CN117576766B

Abstract

Compared with the traditional face recognition method, the method for recognizing the face through space-time compatibility without supervision and self-learning provided by the embodiment of the application has the advantages that firstly, the method for optimizing the face through the non-supervision training based on the vision transformer network structure is adopted, meanwhile, the local and global characteristics are considered, the similarity of the local and global characteristics of the same face can be maximized, and therefore the model has better face representation capability. By clustering faces in a cross-space-time scene and selecting target face photo features in different dimensions to fuse with original features according to clustering results, a feature bank containing multiple feature samples in different dimensions in the cross-space-time scene is dynamically constructed, the feature bank is required to conduct face feature comparison, a separate training model aiming at different scene factors is not required, and the problem of domain (domain) deviation caused by age blocking/increasing, light field/visual field changing, behavior changing and the like in the cross-space-time scene can be solved, so that the complex and various recognition requirements in the cross-space-time multi-scene are considered.

Description

Cross-space-time compatibility unsupervised self-learning face recognition method and system

Technical Field

The application relates to the technical field of face recognition, in particular to a cross-space-time compatibility unsupervised self-learning face recognition method and system.

Background

The face recognition technology is a biological feature recognition technology, and the identification or verification of the identity of an individual is realized by extracting and analyzing features in a face image or video. This technology has wide application in a number of fields including security, authentication, access control, social media and mobile devices, etc.;

in the related art, because the face features have large changes in different empty scenes, for example, the difference between the real-time face collected by the camera and the registered face can be large due to age, makeup, shooting view angle, illumination and the like, so that recognition failure or false recognition is caused.

In the related technology, the training and reasoning modes of the existing face recognition model cannot be compatible with the recognition requirements of the cross-space-time multi-scene, different models are required to be built for different challenging scenes in all aspects of data, training, reasoning and the like, research and development and maintenance costs are greatly increased, and the models for a single scene are difficult to achieve ideal high pass rate in the scene, so that poor somatosensory is brought to users.

Disclosure of Invention

The embodiment of the application provides a cross-space-time compatibility unsupervised self-learning face recognition method, device, system, computer equipment and computer readable storage medium, which are used for at least solving the problem of poor face recognition accuracy in a beginning blank scene in the related technology.

In a first aspect, an embodiment of the present application provides a cross-space-time compatibility unsupervised self-learning face recognition method, where the method includes:

acquiring original face data, and training a pre-constructed vision transformer network based on the original face data to obtain a face representation model;

performing supervised training on the face representation model to obtain a face feature extraction model, and extracting a plurality of groups of face features through the face feature extraction model to construct a face feature library;

judging whether the matching score of the face features of the target user in the face feature library is larger than a preset threshold value, if so, collecting a multi-dimensional face image set of the target user, and clustering all face images in the multi-dimensional face image set to obtain a plurality of face image subclasses;

determining a target face image subclass from the face image subclasses, and acquiring target face images in different dimensions from the target face image subclasses;

and respectively extracting features from the target face images in different dimensions, fusing the extracted features with the original face features to obtain fused features, and performing space-time face recognition based on the fused features.

In some embodiments, training a pre-constructed vision transformer network based on the original face data, the obtaining a face representation model includes:

obtaining a local face image and a global face image by carrying out random change on the original face data;

extracting features from the local face image and the global face image through a local visual network and a global visual network respectively to obtain local face features and global face features respectively;

and iteratively training the vision transformer network by taking the cross entropy between the minimized local face features and the global face features as a constraint condition, and storing the global view network after training as the face representation model.

In some of these embodiments, in iteratively training the vision transformer network, the method further comprises:

updating weight parameters of the local view network by performing back propagation on gradients of the local view network;

and updating the weight parameters of the global view network through exponential movement smoothing based on the weight parameters of the local view network.

In some of these embodiments, the set of multi-dimensional face images includes face images of the target user under different illumination, different fields of view, different age groups, and different occlusion states.

In some embodiments, clustering each image in the multi-dimensional face image set to obtain a plurality of face image subclasses includes:

extracting face features of each face image, and constructing feature pairs based on any two face features;

obtaining a comparison score of each feature pair by calculating the similarity between two face features in any feature pair, and determining a target feature pair from the feature pair according to the comparison score of the feature pair;

and carrying out recursion aggregation on the target feature pairs based on cross index analysis to obtain a plurality of face image subclasses.

In some of these embodiments, determining the target face image subclass from the plurality of face image subclasses includes:

acquiring the number of pictures and identification equipartition of each face image subclass, and determining a target face image subclass from a plurality of face image subclasses according to the number of the pictures and the identification equipartition;

the target face image under different dimensions comprises: target face images under different shielding states, target face images under different illumination, and target face images under different shooting visual angles.

In some of these embodiments, the method is performed byThe formula fuses the extracted features with the original face features to obtain fused features:fts_update = (fts_center * n + ∑fts) / (n+m)

wherein fts _update is the post-fusion feature, fts _center is the pre-fusion feature, wherein in an initial state, the pre-fusion feature is the original face feature; Σ fts is the sum of the features extracted in the round, n is the sum of the extracted features, and m is the number of features fused in the previous round.

In a second aspect, embodiments of the present application provide a cross-space-time compatibility unsupervised self-learning face recognition system, the system comprising: the system comprises an unsupervised training module, a supervised training module and an inference optimization module, wherein:

the unsupervised training module is used for acquiring original face data, training a pre-constructed vision transformer network based on the original face data, and obtaining a face representation model;

the supervised training is used for performing supervised training on the face representation model to obtain a face feature extraction model, and extracting a plurality of groups of face features through the face feature extraction model to construct a face feature library;

the reasoning optimization module is used for judging whether the matching score of the face features of the target user in the face feature library is larger than a preset threshold value, if so, collecting a multi-dimensional face image set of the target user, clustering each face image in the multi-dimensional face image set to obtain a plurality of face image subclasses,

and determining a target face image subclass from the plurality of face image subclasses, acquiring target face images in different dimensions from the target face image subclasses, extracting features from the target face images in different dimensions, fusing the extracted features with the original face features to obtain fused features, and performing space-time cross face recognition based on the fused features.

In a third aspect, embodiments of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect described above when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described in the first aspect above.

Compared with the traditional face recognition method, the cross-space-time compatibility unsupervised self-learning face recognition method provided by the embodiment of the application has the advantages that an unsupervised training optimization mode based on a vision transformer network structure is adopted, and meanwhile, local and global characteristics are considered, so that the similarity of the local and global characteristics of the same face can be maximized, and the model has better face representation capability. Furthermore, by clustering faces in a cross-space-time scene and selecting and fusing target face photo features in different dimensions with original features according to clustering results, a feature bank containing multiple feature samples in different dimensions in the cross-space-time scene can be dynamically constructed, the feature bank can be used for carrying out face feature comparison, a model is not required to be trained independently aiming at different scene factors, the problem of domain (domain) offset caused by age blocking/increasing, light field/visual field change, behavior change and the like in the cross-space-time scene can be solved, and the recognition requirements of complexity and diversity in the cross-space-time multi-scene are considered.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a cross-space-time compatibility unsupervised self-learning face recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training vision transformer network according to an embodiment of the present application;

FIG. 3 is a block diagram of a cross-space-time compatibility unsupervised self-learning face recognition system in accordance with an embodiment of the present application;

fig. 4 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

Although the face recognition technology has been developed, many challenges are still faced in complex practical application, and the face recognition technology is mainly embodied in two scenes of cross time and cross space: in a cross-time scenario, challenges are presented in two cases: firstly, age blocking, namely the age span of the registered photo is larger; secondly, the facial features of the children in the growing period change faster after the age is increased; in a cross-space scenario, challenges are presented in three cases: firstly, the space of the field of view changes from the head-up space of the matched access control to the overlook space of the noninductive monitoring; secondly, the light field space changes from a dim basement space to a strong exposure outdoor space and the like; thirdly, the behavior change of the user in the space dimension, such as the behavior of feature loss caused by shielding by wearing helmets, masks and the like, and the behavior of feature deviation caused by makeup, PS and the like.

Under the condition of guaranteeing the acceptable false recognition rate in practical application, how to simultaneously meet the high-pass rate requirements in the challenging space-time-crossing scenes is a problem recognized in the industry. In the related art, the face recognition model based on the depth convolutional neural network currently mainstream already shows a fatigue state. On one hand, the face representation capability of the face representation device has entered a longer bottleneck; on the other hand, compared with the traditional training and reasoning mode, the method cannot be compatible with the recognition requirement of the cross-space-time multi-scene, different models are required to be built for different challenging scenes from the aspects of data, training, reasoning and the like, the research and development and maintenance cost is greatly increased, and the models for a single scene are difficult to achieve the ideal high pass rate in the scene, so that poor somatosensory is brought to users.

The application provides a cross-space-time compatibility non-supervision self-learning face recognition method, and fig. 1 is a flowchart of the cross-space-time compatibility non-supervision self-learning face recognition method according to an embodiment of the application, as shown in fig. 1, and the flowchart includes the following steps:

s101, acquiring original face data, and training a pre-constructed vision transformer network based on the original face data to obtain a face representation model;

the original face data is composed of massive non-tag face image data, and optionally comprises two parts, wherein one part is an open source face image without a tag, and the other part is a face image obtained by removing the tag from the open source face image with the identification tag.

In this embodiment, through the above manner, the data richness for training the face feature characterization model can be improved, so that the generalization capability of the model in the training process is improved.

Further, fig. 2 is a schematic diagram of a training vision transformer network according to an embodiment of the present application, and as shown in fig. 2, training vision transformer the network to obtain a face representation model includes the following steps Step1 to Step3:

step1, obtaining a local face image and a global face image by randomly changing original face data;

it will be appreciated that the random variation is to augment the data set by applying transformations to the original face data with the aim of improving the generalization performance of the model, in this embodiment the random variation includes but is not limited to: rotation, clipping, scaling, brightness contrast adjustment, etc.

Step2, extracting features from the local face image and the global face image through a local visual network and a global visual network respectively to obtain local face features and global face features respectively;

in this embodiment, vision Transformer (ViT) is used as a base network, wherein ViT is a deep learning model based on self-attention mechanism, and unlike the conventional convolutional neural network, the self-attention mechanism using ViT network in this application can capture global relations in images better.

Specifically, local and Global information of an image is processed through a Local View network (Local View Net) and a Global View network (Global View Net), and it can be understood that the two networks have the same structure but different weight parameters.

Further, the local view network and the global view network output normalized K-dimensional feature vectors F1 and F2, respectively.

Step3, iterating the training vision transformer network by taking the minimum cross entropy between the local face features and the global face features as a constraint condition, and storing the global view network after training as a face representation model.

In model training, the gradient of the local view network carries out weight updating through back propagation, which is a common optimization process in deep learning;

further, the weight parameters of the global view network are updated by using the weight parameters of the local view network and adopting an exponential moving average (Exponential Moving Average, abbreviated as EMA); the EMA is a method for smoothing data, and by performing exponential moving average on the weights, the updated variance can be reduced, the training stability is improved, and the method of updating the weights is helpful for more stably adjusting the parameters of the global view network in the training process.

Further, the model training process adopts the lossThe function may be expressed as l= -F2log F1; by combiningF2. Is input into a logarithmic function and is associated withF1. The calculated values are given a negative sign by multiplication. The goal of model training is to minimize this value, i.e., by adjusting the network parameters such thatF2. AndF1. higher in similarity, thereby minimizing cross entropy loss.

Based on the above steps Step to Step3, it can be appreciated that the model training described in the above Step is a non-supervised learning with negative sampling, in which the local view network updates its own weights by back propagation of gradients, and then smoothly applies these weights to the global view network by EMA, i.e. the updating of the weight parameters of the global view network depends on the local view network.

In this embodiment, the model structure is based on Vision Transformer, and by combining the local view network and the global view network with the optimization target of the cross entropy loss function, and by considering the local and global features at the same time, the similarity of the local and global features of the same face can be maximized, so that the network can better understand the structure and appearance of the face, thereby having better face representation capability and obtaining face feature representation with better effect.

S102, performing supervised training on a face representation model to obtain a face feature extraction model, and extracting a plurality of groups of face features through the face feature extraction model to construct a face feature library;

the face representation model obtained in the step S101 is universal, and can obtain good local and global feature representation through the unlabeled face data;

however, the above-described face representation model is learned without supervision, so it is more focused on general features in the data distribution than on specific requirements of a specific task; therefore, the above general face characterization model needs to be further adjusted to be more suitable for specific face recognition tasks.

Optionally, the global view network obtained by the training is used as an initialization model, and on the basis of the initialization model, the supervised training is performed by using the labeled data so as to fine tune the parameters of the model, so that the model is more suitable for specific face recognition tasks. The face feature extraction model obtained by the step can learn more specialized feature representation so as to better perform face recognition.

Optionally, a general face recognition Arcface loss function is applied to train a supervision model, and specifically, the Arcface loss function can be expressed by the following formula:

wherein,Nis the number of samples in the batch,sis a scaling parameter, commonly referred to as margin,θyiis the ithiThe characteristic angles that individual samples belong to their true class,θjis the ithiThe characteristic angles of the individual samples with all other classes,mis a margin parameter.

The goal of this loss function is to maximize the feature angle of the true classθyi) Differences from other classes of feature angles, by scaling parameterssControlling the distribution range of the angles by a margin parameter mmThe size of the inter-class differences is controlled.

In the step, the model can learn the characteristics related to the task better by performing supervision training on the face recognition data with the labels, so that the performance of the model on specific tasks such as identity recognition and the like is improved. Further, face features of a plurality of users are extracted based on the trained face feature extraction model, and the face features are stored in a face feature library.

S103, judging whether the matching score of the face features of the target user in the face feature library is larger than a preset threshold value, if so, collecting a multi-dimensional face image set of the target user, and clustering all face images in the multi-dimensional face image set to obtain a plurality of face image subclasses;

in this step, the face features of any one target user are collected and matched in the feature library constructed in the above step S102, and if it is determined that the matching score of the face features of the target user in the face feature library is greater than a preset threshold, the face features are input to the subsequent step.

It can be understood that the preset threshold is a buffer parameter, and by the above manner, the features which are considered to be within a certain similarity range but considered to be failed in recognition can also be used as the original data of the subsequent feature fusion;

further, a multi-dimensional face image set of each target user is obtained, wherein the face image set is a plurality of face images accumulated over a certain period of time, including but not limited to: and the target user is provided with face images under different illumination, different view fields, different age groups and different shielding states.

Further, clustering each face image in the multi-dimensional face image set to obtain a plurality of face image subclasses, including the following steps:

step1, extracting face features of each face image, and constructing feature pairs based on any two face features;

specifically, by matching face features in pairs, it is possible to obtainPairs of features;

step2, obtaining a comparison score of each feature pair by calculating the similarity between two face features in any feature pair, and determining a target feature pair from the feature pair according to the comparison score of the feature pair;

specifically, a score threshold may be set, a feature pair with a score exceeding the threshold is considered to be in accordance with a matching relationship, and the feature pair in accordance with the matching relationship is saved as a target feature pair;

step3, performing recursion aggregation on the target feature pairs based on cross index analysis to obtain multiple facial image subclasses.

In this embodiment, association information between data can be obtained through cross index analysis, so as to provide information support for clustering; for example: the A and B features are similar, the B and E features are similar, and the E and G features are similar, then the result of combining cross index analysis and recursion aggregation is A, B, E, G feature similarity, namely A, B, E, G belongs to the same face image subclass.

In this embodiment, in the context of face clustering, cross index analysis may be used to compare face features of the same user in different time periods, and by simultaneously examining face features of multiple time points, a relationship between faces may be more accurately established, and for faces of different categories, similarity between them may be found through cross index analysis. This helps to find similar features between different users, or similar face patterns in the time series.

Optionally, a graph structure may be constructed based on the result of the cross index analysis, where each node represents a feature, each side represents a matching relationship of the face features, and further performing aggregation based on the analysis result by using a clustering algorithm to obtain a plurality of face image subclasses;

the clustering process considers the relation among different subclasses, combines subclasses with high similarity, and each subclass represents an aggregated face group, namely a face set with shared similar characteristics and minimum difference in a time period.

Specifically, the facial features in each subclass come from the same user, but due to factors such as different time, scene and the like, the facial features may have different appearance features such as gestures, expressions, illumination conditions and the like; in addition, the faces of each subclass share a higher similarity, i.e., their features have less variance in some particular metric.

Through the steps Step1 to Step3, the face changes of the same user in different scenes can be better understood and processed through the clustering result, and the faces of each subclass share higher similarity, so that a more accurate data basis is provided for subsequent feature fusion.

S104, determining a target face image subclass from the plurality of face image subclasses, and acquiring target face images in different dimensions from the target face image subclasses;

the method comprises the steps of respectively extracting features from target face images in different dimensions, fusing the extracted features with original face features to obtain fused features, and performing space-time face recognition based on the fused features.

Specifically, the step can obtain the total number of pictures and recognition equipartition of each face image subclass, and select one subclass with the highest picture number and recognition equipartition as the target face image subclass, wherein the average value of the recognition scores corresponding to all face images in each face image subclass;

the face images with the largest snapshot quantity are contained in the selected subcategory, so that the selected face images can cover a longer period of time, and the comprehensive cognition of the faces of the user is improved. In addition, for the face images in each sub-category, the average division of the face images in the recognition accuracy is calculated, and the sub-category with the highest average division is selected, which means that the face images in the sub-category are more easily and accurately recognized.

Further, based on the face quality and the face attribute model, analyzing each snapshot in the main class, and selecting high-quality, partially-shielded, dim light/exposure, nodding and other target face images conforming to the space-time challenge scene;

it should be noted that, after determining the sub-category of the target, the face image with the highest quality needs to be further screened from the sub-category, and the steps specifically include:

face quality analysis: the quality of each face image is evaluated using a face quality model. Selecting a high quality image may involve detecting and filtering out some images of poor quality due to blur, pose problems, or other problems;

face attribute analysis: the attributes of each face image are analyzed using a face attribute model. Images conforming to the scene across the space-time challenges are selected, which may include partial occlusion, darkness/exposure, nodding, etc.

Through the steps, the face image of the target in terms of time, recognition accuracy, image quality and scene span can be selected from the face clustering result of each user for the subsequent application scene.

S105, extracting features from target face images in different dimensions, fusing the extracted features with original face features to obtain fused features, and performing space-time face recognition based on the fused features.

The step is to extract the features of the target face image in each dimension through the face feature extraction model obtained in the step S102;

further, the extracted features are fused with the face features of the user in the current scene extracted in step S103, so as to construct a face feature bank containing multiple sample features, and the feature fusion process can be realized by the following formula:fts_update = (fts_center * n + ∑fts) / (n+m)

wherein fts _update is a post-fusion feature, fts _center is a pre-fusion feature, wherein in an initial state, the pre-fusion feature is an original feature, Σ fts is a sum of features extracted under a round, n is a sum of extracted features, and m is the number of features fused in a previous round.

The feature fusion operation is an incremental process, and each iteration updates the features fused previously. The feature fusion process of the embodiment is a dynamic gradual updating process; by considering features that have been fused before in each new fusion, historical information can be preserved, making the fused features more comprehensive and representative.

After the scheme is practically deployed and applied, the adoption of the mechanism for dynamic updating can help the model to better adapt to the change of the face under the space-time scene, ensure the adequate richness of the representation of the model, and further adapt to the dynamic change of the face characteristics of the user, and not only consider the current characteristics in each fusion.

Through the steps S101 to S105, compared with the traditional face recognition method, the method adopts an unsupervised training optimization mode based on a vision transformer network structure, and uses huge amounts of unlabeled data, and by considering local and global features at the same time, the similarity of the local and global features of the same face can be maximized, so that the model has better face representation capability. Furthermore, by clustering faces in a cross-space-time scene and selecting and fusing target face photo features in different dimensions with original features according to a clustering result, a feature bank containing multiple feature samples in different dimensions in the cross-space-time scene can be dynamically constructed, so that a model is not required to be trained independently aiming at different scene factors, the domain (domain) offset problem caused by age blocking/increasing, light field/field of view change, behavior change and the like in the cross-space-time scene can be solved, and the complex and various recognition requirements in the cross-space-time multi-scene are considered.

In addition, experiments of research personnel prove that the cross-space-time compatibility unsupervised self-learning face recognition method provided by the application is at least improved compared with the traditional mode in the aspect of challenge face recognition in a cross-space-time scene:

1. the average value of the recognition passing rate of the age blocking scene is increased from 62% to 91%;

2. the user does not need to frequently replace the registration photo in the age increasing scene, and the higher recognition passing rate can be always kept;

3. the average value of the recognition passing rate in the view field space change scene is increased from 74% to 93%;

4. the average value of the recognition passing rate in the light field space change scene is increased from 87% to 99%;

5. the average value of the passing rate is increased from 56% to 98% under the condition of large-area shielding such as wearing helmets, masks and the like;

6. the average value of the passing rate is increased from 83% to 98% under the condition that the characteristic deviation is caused by makeup, PS and the like;

it should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

The embodiment also provides a basic cross space-time compatibility unsupervised self-learning face recognition system, which is used for realizing the embodiment and the preferred implementation mode, and is not described again. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 3 is a block diagram of a cross-space-time compatibility unsupervised self-learning face recognition system according to an embodiment of the present application, the system including: an unsupervised training module 30, a supervised training module 31 and an inference optimization module 32, wherein:

the unsupervised training module 30 is configured to collect original face data, and train a pre-constructed vision transformer network based on the original face data to obtain a face representation model;

the supervised training module 31 is configured to perform supervised training on the face representation model to obtain a face feature extraction model, and extract a plurality of groups of face features through the face feature extraction model to construct a face feature library;

the inference optimization module 32 is configured to determine whether a matching score of a face feature of a target user in a face feature library is greater than a preset threshold, if so, collect a multi-dimensional face image set of the target user, cluster each face image in the multi-dimensional face image set to obtain a plurality of face image subclasses,

Through the system, the unsupervised learning module 30 is utilized to take a huge amount of unlabeled data, and the local and global characteristics of the same face are considered at the same time, so that the similarity of the local and global characteristics of the same face can be maximized, and the model has better face representation capability. By utilizing the reasoning optimization module 32, the feature bank containing multiple feature samples of different dimensions in the cross-space-time scene can be dynamically constructed by clustering the faces in the cross-space-time scene and selecting the target face photo features in different dimensions to be fused with the original features according to the clustering result, so that a separate training model aiming at different scene factors is not needed, the domain (domain) offset problem caused by age blocking/increasing, light field/visual field changing, behavior changing and the like in the cross-space-time scene can be solved, and the complex and various recognition requirements in the cross-space-time multi-scene are considered.

In one embodiment, fig. 4 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, as shown in fig. 4, and an electronic device, which may be a server, may be provided, and an internal structure diagram thereof may be shown in fig. 4. The electronic device includes a processor, a network interface, an internal memory, and a non-volatile memory connected by an internal bus, where the non-volatile memory stores an operating system, computer programs, and a database. The processor is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing environment for the operation of an operating system and a computer program, the computer program is executed by the processor to realize a cross-space-time compatibility non-supervision self-learning face recognition method, and the database is used for storing data.

It will be appreciated by those skilled in the art that the structure shown in fig. 4 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. An unsupervised self-learning face recognition method with cross space-time compatibility, which is characterized by comprising the following steps:

2. The method of claim 1, wherein training a pre-constructed vision transformer network based on the raw face data to obtain a face representation model comprises:

3. The method of claim 2, wherein during iterative training of the vision transformer network, the method further comprises:

4. The method of claim 1, wherein the set of multi-dimensional face images includes face images of the target user under different illumination, different fields of view, different age groups, and different occlusion states.

5. The method of claim 4, wherein clustering each image in the set of multi-dimensional face images to obtain a plurality of face image subclasses comprises:

6. The method of claim 5, wherein determining a target face image subclass from the plurality of face image subclasses comprises:

7. The method of claim 6, wherein the extracted features are fused with the original face features by the following formula to obtain fused features:

fts_update = (fts_center * n + ∑fts) / (n+m)

8. A cross-space-time compatibility unsupervised self-learning face recognition system, the system comprising: the system comprises an unsupervised training module, a supervised training module and an inference optimization module, wherein:

the supervised training module is used for performing supervised training on the face representation model to obtain a face feature extraction model, and extracting a plurality of groups of face features through the face feature extraction model to construct a face feature library;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 7.