CN112016524A

CN112016524A - Model training method, face recognition device, face recognition equipment and medium

Info

Publication number: CN112016524A
Application number: CN202011027568.4A
Authority: CN
Inventors: 杨馥魁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-12-01
Anticipated expiration: 2040-09-25
Also published as: CN112016524B

Abstract

The application discloses a model training method, a face recognition device, equipment and a medium, and relates to the technical field of artificial intelligence such as computer vision and deep learning. The specific implementation scheme of the model training method is as follows: respectively extracting each modal characteristic of each sample image by using a pre-established face recognition model; performing feature fusion on the features of each mode to obtain the own features of each mode and the common features between any at least two modes; and respectively carrying out supervision training on the face recognition model according to the own characteristics of each mode and the common characteristics between any at least two modes by utilizing the loss functions corresponding to the own characteristics of each mode and the common characteristics between any at least two modes. According to the embodiment of the application, the characteristics of different modes in the image can be fully mined, the face recognition is carried out based on the characteristics, and the accuracy of the multi-mode face recognition is greatly improved.

Description

Model training method, face recognition device, face recognition equipment and medium

Technical Field

The application relates to the field of artificial intelligence, in particular to computer vision and deep learning technology, and specifically relates to a model training method, a face recognition device and a medium.

Background

Most of the existing face recognition technologies are based on RGB images for recognition, however, in scenes with extremely high requirements for face recognition accuracy, such as door lock and finance scenes, it is difficult to meet the requirements only by using a single mode of the RGB images, and therefore, a multi-modal RGBD (RGB + depth) face recognition technology comes up.

In the RGBD face recognition model in the prior art, a four-channel fusion mode is mostly adopted to directly fuse an RGB mode and a depth mode, that is, a face RGB image and a depth image are input and combined into a four-channel RGBD image, RGBD features are extracted through the RGBD model for feature matching, and a face recognition result is output. However, the face recognition result obtained by the four-channel fusion method still cannot meet the current requirement on the face recognition accuracy.

Disclosure of Invention

The application provides a model training method, a face recognition device, equipment and a medium, so as to improve the accuracy of multi-modal face recognition.

In a first aspect, the present application provides a model training method for multi-modal face recognition, the method including:

respectively extracting each modal characteristic of each sample image by using a pre-established face recognition model;

performing feature fusion on the features of each mode to obtain the own features of each mode and the common features between any at least two modes;

and respectively carrying out supervision training on the face recognition model according to the own characteristics of each mode and the common characteristics between any at least two modes by utilizing the loss functions corresponding to the own characteristics of each mode and the common characteristics between any at least two modes.

In a second aspect, the present application further provides a multi-modal face recognition method, including:

extracting each modal characteristic of a face image to be recognized by using a face recognition model trained according to the model training method provided by the first aspect, and performing characteristic fusion on each modal characteristic to obtain each modal characteristic and a common characteristic between at least two random modes;

and carrying out face recognition on the face image to be recognized by utilizing the face recognition model according to the own characteristics of each mode and the common characteristics between any two modes.

In a third aspect, the present application further provides a model training apparatus for multimodal face recognition, where the apparatus includes:

the characteristic extraction module is used for respectively extracting each modal characteristic of each sample image by utilizing a pre-established face recognition model;

the characteristic fusion module is used for carrying out characteristic fusion on the characteristics of each mode to obtain the own characteristics of each mode and the common characteristics between any at least two modes;

and the supervision training module is used for performing supervision training on the face recognition model according to the own characteristic of each mode and the common characteristic between any at least two modes by utilizing the loss function corresponding to the own characteristic of each mode and the common characteristic between any at least two modes.

In a fourth aspect, the present application further provides a multi-modal face recognition apparatus, including:

the feature processing module is configured to extract, by using a face recognition model trained according to the model training method provided in the first aspect, each modal feature of a face image to be recognized, and perform feature fusion on each modal feature to obtain a feature owned by each modality and a common feature between at least two arbitrary modalities;

and the face recognition module is used for carrying out face recognition on the face image to be recognized according to the own characteristics of each mode and the common characteristics between any two modes by using the face recognition model.

In a fifth aspect, the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a model training method for multimodal face recognition as described in any of the embodiments of the present application.

In a sixth aspect, the present application further provides an electronic device, including:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a multimodal face recognition method as described in any of the embodiments of the present application.

In a seventh aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the model training method for multimodal face recognition according to any of the embodiments of the present application.

In an eighth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the multimodal face recognition method according to any of the embodiments of the present application.

According to the technical scheme of the application, the characteristics of different modes in each sample image are subjected to characteristic fusion, common characteristics between the own characteristics of the modes and the different modes are divided, and the face recognition model is supervised and trained respectively aiming at the own characteristics of the modes and the common characteristics between the different modes in the training stage of the face recognition model, so that the trained face recognition model can fully mine the characteristics of the different modes in the image, and the accuracy of the face recognition model in multi-mode face recognition is improved.

It should be understood that the statements herein do not intend to identify key or critical features of the present application, nor to limit the scope of the present application. Other features of the present application will become readily apparent from the following description, and other effects of the above alternatives will be described hereinafter in conjunction with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow diagram of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a model training method according to an embodiment of the present application;

FIG. 3 is a flow chart diagram of a multi-modal face recognition method according to an embodiment of the application;

FIG. 4 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a multi-modal face recognition apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of an electronic device for implementing a model training method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow diagram of a model training method according to an embodiment of the present application, which is applicable to training a multi-modal face recognition model so as to perform multi-modal face recognition by using the trained face recognition model, and relates to the technical field of artificial intelligence such as computer vision and deep learning. The method may be performed by a model training apparatus, which is implemented in software and/or hardware, and is preferably configured in an electronic device, such as a server or a computer device. As shown in fig. 1, the method specifically includes the following steps:

s101, respectively extracting each modal characteristic of each sample image by using a pre-established face recognition model.

Specifically, a face recognition model can be established by using any deep learning algorithm, and various modal characteristics can be extracted by using the model. In one embodiment, the established face recognition model may include a feature extraction network of each modality, that is, the face recognition model extracts features of each modality by using the feature extraction network of each modality respectively. As for the specific network structure of the face recognition model, the application is not limited at all.

Different modalities of the image may include, for example, an RGB modality, a depth modality, an infrared modality, a 3D point cloud modality, or the like. The features of each modality of the extracted image may be features of at least two modalities as described above, for example, features of an RGB modality and a depth modality, or features of an RGB modality, a depth modality, and an infrared modality, or any other multi-modality combination. That is, the present application does not limit the specific modality.

In addition, before feature extraction, the sample image may be preprocessed, for example, the input depth image is normalized to remove background noise and the like. Through preprocessing, a high-quality sample image can be obtained, so that a good data base is provided for the subsequent feature extraction and model training.

And S102, performing feature fusion on the features of each mode to obtain the own features of each mode and the common features between any at least two modes.

The common characteristic refers to a combination of characteristics of any two or more modalities, except the characteristic of the modality. For example, when the multi-modality includes an RGB modality and a depth modality, the own feature of each modality refers to a part of the RGB feature and a part of the depth feature, respectively, and the common feature refers to a combination of another part of the RGB feature excluding the own feature and another part of the depth feature excluding the own feature.

It should be noted that, in the process of feature fusion, which part of the features of each modality is used as the own feature of each modality, the present application is not limited at all, and as long as the part belongs to the features of each modality, the features of each modality may be used as the own features of each modality, and a combination of the features of the other parts except the own features of each modality in any at least two modalities may be used as the common feature between any at least two modalities.

S103, performing supervision training of the face recognition model according to the own features of each mode and the common features between at least two arbitrary modes by using the loss functions corresponding to the own features of each mode and the common features between at least two arbitrary modes.

Specifically, the own characteristic of each mode corresponds to a respective loss function, the common characteristic also corresponds to a loss function, and the loss functions corresponding to different characteristics are different. In the training process, the respective loss functions are utilized to respectively supervise the characteristics corresponding to the loss functions, so that the parameters of the respective corresponding characteristic extraction networks in different characteristics are updated.

Note that different modalities have common features and different features as different expressions of an image. In the prior art, features of different modes are simply superposed and spliced, so that in the process of model training and learning, which part of the spliced features represents structural information in an image cannot be determined, so that the characteristics cannot be well learned, and the accuracy of the learned model for face recognition cannot meet the requirement. In the method, in the feature fusion process, the features are divided into the self-owned features of different modalities and the common features among the modalities, supervised learning is respectively carried out by utilizing respective corresponding loss functions, the parameters of the feature extraction networks of the different modalities are updated, and more refined separate supervised training is realized, so that the model can better learn the structural information of each feature, the features of the different modalities of the image are fully mined, and finally the face recognition accuracy of the model is improved.

According to the technical scheme, the characteristics of different modes in each sample image are subjected to characteristic fusion, common characteristics between the own characteristics of the modes and the different modes are divided, and the face recognition model is supervised and trained respectively according to the own characteristics of the modes and the common characteristics between the different modes in the training stage of the face recognition model, so that the trained face recognition model can fully mine the characteristics of the different modes in the image, and the accuracy of the face recognition model in multi-mode face recognition is improved.

Fig. 2 is a schematic flow chart of a multi-modal face recognition method according to an embodiment of the present application, and the embodiment is further optimized based on the above embodiment. As shown in fig. 2, the method specifically includes the following steps:

s201, respectively extracting each modal characteristic of each sample image by using a pre-established face recognition model.

The pre-established face recognition model at least includes a feature extraction network of each modality, for example, in an embodiment, the face recognition model is a multi-branch network structure, and each network branch is used for extracting features of one modality from an input image of the modality. And the adoption of the multi-branch network structure has good effect and low cost.

When each mode comprises an RGB mode and a depth mode, the multi-branch network is a dual-branch network, that is, the multi-branch network comprises an RGB image feature extraction network and a depth image feature extraction network, and is respectively used for extracting RGB features and depth features.

S202, respectively taking any first part of the modal characteristics of each sample image as the own characteristics of each mode; and dividing the combination of the features of any at least two modes by the second part of the first part to form the common feature between any at least two modes.

In one embodiment, for example, when each modality includes an RGB modality and a depth modality, the first half of the depth feature may be used as the depth-owned feature, the second half of the RGB feature may be used as the RGB-owned feature, and the second half of the depth feature and the first half of the RGB feature may be combined to be used as the depth-and-RGB-shared feature.

The above-described division of the unique features and the common features is merely an example, and the present application does not limit the division at all. That is, other division methods may be used in addition to the division method in the above embodiment. For example, an arbitrary portion of each of the depth feature and the RGB feature is defined as the depth-owned feature and the RGB-owned feature, and the remaining portion of the depth feature and the RGB feature is combined to define the depth-owned feature and the RGB-owned feature. That is, in the present application, the unique features of each modality and the common features between at least two of the modalities are classified in the fusion process of the features of each modality.

For another example, in another embodiment, when each modality includes three modalities, namely, an RGB modality, a depth modality, and an infrared modality, the own features of each modality are RGB own features, depth own features, and infrared own features, respectively. The common features between any two or more modes can be the combination of the extracted RGB features, depth features and infrared features except the respective RGB inherent features, depth inherent features and infrared inherent features; it may also be a common feature between each two modalities, namely: a combination of the RGB features and the remaining ones of the depth features other than their own features, a combination of the RGB features and the remaining ones of the infrared features other than their own features, and a combination of the infrared features and the remaining ones of the depth features other than their own features.

That is, when face recognition is performed for three or more modalities, the common feature determined in the feature fusion process may be a combination of features other than its own feature among the features of all modalities, or a combination of features other than its own feature among the features of each of two modalities.

S203, performing supervision training of the face recognition model respectively aiming at the own features of each mode and the common features between at least two arbitrary modes by using the loss functions corresponding to the own features of each mode and the common features between at least two arbitrary modes.

In the process of the supervision training, the supervision training process of the characteristic of each mode is to update the parameters of the characteristic extraction network of each mode, and the supervision training process of the common characteristic between any at least two modes is to update the parameters of the characteristic extraction network of any at least two modes. For example, when each mode includes an RGB mode and a depth mode, the supervised training process for the depth self-characterization is to perform parameter update on the depth image feature extraction network, the supervised training process for the RGB self-characterization is to perform parameter update on the RGB image feature extraction network, and the supervised training process for the depth and RGB feature extraction network is to perform parameter update on the depth image feature extraction network and the RGB image feature extraction network at the same time.

The mode of the supervised training can comprise a classification learning mode or a measurement learning mode. For example, in the classification learning manner, model training may be performed based on a cross entropy loss function of softmax; in the metric learning mode, model training can be performed based on a centerlos loss function; of course, training based on a combination of centerloss and softmax is also possible. The specific training mode of the model is not limited in any way.

It should be noted that, if model training is performed based on the centerlos loss function, that is, corresponding centerlos loss functions are respectively set for the own features of each modality and the common features between any two modalities, and the loss functions corresponding to the respective models are used to supervise the features, and update the network parameters. And the centerlos loss function is adopted for supervision training, so that the extracted features of the network are more aggregated, common features among different modes are more compact, and a better face recognition effect is achieved.

According to the technical scheme, the characteristics of different modes in each sample image are subjected to characteristic fusion, the characteristics are divided into the self characteristics of different modes and the common characteristics among the modes in the characteristic fusion part, the loss functions corresponding to the self characteristics are utilized to respectively perform supervised learning, the parameters of the characteristic extraction networks of different modes are updated, and the more refined separate supervised training is realized, so that the model can better learn the structural information of the characteristics, the characteristics of the images in different modes are fully mined, and finally the face recognition precision of the model is improved.

Fig. 3 is a schematic flow diagram of a multi-modal face recognition method according to an embodiment of the present application, which is applicable to a situation where a trained face recognition model is used to perform multi-modal face recognition, and relates to the technical fields of artificial intelligence such as computer vision and deep learning. The method can be executed by a multi-modal face recognition apparatus, which is implemented by software and/or hardware, and is preferably configured in an electronic device, such as a server or a computer device. As shown in fig. 3, the method specifically includes the following steps:

s301, respectively extracting each modal characteristic of each sample image by using a pre-established face recognition model.

And S302, performing feature fusion on the features of each mode to obtain the own features of each mode and the common features between any at least two modes.

S303, performing supervision training of the face recognition model respectively aiming at the own features of each mode and the common features between at least two arbitrary modes by utilizing the loss functions corresponding to the own features of each mode and the common features between at least two arbitrary modes.

The specific implementation of S301 to S303 is the same as that of the previous embodiment, and is not described herein again.

S304, extracting each modal characteristic of the face image to be recognized by using the trained face recognition model, and performing characteristic fusion on each modal characteristic to obtain the own characteristic of each modal and the common characteristic between any at least two modes.

S305, carrying out face recognition on the face image to be recognized according to the own characteristics of each mode and the common characteristics between at least two random modes by using the face recognition model.

For example, when each mode comprises an RGB mode and a depth mode, a face image to be recognized is input into a trained face recognition model, RGB features and depth features of the image are respectively extracted by the model, feature fusion is carried out, RBG self-features, depth self-features and RGB and depth common features are divided, and finally a face recognition result is given based on the RBG self-features, the depth self-features and the RGB and depth common features.

According to the technical scheme, during model training, the features of different modes in each sample image are subjected to feature fusion, common features of the modes and the common features among the different modes are divided, and the face recognition model is subjected to supervision training aiming at the common features of the modes and the common features among the different modes, so that the trained face recognition model can fully mine the features of the different modes in the image, and the accuracy of the face recognition model in multi-mode face recognition is improved.

Fig. 4 is a schematic structural diagram of a model training device according to an embodiment of the present application, which is applicable to training a multi-modal face recognition model so as to perform multi-modal face recognition by using the trained face recognition model, and relates to the technical field of artificial intelligence such as computer vision and deep learning. The device can realize the model training method in any embodiment of the application and is used for multi-modal face recognition. As shown in fig. 4, the apparatus 400 specifically includes:

a feature extraction module 401, configured to respectively extract each modal feature of each sample image by using a pre-established face recognition model;

a feature fusion module 402, configured to perform feature fusion on the features of the modalities to obtain own features of the modalities and common features between at least two of the modalities;

and a supervised training module 403, configured to perform supervised training on the face recognition model according to the own features of each modality and the loss functions corresponding to the common features between any at least two modalities, respectively.

Optionally, the feature fusion module 403 includes:

an own characteristic determining unit, configured to use any first portion of the respective modality characteristics of each sample image as the own characteristics of the respective modalities respectively;

and a common feature determination unit, configured to divide each of the features of any at least two modalities by a combination of the second part of the first part, as a common feature between the at least two modalities.

Optionally, the face recognition model at least includes a feature extraction network of each modality.

Optionally, in the process of performing supervised training by the supervised training module, the supervised training process for the own features of each modality is to perform parameter updating on the feature extraction networks of each modality, and the supervised training process for the common features between any at least two modalities is to perform parameter updating on both the feature extraction networks of any at least two modalities.

Optionally, the loss function is a centerloss loss function.

Optionally, the face recognition model is a multi-branch network structure.

Optionally, the supervised training mode includes a classification learning mode or a metric learning mode.

Optionally, the modalities at least include a depth modality and an RGB modality.

The model training device 400 provided by the embodiment of the present application can execute the model training method provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

Fig. 5 is a schematic structural diagram of a multi-modal face recognition device according to an embodiment of the present application, which is applicable to a situation where a trained face recognition model is used to perform multi-modal face recognition, and relates to the technical fields of artificial intelligence, such as computer vision and deep learning. The device can realize the multi-mode face recognition method in any embodiment of the application. As shown in fig. 5, the apparatus 500 specifically includes:

the feature processing module 501 is configured to extract features of each modality of a face image to be recognized by using a face recognition model trained by the model training method according to any embodiment, and perform feature fusion on the features of each modality to obtain a feature owned by each modality and a common feature between at least two arbitrary modalities;

a face recognition module 502, configured to perform face recognition on the face image to be recognized according to the own features of each modality and the common features between any at least two modalities by using the face recognition model.

The multi-modal face recognition device 500 provided by the embodiment of the application can execute the multi-modal face recognition method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the model training methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the model training method provided herein.

The memory 602, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the model training method in the embodiments of the present application (e.g., the feature extraction module 401, the feature fusion module 402, and the supervised training module 403 shown in fig. 4). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implementing the model training method in the above method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device that implements the model training method of the embodiment of the present application, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and such remote memory may be connected over a network to an electronic device implementing the model training methods of embodiments of the present application. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for implementing the model training method of the embodiment of the present application may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus implementing the model training method of the embodiments of the present application, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

According to an embodiment of the present application, there is also provided another electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the multimodal face recognition method of any of the embodiments of the present application. The hardware structure and functions of the electronic device can be explained with reference to the content of the above-mentioned embodiment shown in fig. 6.

There is further provided, in accordance with an embodiment of the present application, another non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the multimodal face recognition method according to any embodiment of the present application. The introduction of the storage medium is explained with reference to the above-described embodiment shown in fig. 6.

According to the technical scheme of the embodiment of the application, the characteristics of different modes in each sample image are fused, the own characteristics of each mode and the common characteristics among different modes are divided, and the face recognition model is supervised and trained respectively aiming at the own characteristics of each mode and the common characteristics among different modes in the training stage of the face recognition model, so that the trained face recognition model can fully mine the characteristics of different modes in the image, and the accuracy of the face recognition model for multi-mode face recognition is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A model training method for multi-modal face recognition, the method comprising:

2. The method according to claim 1, wherein the performing feature fusion on the features of each modality of each sample image to obtain the own features of each modality and the common features between any at least two modalities comprises:

taking any first part of the characteristics of each modality of each sample image as the own characteristics of each modality;

and a combination of the features of any at least two modes divided by the second part of the first part is used as the common feature between any at least two modes.

3. The method of claim 1, wherein the face recognition model includes at least a feature extraction network for each modality.

4. The method according to claim 3, wherein during the supervised training, the supervised training process with the characteristic of each modality is to perform parameter updating on the feature extraction network of each modality, and the supervised training process with the common characteristic between any at least two modalities is to perform parameter updating on the feature extraction networks of any at least two modalities.

5. The method of claim 1, wherein the loss function is a centterlos loss function.

6. The method of claim 1, wherein the face recognition model is a multi-branch network structure.

7. The method of claim 1, wherein the supervised training mode comprises a class learning mode or a metric learning mode.

8. The method according to any of claims 1-7, wherein the modalities include at least a depth modality and an RGB modality.

9. A multi-modal face recognition method, comprising:

extracting each modal characteristic of a face image to be recognized by using a face recognition model trained by using the model training method according to any one of claims 1 to 8, and performing characteristic fusion on each modal characteristic to obtain a characteristic of each modal and a common characteristic between any at least two modes;

10. A model training apparatus for multimodal face recognition, the apparatus comprising:

11. The apparatus of claim 10, wherein the feature fusion module comprises:

12. The apparatus of claim 10, wherein the face recognition model includes at least a feature extraction network for each modality.

13. The apparatus according to claim 12, wherein during the supervised training by the supervised training module, the supervised training process for the characteristic of each modality is to perform parameter updating on the feature extraction network of each modality, and the supervised training process for the common characteristic between any at least two modalities is to perform parameter updating on the feature extraction networks of any at least two modalities.

14. The apparatus of claim 10, wherein the loss function is a centterlos loss function.

15. The apparatus of claim 10, wherein the face recognition model is a multi-branch network structure.

16. The apparatus of claim 10, wherein the supervised training approach comprises a classification learning approach or a metric learning approach.

17. The apparatus according to any of claims 10-16, wherein the modalities include at least a depth modality and an RGB modality.

18. A multi-modal face recognition device, comprising:

a feature processing module, configured to extract features of each modality of a face image to be recognized by using a face recognition model trained by the model training method according to any one of claims 1 to 8, and perform feature fusion on the features of each modality to obtain features of each modality and common features between at least two of the modalities;

19. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method for multimodal face recognition of any one of claims 1-8.

20. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multimodal face recognition method of claim 9.

21. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the model training method for multimodal face recognition of any one of claims 1-8.

22. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the multimodal face recognition method of claim 9.