CN112668509A

CN112668509A - Training method and recognition method of social relationship recognition model and related equipment

Info

Publication number: CN112668509A
Application number: CN202011634101.6A
Authority: CN
Inventors: 邢玲; 余意
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-16
Anticipated expiration: 2040-12-31
Also published as: CN112668509B

Abstract

The invention relates to the technical field of image relationship recognition, and provides a training method, a recognition method and related equipment of a social relationship recognition model, wherein the training method of the social relationship recognition model comprises the following steps: performing self-supervision training on the first network model through a preset first training set based on the prior distance of the relationship between two sample persons in the training image to obtain a second network model; adjusting the full connection layer of the second network model to adapt to the identification and classification of social relations, and obtaining a third network model; and adjusting and training the third network model through a preset second training set to obtain a target network model. The method and the device can reduce the use of the labeled data, and the accuracy of the trained target network model for identifying the social relationship is higher.

Description

Training method and recognition method of social relationship recognition model and related equipment

Technical Field

The invention relates to the technical field of image relationship recognition, in particular to a training method and a recognition method of a social relationship recognition model and related equipment.

Background

Social Relations (SR) are the foundation of human society, so it is very important to master Social relations of the whole society, both for national governance and scientific research. Identifying the social relationships between people in an image may allow an agent to better understand the behavior or emotion of a human. The task of image-based social relationship recognition is to classify a pair of people in an image as one of predefined relationship types, such as: friends, family, etc. It has many important applications, for example: personal image collection mining and social event understanding.

Image-based sociological analysis has been a topical topic in the past decade. In contrast, in recent years, with the distribution of large-scale data sets, researchers have been attracted to image-based social relationship identification. The performance of the method is improved remarkably from the traditional manual characteristic to the recent deep learning-based algorithm. However, identifying social relationships from images remains a non-negligible problem that faces some challenges. First, there is a large domain gap between visual features and social relationships. Secondly, no matter the method based on manual characteristics or deep learning needs certain supervision information, the label acquisition of the social relationship type is a very difficult matter, in the past work, the two proposed public data sets PIPA and PISC are small-scale data sets, the labeling process needs 5 persons to be completed independently, the final label is generated based on a voting mechanism, too much labeling manpower is needed to be consumed, the method based on deep learning often needs a large amount of data, and the application difficulty of the identification of the social relationship type is increased. It can be seen that there are problems in the visual-based social relationship identification task that tag data is small and difficult to collect, resulting in low social relationship identification accuracy.

Disclosure of Invention

The embodiment of the invention provides a training method of a social relationship recognition model, which can reduce the use of labeled data and improve the accuracy of social relationship recognition.

In a first aspect, an embodiment of the present invention provides a method for training a social relationship recognition model, including:

performing self-supervision training on the first network model through a preset first training set based on the prior distance of the relationship between two sample persons in the training image to obtain a second network model;

adjusting the full connection layer of the second network model to adapt to social relationship recognition classification to obtain a third network model;

and adjusting and training the third network model through a preset second training set to obtain a target network model.

In a second aspect, an embodiment of the present invention further provides a social relationship identification method, including the following steps:

acquiring an image to be identified, wherein the image to be identified comprises a first target person and a second target person;

and performing social relationship recognition on the first target person and the second target person in the image to be recognized through a target network model obtained through training in any embodiment.

In a third aspect, an embodiment of the present invention provides a training apparatus for a social relationship recognition model, including:

the first training module is used for carrying out self-supervision training on the first network model through a preset first training set based on the prior distance of the relationship between two sample persons in the training image to obtain a second network model;

the adjusting module is used for adjusting the full connection layer of the second network model to adapt to the training classification of the social relationship recognition model to obtain a third network model;

and the second training module is used for adjusting and training the third network model through a preset second training set to obtain a target network model.

In a fourth aspect, an embodiment of the present invention provides a social relationship identifying apparatus, including:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an image to be recognized, and the image to be recognized comprises a first target person and a second target person;

and the recognition module is used for recognizing the social relationship between the first target person and the second target person in the image to be recognized through a trained target network model.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including: the training method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the training method of the social relationship recognition model provided by the embodiment of the invention.

In a sixth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps in the training method for a social relationship recognition model provided in the embodiment of the present invention.

In the embodiment of the invention, a second network model is obtained by carrying out self-supervision training on a first network model through a preset first training set based on the prior distance of the relationship between two sample persons in a training image; adjusting the full connection layer of the second network model to adapt to the training classification of the social relationship recognition model to obtain a third network model; and adjusting and training the third network model through a preset second training set to obtain a target network model. According to the method, model training is respectively carried out through the preset first training set and the second training set, the first training set is used as unsupervised data, the second training set is used as supervised data, after training, the demand of the supervised data is reduced, and the recognition capability of a target network model is improved; and in practical application, a large amount of unsupervised data can be used to improve the performance of the model, so that the requirement on the quantity of the labeled data is reduced, and the labor cost of labeling is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for training a social relationship recognition model according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for training a social relationship recognition model according to an embodiment of the present invention;

FIG. 2a is a diagram illustrating a second network model training scheme according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of target network model training according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for training a social relationship recognition model according to an embodiment of the present invention;

FIG. 4 is a flowchart of another social relationship identifying method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to another embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to another embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to another embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to another embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a method for training a social relationship recognition model according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

101. and carrying out self-supervision training on the first network model through a preset first training set based on the prior distance of the relationship between two sample persons in the training image to obtain a second network model.

In the embodiment of the invention, the training method of the social relationship recognition model can be applied to a personnel relationship recognition scene under video monitoring. The electronic device on which the social relationship recognition model training method is operated can acquire images and the like in a wired connection mode or a wireless connection mode and perform data transmission. It should be noted that the Wireless connection manner may include, but is not limited to, a 3G/4G connection, a WiFi (Wireless-Fidelity) connection, a bluetooth connection, a wimax (Worldwide Interoperability for Microwave Access) connection, a Zigbee (low power local area network protocol) connection, a uwb (ultra wideband) connection, and other Wireless connection manners known now or developed in the future.

The above-mentioned self-supervised training belongs to self-supervised learning (self-supervised learning), which can be regarded as an "ideal state" of machine learning, and the model learns itself directly from the unlabeled data without labeling the data. The model learns a certain priori knowledge by combining with self-supervision learning, and then adjusts (finetune) downstream (downstream) tasks on the basis of the a certain priori knowledge, so that the model performance which can be achieved based on a large amount of data before the completion by using a small amount of tag data is achieved.

The training images may refer to images in a training set, and the training images may be used for model training recognition. The training images can be acquired in real time through image acquisition equipment, and can also be uploaded from a terminal through manual work, and the like. The image acquisition device can comprise a camera and an electronic device which is provided with the camera and can acquire images. The image may be an image for which a relationship analysis is required, and at least 2 persons, i.e., the two sample persons, may be included in the image. In addition, some interferers, such as buildings, other unrelated people, trees, etc., may also be included.

The a priori distances described above may be initial values of distances determined from previous experience. The preset first training set may be a large amount of image data that is not labeled, i.e., unlabeled data (unlabeled sample image). The first network model is an initial recognition model for social relationship recognition, and the second network model can be obtained through self-supervision training of a large amount of non-tag data. The first training set may include a PISC data set and unlabeled data in the self-constructed data set, and the self-constructed data set includes approximately 20 ten thousand unlabeled data and 169900 labeled samples, and the self-constructed data set includes three relationships: family, friend, don't know. The PISC data set is one of large social relationship data sets, mainly comprises images of common social relationships in daily life, and comprises 3 coarse-grained relationships: close relationships, general relationships, unconcerned

It should be noted that the terminal may include, but is not limited to, a Mobile phone, a Tablet Personal Computer (Tablet Personal Computer), a Laptop Computer (Laptop Computer), a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), a Computer, or a notebook Computer.

102. And adjusting the full connection layer of the second network model to adapt to the identification and classification of the social relation, so as to obtain a third network model.

The adjusting of the fully connected layer (fc) of the second network model may be adjusting the number of output neurons and the number of relationship categories of the fully connected layer based on parameters such as the type and the number to be identified. The fully connected layer mainly plays a classification role, and each node of the fully connected layer is connected with all nodes of the previous layer and used for integrating the extracted features. Due to its fully-connected nature, the parameters of a fully-connected layer are also typically the most, each layer of the fully-connected layer is composed of many neurons, all neurons of a later layer are fully connected to all neurons of a previous layer to integrate local information with class distinction in a convolutional or pooling layer before, and the output value at the last layer of the fully-connected layer is passed to an output, such as: and adjusting and outputting by combining the loss function and the normalization processing.

103. And adjusting and training the third network model through a preset second training set to obtain the target network model.

The second training set may include labeled data (labeled sample images) in the PISC data set. After the second network model is trained through the label-free data, the obtained third network model continues to be trained through the labeled data, so that the labeled data can be less utilized, and the manual labeling cost is reduced. And respectively training the models by using the labeled data and the unlabeled data to obtain the models, namely the target network models. The social relationship of the people can be identified through the target network model.

Specifically, in the embodiment of the present invention, the use conditions of the labeled data in the psec dataset in the reenr 50 and the self-supervision task training are as follows, and specifically refer to the following table 1:

usage tag data ratio	Resnet50 pretraining	Self-supervised task pre-training
			5％	65.5％	66.9％
10％	67.7％	68.8％
			20％	69.7％	71.4％

TABLE 1

Specifically, in the embodiment of the present invention, the use conditions of the unlabeled data in the self-established dataset in the reenr 50 and the self-supervision task training are as follows, and specifically refer to the following table 2:

resnet50 pretraining	Self-supervised task pre-training
		76.5％	79.9％

TABLE 2

According to the use condition of the data, the performance of the model can be improved under the condition of using a large amount of non-label data, and the amount of the used non-label data has a certain relation with the improvement performance.

In the embodiment of the invention, a second network model is obtained by carrying out self-supervision training on a first network model through a preset first training set based on the prior distance of the relationship between two sample persons in a training image; adjusting the full connection layer of the second network model to adapt to the identification and classification of social relations, and obtaining a third network model; and adjusting and training the third network model through a preset second training set to obtain the target network model. According to the method, model training is respectively carried out through the preset first training set and the second training set, the first training set is used as unsupervised data, the second training set is used as supervised data, the quantity of the supervised data is reduced after training, and the recognition capability of the model is improved; and in practical application, a large amount of unsupervised data can be used to improve the performance of the model, so that the requirement on the quantity of the labeled data is reduced, and the labor cost of labeling is reduced.

Referring to fig. 2, the first network model further includes a target detection network, fig. 2 is a flowchart of a training method of another social relationship recognition model provided in the embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

201. and inputting the label-free sample image into a target detection network, and detecting a first sample human body frame of a first sample person, a second sample human body frame of a second sample person and a first sample human body combined frame of the first sample person and the second sample person in the output sample image through the target detection network.

Wherein unlabeled sample images are included in the first training set and at least two sample people are included in each unlabeled sample image. The first network model provided in the above embodiment may include a first abstraction network, a second abstraction network, a third abstraction network, a first full connectivity layer, and a first loss function (L1loss), wherein network parameters of the first abstraction network and the second abstraction network are shared. The loss function is derived from the forward propagation calculation and is also the starting point for the backward propagation. The first extraction network, the second extraction network and the third extraction network may be used to extract feature information of two sample persons in the unlabeled sample image. The first penalty function may be used to make output adjustments to the first fully-connected layer. The target detection network can lock the positions of two sample persons in the unlabeled sample image through a sample frame. And taking the coordinate position of the marked sample frame as the coordinate position of the corresponding sample person. Specifically, the two sample persons include a first sample person and a second sample person.

Referring to fig. 2a, after the unlabeled sample image in the first training data set is input to the target detection network, the first sample human body frame of the first sample person, the second sample human body frame of the second sample person, and the first sample human body combination frame of the first sample person and the second sample person in the sample image may be output through target detection. The first sample human body union frame may include a common portion of the first sample human body and the second sample human body, and is a minimum frame including the first sample human body and the second sample human body.

202. And calculating the prior distance between the two sample persons according to the first sample body frame and the second sample body frame.

In consideration of the fact that the degree of closeness between two people is reflected in the distance between two people, distance-related data is added in the embodiment of the present application to perform an auxiliary recognition function, especially in an outdoor scene, for example: lovers tend to be more intimate and embrace more hands, while strangers are usually not close to each other.

Optionally, step 202 may specifically include the steps of:

and calculating the center distance between the first sample human body frame and the second sample human body frame.

Wherein centers of the first and second sample body frames may be located, and then a center distance of the two center coordinates is calculated based on the coordinates of the centers (box1, box 2). Where box1 denotes the center position of the first sample body frame, and box2 denotes the center position of the second sample body frame. The center of the first sample human body frame and the center of the second sample human body frame can be the intersection point of two diagonal lines of the first sample human body frame, and similarly, the center of the second sample human body frame is also the intersection point of two diagonal lines of the second sample human body frame.

A first diagonal length value of the first sample body frame is calculated, and a second diagonal length value of the second sample body frame is calculated.

The first sample human body frame and the second sample human body frame are both rectangular frames, and the length of the diagonal line of each rectangular frame is equal, so that the diagonal line can refer to any one of the corresponding human body frames. By obtaining the coordinates of two points on the diagonal of the body frame of the first sample, the first diagonal length value diag1 is calculated according to the coordinates of the two points, and similarly, the second diagonal length value diag2 of the body frame of the second sample can be calculated according to the above manner.

And selecting the larger value of the first diagonal length value and the second diagonal length value as a reference value, and calculating the ratio of the central distance to the reference value as the prior distance between the two sample persons.

When the reference value is selected, the larger value of the first diagonal length value and the second diagonal length value can be used as the reference value. The ratio of the center distance to the reference value is then calculated as the a priori distance between the two sample persons. The greater the calculated a priori distance, the greater the distance between the first sample person and the second sample person may be represented. The problem that the pixel distances do not have consistency can be solved to a certain extent by calculating the ratio of the center distance to the reference value as the prior distance. Specifically, the formula for calculating the prior distance is shown by reference to the following equation (1):

203. and extracting a first feature of the first sample human body frame through a first extraction network, extracting a second feature of the second sample human body frame through a second extraction network, and extracting a third feature of the first sample human body combination frame through a third extraction network.

The first extraction network may be a residual 50 (a residual network), the first feature of the first sample body frame may be extracted by a backoff bone of a residual 50, the second extraction network may be a residual 50, and the second feature of the second sample body frame may be extracted by a backoff bone of a residual 50. The first sample human body frame and the second sample human body frame share network parameters, and the first sample human body combination frame and the first sample human body frame and the second sample human body frame do not share the network parameters. The feature dimension for the feature extraction is 2048. The third extraction network may be a content 50, and similarly, the third feature of the first sample human body union box can be extracted through a content 50.

204. And splicing the first characteristic, the second characteristic and the third characteristic to obtain a first splicing characteristic.

The feature dimensions of the extracted first feature, the extracted second feature and the extracted third feature are all 2048, and a 2048 × 3 one-dimensional feature, that is, the first stitching feature, is obtained by stitching the first feature, the second feature and the third feature.

205. And carrying out full-connection calculation on the first splicing characteristics through the first full-connection layer to obtain a predicted distance.

After the first splicing characteristics are extracted through the network, the first splicing characteristics are output through the full connection layer, and the first splicing characteristics are fully connected. In practical use, the fully-connected layer may be implemented by a convolution operation: a fully connected layer with a fully connected preceding layer can be converted into a convolution with a convolution kernel of 1 × 1; the full-connection layer of which the front layer is the convolution layer can be converted into the global convolution with convolution kernel h multiplied by w, h and w are respectively the height and width of the convolution result of the front layer, and the prediction distance can be output after full-connection calculation.

206. A first error between the prior distance and the predicted distance is calculated by a first loss function.

After the priori distance and the predicted distance are obtained through calculation, the difference between the priori distance and the predicted distance can be obtained through a first loss function, and a first error between the priori distance and the predicted distance is calculated.

207. And according to the first error, the network parameters of the first extraction network, the second extraction network and the third extraction network are adjusted through back propagation to obtain a second network model.

Adjusting the network parameters of the first extraction network, the second extraction network, and the third extraction network according to the first error back propagation may enhance the identification accuracy of the second network model.

208. And adjusting the full connection layer of the second network model to adapt to the identification and classification of the social relation, so as to obtain a third network model.

209. And adjusting and training the third network model through a preset second training set to obtain the target network model.

In the embodiment of the invention, the model is trained through the image pairs of the unlabeled sample image and the labeled sample image, so that the requirement for supervised data is reduced and the recognition capability of the model is improved after the model is trained; and in practical application, a large amount of unsupervised data can be used to improve the performance of the model, so that the requirement on the quantity of the labeled data is reduced, and the labor cost of labeling is reduced. And the ratio of the center distance of the sample human body frame to the length value of the diagonal line is used as the prior distance, so that the problem that the pixel distance is not consistent can be solved to a certain extent.

Referring to fig. 3, fig. 3 is a flowchart of another method for training a social relationship recognition model provided in the embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

301. and carrying out self-supervision training on the first network model through a preset first training set based on the prior distance of the relationship between two sample persons in the training image to obtain a second network model.

302. And adjusting the full connection layer of the second network model to adapt to the identification and classification of the social relation, so as to obtain a third network model.

Specifically, the relationship category to be identified and the number of relationship categories are determined.

The first training set includes a plurality of preset relationship types, and after the first training set is input to the first network model for self-supervision training to obtain the second network model, the relationship types to be identified and the number of the relationship types in the first training set can be obtained first, and the number corresponding to each relationship type can be different.

And adjusting the number of output neurons and the number of relation categories of the second network model full-connection layer to obtain a second full-connection layer.

The fully-connected layer plays a role of a classifier, so that the number of output neurons can be adjusted according to the number of relation categories, and the neurons of the fully-connected layer can meet all the number of relation categories conveniently. Thereby obtaining the second fully-connected layer.

And determining the type of the second loss function according to the relation category.

Wherein the type of the second loss function may include: 0-1 loss function (0-1 loss function), square loss function (square loss function), absolute value loss function (absolute loss function), and the like. After the relationship type is obtained, the variable in the second loss function can be determined as the parameter of the relationship type according to the relationship type, so that the type of the second loss function is determined.

And obtaining a third network model according to the second full connection layer and the second loss function.

And after the second loss function is determined, performing back propagation adjustment on the second full connection layer to obtain a third network model of the fir tree.

303. And inputting the label sample image into a target detection network, and detecting a third sample human body frame of a third sample person, a fourth sample human body frame of a fourth sample person and a second sample human body combined frame of the third sample person and the fourth sample person in the output sample image through the target detection network.

The second training set comprises labeled sample images, and the labeled sample images comprise at least two sample persons and relationship labels. Relationship labels refer to the category of relationships between two sample people. The third network model comprises a fourth extraction network, a fifth extraction network, a sixth extraction network, a second full connection layer and a second loss function, wherein the network parameters of the fourth extraction network and the fifth extraction network are shared. The fourth extraction network, the fifth extraction network and the sixth extraction network are used for feature extraction, and the second loss function can be used for adjusting the second full connection layer.

Referring to fig. 2b, after the labeled sample image in the second training data set is input to the target detection network, a fourth sample human body frame of the third sample person, and a second sample human body combination frame of the third sample person and the fourth sample person in the labeled sample image may be output through target detection. The second sample human body union box may include a common portion of the third sample person and the fourth sample person, and is a minimum box including the third sample person and the fourth sample person.

304. And extracting a fourth feature of the third sample human body frame through a fourth extraction network, extracting a fifth feature of the fourth sample human body frame through a fifth extraction network, and extracting a sixth feature of the second sample human body combined frame through a sixth extraction network.

The fourth extraction network, the fifth extraction network, and the sixth extraction network may be a response 50, and the fourth feature of the third sample body frame, the fifth feature of the fourth sample body frame, and the sixth feature of the second sample body frame may be extracted by a fallback of the response 50. And the network parameters are not shared between the second sample human body combined frame and the fourth sample human body frame and the fifth sample human body frame. The feature dimension for the feature extraction is 2048.

305. And splicing the fourth feature, the fifth feature and the sixth feature to obtain a second spliced feature.

The feature dimensions of the extracted fourth feature, the extracted fifth feature and the extracted sixth feature are 2048, and the one-dimensional feature of 2048 × 3, that is, the second stitching feature, is obtained by stitching the fourth feature, the fifth feature and the sixth feature.

306. And carrying out full-connection calculation on the second splicing characteristics through the second full-connection layer to obtain a relation classification result.

After the second splicing characteristics are extracted through the network, the second splicing characteristics are output through the full connection layer, and the second splicing characteristics are fully connected. In practical use, the full-link layer can be realized by convolution operation, and after convolution calculation, a relationship classification result can be output, wherein the relationship classification result comprises a relationship identification result between the third sample person and the fourth sample person.

307. And calculating a second error between the relation classification result and the relation label through a second loss function.

The relationship classification result can be represented by a numerical value, and the relationship label can also be represented by a numerical value. The second loss function may be a function for calculating a difference, and the value corresponding to the relationship classification result and the value corresponding to the relationship label may be subtracted by the second loss function to obtain the second error.

308. And adjusting the network parameters of the fourth extraction network, the fifth extraction network and the sixth extraction network according to the second error by back propagation.

And adjusting the network parameters of the fourth extraction network, the fifth extraction network and the sixth extraction network according to the second error back propagation, so that the identification accuracy of the second network model can be enhanced.

In the embodiment of the invention, the model is trained through the image pairs of the unlabeled sample image and the labeled sample image, so that the requirement for supervised data is reduced and the recognition capability of the model is improved after the model is trained; and in practical application, a large amount of unsupervised data can be used to improve the performance of the model, so that the requirement on the quantity of the labeled data is reduced, and the labor cost of labeling is reduced.

Referring to fig. 4, fig. 4 is a flowchart of a social relationship identifying method according to an embodiment of the present invention. The method comprises the following steps:

401. and acquiring an image to be recognized, wherein the image to be recognized comprises a first target person and a second target person.

The image to be recognized refers to an image needing social relationship recognition. The image to be recognized comprises a first target person and a second target person, and the first target person and the second target person are persons needing to be subjected to social relationship judgment in the same image to be recognized.

402. The target network model obtained through training in any embodiment performs social relationship recognition on a first target person and a second target person in an image to be recognized.

After the model is trained to obtain a target network model, namely after a self-supervision task (pretext task) of the model is completed, the image to be recognized is input into the target network model to extract a plurality of feature information of a first target person and a second target person in the image to be recognized, the plurality of feature information is subjected to data analysis, an output is obtained from a full connection layer according to an output value of the data analysis, and the social relationship between the first target person and the second target person can be recognized according to the output corresponding to the provided centralized social relationship, so that the social relationship recognition task (downlead task) is completed.

In the embodiment of the present invention, the social relationship recognition method is used for recognizing a target network model obtained by training through a training method of a social relationship recognition model, so that the effects that can be achieved by each embodiment of the training method of the social relationship recognition model can also be achieved in this embodiment, and are not described again to avoid repetition.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a training apparatus for a social relationship recognition model according to an embodiment of the present invention, and as shown in fig. 5, an apparatus 500 includes:

the first training module 501 is configured to perform self-supervision training on a first network model through a preset first training set based on a priori distance between two sample persons in a training image to obtain a second network model;

an adjusting module 502, configured to adjust a full connection layer of the second network model to adapt to training classification of the social relationship recognition model, so as to obtain a third network model;

the second training module 503 is configured to perform adjustment training on the third network model through a preset second training set to obtain a target network model.

Optionally, the first network model includes a first extraction network, a second extraction network, a third extraction network, a first full connection layer, and a first loss function, where network parameters of the first extraction network and the second extraction network are shared.

Optionally, the first training set includes unlabeled sample images, each unlabeled sample image includes at least two sample persons, fig. 6 is a schematic structural diagram of a training apparatus for another social relationship recognition model provided in an embodiment of the present invention, as shown in fig. 6, where the first training module 501 includes:

the first output sub-module 5011 is configured to input the unlabeled sample image into the target detection network, and detect and output a first sample human body frame of a first sample person, a second sample human body frame of a second sample person, and a first sample human body combination frame of the first sample person and the second sample person in the unlabeled sample image through the target detection network;

the first calculating submodule 5012 is used for calculating a priori distance between two sample persons according to the first sample person frame and the second sample person frame;

the first extraction submodule 5013 is configured to extract a first feature of the first sample human body frame through a first extraction network, extract a second feature of the second sample human body frame through a second extraction network, and extract a third feature of the first sample human body combination frame through a third extraction network;

the first splicing submodule 5014 is configured to splice the first feature, the second feature and the third feature to obtain a first spliced feature;

the second calculation submodule 5015 is configured to perform full connection calculation on the first splicing feature through the first full connection layer to obtain a predicted distance;

a third calculating submodule 5016 for calculating a first error between the priori distance and the predicted distance by using the first loss function;

the first adjusting submodule 5017 is configured to perform back propagation adjustment on network parameters of the first extraction network, the second extraction network, and the third extraction network according to the first error, so as to obtain a second network model.

Optionally, fig. 7 is a schematic structural diagram of another training apparatus for a social relationship recognition model according to an embodiment of the present invention, and as shown in fig. 7, the first computing submodule 5012 includes:

a first calculating subunit 50121 configured to calculate a center distance between the first sample body frame and the second sample body frame;

a second calculating subunit 50122, configured to calculate a first diagonal length value of the first sample body frame and calculate a second diagonal length value of the second sample body frame;

the third computing subunit 50123 is configured to select a larger value of the first diagonal length value and the second diagonal length value as a reference value, and calculate a ratio of the center distance to the reference value as a prior distance between two sample persons.

Optionally, the third network model includes a fourth extraction network, a fifth extraction network, a sixth extraction network, a second full connection layer, and a second loss function, where network parameters of the fourth extraction network and the fifth extraction network are shared.

Optionally, the second training set includes a labeled sample image, the labeled sample image includes at least two sample persons and a relationship label, fig. 8 is a schematic structural diagram of a training apparatus of another social relationship recognition model provided in an embodiment of the present invention, and as shown in fig. 8, the second training module 503 includes:

the second output sub-module 5031 is configured to input the labeled sample image into the target detection network, and detect and output a third sample human body frame of a third sample person, a fourth sample human body frame of a fourth sample person, and a second sample human body combined frame of the third sample person and the fourth sample person in the labeled sample image through the target detection network;

the fourth calculation submodule 5032 is configured to extract a fourth feature of the third sample human body frame through a fourth extraction network, extract a fifth feature of the fourth sample human body frame through a fifth extraction network, and extract a sixth feature of the sample human body combination frame through a sixth extraction network;

the second splicing sub-module 5033 is configured to splice the fourth feature, the fifth feature and the sixth feature to obtain a second spliced feature;

the fifth calculation submodule 5034 is configured to perform full-connection calculation on the second splicing feature through the second full-connection layer to obtain a relationship classification result;

a sixth calculating sub-module 5035, configured to calculate, through the second loss function, a second error between the relationship classification result and the relationship label;

a seventh calculation submodule 5036 configured to perform back propagation adjustment on the network parameters of the fourth extraction network, the fifth extraction network, and the sixth extraction network according to the second error.

Optionally, fig. 9 is a schematic structural diagram of another training apparatus for a social relationship recognition model according to an embodiment of the present invention, and as shown in fig. 9, the adjusting module 502 includes:

the first determining submodule 5021 is used for determining the relationship types to be identified and the number of the relationship types;

the adjusting submodule 5022 is used for adjusting the number of output neurons and the number of relation categories of the second network model full-connection layer to obtain a second full-connection layer;

the second determining submodule 5023 is used for determining the type of the second loss function according to the relation category;

and the third determining submodule 5024 is used for obtaining a third network model according to the second full connection layer and the second loss function.

The invention further provides an electronic device 1000, and the electronic device 1000 provided in the embodiment of the invention can implement each process implemented by the training method of the social relationship recognition model in the above method embodiments, and for avoiding repetition, details are not repeated here. And the same beneficial effects can be achieved.

As shown in fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 1000 includes: the social relationship recognition model training system comprises a processor 1001, a memory 1002, a network interface 1003 and a computer program which is stored on the memory 1002 and can run on the processor 1001, wherein the processor 1001 executes the computer program to realize the steps in the social relationship recognition model training method provided by the embodiment. Specifically, the processor 1001 is configured to call the computer program stored in the memory 1002, and execute the following steps:

adjusting the full connection layer of the second network model to adapt to training classification of the social relationship recognition model to obtain a third network model;

and adjusting and training the third network model through a preset second training set to obtain the target network model.

Optionally, in the step executed by the processor 1001, the first network model includes a first extraction network, a second extraction network, a third extraction network, a first full connection layer, and a first loss function, where network parameters of the first extraction network and the second extraction network are shared.

Optionally, the first network model further includes a target detection network, the first training set includes unlabeled sample images, each unlabeled sample image includes at least two sample persons, and the processor 1001 performs, based on a priori distance between two sample persons in the training image, an auto-supervised training of the first network model through a preset first training set to obtain a second network model, including:

inputting the unlabeled sample image into a target detection network, and detecting and outputting a first sample human body frame of a first sample person, a second sample human body frame of a second sample person and a first sample person body combined frame of the first sample person and the second sample person in the unlabeled sample image through the target detection network;

calculating the prior distance between two sample persons according to the first sample body frame and the second sample body frame;

extracting a first feature of the first sample human body frame through a first extraction network, extracting a second feature of the second sample human body frame through a second extraction network, and extracting a third feature of the first sample human body combination frame through a third extraction network;

splicing the first characteristic, the second characteristic and the third characteristic to obtain a first splicing characteristic;

performing full-connection calculation on the first splicing characteristics through the first full-connection layer to obtain a predicted distance;

calculating a first error between the prior distance and the predicted distance by a first loss function;

and according to the first error, the network parameters of the first extraction network, the second extraction network and the third extraction network are adjusted through back propagation to obtain a second network model.

Optionally, the calculating, by the processor 1001, the prior distance between two sample persons according to the first sample person frame, the second sample person frame, and the sample person combination frame includes:

calculating the center distance between the first sample human body frame and the second sample human body frame;

calculating a first diagonal length value of the first sample body frame and calculating a second diagonal length value of the second sample body frame;

Optionally, in the step executed by the processor 1001, the third network model includes a fourth extraction network, a fifth extraction network, a sixth extraction network, a second full connection layer, and a second loss function, where network parameters of the fourth extraction network and the fifth extraction network are shared.

Optionally, the second training set includes a labeled sample image, the labeled sample image includes at least two sample persons and a relationship label, and the processor 1001 adjusts and trains the third network model through a preset second training set to obtain the target network model, including:

inputting the labeled sample image into a target detection network, and detecting and outputting a third sample human body frame of a third sample person, a fourth sample human body frame of a fourth sample person and a second sample human body combined frame of the third sample person and the fourth sample person in the labeled sample image through the target detection network;

extracting a fourth feature of the third sample human body frame through a fourth extraction network, extracting a fifth feature of the fourth sample human body frame through a fifth extraction network, and extracting a sixth feature of the sample human body combination frame through a sixth extraction network;

splicing the fourth feature, the fifth feature and the sixth feature to obtain a second spliced feature;

performing full-connection calculation on the second splicing characteristics through a second full-connection layer to obtain a relation classification result;

calculating a second error between the relationship classification result and the relationship label through a second loss function;

and adjusting the network parameters of the fourth extraction network, the fifth extraction network and the sixth extraction network according to the second error by back propagation.

Optionally, the adjusting, performed by the processor 1001, the full connection layer of the second network model to adapt to the training classification of the social relationship recognition model to obtain a third network model includes:

determining the relation category to be identified and the number of the relation categories;

adjusting the number of output neurons and the number of relation categories of the second network model full-connection layer to obtain a second full-connection layer;

determining the type of the second loss function according to the relation category;

Further, an embodiment of the present invention provides a social relationship identifying apparatus, including:

and the recognition module is used for recognizing the social relationship between the first target person and the second target person in the image to be recognized through the trained target network model.

The electronic device 1000 provided by the embodiment of the present invention can implement each implementation manner in the embodiment of the training method for a social relationship recognition model, and has corresponding beneficial effects, and for avoiding repetition, details are not repeated here.

It should be noted that only 1001-1003 with components are shown, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead. As will be understood by those skilled in the art, the electronic device 1000 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The electronic device 1000 may be a desktop computer, a notebook, a palm computer, or other computing devices. The electronic device 1000 may interact with a user through a keyboard, a mouse, a remote control, a touch pad, or a voice-controlled device.

The memory 1002 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 1002 may be an internal storage unit of the electronic device 1000, such as a hard disk or a memory of the electronic device 1000. In other embodiments, the memory 1002 may also be an external storage device of the electronic device 1000, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the electronic device 1000. Of course, the memory 1002 may also include both internal and external memory units of the electronic device 1000. In this embodiment, the memory 1002 is generally configured to store an operating system installed in the electronic device 1000 and various types of application software, such as program codes of a training method for a social relationship recognition model. In addition, the memory 1002 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 1001 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 1001 generally serves to control the overall operation of the electronic device 1000. In this embodiment, the processor 1001 is configured to execute the program code stored in the memory 1002 or process data, for example, execute the program code of a training method of a social relationship recognition model.

The network interface 1003 may include a wireless network interface or a wired network interface, and the network interface 1003 is generally used for establishing a communication connection between the electronic device 1000 and other electronic devices.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor 1001, the computer program implements each process of the training method for a social relationship recognition model provided in the embodiment of the present invention, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program to instruct associated hardware, and a program of a method for training a social relationship recognition model may be stored in a computer-readable storage medium, and when executed, the program may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. And the terms "first," "second," and the like in the description and claims of the present application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A training method of a social relationship recognition model is characterized by comprising the following steps:

adjusting the full connection layer of the second network model to adapt to the identification and classification of social relations, and obtaining a third network model;

2. The method of claim 1, wherein the first network model comprises a first extraction network, a second extraction network, a third extraction network, a first full connection layer, and a first loss function, and wherein the first extraction network is shared with the second extraction network in terms of network parameters.

3. The method for training the social relationship recognition model according to claim 2, wherein the first network model further includes a target detection network, the first training set includes unlabeled sample images, each unlabeled sample image includes at least two sample persons, and the self-supervised training of the first network model through a preset first training set based on the a priori distance between two sample persons in the training images to obtain the second network model includes:

inputting the unlabeled sample image into the target detection network, and detecting and outputting a first sample human body frame of a first sample person, a second sample human body frame of a second sample person and a first sample human body combined frame of the first sample person and the second sample person in the unlabeled sample image through the target detection network;

extracting a first feature of the first sample body frame through the first extraction network, extracting a second feature of the second sample body frame through the second extraction network, and extracting a third feature of the first sample body combination frame through the third extraction network;

splicing the first feature, the second feature and the third feature to obtain a first spliced feature;

calculating a first error between the prior distance and the predicted distance by the first loss function;

and according to the first error, performing back propagation to adjust network parameters of the first extraction network, the second extraction network and the third extraction network to obtain the second network model.

4. A method for training a social relationship recognition model according to claim 3, wherein the calculating a priori distance between two sample persons according to the first sample body frame and the second sample body frame comprises:

selecting the larger value of the first diagonal length value and the second diagonal length value as a reference value, and calculating the ratio of the center distance to the reference value as the prior distance between two sample persons.

5. The method for training the social relationship recognition model according to any one of claims 1 to 4, wherein the third network model comprises a fourth extraction network, a fifth extraction network, a sixth extraction network, a second full connection layer and a second loss function, wherein the network parameters of the fourth extraction network and the fifth extraction network are shared.

6. The method for training the social relationship recognition model according to claim 5, wherein the second training set includes labeled sample images, the labeled sample images include at least two sample persons and relationship labels, and the adjusting and training of the third network model through the preset second training set to obtain the target network model includes:

inputting the labeled sample image into the target detection network, and detecting and outputting a third sample human body frame of a third sample person, a fourth sample human body frame of a fourth sample person and a second sample human body combined frame of the third sample person and the fourth sample person in the labeled sample image through the target detection network;

extracting a fourth feature of the third sample body frame through the fourth extraction network, extracting a fifth feature of a fourth sample body frame through the fifth extraction network, and extracting a sixth feature of the sample body combination frame through the sixth extraction network;

performing full-connection calculation on the second splicing characteristics through the second full-connection layer to obtain a relation classification result;

calculating a second error between the relationship classification result and the relationship label through the second loss function;

and according to the second error, the network parameters of the fourth extraction network, the fifth extraction network and the sixth extraction network are adjusted through back propagation.

7. The method for training the social relationship recognition model according to claim 6, wherein the adjusting the full connection layer of the second network model to adapt to the recognition classification of the social relationship to obtain a third network model comprises:

determining the type of a second loss function according to the relation category;

8. A social relationship recognition method is characterized by comprising the following steps:

and performing social relationship identification on the first target person and the second target person in the image to be identified through a target network model obtained through training of the claims 1-7.

9. A training device for a social relationship recognition model is characterized by comprising:

10. An apparatus for identifying social relationships, comprising:

an identification module, configured to perform social relationship identification on the first target person and the second target person in the image to be identified through the target network model obtained through training in the claims 1 to 7.

11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps of a method for training a social relationship recognition model according to any one of claims 1 to 7 and the steps of a method for social relationship recognition according to claim 8 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a social relationship recognition model according to any one of claims 1 to 7 and the steps of the method for social relationship recognition according to claim 8.