CN116310474A

CN116310474A - End-to-end relationship identification method, model training method, device, equipment and medium

Info

Publication number: CN116310474A
Application number: CN202211413098.4A
Authority: CN
Inventors: 唐锲; 余晓填; 蚁韩羚; 王孝宇
Original assignee: Hangzhou Lifei Software Technology Co ltd; Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Hangzhou Lifei Software Technology Co ltd; Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-06-23

Abstract

The present invention relates to the field of artificial intelligence technologies, and in particular, to an end-to-end relationship identification method, a model training method, a device, equipment, and a medium. The method comprises the steps of obtaining a sample image and a label vector thereof, inputting the sample image into a feature extraction model to obtain sample features, inputting the sample features into a prediction model to obtain N prediction vectors, respectively matching the N prediction vectors with the preset label vector by adopting a preset matching algorithm, determining the prediction vector with the largest matching degree as a reference vector, calculating an overall matching error according to the reference vector and the label vector, training the feature extraction model and the prediction model according to the overall matching error, and determining that the obtained trained feature extraction model and the trained prediction model form a trained relation recognition model. By adopting the end-to-end relationship identification model, the influence of artificial subjective factors is reduced, the fitting effect of the model is improved, and the accuracy of the relationship identification model is further improved.

Description

End-to-end relationship identification method, model training method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to an end-to-end relationship identification method, a model training method, a device, equipment, and a medium.

Background

Along with the development of artificial intelligence technology, the accuracy rate of relationship identification between objects to be identified is higher and higher, the relationship identification technology is widely applied to social security, user recommendation and other scenes, great social and commercial values are generated, and along with the rising of the accuracy rate of the relationship identification technology, the relationship identification technology can be gradually applied to intelligent security, accurate delivery and other scenes.

However, the existing relationship recognition model for the relationship recognition technology still adopts a two-stage form, namely, firstly recognizing a bounding box of an object to be recognized in an image, and then performing relationship recognition based on the bounding box of the object to be recognized, wherein the relationship recognition model framework in the two-stage form causes a part requiring to manually determine parameter tuning in the process of training the relationship recognition model, so that the relationship recognition model is difficult to fit to achieve an optimal recognition effect, and the accuracy of the trained relationship recognition model is low. Therefore, how to improve the accuracy of the relationship recognition model is a problem to be solved.

Disclosure of Invention

In view of the above, the embodiment of the invention provides an end-to-end relationship identification method, a model training method, a device, equipment and a medium, so as to solve the problem of low accuracy of a relationship identification model.

In a first aspect, an embodiment of the present invention provides a training method for an end-to-end relationship identification model, where the training method includes:

acquiring a sample image and at least one corresponding label vector thereof, inputting the sample image into a feature extraction model for feature extraction to obtain sample features, wherein the label vector comprises a first label bounding box, a second label bounding box and a label relation class;

inputting the sample characteristics into a prediction model to obtain N prediction vectors, wherein the prediction vectors comprise a first sample bounding box, a second sample bounding box and a sample relation class, and N is an integer greater than zero;

matching the N predicted vectors with the at least one tag vector by adopting a preset matching algorithm, and determining the predicted vector with the largest matching degree with any tag vector as a reference vector of the corresponding tag vector;

traversing each label vector, and calculating a class matching error of a sample relation class in the reference vector and a label relation class in the label vector;

Calculating a first bounding box matching error of a first sample bounding box in the reference vector and a first label bounding box in the label vector, and calculating a second bounding box matching error of a second sample bounding box in the reference vector and a second label bounding box in the label vector;

and according to the category matching error, the first bounding box matching error and the second bounding box matching error, determining the overall matching error of the tag vectors, training the feature extraction model and the prediction model according to the overall matching error of all the tag vectors, and determining that the obtained trained feature extraction model and the trained prediction model form a trained relation recognition model.

In a second aspect, an embodiment of the present invention provides an end-to-end relationship identifying method, where the end-to-end relationship identifying method includes:

acquiring an image to be identified, wherein the image to be identified comprises at least two objects to be identified;

inputting the image to be identified into a trained relation identification model to obtain P relation identification results, wherein the relation identification results comprise respective bounding boxes of two objects to be identified and identification relation categories between the two objects to be identified, P is an integer greater than zero, and the trained relation identification model is obtained based on the training method of the end-to-end relation identification model according to any one of claims 1-6.

In a third aspect, an embodiment of the present invention provides a training apparatus for an end-to-end relationship identification model, where the training apparatus includes:

the feature extraction module is used for acquiring a sample image and at least one corresponding label vector thereof, inputting the sample image into the feature extraction model for feature extraction to obtain sample features, wherein the label vector comprises a first label bounding box, a second label bounding box and a label relation class;

the vector prediction module is used for inputting the sample characteristics into a prediction model to obtain N prediction vectors, wherein the prediction vectors comprise a first sample bounding box, a second sample bounding box and a sample relation class, and N is an integer greater than zero;

the vector matching module is used for matching the N predicted vectors with the at least one tag vector respectively by adopting a preset matching algorithm, and determining the predicted vector with the largest matching degree with any tag vector as a reference vector of the corresponding tag vector;

the first calculation module is used for traversing each label vector and calculating a class matching error of the sample relation class in the reference vector and the label relation class in the label vector;

the second calculating module is used for calculating a first bounding box matching error of a first sample bounding box in the reference vector and a first bounding box of a first label bounding box in the label vector and calculating a second bounding box matching error of a second sample bounding box in the reference vector and a second label bounding box in the label vector;

The model training module is used for determining an overall matching error according to the category matching error, the first bounding box matching error and the second bounding box matching error, training the feature extraction model and the prediction model according to the overall matching error of all the tag vectors, and determining that the obtained trained feature extraction model and the trained prediction model form a trained relation recognition model.

In a fourth aspect, an embodiment of the present invention provides a computer device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the training method according to the first aspect when executing the computer program.

In a fifth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the training method according to the first aspect.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

the method comprises the steps of obtaining sample images and at least one corresponding label vector, inputting the sample images into a feature extraction model to extract features, obtaining sample features, inputting the sample features into a prediction model, obtaining N prediction vectors, respectively matching the N prediction vectors with the at least one label vector by adopting a preset matching algorithm, determining the prediction vector with the largest matching degree with any label vector as a reference vector of the corresponding label vector, traversing each label vector, calculating the matching error of the sample relation type in the reference vector and the label relation type in the label vector, calculating the matching error of the first sample bounding box in the reference vector and the first bounding box in the label vector, calculating the matching error of the second sample bounding box in the reference vector and the second bounding box in the label vector, determining the overall matching error of the label vector according to the matching error of the category, the first bounding box matching error and the second bounding box matching error, training the feature extraction model and the prediction model according to the overall matching error of all label vectors, forming the training good training relation between the training model and the training model, forming the matching relation between the training model and the good recognition model and the label relation in the label vector, and the recognition model being matched with the recognition model is improved, the matching error is improved, the recognition result is improved by adopting the matching relation is calculated, and the recognition result is improved, and further, the accuracy of the relation recognition model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a training method of an end-to-end relationship identification model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of an end-to-end relationship identification model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training method of an end-to-end relationship recognition model according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training device for an end-to-end relationship recognition model according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

The training method of the end-to-end relationship recognition model provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server. The client includes, but is not limited to, a palm top computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud terminal device, a personal digital assistant (personal digital assistant, PDA), and other computer devices. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The client can be applied to scenes such as image classification, intelligent security, user recommendation and the like, and receives a relation identification request, wherein the relation identification request can be sent by a user or can be automatically generated according to a preset time point, for example, in the intelligent security scene, the relation identification request can be automatically generated according to the preset time point, so that abnormal intrusion is detected in real time according to a monitoring image, namely, according to the identified social relation, the condition that the preset social relation is not met but the entrance guard is simultaneously entered is determined as the abnormal intrusion condition. The server is connected with the image acquisition device in an application scene, for example, the image acquisition device can be a monitoring camera in an intelligent security scene, and the image acquisition device can be a camera in a user recommended scene so as to acquire images acquired by the image acquisition device. In this embodiment, a relationship recognition result corresponds to two objects to be recognized in an image, when a plurality of objects exist in the image, the relationship of the plurality of objects to be recognized can be determined according to the relationship recognition result between every two objects, and the objects to be recognized are generally personnel.

Referring to fig. 2, a flowchart of a training method of an end-to-end relationship recognition model according to an embodiment of the present invention is provided, where the training method may be applied to a client in fig. 1, a computer device corresponding to the client is connected to a server to obtain a sample image and a tag vector thereof, a feature extraction model and a prediction model are deployed in the client, the feature extraction model may be used to extract feature information of the sample image, and the prediction model may be used to perform social relationship type prediction according to the feature information of the sample image. As shown in fig. 2, the training method may include the steps of:

step S201, a sample image and at least one label vector corresponding to the sample image are obtained, and the sample image is input into a feature extraction model to perform feature extraction, so that sample features are obtained.

The label vector includes a first label bounding box, a second label bounding box, and a label relation category, and the sample image may refer to an image containing objects to be identified.

The feature extraction model can be combined with a visual semantic model and a transducer model, namely, a sample image is input into the visual semantic model for processing to obtain a semantic feature vector, and then the semantic feature vector is input into the transducer model to obtain sample features.

The visual semantic model can select a residual network model, the residual network can refer to a multi-layer convolution model based on residual connection, the residual connection can refer to the combination of the output of each layer of convolution and the output of the upper layer of convolution, and the residual network model can play a role of a concentration mechanism as the input of the next layer of convolution and is used for extracting important features in semantic feature vectors.

Optionally, the feature extraction model includes a semantic extraction layer, an encoder and a decoder;

inputting the sample image into a feature extraction model for feature extraction, wherein obtaining sample features comprises the following steps:

inputting the sample image into a semantic extraction layer to obtain a semantic feature vector, and multiplying the semantic feature vector by a preset first vector to obtain a value vector;

coding the sample image by adopting a preset coding mode to obtain a position code, adding semantic feature vectors and the position code point by point, and multiplying the added result by a preset second vector and a preset third vector respectively to obtain a query vector and a key value vector;

inputting the value vector, the query vector and the key value vector into an encoder to obtain a global information feature vector;

and carrying out preset initialization on the preset learnable vector, and inputting the initialized learnable vector, the global information feature vector and the position code into a decoder to obtain sample features.

The semantic extraction layer may adopt a visual backbone network, for example, an encoder of a semantic segmentation model, the preset first vector, second vector and third vector may be weight vectors with learnable parameters, the position coding may refer to position information coding of each sub-image block in the image, the acquiring mode of the position information coding may be single-hot coding, sine and cosine coding or processing by adopting a position coding part of a BERT model, and the preset initialization may adopt a random initialization mode.

The value vector, the query vector and the key value vector can be used for converting the low-dimensional semantic feature vector into the high-dimensional space, the learner-driven vector takes the dimension as the super parameter, namely the dimension of the learner-driven vector is preset by an implementer, and the fact that the dimension of the learner-driven vector needs to be larger than the maximum relation number in the sample image when the dimension is set is needed.

Specifically, in this embodiment, a sample image is divided into K sub-image blocks with identical sizes, each sub-image block is respectively input into a semantic extraction layer to perform semantic feature extraction, K semantic feature vectors are obtained, serial numbers are allocated to each sub-image block, K sub-image block serial numbers are obtained, position encoding is performed based on the K sub-image block serial numbers, single-heat encoding can be adopted for the position encoding, K position encoding is obtained, the position encoding is consistent with the dimension of the semantic feature vector, the semantic feature vector of the corresponding sub-image block and the position encoding of the corresponding sub-image block are respectively added and spliced, a vector with the size of k×d is obtained as an addition result, and D can refer to the dimension of the semantic feature vector.

In the embodiment, the query vector, the key value vector and the value vector are constructed through the weight vector of the learnable parameter, so that the context information of the sub-image block in the sample image is effectively extracted, the follow-up feature weighting according to the context information is facilitated, the effect of an attention mechanism is achieved, the feature integration efficiency is improved, and the important feature loss is avoided.

Optionally, inputting the value vector, the query vector, and the key-value vector into the encoder includes:

multiplying the query vector by the transpose of the key value vector, dividing the multiplication result by a preset coefficient, and determining the division result as an attention vector;

and carrying out normalization processing on the attention vector through a normalization exponential function, multiplying the normalization processing result by the value vector, and inputting the multiplication result into the encoder.

The predetermined coefficient may be used to scale the multiplication result to avoid the excessive multiplication result, the attention vector may refer to a vector containing the weighting parameter, and the normalization process may be used to normalize the weighting parameter to a probability value, that is, the sum of normalized values of each column in the attention vector is 1.

Specifically, in the present embodiment, the preset coefficient may be set to

I.e. the evolution of the dimension of the semantic feature vector.

In this embodiment, the value vector is weighted by the normalization result of the attention vector, so as to play a role in giving attention to the feature, so that the subsequent global feature processing focuses on the important feature, and the validity of the global information feature vector is ensured.

Optionally, inputting the sample image into a semantic extraction layer, and obtaining the semantic feature vector includes:

inputting the sample image into a semantic extraction layer to obtain an initial feature vector;

and adopting a preset convolution layer to reduce the dimension of the initial feature vector, carrying out flattening operation on the dimension reduction result, and determining the flattening operation result as a semantic feature vector.

The initial feature vector may be a vector obtained by directly extracting features from a sample image, and the convolution layer may include a convolution kernel with a preset n×n size.

For example, in this embodiment, a convolution kernel of 5*5 is used, let the initial eigenvector size be h×w×c, where H, W is the height and width of the vector, C is the number of channels, and the initial value of C is 2048, and then, after convolution with a convolution kernel of 5*5, C is reduced to 256. The flattening operation may refer to a flattening operation that compresses the high-dimensional vector to 1*M dimensions, M may be equal to h×w×c, and converts the high-dimensional semantic feature vector to a low-dimensional representation, which may be considered as a word embedding vector in a conventional transducer for subsequent multiplication with position codes.

According to the embodiment, the feature vector is subjected to dimension reduction processing, so that the calculated amount of subsequent processing is reduced, the calculation efficiency is improved, social relationship identification can be applied to a real-time scene, and the applicability of a relationship identification model is improved.

The step of obtaining the sample image and at least one corresponding label vector thereof, namely inputting the sample image into a feature extraction model for feature extraction to obtain sample features, and obtaining the sample features with stronger characterization capability through the feature extraction model, thereby being convenient for improving the fitting accuracy and efficiency of the subsequent model training process, and further improving the accuracy of the trained relationship recognition model.

Step S202, inputting the sample characteristics into a prediction model to obtain N prediction vectors.

The prediction vector comprises a first sample bounding box, a second sample bounding box and a sample relation class, N is an integer larger than zero, the prediction model can comprise a classification branch and two regression branches, each branch can be realized by adopting a full connection layer, the input of the classification branch is a sample characteristic, the output of the classification branch is the sample relation class, the input of the two regression branches is a sample characteristic, and the output of the two regression branches is the first sample bounding box and the second sample bounding box respectively.

The first sample bounding box may refer to a positioning box of an object to be identified in the sample image, the second sample bounding box may refer to a positioning box of another object to be identified in the sample image, the sample relationship type may refer to a relationship recognition result between the object to be identified corresponding to the first sample bounding box and the object to be identified corresponding to the second sample bounding box, and the relationship recognition result may include a relatives type, a friends type, a neighbors type, and the like.

Specifically, the classification branch outputs a predicted value of each relationship category, normalizes the predicted value of each relationship category to obtain a predicted probability of each relationship category, and determines a relationship category corresponding to the maximum value in the predicted probabilities of all relationship categories as a sample relationship category.

The regression branches output bounding box coordinate points, which may include bounding box upper left corner coordinates and bounding box lower right corner coordinates.

Since the sample image includes at least two objects to be identified, each pair of identified objects corresponds to a prediction vector, the output of the prediction model is N prediction vectors, for example, when the objects to be identified are Q, the value of N is the same as the factorization of Q-1.

The step of inputting the sample characteristics into the prediction model to obtain the N prediction vectors adopts an end-to-end model architecture, and directly outputs the first bounding box, the second bounding box and the sample relation type according to the sample characteristics of the input sample image, so that the model training and using convenience degree is improved, the method can be widely applied to specific scenes, and under the premise of ensuring the model accuracy, an implementer can obtain corresponding results only by directly inputting the acquired data into the model, so that the model has higher applicability and convenience.

In step S203, a preset matching algorithm is adopted to match the N prediction vectors with at least one label vector, and the prediction vector with the greatest matching degree with any label vector is determined as the reference vector of the corresponding label vector.

The preset matching algorithm may be a hungarian algorithm, a KM algorithm, or the like, and in this embodiment, each prediction vector is respectively matched with a label vector to obtain N matching pairs.

In this embodiment, for any matching pair, the calculation of the matching loss may be performed according to the first sample bounding box and the second sample bounding box of the matching intra-prediction vector and the first tag bounding box and the second tag bounding box in the tag vector, and the generalized cross-correlation may be adopted for the loss calculation method between the bounding boxes.

The generalized cross-over ratio is calculated by firstly obtaining the minimum circumscribed rectangle of the two bounding boxes for the two bounding boxes, calculating the ratio of the area of the union of the two bounding boxes subtracted from the minimum circumscribed rectangle to the area of the minimum circumscribed rectangle, and subtracting the ratio from the cross-over ratio of the two bounding boxes to obtain the generalized cross-over ratio.

The smaller the generalized cross ratio of the bounding boxes in a matching pair, the larger the matching loss of the bounding boxes is, the smaller the matching degree is, and the larger the generalized cross ratio of the bounding boxes in the matching pair is, the smaller the matching loss of the bounding boxes is, the larger the matching degree is, and the prediction vector in the matching pair with the largest matching degree is determined as the reference vector.

According to the method, N predicted vectors are respectively matched with at least one tag vector by adopting a preset matching algorithm, the predicted vector with the largest matching degree with any tag vector is determined to be the reference vector of the corresponding tag vector, and the tag is determined according to the matching result, so that the method can effectively adapt to sample images under the condition of multiple objects to be identified, avoid the situation that the tag vector does not correspond to the predicted vector and cause the fitting error of the relation identification model, and further improve the fitting accuracy of the relation identification model.

Step S204, traversing each label vector, and calculating the class matching error of the sample relation class in the reference vector and the label relation class in the label vector.

The label relation category may be a preset relation category, the relation category may include a couple relation category, a friend relation category, a family relation category and the like, the expression form of the relation category may be a single-hot code, for example, the total number of the relation categories is set to be 3, the single-hot code of the couple relation category may be [1, 0], and the category matching error may be used to represent the difference between the sample relation category and the label relation category.

Specifically, the class matching error may be calculated by using a classification error, for example, a cross entropy loss function, where the larger the class matching error is, the larger the difference between the sample relationship class and the label relationship class is, and the smaller the class matching error is, the smaller the difference between the sample relationship class and the label relationship class is, and since the class matching error is calculated by using a reference vector, the model parameters are adjusted in the subsequent training process so that the output reference vector is close to the label vector, and accordingly, the sample relationship class and the label relationship class output by the adjusted model are consistent as much as possible.

And traversing each label vector, calculating the class matching error of the sample relation class in the reference vector and the label relation class in the label vector, and ensuring that the sample relation class output by the model is consistent with the label relation class in the label vector as much as possible through the supervision of the class matching error during subsequent training, so that the model can be trained in an end-to-end mode, and the training efficiency and the recognition accuracy of the model are improved.

In step S205, a first bounding box matching error between the first sample bounding box in the reference vector and the first label bounding box in the label vector is calculated, and a second bounding box matching error between the second sample bounding box in the reference vector and the second bounding box in the label vector is calculated.

The sample bounding box and the label bounding box are both represented in the form of bounding box locating points, the bounding box locating points can comprise an upper left corner point and a lower right corner point of the bounding box, and a bounding box matching error can be used for representing the distance difference between the sample bounding box and the label bounding box.

Specifically, the bounding box matching error may be calculated by adopting the generalized cross ratio manner, in an embodiment, the regression error may also be calculated by adopting a mean square error loss function, for example, the larger the bounding box matching error is, the farther the distance between the sample bounding box and the label bounding box is, the smaller the bounding box matching error is, the closer the distance between the sample bounding box and the label bounding box is, and the calculation of the matching loss is performed by using the reference vector, so that the model parameter is adjusted in the subsequent training process to enable the output reference vector to be close enough to the label vector, and accordingly, the distance between the sample bounding box output by the adjusted model and the label bounding box is also close enough.

The step of calculating the matching error of the first sample bounding box in the reference vector and the first bounding box of the first label bounding box in the label vector and the step of calculating the matching error of the second sample bounding box in the reference vector and the second bounding box of the second bounding box in the label vector can learn the position of the object to be identified through the bounding box matching error supervision model, so that characteristic information is provided for the relation identification task, and the accuracy of relation identification of the model is improved conveniently.

Step S206, determining the overall matching error of the tag vectors according to the category matching error, the first bounding box matching error and the second bounding box matching error, training the feature extraction model and the prediction model according to the overall matching error of all the tag vectors, and forming a trained relation recognition model by the determined trained feature extraction model and the trained prediction model.

The trained feature extraction model may be used to perform feature extraction on the acquired image to be identified, and the trained prediction model may be used to map a feature tensor corresponding to the image to be identified to an output space.

Specifically, the loss function employed for training is:

wherein L represents a predictive loss, L _match (g ⁱ ，p ^σ(i) ) Represents the overall match error g ⁱ Represents the i-th tag vector, p ^σ(i) Represents the prediction vector matching the i-th label vector, and σ (i) represents the sequence number of the prediction vector matching the i-th label vector.

Optionally, the category matching error corresponds to a first weight, and the first bounding box matching error and the second bounding box matching error correspond to a second weight;

determining the overall match error of the tag vector based on the category match error, the first bounding box match error, and the second bounding box match error includes:

multiplying the category matching error with a first weight to obtain a first matching error;

multiplying the first bounding box matching error and the second bounding box matching error with a second weight respectively, and then adding to obtain a second matching error;

and determining the sum of the first matching error and the second matching error as an overall matching error.

The first weight can be used for controlling the influence degree of the category matching error on the whole matching error, and the second weight can be used for controlling the influence degree of the bounding box matching error on the whole matching error.

Specifically, the overall match error is:

Wherein, the liquid crystal display device comprises a liquid crystal display device,

a category matching error representing the j-th sample relationship category, r representing the number of sample relationship categories,

a k bounding box matching error representing a k bounding box, h being the number of bounding boxes, beta ₁ First weight, beta, representing class match error ₂ And a second weight representing a k-th bounding box match error.

Specifically, in this embodiment, the value of h is 2, and the relationship category may include r number of categories such as friends, couples, relatives, and the like.

In this embodiment, the first weight and the second weight are both set to 0.5, and it should be noted that, an implementer may adjust the values of the first weight and the second weight according to the actual situation, for example, adjust the first weight to 0.3 and adjust the second weight to 0.7, so that the accuracy of classification of the category is more focused.

According to the method and the device, the overall matching error is calculated through the category matching error and the bounding box matching error in a weighted mode, so that an effective fitting direction can be provided for the training of the relation recognition model, and the accuracy of the trained relation recognition model is improved.

According to the method, the overall matching error is determined according to the category matching error, the first bounding box matching error and the second bounding box matching error, the feature extraction model and the prediction model are trained according to the overall matching error of all the tag vectors, the step that the trained feature extraction model and the trained prediction model are determined to form the trained relation recognition model is determined, the feature extraction model and the prediction model are jointly trained in an end-to-end mode, the artificial intervention training process is not needed, learning parameters of the feature extraction model and the prediction model are adjusted during training, accuracy errors caused by artificial adjustment are avoided, and therefore more accurate model fitting can be achieved, and more accurate relation recognition is achieved.

According to the method, the device and the system, the class matching error and the bounding box matching error are calculated on the prediction vector and the label vector of the corresponding sample image, so that the recognition accuracy of the airspace association is improved, the loss function is calculated according to the determined matched reference vector, the relationship recognition model is trained in an end-to-end mode, the model tuning is prevented from being manually participated, the influence of subjective factors is reduced, the fitting effect of the relationship recognition model is improved, and the accuracy of the relationship recognition model is further improved.

Referring to fig. 3, a flowchart of a training method of an end-to-end relationship recognition model according to a second embodiment of the present invention is shown, in the training method, the matching of N prediction vectors with at least one label vector by a preset matching algorithm, and the determining that the prediction vector with the greatest matching degree with any label vector is the reference vector of the corresponding label vector includes:

step S301, adopting a Hungary algorithm to match N predicted vectors with at least one tag vector respectively to obtain N matched pairs;

step S302, for any matching pair, calculating the overall matching error of the matching pair, traversing N matching pairs to obtain N overall matching errors;

In step S303, for any of the tag vectors, a prediction vector corresponding to the minimum value of the overall matching errors corresponding to the matching pair including the tag vector is determined as the prediction vector having the greatest matching degree with the tag vector, and the prediction vector having the greatest matching degree is used as the reference vector of the corresponding tag vector.

The Hungary algorithm can be used for binary matching, and in order to improve matching efficiency, after N matching pairs are obtained, the overall matching error is calculated by directly adopting a predictive vector and a label vector in the matching pairs, so that N overall matching errors are obtained.

The smaller the matching error is, the higher the matching degree of the prediction vector and the label vector is, and the minimum value in the N overall matching errors can represent that the prediction vector used for calculation is close enough to the label vector, so that the prediction vector corresponding to the minimum value in the N overall matching errors is directly taken as the prediction vector with the largest matching degree, namely the reference vector.

Specifically, in performing the overall matching error calculation, the second weight of the bounding box matching error may be set to be larger, for example, set to 0.8 in the present embodiment, so that the matching process is more focused on the distance between the sample bounding box and the tag bounding box.

According to the embodiment, N predicted vectors are respectively matched with at least one label vector through a Hungary algorithm, and the overall matching loss is directly calculated according to the matching pair obtained by matching, so that the overall matching loss for training the relationship recognition model is determined, the flow of the training process of the relationship recognition model is simplified, the training efficiency of the relationship recognition model is improved, and the training accuracy of the relationship recognition model is ensured.

The third embodiment of the invention provides an end-to-end relationship identification method, which comprises the following steps:

inputting the images to be identified into a trained relationship identification model to obtain P relationship identification results, wherein the relationship identification results comprise respective bounding boxes of two objects to be identified and identification relationship categories between the two objects to be identified.

Wherein P is an integer greater than zero, and the trained relationship recognition model is an end-to-end model, that is, after the acquired image to be recognized, the trained relationship recognition model is directly input, so as to obtain a relationship recognition result, and the trained relationship recognition model is obtained based on the training method of the end-to-end relationship recognition model provided in the first embodiment.

For example, in the accurate delivery scene, the corresponding delivery content can be determined through the relationship category identified by the image, and if the relationship category of the two objects is identified as a couple, the couple product can be pushed by the auxiliary information to realize the accurate delivery. Under an intelligent security scene, security policies such as entrance guard following recognition, entrance guard automatic opening and closing and the like can be realized by combining relationship types recognized by images with a pre-deployed sensor, if the relationship types of two objects are recognized as relatives and friends or couples, when the sensor senses that one object passes through the entrance guard, but the other object does not pass through the entrance guard, the entrance guard can be directly opened without a verification process, and correspondingly, when the relationship types of the two objects are recognized as not being preset types, after the sensor senses that one object passes through the entrance guard, the entrance guard is automatically controlled to be closed, and after the other object passes through the verification process, the entrance guard is opened.

According to the method and the device, the images to be identified are processed by using the end-to-end trained relational identification model, the relational identification is performed after the surrounding frames are obtained without preprocessing, the deployment efficiency and the identification efficiency of the trained relational identification model are improved, and the practicability of the trained relational identification model is improved.

Fig. 4 shows a structural block diagram of a training device for an end-to-end relationship recognition model according to a fourth embodiment of the present invention, where the training device is applied to a client, a computer device corresponding to the client is connected to a server to obtain a sample image and a tag vector thereof, a feature extraction model and a prediction model are deployed in the client, the feature extraction model may be used to extract feature information of the sample image, and the prediction model may be used to perform social relationship type prediction according to the feature information of the sample image. For convenience of explanation, only portions relevant to the embodiments of the present invention are shown.

Referring to fig. 4, the training device includes:

the feature extraction module 41 is configured to obtain a sample image and at least one corresponding tag vector thereof, input the sample image into a feature extraction model to perform feature extraction, and obtain a sample feature, where the tag vector includes a first tag bounding box, a second tag bounding box, and a tag relationship class;

the vector prediction module 42 is configured to input the sample characteristics into a prediction model to obtain N prediction vectors, where the prediction vectors include a first sample bounding box, a second sample bounding box, and a sample relationship class, and N is an integer greater than zero;

The vector matching module 43 is configured to match the N predicted vectors with at least one tag vector by using a preset matching algorithm, and determine a predicted vector with the greatest matching degree with any tag vector as a reference vector of the corresponding tag vector;

a first calculation module 44, configured to traverse each tag vector, and calculate a category matching error between the sample relationship category in the reference vector and the tag relationship category in the tag vector;

a second calculating module 45, configured to calculate a first bounding box matching error between a first sample bounding box in the reference vector and a first label bounding box in the label vector, and calculate a second bounding box matching error between a second sample bounding box in the reference vector and a second label bounding box in the label vector;

the model training module 46 is configured to determine an overall match error of the tag vector according to the category match error, the first bounding box match error, and the second bounding box match error, train the feature extraction model and the prediction model according to the overall match error of all the tag vectors, and form a trained relational recognition model from the determined trained feature extraction model and the trained prediction model.

The feature extraction module 41 includes:

the semantic extraction unit is used for inputting the sample image into the semantic extraction layer to obtain a semantic feature vector, and multiplying the semantic feature vector by a preset first vector to obtain a value vector;

the vector calculation unit is used for coding the sample image by adopting a preset coding mode to obtain a position code, adding semantic feature vectors and the position code point by point, and multiplying the added result with a preset second vector and a preset third vector respectively to obtain a query vector and a key value vector;

the global extraction unit is used for inputting the value vector, the query vector and the key value vector into the encoder to obtain a global information feature vector;

and the feature decoding unit is used for carrying out preset initialization on the preset learnable vector, and inputting the initialized learnable vector, the global information feature vector and the position code into the decoder to obtain sample features.

Optionally, the global extraction unit includes:

the attention subunit is used for multiplying the query vector by the transpose of the key value vector, dividing the multiplication result by a preset coefficient, and determining the division result as an attention vector;

and the normalization subunit is used for carrying out normalization processing on the attention vector through a normalization exponential function, multiplying the normalization processing result by the value vector and inputting the multiplication result into the encoder.

Optionally, the semantic extraction unit includes:

the initial extraction subunit is used for inputting the sample image into the semantic extraction layer to obtain an initial feature vector;

the dimension reduction subunit is used for reducing the dimension of the initial feature vector by adopting a preset convolution layer, carrying out flattening operation on the dimension reduction result, and determining the flattening operation result as a semantic feature vector.

Optionally, the vector matching module 43 includes:

the matching pair acquisition unit is used for matching the N predicted vectors with at least one tag vector respectively by adopting a Hungary algorithm to obtain N matching pairs;

the error traversing unit is used for calculating the overall matching error of the matching pair aiming at any matching pair, traversing N matching pairs and obtaining N overall matching errors;

and the reference vector determining unit is used for determining a prediction vector corresponding to the minimum value in the overall matching error corresponding to the matching pair containing the label vector for any label vector, wherein the prediction vector is the prediction vector with the largest matching degree with the label vector, and the prediction vector with the largest matching degree is used as the reference vector of the corresponding label vector.

The model training module 46 includes:

the first weighting unit is used for multiplying the category matching error with the first weight to obtain a first matching error;

the second weighting unit is used for multiplying the first bounding box matching error and the second bounding box matching error with second weights respectively and then adding the multiplied first bounding box matching error and the multiplied second bounding box matching error to obtain a second matching error;

and the summation calculation unit is used for determining the sum of the first matching error and the second matching error as an integral matching error.

It should be noted that, because the content of information interaction, execution process and the like between the modules, units and sub-units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. As shown in fig. 5, the computer device of this embodiment includes: at least one processor (only one shown in fig. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various training method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 5 is merely an example of a computer device and is not intended to limit the computer device, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present invention may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A training method of an end-to-end relationship recognition model, the training method comprising:

2. The training method of claim 1, wherein the feature extraction model comprises a semantic extraction layer, an encoder, and a decoder;

inputting the sample image into the semantic extraction layer to obtain a semantic feature vector, and multiplying the semantic feature vector by a preset first vector to obtain a value vector;

coding the sample image by adopting a preset coding mode to obtain a position code, adding the semantic feature vector and the position code point by point, and multiplying an added result by a preset second vector and a preset third vector respectively to obtain a query vector and a key value vector;

inputting the value vector, the query vector and the key value vector into the encoder to obtain a global information feature vector;

and carrying out preset initialization on the preset learnable vector, and inputting the initialized learnable vector, the global information feature vector and the position code into the decoder to obtain the sample feature.

3. The training method of claim 2, wherein said inputting the value vector, the query vector, and the key-value vector into the encoder comprises:

4. The training method of claim 2, wherein the inputting the sample image into the semantic extraction layer to obtain a semantic feature vector comprises:

inputting the sample image into the semantic extraction layer to obtain an initial feature vector;

and adopting a preset convolution layer to reduce the dimension of the initial feature vector, carrying out flattening operation on the dimension reduction result, and determining the flattening operation result as the semantic feature vector.

5. The training method according to claim 1, wherein the matching the N prediction vectors with the at least one label vector by using a preset matching algorithm, and determining the prediction vector with the greatest matching degree with any label vector as the reference vector of the corresponding label vector includes:

adopting a Hungary algorithm to match the N predicted vectors with the at least one tag vector respectively to obtain N matched pairs;

For any matching pair, calculating the overall matching error of the matching pair, traversing the N matching pairs to obtain N overall matching errors;

and determining a prediction vector corresponding to the minimum value in the overall matching error corresponding to the matching pair containing the label vector according to any label vector, wherein the prediction vector is the prediction vector with the largest matching degree with the label vector, and taking the prediction vector with the largest matching degree as a reference vector corresponding to the label vector.

6. The training method of claim 1, wherein the class match error corresponds to a first weight, and the first bounding box match error and the second bounding box match error correspond to a second weight;

the determining the overall match error of the tag vector according to the category match error, the first bounding box match error, and the second bounding box match error includes:

multiplying the category matching error by the first weight to obtain a first matching error;

multiplying the first bounding box matching error and the second bounding box matching error with the second weight respectively, and then adding to obtain a second matching error;

And determining the sum of the first matching error and the second matching error as the integral matching error.

7. An end-to-end relationship identification method, characterized in that the end-to-end relationship identification method comprises:

8. A training device for an end-to-end relationship identification model, the training device comprising:

9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the training method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method according to any of claims 1 to 7.