CN115035342A

CN115035342A - Machine learning model training method and device and visual relation detection method and device

Info

Publication number: CN115035342A
Application number: CN202210689986.2A
Authority: CN
Inventors: 潘滢炜; 李业豪; 姚霆; 梅涛
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-09-09

Abstract

The disclosure provides a machine learning model training method and device and a visual relation detection method and device, and relates to the field of artificial intelligence. The machine learning model training method comprises the following steps: processing the sample image by using a first machine learning model to obtain semantic features and spatial features of a target subject and a target object in a target relation triple, a predicate probability distribution result and visual features of a target area comprising the target subject and the target object; determining a first loss function according to the predicate probability distribution result; predicting a first semantic vector of a target subject, predicting a first space vector of the target subject, predicting a first semantic vector of a target object, and predicting a first space vector of the target object by using a second machine learning model; determining a second loss function according to the prediction result; determining a first target loss function according to the first loss function and the second loss function; the first machine learning model and the second machine learning model are trained using a first objective loss function.

Description

Machine learning model training method and device, and visual relation detection method and device

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to a machine learning model training method and device and a visual relationship detection method and device.

Background

In the related art, a visual relationship between a subject and an object is detected by using semantic features and spatial features of the subject and the object, and visual features of a target area including the subject and the object. Namely, the visual relation prediction problem is converted into a multi-modal feature fusion classification problem.

Disclosure of Invention

The inventor notices that the existing visual relation detection scheme does not utilize an internal supervision signal derived from the image structured semantic understanding, so that a visual relation detection result with stable performance cannot be obtained.

Accordingly, the present disclosure provides a machine learning model training scheme, which can obtain a visual relationship detection result with stable performance.

According to a first aspect of the embodiments of the present disclosure, there is provided a machine learning model training method, including: processing the sample image by using a first machine learning model to obtain the semantic features and the spatial features of a target subject, the semantic features and the spatial features of a target object, predicate probability distribution results and visual features of a target area comprising the target subject and the target object in a target relation triple; determining a first loss function according to the predicate probability distribution result and the predicate marking result; predicting a first semantic vector of the target subject according to the spatial features of the target subject, predicting the first spatial vector of the target subject according to the semantic features of the target subject, predicting the first semantic vector of the target object according to the spatial features of the target object, and predicting the first spatial vector of the target object according to the semantic features of the target object by using a second machine learning model; determining a second loss function according to the prediction result; determining a first target loss function according to the first loss function and the second loss function; training the first machine learning model and the second machine learning model with the first target loss function.

In some embodiments, said determining a second loss function from the prediction comprises: determining a first sub-loss function according to the first space vector of the target subject, the space labeling result of the target subject, the first space vector of the target object and the space labeling result of the target object; determining a second sub-loss function according to the first semantic vector of the target subject, the semantic annotation result of the target subject, the first semantic vector of the target object and the semantic annotation result of the target object; determining the second loss function according to the first sub-loss function and the second sub-loss function.

In some embodiments, the first sub-loss function is positively correlated with the sum of the deviation of the first spatial vector of the target subject and the spatial labeling result of the target subject and the deviation of the first spatial vector of the target object and the spatial labeling result of the target object; the second sub-loss function is inversely related to the sum of the cross entropy of the first semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the first semantic vector of the target object and the semantic annotation result of the target object.

In some embodiments, the second loss function is a weighted sum of the first sub-loss function and the second sub-loss function.

In some embodiments, the predicting a first semantic vector of the target subject from spatial features of the target subject comprises: fusing the spatial feature and the visual feature of the target subject to obtain a first fused feature; compressing the first fusion feature to obtain a first compression feature; and processing the first compression characteristic by utilizing a multilayer perceptron to obtain a first semantic vector of the target subject.

In some embodiments, the predicting the first spatial vector of the target subject from semantic features of the target subject comprises: and performing reconstruction processing by using the semantic features and the visual features of the target subject to obtain a first space vector of the target subject.

In some embodiments, the predicting the first semantic vector of the target object according to the spatial features of the target object comprises: fusing the spatial features and the visual features of the target object to obtain second fused features; compressing the second fusion feature to obtain a second compression feature; and processing the second compression characteristic by utilizing a multilayer perceptron to obtain a first semantic vector of the target object.

In some embodiments, the obtaining the first spatial vector of the target object according to the semantic features of the target object includes: and performing reconstruction processing by using the semantic features and the visual features of the target object to obtain a first space vector of the target object.

In some embodiments, the first penalty function is negatively correlated with the cross-entropy of the predicate probability distribution result and the predicate annotation result.

In some embodiments, the first target loss function is a weighted sum of the first loss function and the second loss function.

In some embodiments, a predicate characteristic of the target relationship triple is determined from the predicate probability distribution result.

In some embodiments, the above method further comprises: performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target object by utilizing the second machine learning model to obtain first object features; performing inter-object reconstruction by using the second machine learning model according to the predicate feature and the first object feature to obtain a second semantic vector and a second spatial vector of the target subject; performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target subject by using the second machine learning model to obtain second object features; performing inter-object reconstruction by using the second machine learning model according to the predicate feature and the second object feature to obtain a second semantic vector and a second space vector of the target object; determining a third loss function according to the reconstruction result between the objects; determining a second target loss function according to the first loss function, the second loss function and the third loss function; training the first machine learning model and the second machine learning model using the second objective loss function.

In some embodiments, the determining a third loss function from the inter-object reconstruction results comprises: determining a third sub-loss function according to the second space vector of the target subject, the space labeling result of the target subject, the second space vector of the target object and the space labeling result of the target object; determining a fourth sub-loss function according to the second semantic vector of the target subject, the semantic annotation result of the target subject, the second semantic vector of the target object and the semantic annotation result of the target object; determining the third loss function according to the third sub-loss function and the fourth sub-loss function.

In some embodiments, the third sub-loss function positively correlates to a sum of a deviation of the second spatial vector of the target subject and the spatial labeling result of the target subject and a deviation of the second spatial vector of the target object and the spatial labeling result of the target object; the fourth sub-loss function is inversely related to the sum of the cross entropy of the second semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the second semantic vector of the target object and the semantic annotation result of the target object.

In some embodiments, the third loss function is a weighted sum of the third sub-loss function and the fourth sub-loss function.

In some embodiments, the second target loss function is a weighted sum of the first, second, and third loss functions.

In some embodiments, semantic features, spatial features, predicate features, and visual features of relationship triples other than the target relationship triplet are extracted from the sample image using the first machine learning model; performing inter-relationship reconstruction by using the second machine learning model according to the semantic features, the spatial features, the predicate features and the visual features of the other relationship triples, the spatial features of the target subject and the spatial features of the target object to obtain a third semantic vector of the target subject, a third semantic vector of the target object and a predicate probability prediction distribution result; determining a fourth loss function according to the inter-relationship reconstruction result; determining a third target loss function according to the first loss function, the second loss function, the third loss function, and the fourth loss function; training the first machine learning model and the second machine learning model with the third objective loss function.

In some embodiments, said determining a fourth loss function from the inter-relationship reconstruction result comprises: determining a fifth sub-loss function according to the third semantic vector and the semantic annotation result of the target subject and the third semantic vector and the semantic annotation result of the target object; determining a sixth sub-loss function according to the predicate probability prediction distribution result and the predicate annotation result; determining the fourth loss function according to the fifth sub-loss function and the sixth sub-loss function.

In some embodiments, the fifth sub-loss function is inversely related to a sum of cross entropy of the third semantic vector of the target subject and the semantic annotation result of the target subject and cross entropy of the third semantic vector of the target object and the semantic annotation result of the target object; the sixth sub-loss function is negatively correlated with the cross entropy of the predicate probability prediction distribution result and the predicate marking result.

In some embodiments, the fourth loss function is a weighted sum of the fifth sub-loss function and the sixth sub-loss function.

In some embodiments, the third target loss function is a weighted sum of the first, second, third, and fourth loss functions.

In some embodiments, said processing the sample image with the first machine learning model comprises: processing the sample image by using a first machine learning model to obtain the semantic features and the spatial features of a target subject, the semantic features and the spatial features of a target object and the visual features of the target area in the target relation triple; and determining the predicate probability distribution result by utilizing the semantic features and the spatial features of the target subject, the semantic features and the spatial features of the target object and the visual features of the target area.

In some embodiments, the determining the predicate probability distribution result comprises: fusing the semantic features and the spatial features of the target subject to obtain third fusion features; compressing the visual features of the target area to obtain third compressed features; fusing the third fused feature and the third compressed feature to obtain a fourth fused feature; and processing the fourth fusion feature by using a multilayer perceptron to obtain the predicate probability distribution result.

According to a second aspect of the embodiments of the present disclosure, there is provided a machine learning model training apparatus including: the system comprises a first training module, a second training module and a third training module, wherein the first training module is configured to process a sample image by using a first machine learning model to obtain the semantic features and the spatial features of a target subject, the semantic features and the spatial features of a target object, a predicate probability distribution result and the visual features of a target area comprising the target subject and the target object in a target relation triple, and determine a first loss function according to the predicate probability distribution result and the predicate annotation result; the second training module is configured to predict a first semantic vector of the target subject according to the spatial features of the target subject by using a second machine learning model, predict the first spatial vector of the target subject according to the semantic features of the target subject, predict the first semantic vector of the target object according to the spatial features of the target object, predict the first spatial vector of the target object according to the semantic features of the target object, and determine a second loss function according to a prediction result; a third training module configured to determine a first target loss function from the first and second loss functions, train the first and second machine learning models with the first target loss function.

According to a third aspect of the embodiments of the present disclosure, there is provided a machine learning model training apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method implementing any of the embodiments described above based on instructions stored by the memory.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a visual relationship detection method, including: inputting an image to be processed into a first machine learning model so that the first machine learning model outputs a predicate probability distribution result associated with a subject to be processed and an object to be processed in a relation triple to be processed, wherein the first machine learning model is obtained by training with the machine learning model training method of any one of the embodiments; and predicting the predicate probability distribution result to obtain the visual relationship between the subject to be processed and the object to be processed.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a visual relationship detecting apparatus including: a first detection module, configured to input an image to be processed into a first machine learning model, so that the first machine learning model outputs a predicate probability distribution result associated with a subject to be processed and an object to be processed in a triplet of a relation to be processed, where the first machine learning model is obtained by training using the machine learning model training method described in any of the embodiments above; and the second detection module is configured to predict the predicate probability distribution result so as to obtain the visual relationship between the subject to be processed and the object to be processed.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a visual relationship detecting apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method implementing any of the embodiments described above based on instructions stored by the memory.

According to a seventh aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer instructions are stored, and when executed by a processor, the computer-readable storage medium implements the method according to any of the embodiments described above.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart diagram of a machine learning model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a machine learning model training method according to another embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram of a machine learning model training method according to yet another embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a machine learning model training apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a machine learning model training apparatus according to another embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a visual relationship detection method according to an embodiment of the disclosure;

FIG. 7 is a schematic structural diagram of a visual relationship detection apparatus according to an embodiment of the disclosure;

fig. 8 is a schematic structural diagram of a visual relationship detection apparatus according to another embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

Fig. 1 is a schematic flow chart of a machine learning model training method according to an embodiment of the present disclosure. In some embodiments, the following machine learning model training method is performed by a machine learning model training apparatus.

In step 101, a first machine learning model is used to process a sample image to obtain a semantic feature and a spatial feature of a target subject, a semantic feature and a spatial feature of a target object, a predicate probability distribution result, and a visual feature of a target area including the target subject and the target object in a target relationship triple.

A relational triple may be represented as < subject, predicate, object >. For example, the image includes a car and a parking sign, the car is in front of the parking sign, and the predicate is front when the car is the subject and the parking sign is the object. The corresponding relationship triplet may be denoted as < car, front, stop sign >. The target area is a minimum united area including the car and the stop sign.

In some embodiments, the sample images are processed using a first machine learning model to obtain the semantic features and spatial features of the target subject, the semantic features and spatial features of the target object, and the visual features of the target region in the target relationship triples.

For example, a visual representation is extracted by inputting a pre-trained CNN (Convolutional Neural Network) after a small range extension is performed on a joint region including a subject and an object. The obtained visual representation can not only capture the visual appearance of the host and the object, but also capture the relationship of the surrounding environment. And coding a double-layer space mask through an hourglass network by using a space mask module to obtain a coding result, wherein the double-layer space mask consists of binary masks of a subject and an object. And then splicing by using the coding result and the visual representation to obtain the visual characteristics.

For the semantic features of the subject and the object, the names of the subject and the object may be first converted into Word vectors (for example, using Word2Vec model), then encoded by using GRU (Gated recursive Unit) to obtain high-dimensional (for example, 300-dimensional) feature vectors, and then the high-dimensional feature vectors are converted into the semantic features of the subject and the semantic features of the object through a fully-connected network.

Aiming at the spatial characteristics of the subject and the object, the 4-dimensional coordinate vector of the subject and the 4-dimensional coordinate vector of the object can be respectively converted into the spatial characteristics of the subject and the spatial characteristics of the object through the full connection layer.

And then, determining a predicate probability distribution result by using the semantic feature and the spatial feature of the target subject, the semantic feature and the spatial feature of the target object and the visual feature of the target area.

For example, the semantic features and the spatial features of the target subject are fused (e.g., the semantic features and the spatial features of the target subject are spliced) to obtain fused features. The visual characteristics of the target area are compressed (e.g., using an AdaptiveAvgPool layer) to obtain a compression result. The fused features and the compressed results are then fused (e.g., the fused features and the compressed results are stitched) to obtain fused results. Then, the fusion result is processed by using an MLP (Multi-layer Perceptron) as a predicate classifier to obtain a predicate probability distribution result

As shown in equation (1).

Wherein, the first and the second end of the pipe are connected with each other,

is a semantic feature of the subject and is,

is a spatial feature of the main body,

is a semantic feature of the object and is,

being a spatial feature of the object, W _f For fusing transformation matrices, f _vis For visual features, avgpool is a compression function, E _cls For predicate classificationProvided is a device.

In step 102, a first loss function is determined according to the predicate probability distribution result and the predicate annotation result.

In some embodiments, the first penalty function is negatively correlated with the cross-entropy of the predicate probability distribution result and the predicate annotation result. For example, the first loss function L _Base As shown in equation (2).

Wherein the content of the first and second substances,

and the result is the predicate probability distribution, p is the predicate marking result, and x represents the x-th dimension in the vector.

In step 103, a second machine learning model is used to predict a first semantic vector of the target subject according to the spatial features of the target subject, predict a first spatial vector of the target subject according to the semantic features of the target subject, predict a first semantic vector of the target object according to the spatial features of the target object, and predict a first spatial vector of the target object according to the semantic features of the target object. I.e. to perform intra-object vector prediction.

In some embodiments, the spatial and visual features of the target subject are fused (e.g., stitched) to obtain a first fused feature. And compressing the first fused feature to obtain a first compressed feature. And processing the first compression characteristic by utilizing a multilayer perceptron to obtain a first semantic vector of the target subject.

In some embodiments, a reconstruction process is performed using the semantic features and the visual features of the target subject to obtain a first spatial vector of the target subject.

In some embodiments, the spatial and visual features of the target object are fused (e.g., stitched) to obtain a second fused feature. And compressing the second fused feature to obtain a second compressed feature. And processing the second compression characteristic by using a multilayer perceptron to obtain a first semantic vector of the target object.

In some embodiments, the semantic features and the visual features of the target object are used for reconstruction processing to obtain a first space vector of the target object.

For example, the first space vector of the target subject and the first space vector of the target object are 7 × 7 × 5 dimensions, wherein the image is divided into a 7 × 7 grid, and 5 dimensions represent 4-dimensional coordinate information

Plus 1-dimensional confidence

In step 104, a second loss function is determined based on the prediction.

In some embodiments, the step of determining the second loss function comprises the following:

1) and determining a first sub-loss function according to the first space vector of the target subject, the space labeling result of the target subject, the first space vector of the target object and the space labeling result of the target object.

For example, the first sub-loss function has a positive correlation with the sum of the deviation between the first space vector of the target subject and the spatial labeling result of the target subject and the deviation between the first space vector of the target object and the spatial labeling result of the target object.

Setting the first space vector of the target subject as

The first space vector of the target object is

The space labeling result of the target subject is b _s And the result of spatial labeling of the target object is b _o The first loss sub-function loss ₁ As shown in equation (3).

For example, the first spatial vector of the target subject is 4-dimensional coordinate information

The space labeling result of the target subject is 4-dimensional coordinate information (x, y, w, h), then

As shown in equation (4).

Wherein

And the confidence c of the grid unit containing the target value is 1, and i represents the ith grid unit.

Accordingly, the number of the first and second electrodes,

the calculation can also be performed using equation (4).

2) And determining a second sub-loss function according to the first semantic vector of the target subject, the semantic annotation result of the target subject, the first semantic vector of the target object and the semantic annotation result of the target object.

For example, the second sub-loss function is inversely related to the sum of the cross entropy of the first semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the first semantic vector of the target object and the semantic annotation result of the target object.

Set the first semantic vector of the target subject to

The semantic annotation result of the target subject is n _s The first semantic vector of the target object is

The semantic annotation result of the target object is n _o The second loss sub-function loss ₂ As shown in equation (5).

3) A second loss function is determined from the first sub-loss function and the second sub-loss function.

For example, the second loss function L _Intra-object As shown in equation (6).

L _Intra-object ＝loss ₁ +loss ₂ (6)

In step 105, a first target loss function is determined from the first loss function and the second loss function.

For example, the first target loss function is shown in equation (7).

L1＝L _Base +L _Intra-object (7)

At step 106, the first machine learning model and the second machine learning model are trained using a first objective loss function.

In the machine learning model training method provided by the above embodiment of the present disclosure, on the basis that the sample image is processed by the first machine learning model to obtain the relevant features of the target relationship triplet, the second machine learning model is used to perform vector prediction in the object, and the first machine learning model and the second machine learning model are trained according to the prediction result. Because the internal supervision signal is utilized in the training process, the visual relation detection result with stable performance can be effectively obtained.

Fig. 2 is a schematic flow chart of a machine learning model training method according to another embodiment of the present disclosure. In some embodiments, the following machine learning model training method is performed by a machine learning model training apparatus.

In step 201, a first machine learning model is used to process a sample image to obtain a semantic feature and a spatial feature of a target subject, a semantic feature and a spatial feature of a target object, a predicate probability distribution result, and a visual feature of a target area including the target subject and the target object in a target relationship triple.

In step 202, a first loss function is determined according to the predicate probability distribution result and the predicate annotation result.

For example, the first loss function is as shown in the above equation (2).

And step 203, determining the predicate characteristics of the target relation triple according to the predicate probability distribution result.

For example, predicate features are identified in a form of Hot Encoding (One Hot Encoding).

At step 204, vector prediction within the object is performed.

That is, the second machine learning model is used to predict the first semantic vector of the target subject based on the spatial features of the target subject, predict the first spatial vector of the target subject based on the semantic features of the target subject, predict the first semantic vector of the target object based on the spatial features of the target object, and predict the first spatial vector of the target object based on the semantic features of the target object.

In step 205, a second loss function is determined based on the prediction.

For example, the second loss function is as shown in the above equation (6).

In step 206, inter-object vector reconstruction is performed.

And performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target object by using a second machine learning model to obtain first object features. And performing inter-object reconstruction by using a second machine learning model according to the predicate characteristics and the first object characteristics to obtain a second semantic vector and a second spatial vector of the target subject. And performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target subject by using a second machine learning model to obtain second object features. And performing inter-object reconstruction by using a second machine learning model according to the predicate characteristics and the second object characteristics to obtain a second semantic vector and a second spatial vector of the target object.

In step 207, a third loss function is determined from the inter-object reconstruction results.

In some embodiments, the step of determining the third loss function comprises the following:

1) and determining a third sub-loss function according to the second space vector of the target subject, the space labeling result of the target subject, the second space vector of the target object and the space labeling result of the target object.

For example, the third sub-loss function positively correlates to a sum of a deviation of the second space vector of the target subject and the spatial labeling result of the target subject and a deviation of the second space vector of the target object and the spatial labeling result of the target object.

Setting the second space vector of the target subject as

The second space vector of the target object is

The space labeling result of the target subject is b _s And the space marking result of the target object is b _o The third sub-loss function loss ₃ As shown in equation (8).

Wherein the content of the first and second substances,

and

can be calculated by the above formula (4).

2) And determining a fourth sub-loss function according to the second semantic vector of the target subject, the semantic annotation result of the target subject, the second semantic vector of the target object and the semantic annotation result of the target object.

For example, the fourth sub-loss function is inversely related to the sum of the cross entropy of the second semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the second semantic vector of the target object and the semantic annotation result of the target object.

Set the second semantic vector of the target subject to

The semantic annotation result of the target subject is n _s The second semantic vector of the target object is

The semantic annotation result of the target object is n _o The fourth sub-loss function loss ₄ As shown in equation (9).

3) A third loss function is determined from the third sub-loss function and the fourth sub-loss function.

For example, the third loss function L _Inter-object As shown in equation (10).

L _Inter-object ＝loss ₃ +loss ₄ (10)

At step 208, a second target loss function is determined based on the first, second, and third loss functions.

In some embodiments, the second target loss function is a weighted sum of the first loss function, the second loss function, and the third loss function.

For example, the second target loss function is shown in equation (11).

L2＝L _Base +L _Intra-object +L _Inter-object (11)

At step 209, the first machine learning model and the second machine learning model are trained using a second target loss function.

In the machine learning model training method provided by the above embodiment of the present disclosure, on the basis that the sample image is processed by the first machine learning model to obtain the relevant features of the target relationship triplet, the second machine learning model is used to perform intra-object vector prediction and inter-object vector prediction, and the first machine learning model and the second machine learning model are trained according to the prediction result. Because the internal supervision signals with multiple granularities are utilized in the training process, the visual relation detection result with stable performance can be effectively obtained.

Fig. 3 is a flowchart illustrating a machine learning model training method according to yet another embodiment of the present disclosure. In some embodiments, the following machine learning model training method is performed by a machine learning model training apparatus.

In step 301, the sample image is processed by using a first machine learning model to obtain a semantic feature and a spatial feature of a target subject, a semantic feature and a spatial feature of a target object, a predicate probability distribution result, and a visual feature of a target area including the target subject and the target object in a target relationship triple.

In step 302, a first loss function is determined according to the predicate probability distribution result and the predicate annotation result.

For example, the first loss function is as shown in the above equation (2).

In step 303, the predicate characteristics of the target relationship triples are determined according to the predicate probability distribution result.

At step 304, vector prediction within the object is performed.

In step 305, a second penalty function is determined based on the prediction.

For example, the second loss function is as shown in the above equation (6).

In step 306, inter-object vector reconstruction is performed.

And performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target object by using a second machine learning model to obtain first object features. And performing inter-object reconstruction by using a second machine learning model according to the predicate characteristics and the first object characteristics to obtain a second semantic vector and a second spatial vector of the target subject. And performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target subject by using a second machine learning model to obtain second object features. And performing inter-object reconstruction by using a second machine learning model according to the predicate characteristics and the second object characteristics to obtain a second semantic vector and a second space vector of the target object.

In step 307 a third loss function is determined from the inter-object reconstruction results.

For example, the third loss function L _Inter-object As shown in the above equation (10).

In step 308, a vector reconstruction between the relationships is performed.

Firstly, semantic features, spatial features, predicate features and visual features of relation triples other than the target relation triples are extracted from a sample image by using a first machine learning model.

For example, if the image includes a car and a stop sign, and the car is located in front of the stop sign, the corresponding relation triple is < car, front, stop sign >. In addition, the image may include objects such as house, ground, etc., and the corresponding relationship triplets may include < car, in. By utilizing the relation triples, a relatively perfect scene graph can be constructed, and the model can understand the scene conveniently.

And then, carrying out inter-relationship reconstruction by using a second machine learning model according to the semantic features, the spatial features, the predicate features and the visual features of other relationship triples, the spatial features of the target subject and the spatial features of the target object to obtain a third semantic vector of the target subject, a third semantic vector of the target object and a predicate probability prediction distribution result.

In step 309, a fourth loss function is determined based on the inter-relationship reconstruction result.

In some embodiments, the step of determining the fourth loss function comprises the following.

1) And determining a fifth sub-loss function according to the third semantic vector and the semantic annotation result of the target subject and the third semantic vector and the semantic annotation result of the target object.

For example, the fifth sub-loss function is inversely related to the sum of the cross entropy of the third semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the third semantic vector of the target object and the semantic annotation result of the target object.

Setting a sample image to comprise M relation triples, wherein the target relation triplet is the ith relation triplet R _i . At R _i In (1), the third semantic vector of the target subject is set as

The semantic annotation result of the target subject is n _si The third semantic vector of the target object is

The semantic annotation result of the target object is n _oi The fifth sub-loss function loss ₄ As shown in equation (12).

2) And determining a sixth sub-loss function according to the predicate probability prediction distribution result and the predicate annotation result.

For example, the sixth sub-loss function is negatively correlated with the cross-entropy of the predicate probability prediction distribution result and the predicate flag result.

Setting a sample image to comprise M relation triples, wherein the target relation triplet is the ith relation triplet R _i . At R _i In (1), the predicate probability prediction distribution result is

Predicate flag result is p _si The sixth sub-loss function loss ₆ As shown in equation (13).

3) A fourth loss function is determined from the fifth sub-loss function and the sixth sub-loss function.

For example, a fourth loss function L _{Inter-relation} As shown in equation (14).

L _{Inter-relation} ＝loss ₅ +loss ₆ (14)

At step 310, a third target loss function is determined based on the first, second, third, and fourth loss functions.

In some embodiments, the third target loss function is a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

For example, the third objective loss function is shown in equation (15).

L3＝L _Base +L _Intra-object +L _Inter-object +L _{Inter-relation} (15)

In step 311, the first machine learning model and the second machine learning model are trained using a fourth target loss function.

In the machine learning model training method provided by the above embodiment of the present disclosure, on the basis that the sample image is processed by the first machine learning model to obtain the relevant features of the target relationship triplet, the second machine learning model is used to perform intra-object vector prediction, inter-object vector prediction, and inter-relationship vector prediction, and the first machine learning model and the second machine learning model are trained according to the prediction results. Because the internal supervision signals with multiple granularities are utilized in the training process, the visual relation detection result with stable performance can be effectively obtained.

Fig. 4 is a schematic structural diagram of a machine learning model training apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the machine learning model training apparatus includes a first training module 41, a second training module 42, and a third training module 43.

The first training module 41 is configured to process the sample image by using a first machine learning model to obtain a semantic feature and a spatial feature of the target subject, the semantic feature and the spatial feature of the target object, a predicate probability distribution result, and a visual feature of a target area including the target subject and the target object in the target relationship triplet, and determine a first loss function according to the predicate probability distribution result and the predicate annotation result.

In some embodiments, first training module 41 processes the sample images using the first machine learning model to obtain the semantic features and spatial features of the target subject, the semantic features and spatial features of the target object, and the visual features of the target region in the target relationship triples. And determining a predicate probability distribution result by using the semantic features and the spatial features of the target subject, the semantic features and the spatial features of the target object and the visual features of the target area.

In some embodiments, the first training module 41 fuses the semantic features and the spatial features of the target subject to obtain a third fused feature, compresses the visual features of the target region to obtain a third compressed feature, fuses the third fused feature and the third compressed feature to obtain a fourth fused feature, and processes the fourth fused feature by using the multi-layer perceptron to obtain the predicate probability distribution result.

In some embodiments, the first penalty function is negatively correlated with the cross-entropy of the predicate probability distribution result and the predicate annotation result. For example, the first loss function is as shown in the above equation (2).

The second training module 42 is configured to predict a first semantic vector of the target subject according to the spatial features of the target subject, predict a first spatial vector of the target subject according to the semantic features of the target subject, predict a first semantic vector of the target object according to the spatial features of the target object, predict a first spatial vector of the target object according to the semantic features of the target object, and determine a second loss function according to the prediction result using the second machine learning model.

In some embodiments, second training module 42 fuses the spatial features and the visual features of the target subject to obtain first fused features, compresses the first fused features to obtain first compressed features, and processes the first compressed features using the multi-layered perceptron to obtain a first semantic vector of the target subject.

In some embodiments, second training module 42 performs a reconstruction process using the semantic features and the visual features of the target subject to obtain a first spatial vector of the target subject.

In some embodiments, the second training module 42 fuses the spatial feature and the visual feature of the target object to obtain a second fused feature, compresses the second fused feature to obtain a second compressed feature, and processes the second compressed feature by using the multi-layered perceptron to obtain the first semantic vector of the target object.

In some embodiments, second training module 42 performs a reconstruction process using the semantic features and the visual features of the target object to obtain a first spatial vector of the target object.

In some embodiments, second training module 42 determines the first sub-loss function according to the first space vector of the target subject and the spatial labeling result of the target subject, the first space vector of the target object and the spatial labeling result of the target object; determining a second sub-loss function according to the first semantic vector of the target subject, the semantic annotation result of the target subject, the first semantic vector of the target object and the semantic annotation result of the target object; a second loss function is determined from the first sub-loss function and the second sub-loss function.

In some embodiments, the first sub-loss function positively correlates to a sum of a deviation of the first spatial vector of the target subject and the spatial labeling result of the target subject and a deviation of the first spatial vector of the target object and the spatial labeling result of the target object. The second sub-loss function is in negative correlation with the sum of the cross entropy of the first semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the first semantic vector of the target object and the semantic annotation result of the target object.

In some embodiments, the second loss function is a weighted sum of the first sub-loss function and the second sub-loss function. For example, the second loss function is as shown in the above equation (6).

The third training module 43 is configured to determine a first target loss function from the first and second loss functions, with which the first and second machine learning models are trained.

In some embodiments, the first target loss function is a weighted sum of the first loss function and the second loss function. For example, the first objective loss function is as shown in the above equation (7).

In some embodiments, the first training module 41 determines a predicate feature of the target relationship triple from the predicate probability distribution result.

In some embodiments, the second training module 42 performs multi-modal fusion on the semantic features, the spatial features, and the visual features of the target object by using the second machine learning model to obtain the first object features, performs inter-object reconstruction by using the second machine learning model according to the predicate features and the first object features to obtain the second semantic vector and the second spatial vector of the target subject, performs multi-modal fusion on the semantic features, the spatial features, and the visual features of the target subject by using the second machine learning model to obtain the second object features, performs inter-object reconstruction by using the second machine learning model according to the predicate features and the second object features to obtain the second semantic vector and the second spatial vector of the target object, and determines the third loss function according to the inter-object reconstruction result.

In some embodiments, second training module 42 determines a third sub-loss function according to the second spatial vector of the target subject and the spatial labeling result of the target subject, the second spatial vector of the target object and the spatial labeling result of the target object, determines a fourth sub-loss function according to the second semantic vector of the target subject and the semantic labeling result of the target subject, the second semantic vector of the target object and the semantic labeling result of the target object, and determines the third loss function according to the third sub-loss function and the fourth sub-loss function.

In some embodiments, the third sub-loss function positively correlates to a sum of a deviation of the second spatial vector of the target subject and the spatial labeling result of the target subject and a deviation of the second spatial vector of the target object and the spatial labeling result of the target object. And the fourth sub-loss function is in negative correlation with the sum of the cross entropy of the second semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the second semantic vector of the target object and the semantic annotation result of the target object.

In some embodiments, the third loss function is a weighted sum of the third sub-loss function and the fourth sub-loss function. For example, the third loss function is as shown in the above equation (10).

In some embodiments, the third training module 43 determines a second target loss function from the first, second, and third loss functions, with which the first and second machine learning models are trained.

In some embodiments, the second target loss function is a weighted sum of the first loss function, the second loss function, and the third loss function. For example, the second target loss function is as shown in the above equation (11).

In some embodiments, the first training module 41 extracts semantic, spatial, predicate, and visual features of relationship triples other than the target relationship triplet from the sample image using the first machine learning model.

The second training module 42 performs inter-relationship reconstruction according to the semantic features, the spatial features, the predicate features, and the visual features of the other relationship triples, the spatial features of the target subject, and the spatial features of the target object by using the second machine learning model to obtain a third semantic vector of the target subject, a third semantic vector of the target object, and a predicate probability prediction distribution result, and determines a fourth loss function according to the inter-relationship reconstruction result.

In some embodiments, the second training module 42 determines a fifth sub-loss function according to the third semantic vector and the semantic annotation result of the target subject, the third semantic vector and the semantic annotation result of the target object, determines a sixth sub-loss function according to the predicate probability prediction distribution result and the predicate annotation result, and determines a fourth loss function according to the fifth sub-loss function and the sixth sub-loss function.

In some embodiments, the fifth sub-loss function is inversely related to a sum of cross entropy of the third semantic vector of the target subject and the semantic annotation result of the target subject and cross entropy of the third semantic vector of the target object and the semantic annotation result of the target object. And the sixth sub-loss function is negatively correlated with the cross entropy of the predicate probability prediction distribution result and the predicate mark result.

In some embodiments, the fourth loss function is a weighted sum of the fifth sub-loss function and the sixth sub-loss function. For example, the fourth loss function is as shown in the above equation (14).

In some embodiments, the third training module 43 determines a third target loss function from the first, second, third, and fourth loss functions, with which the first and second machine learning models are trained.

In some embodiments, the third target loss function is a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function. For example, the third objective loss function is as shown in equation (15) above.

Fig. 5 is a schematic structural diagram of a machine learning model training apparatus according to another embodiment of the present disclosure. As shown in fig. 5, the machine learning model training apparatus includes a memory 51 and a processor 52.

The memory 51 is used for storing instructions, the processor 52 is coupled to the memory 51, and the processor 52 is configured to execute the method according to any one of the embodiments in fig. 1-3 based on the instructions stored in the memory.

As shown in fig. 5, the machine learning model training apparatus further includes a communication interface 53 for information interaction with other devices. Meanwhile, the machine learning model training device further comprises a bus 54, and the processor 52, the communication interface 53 and the memory 51 are communicated with each other through the bus 54.

The memory 51 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 51 may also be a memory array. The storage 51 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules.

Further, the processor 52 may be a central processing unit CPU, or may be an application specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present disclosure.

The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement a method according to any one of the embodiments in fig. 1-3.

Fig. 6 is a flowchart illustrating a visual relationship detection method according to an embodiment of the disclosure. In some embodiments, the following visual relationship detection method is performed by a visual relationship detection apparatus.

In step 601, the image to be processed is input into the first machine learning model, so that the first machine learning model outputs predicate probability distribution results associated with the subject to be processed and the object to be processed in the triple of the relationship to be processed. The first machine learning model is obtained by training through the machine learning model training method according to any one of the embodiments in fig. 1 to fig. 3.

In step 602, the predicate probability distribution result is predicted to obtain a visual relationship between the subject to be processed and the object to be processed.

It should be noted that, because a multi-granularity internal supervision signal is used in the training process, the trained first machine learning model can output a visual relationship detection result with stable performance.

Fig. 7 is a schematic structural diagram of a visual relationship detection apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the visual relationship detecting apparatus includes a first detecting module 71 and a second detecting module 72.

The first detection module 71 is configured to input the image to be processed into the first machine learning model, so that the first machine learning model outputs predicate probability distribution results associated with the subject to be processed and the object to be processed in the triplet of relationships to be processed. The first machine learning model is trained by using the machine learning model training method according to any one of the embodiments of fig. 1 to 3.

The second detection module 72 is configured to predict the predicate probability distribution result to obtain a visual relationship between the subject to be processed and the object to be processed.

Fig. 8 is a schematic structural diagram of a visual relationship detection apparatus according to another embodiment of the disclosure. As shown in fig. 8, the visual relationship detecting means includes a memory 81, a processor 82, a communication interface 83, and a bus 84. Fig. 8 differs from fig. 5 in that, in the embodiment shown in fig. 8, the processor 82 is configured to perform the method referred to in any of the embodiments of fig. 6 based on instructions stored in the memory.

The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement a method according to any one of the embodiments in fig. 6.

In some embodiments, the functional unit modules described above can be implemented as a general purpose Processor, a Programmable Logic Controller (PLC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable Logic device, discrete Gate or transistor Logic, discrete hardware components, or any suitable combination thereof for performing the functions described in this disclosure.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A machine learning model training method, comprising:

processing the sample image by using a first machine learning model to obtain the semantic features and the spatial features of a target subject, the semantic features and the spatial features of a target object, a predicate probability distribution result and the visual features of a target area comprising the target subject and the target object in a target relation triple;

determining a first loss function according to the predicate probability distribution result and the predicate marking result;

predicting a first semantic vector of the target subject according to the spatial features of the target subject, predicting the first spatial vector of the target subject according to the semantic features of the target subject, predicting the first semantic vector of the target object according to the spatial features of the target object, and predicting the first spatial vector of the target object according to the semantic features of the target object by using a second machine learning model;

determining a second loss function according to the prediction result;

determining a first target loss function according to the first loss function and the second loss function;

training the first machine learning model and the second machine learning model using the first objective loss function.

2. The method of claim 1, wherein said determining a second loss function from the prediction comprises:

determining a first sub-loss function according to the first space vector of the target subject, the space labeling result of the target subject, the first space vector of the target object and the space labeling result of the target object;

determining a second sub-loss function according to the first semantic vector of the target subject, the semantic annotation result of the target subject, the first semantic vector of the target object and the semantic annotation result of the target object;

determining the second loss function according to the first sub-loss function and the second sub-loss function.

3. The method of claim 2, wherein,

the first sub-loss function is positively correlated with the sum of the deviation between the first space vector of the target subject and the spatial labeling result of the target subject and the deviation between the first space vector of the target object and the spatial labeling result of the target object;

the second sub-loss function is inversely related to the sum of the cross entropy of the first semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the first semantic vector of the target object and the semantic annotation result of the target object.

4. The method of claim 2, wherein,

the second loss function is a weighted sum of the first sub-loss function and the second sub-loss function.

5. The method of claim 1, wherein the predicting the first semantic vector of the target subject according to the spatial features of the target subject comprises:

fusing the spatial feature and the visual feature of the target subject to obtain a first fused feature;

compressing the first fusion feature to obtain a first compression feature;

and processing the first compression characteristic by utilizing a multilayer perceptron to obtain a first semantic vector of the target subject.

6. The method of claim 1, wherein the predicting the first spatial vector of the target subject from the semantic features of the target subject comprises:

and performing reconstruction processing by using the semantic features and the visual features of the target subject to obtain a first space vector of the target subject.

7. The method of claim 1, wherein the predicting the first semantic vector of the target object according to the spatial features of the target object comprises:

fusing the spatial feature and the visual feature of the target object to obtain a second fused feature;

compressing the second fusion feature to obtain a second compression feature;

and processing the second compression characteristic by utilizing a multilayer perceptron to obtain a first semantic vector of the target object.

8. The method of claim 1, wherein the obtaining the first space vector of the target object according to the semantic features of the target object comprises:

and performing reconstruction processing by using the semantic features and the visual features of the target object to obtain a first space vector of the target object.

9. The method of claim 1, wherein,

and the first loss function is negatively correlated with the cross entropy of the predicate probability distribution result and the predicate marking result.

10. The method of claim 1, wherein,

the first target loss function is a weighted sum of the first loss function and the second loss function.

11. The method of claim 1, further comprising:

and determining the predicate characteristics of the target relation triple according to the predicate probability distribution result.

12. The method of claim 11, further comprising:

performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target object by utilizing the second machine learning model to obtain first object features;

performing inter-object reconstruction by using the second machine learning model according to the predicate feature and the first object feature to obtain a second semantic vector and a second spatial vector of the target subject;

performing multi-mode fusion on the semantic features, the spatial features and the visual features of the target subject by using the second machine learning model to obtain second object features;

performing inter-object reconstruction by using the second machine learning model according to the predicate feature and the second object feature to obtain a second semantic vector and a second space vector of the target object;

determining a third loss function according to the reconstruction result between the objects;

determining a second target loss function according to the first loss function, the second loss function and the third loss function;

training the first machine learning model and the second machine learning model using the second objective loss function.

13. The method of claim 12, wherein said determining a third loss function from inter-object reconstruction results comprises:

determining a third sub-loss function according to the second space vector of the target subject, the space labeling result of the target subject, the second space vector of the target object and the space labeling result of the target object;

determining a fourth sub-loss function according to the second semantic vector of the target subject, the semantic annotation result of the target subject, the second semantic vector of the target object and the semantic annotation result of the target object;

determining the third loss function according to the third sub-loss function and the fourth sub-loss function.

14. The method of claim 13, wherein,

the third sub-loss function is positively correlated with the sum of the deviation between the second space vector of the target subject and the spatial labeling result of the target subject and the deviation between the second space vector of the target object and the spatial labeling result of the target object;

the fourth sub-loss function is inversely related to the sum of the cross entropy of the second semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the second semantic vector of the target object and the semantic annotation result of the target object.

15. The method of claim 13, wherein,

the third loss function is a weighted sum of the third sub-loss function and the fourth sub-loss function.

16. The method of claim 12, wherein,

the second target loss function is a weighted sum of the first, second, and third loss functions.

17. The method of claim 12, further comprising:

extracting semantic features, spatial features, predicate features and visual features of relation triples other than the target relation triples from the sample image by using the first machine learning model;

performing inter-relationship reconstruction by using the second machine learning model according to the semantic features, the spatial features, the predicate features and the visual features of the other relationship triples, the spatial features of the target subject and the spatial features of the target object to obtain a third semantic vector of the target subject, a third semantic vector of the target object and a predicate probability prediction distribution result;

determining a fourth loss function according to the inter-relationship reconstruction result;

determining a third target loss function according to the first loss function, the second loss function, the third loss function, and the fourth loss function;

training the first machine learning model and the second machine learning model with the third objective loss function.

18. The method of claim 17, wherein said determining a fourth loss function from the inter-relationship reconstruction results comprises:

determining a fifth sub-loss function according to the third semantic vector and the semantic annotation result of the target subject and the third semantic vector and the semantic annotation result of the target object;

determining a sixth sub-loss function according to the predicate probability prediction distribution result and the predicate annotation result;

determining the fourth loss function according to the fifth sub-loss function and the sixth sub-loss function.

19. The method of claim 18, wherein,

the fifth sub-loss function is in negative correlation with the sum of the cross entropy of the third semantic vector of the target subject and the semantic annotation result of the target subject and the cross entropy of the third semantic vector of the target object and the semantic annotation result of the target object;

the sixth sub-loss function is negatively correlated with the cross entropy of the predicate probability prediction distribution result and the predicate marking result.

20. The method of claim 19, wherein,

the fourth loss function is a weighted sum of the fifth sub-loss function and the sixth sub-loss function.

21. The method of claim 17, wherein,

the third target loss function is a weighted sum of the first loss function, the second loss function, the third loss function, and the fourth loss function.

22. The method of any of claims 1-21, wherein the processing the sample image with the first machine learning model comprises:

processing the sample image by utilizing a first machine learning model to obtain the semantic features and the spatial features of a target subject, the semantic features and the spatial features of a target object and the visual features of the target area in a target relation triple;

and determining the predicate probability distribution result by utilizing the semantic features and the spatial features of the target subject, the semantic features and the spatial features of the target object and the visual features of the target area.

23. The method of claim 22, wherein the determining the predicate probability distribution result comprises:

fusing the semantic features and the spatial features of the target subject to obtain third fusion features;

compressing the visual features of the target area to obtain third compressed features;

fusing the third fused feature and the third compressed feature to obtain a fourth fused feature;

and processing the fourth fusion feature by using a multilayer perceptron to obtain the predicate probability distribution result.

24. A machine learning model training apparatus comprising:

the first training module is configured to process a sample image by using a first machine learning model to obtain a semantic feature and a spatial feature of a target subject, a semantic feature and a spatial feature of a target object, a predicate probability distribution result and a visual feature of a target area including the target subject and the target object in a target relationship triple, and determine a first loss function according to the predicate probability distribution result and the predicate annotation result;

a second training module configured to predict a first semantic vector of the target subject according to the spatial features of the target subject, predict the first spatial vector of the target subject according to the semantic features of the target subject, predict the first semantic vector of the target object according to the spatial features of the target object, predict the first spatial vector of the target object according to the semantic features of the target object, and determine a second loss function according to a prediction result by using a second machine learning model;

a third training module configured to determine a first target loss function from the first and second loss functions, train the first and second machine learning models with the first target loss function.

25. A machine learning model training apparatus, comprising:

a memory configured to store instructions;

a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-23 based on instructions stored by the memory.

26. A visual relationship detection method, comprising:

inputting an image to be processed into a first machine learning model so that the first machine learning model outputs a predicate probability distribution result associated with a subject to be processed and an object to be processed in a relation triple to be processed, wherein the first machine learning model is obtained by training through a machine learning model training method of any one of claims 1 to 23;

and predicting the predicate probability distribution result to obtain the visual relationship between the subject to be processed and the object to be processed.

27. A visual relationship detection apparatus comprising:

a first detection module configured to input an image to be processed into a first machine learning model so that the first machine learning model outputs predicate probability distribution results associated with a subject to be processed and an object to be processed in a triplet of a relation to be processed, wherein the first machine learning model is trained by the machine learning model training method according to any one of claims 1 to 23;

and the second detection module is configured to predict the predicate probability distribution result so as to obtain the visual relationship between the subject to be processed and the object to be processed.

28. A visual relationship detection apparatus comprising:

a memory configured to store instructions;

a processor coupled to the memory, the processor configured to perform implementing the method of claim 26 based on instructions stored by the memory.

29. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the method of any one of claims 1-23, 26.