CN116977105A

CN116977105A - Method, apparatus, device, storage medium and program product for determining a pushmark

Info

Publication number: CN116977105A
Application number: CN202210404503.XA
Authority: CN
Inventors: 李菁; 徐纯蒲; 刘乐茂; 史树明
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2023-10-31

Abstract

The application discloses a method and a device for determining a text label, and belongs to the technical field of machine learning models. The method comprises the following steps: acquiring a target image and a target text in a target push text; generating a model based on the target image and the image description text to obtain the image description text; obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model; based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features; and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model. By adopting the method and the device, the efficiency of determining the text label can be improved, and the cost is reduced.

Description

Method, apparatus, device, storage medium and program product for determining a pushmark

Technical Field

The present application relates to the field of machine learning, and in particular, to a method, apparatus, device, storage medium and program product for determining a pushmark.

Background

With the development of social media, more and more people publish a tweet on a social application to share their lives, and the social application can analyze and determine the label of the tweet according to the tweet published by a user. The pushout tags may include emotion tags, category tags, topic tags, and the like. The emotion tags can be used for representing the emotion of the user reflected by the push, such as positive, negative and the like; the category labels may be used to represent the vertical domain to which the tweets belong, such as "games", "movies", etc.; event tags may be used to represent the hot topics associated with the tweets, such as "# a new day by a crafter" # "," # a research score # ", and the like.

Currently, the method of determining a pushout tag is typically performed manually by a staff member of a social application.

However, the method of manually determining the pushout tag is inefficient and costly.

Disclosure of Invention

The application provides a method, a device, equipment, a storage medium and a program product for determining a text label, which can solve the problems of low efficiency and high cost caused by manually determining the text label.

In a first aspect, a method for determining a pushout tag is provided, the method comprising: acquiring a target image and a target text in a target push text; generating a model based on the target image and the image description text to obtain the image description text; obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model; based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features; and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model.

In one possible implementation, the image feature extraction model includes a convolutional neural network CNN and a fully-connected network, and the first text feature extraction model and the second text feature extraction model each include a Bi-directional long-short-term memory Bi-LSTM network and a fully-connected network.

In one possible implementation manner, the extracting a model based on the image feature, the text feature, the image description text feature and the image-text relationship feature to obtain the image-text relationship feature includes: extracting a model based on the image features, the text features and the image attention features to obtain the image attention features; obtaining image description text attention features based on the image description text features, the text features and an image description text attention feature extraction model; and obtaining the image-text relationship feature based on the text feature, the image attention feature, the image description text attention feature and the image-text relationship feature extraction model.

In one possible implementation, the image feature of interest extraction model and the image description text feature of interest extraction model are both multi-headed attention mechanism models.

In one possible implementation, before acquiring the target image and the target text in the target push, the method further includes: obtaining a sample relation type of a sample image and a sample text, wherein the sample image and the sample text belong to the same sample push; generating a model based on the sample image and the image description text to be trained to obtain a predicted image description text; obtaining predicted image features based on the sample images and the image feature extraction model to be trained, obtaining predicted text features based on the sample texts and the first text feature extraction model to be trained, and obtaining predicted image description text features based on the predicted image description texts and the second text feature extraction model to be trained; obtaining predicted graph-text relation features based on the predicted image features, the predicted text features, the predicted image description text features and the graph-text relation features to be trained; determining the predicted relationship type of the sample image and the sample text based on the predicted image-text relationship features; and training and adjusting parameters of the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text.

In one possible implementation manner, the training and tuning the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained, and the image-text feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text includes: determining a target loss value based on a sample relationship type of the sample image and the sample text, the prediction relationship type and a weighted cross entropy loss function; and training and parameter adjustment is carried out on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

In a second aspect, there is provided an apparatus for determining a pushmark, the apparatus comprising: the acquisition module is used for acquiring a target image and a target text in the target push; the determining module is used for generating a model based on the target image and the image description text to obtain the image description text; obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model; based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features; and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model.

In one possible implementation manner, the determining module is configured to extract a model based on the image feature, the text feature and the image attention feature, and obtain the image attention feature; obtaining image description text attention features based on the image description text features, the text features and an image description text attention feature extraction model; and obtaining the image-text relationship feature based on the text feature, the image attention feature, the image description text attention feature and the image-text relationship feature extraction model.

In a possible implementation manner, the obtaining module is further configured to obtain a sample relationship type between a sample image and a sample text, where the sample image and the sample text belong to the same sample text; the determining module is further used for generating a model based on the sample image and the image description text to be trained to obtain a predicted image description text; obtaining predicted image features based on the sample images and the image feature extraction model to be trained, obtaining predicted text features based on the sample texts and the first text feature extraction model to be trained, and obtaining predicted image description text features based on the predicted image description texts and the second text feature extraction model to be trained; obtaining predicted graph-text relation features based on the predicted image features, the predicted text features, the predicted image description text features and the graph-text relation features to be trained; determining the predicted relationship type of the sample image and the sample text based on the predicted image-text relationship features; and training and adjusting parameters of the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text.

In one possible implementation, the determining module is configured to determine a target loss value based on a sample relationship type of the sample image and the sample text, the prediction relationship type, and a weighted cross entropy loss function; and training and parameter adjustment is carried out on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

In a third aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory for storing computer instructions, the processor for executing the computer instructions stored by the memory to cause the computer device to perform the method of the first aspect and possible implementations thereof.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing computer program code which, when executed by a computer device, performs the method of the first aspect and possible implementations thereof.

In a fifth aspect, a computer program product is provided, the computer program product comprising computer program code for, when executed by a computer device, performing the method of the first aspect and possible implementations thereof.

In the embodiment of the application, the label of the target push text is determined based on the target image and the target text in the target push text and a series of machine learning models. Therefore, the machine learning model is used for completing the work, and staff is not required to manually determine the text label, so that the efficiency of determining the text label is improved, and the cost is reduced. In addition, in the embodiment, the image description text is obtained based on the image description text generation model, and the image description text feature is obtained based on the second text feature extraction model. The image description text features can be aligned with the text features in the feature space better, so that the image-text relationship features obtained based on the image description text features, the text features and the image features can reflect the relationship features between the image and the text better.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a computer device according to an embodiment of the present application;

FIG. 2 is a flowchart for determining a text label according to an embodiment of the present application;

FIG. 3 is a flowchart for determining a text label according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for determining a text label according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a target image according to an embodiment of the present application;

FIG. 6 is a schematic illustration of an image feature provided as an input in accordance with an embodiment of the present application;

FIG. 7 is a flow chart of an image description text feature provided by an embodiment of the present application as an input;

FIG. 8 is a flow chart of a training machine learning model provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of an apparatus for determining a text label according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The user can edit the push on the client of the social application and then click on the release control, the social application client can send the push to the server of the social application, and the server can analyze the push to determine the label of the push. The application of the pushmark is possible, for example, the pushmark may be pushed to a user interested in the pushmark, the label may be added in the pushmark, so that the pushmark may be searched by other users through the label searching mode, and so on.

The embodiment of the application provides a method for determining a text label, which can be realized by computer equipment. The computer device may be a server or a terminal or the like. The terminal may be a desktop computer, notebook computer, tablet computer, cell phone, etc. The server may be a single server or a server group formed by a plurality of servers.

Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present application, and from a hardware perspective, a structure of a computer device 100 may be shown in fig. 1, and includes a processor 101, a memory 102, and a communication unit 103.

The processor 101 may be a central processing unit (central processing unit, CPU) or a system on chip (SoC), etc., and the processor 101 may be configured to generate a model based on the target image and the image descriptive text to obtain the image descriptive text, to extract the model based on the image feature, the text feature, the image descriptive text feature, and the image-text relationship feature to obtain the image-text relationship feature, and to determine a label of the target text based on the image feature, the text feature, the image-text relationship feature, and the label analysis model, etc.

The memory 102 may include various volatile memories or non-volatile memories, such as Solid State Disk (SSD), dynamic random access memory (dynamic random access memory, DRAM) memory, and the like. The memory 102 may be used to determine pre-stored data, intermediate data, and result data during processing of the pushmark, e.g., target images and target text in a target pushmark, image description text, image features, text features, image description text features, labels of a target pushmark, etc.

In addition to the processor 101, the memory 102, the computer device 100 may also comprise a communication component 103.

The communication component 103 may be a wired network connector, a wireless fidelity (wireless fidelity, wiFi) module, a bluetooth module, a cellular network communication module, or the like. The communication unit 103 may be used for data transmission with other devices, which may be servers or terminals. For example, the computer device 100 may receive the target image and the target text in the target push, and the computer device 100 may also send the tag or the like of the target push to the server for storage.

Fig. 2 is a flowchart of a process for determining a pushmark according to an embodiment of the present application. Referring to fig. 2, the process may include the steps of:

And 201, acquiring a target image and a target text in the target push.

And 202, generating a model based on the target image and the image description text to obtain the image description text.

203, obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model.

204, extracting a model based on the image features, the text features, the image description text features and the image-text relationship features to obtain the image-text relationship features.

205, determining the label of the target push text based on the image feature, the text feature, the image-text relationship feature and the label analysis model.

Fig. 3 is a flowchart of a process for determining a pushmark according to an embodiment of the present application, and fig. 4 is a schematic diagram of a process for determining a pushmark. Referring to fig. 3, the process may include the steps of:

and 301, acquiring a target image and a target text in the target push.

And 302, generating a model based on the target image and the image description text to obtain the image description text.

The image description text generation model may be a machine learning model, and the type of the image description text generation model may be various, for example, a recurrent neural network (recurrent neural network, RNN) model, a Bottom-Up and Top-Down model, and the like.

In implementations, the target image may be input into an image description text generation model, which may output the image description text of the target image. The image description text may be used to describe the meaning of the target image, for example, the target image is as shown in fig. 5, the meaning of expression is "one dog is swimming", and then the image description text may be "one dog is swimming".

303, obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model.

The image feature extraction model may include a convolutional neural network (convolutional neural network, CNN) and a fully-connected network, and the first text feature extraction model and the second text feature extraction model may each include a two-way long-short-term memory (bidirectional long-short term memory, bi-LSTM) network and a fully-connected network.

In implementations, the target image may be input into an image feature extraction model, which may output image features of the target image, which may be used to reflect the composite characteristics of the target image, including the visual characteristics of the target image. The image feature extraction model can comprise a plurality of full-connection layer networks, the convolution neural network in the image feature extraction model can output intermediate image features of the target image, and the intermediate image features can be input into the plurality of full-connection layer networks to obtain a plurality of image features. The network parameters of the plurality of full-connection layer networks may be different, and the plurality of image features may be different from each other.

The target text may be input into a first text feature extraction model, which may output text features of the target text, which may be used to reflect comprehensive characteristics of the target text, including characteristics between contexts of the target text. The first text feature extraction model may include a plurality of full-connection layer networks, the two-way long-short-term memory network in the first text feature extraction model may output intermediate text features of the target text, and the intermediate text features may be input into the plurality of full-connection layer networks to obtain a plurality of text features. The network parameters of the plurality of full-connectivity layer networks may be different and the plurality of text features may be different from one another.

The image description text may be input into a second text feature extraction model, which may output image description text features, which may be used to reflect the integrated characteristics of the image description text, including characteristics between the contexts of the image description text. The second text feature extraction model may include a plurality of fully-connected layer networks, the two-way long-short-term memory network in the second text feature extraction model may output intermediate image description text features of the image description text, and the intermediate image description text features may be input into the plurality of fully-connected layer networks to obtain a plurality of image description text features. The network parameters of the plurality of full-connection layer networks may be different, and the plurality of image description text features may be different from each other.

304, obtaining the image attention feature based on the image feature, the text feature and the image attention feature extraction model.

The image Attention feature extraction model may be a machine learning model, and the type of the image Attention feature extraction model is various, for example, a Multi-head Attention (Multi-head Attention) mechanism model. The image attention feature extraction model may be used to extract image attention features, which may be used to reflect characteristics of the related content between the image features and the text features.

In implementations, image features and text features may be input into an image feature of interest extraction model, which may output image features of interest.

Taking an example that the image attention feature extraction model is a multi-head attention mechanism model, the number of attention heads of the model is n, the number of full connection layers included in the image feature extraction model is 2n corresponding to the number of attention heads of the model, and the number of full connection layers included in the first text feature extraction model is n.

The inputs of the multi-head attention mechanism model include query (abbreviated as Q), key (abbreviated as K), and value (abbreviated as V), and text features may be set as Q, image features as K and V, and referring to fig. 6, image features as K and image features as V are different from each other. The computation of the multi-headed attentiveness mechanism model may be expressed as:

Wherein MA represents the image attention feature output by the multi-head attention mechanism model, [ ·]Representing the concatenation of vectors in brackets, θ (·) represents the softmax activation function, Q _n Represents the nth Q (i.e. text feature),representing network parameters of an n-th Q-output fully-connected network, K _n Represents the nth K, < >>Representing network parameters of the fully connected network outputting the nth K, (. Cndot.) T represents matrix transposition of vectors in brackets, d _k Represents normalization factor, V _n Represents the nth V->Network representing fully connected network outputting nth VParameters, W ^o Model parameters representing a model of a multi-headed attentiveness mechanism.

Q, K, V is input into an image feature of interest extraction model, which can output image features of interest.

305, obtaining the image description text attention feature based on the image description text feature, the text feature and the image description text attention feature extraction model.

The image description text attention feature extraction model is a multi-head attention mechanism model

The image description text attention feature extraction model may be a machine learning model, and the type of the image description text attention feature extraction model may be multiple attention mechanism models. The image attention feature extraction model may be used to extract image description text attention features that may be used to reflect characteristics of the image description text features and related content between the text features.

In implementations, the image-description text feature and the text feature may be input into an image-description text feature of interest extraction model, which may output the image-description text feature of interest.

Taking an example that the image description text attention feature extraction model is a multi-head attention mechanism model, the number of attention heads of the model is n, the number of full connection layers included in the second text feature extraction model is 2n corresponding to the number of attention heads of the model, and the number of full connection layers included in the first text feature extraction model is n.

The text feature may be set to Q, the image-description text features to K and V, and referring to fig. 7, the image-description text feature as K and the image-description text feature as V are different from each other. The computation of the multi-headed attentiveness mechanism model may be expressed as:

wherein MA represents a multi-head attention mechanism modelOutput image attention feature, [ ·]Representing the concatenation of vectors in brackets, θ (·) represents the softmax activation function, Q _n Represents the nth Q (i.e. text feature),representing network parameters of an n-th Q-output fully-connected network, K _n Represents the nth K, < >>Network parameters representing the fully connected network outputting the nth K, (-) ^T Represents the matrix transposition of vectors in brackets, d _k Represents normalization factor, V represents nth V,>representing network parameters of the fully connected network outputting the nth V, W ^o Model parameters representing a model of a multi-headed attentiveness mechanism.

Q, K, V is input into an image description text feature of interest extraction model, which can output image description text features of interest.

306, extracting a model based on the text feature, the image attention feature, the image description text attention feature and the image-text relation feature to obtain the image-text relation feature.

The image-text relation feature extraction model can be a machine learning model, and the type of the image-text relation feature extraction model is various, for example, a full-connection network model, a multi-layer perceptron model, or the like. The image-text relation feature extraction model can be used for extracting image-text relation features, and the image-text relation features can be used for reflecting relation features between the target image and the target text.

In the implementation, text features, image attention features and image description text attention features can be spliced, the spliced feature vectors are input into a graph-text relation feature extraction model, and the graph-text relation feature extraction model can output graph-text relation features.

307, determining the label of the target push based on the image features, the text features, the image-text relationship features and the label analysis model.

In implementation, the image features, text features and graphic relationship features may be input into a tag analysis model, which may output tags of the target tweets. The type of tag of the target tweet may be a variety of possibilities, for example, an emotion tag, a category tag, a topic tag, and so on. For the case that the label type of the target push text is an emotion label, the label of the target push text can be positive, negative, normal or happy, difficult or the like; for the case that the label type of the target push text is a classification label, the label of the target push text can be a game, a movie, a current affair, a financial transaction and the like; for the case that the label type of the target tweet is a topic label, the label of the target tweet may be "# the new day # of the constructor" # the score # of the research "#", "# the first snow # of 2022, etc.

After the server determines the label of the target push, the label may be added to the target push.

Some machine learning models are involved in the above processing procedure, and an embodiment of the present application provides a training method for a machine learning model, as shown in fig. 8, the method may include the following steps:

801, a sample image, a sample text, and a sample relationship type of the sample image and the sample text are acquired.

Wherein the sample image and the sample text belong to the same sample text.

And 802, generating a model based on the sample image and the image description text to be trained, and obtaining a predicted image description text.

In implementations, the sample image may be input into an image description text generation model to be trained, which may output a predicted image description text of the target image.

803, obtaining predicted image features based on the sample image and the image feature extraction model to be trained, obtaining predicted text features based on the sample text and the first text feature extraction model to be trained, and obtaining predicted image description text features based on the predicted image description text and the second text feature extraction model to be trained.

The image feature extraction model to be trained can comprise a convolutional neural network to be trained and a fully-connected network, and the first text feature extraction model to be trained and the second text feature extraction model to be trained can comprise a two-way long-short-term memory network to be trained and a fully-connected network to be trained.

In implementations, the sample image may be input to an image feature extraction model to be trained, which may output predicted image features of the sample image. The sample text may be input into a first text feature extraction model to be trained, which may output predicted text features of the sample text. The predictive image description text may be input into a second text feature extraction model to be trained, which may output predictive image description text features.

804, obtaining the predicted graph-text relationship feature based on the predicted image feature, the predicted text feature, the predicted image description text feature and the graph-text relationship feature extraction model to be trained.

In an implementation, the predicted image features and the predicted text features may be input into an image attention feature extraction model to be trained, which may output the predicted image attention features. The predicted image description text feature and the predicted text feature may be input into an image description text feature of interest extraction model to be trained, which may output the predicted image description text feature of interest.

And then, inputting the predicted text feature, the predicted image attention feature and the predicted image description text attention feature into a to-be-trained image-text relation feature extraction model to obtain the predicted image-text relation feature.

And 805, determining the predicted relationship type of the sample image and the sample text based on the predicted image-text relationship features.

In implementation, the predicted teletext feature may be input into a relationship classification function to obtain a confidence level for each candidate relationship type of the sample text. The type of relationship classification function may be a variety of possibilities, such as a softmax function.

Candidate relationship types may include both "include" and "extend" relationship types. The "include" relationship type indicates that the content of the image is included in the meaning expressed by the text without exceeding the meaning range expressed by the text. For example, the content of the image is a picture of swimming a dog, the text expresses that the meaning is "a golden hair dog swimming in a pond", the content of the image does not exceed the meaning range expressed by the text, and the graph-text relationship type is an "containing" relationship type.

The "expanded" relationship type indicates that the content of the image is beyond the meaning range expressed by the text, and the image expresses the meaning not expressed by the text. For example, the content of the image is "one dog swimming", the meaning expressed by the text is "what is done by the dog", the image expresses the meaning of the text which is not expressed as "the dog swimming", the image and the text are combined to express the meaning of "the dog swimming", namely, the image expands the meaning expressed by the text, and the image-text relationship type is an "expanded" relationship type.

The "include" relationship type may be further divided into an "entity include" relationship type and an "event include" relationship type. An "entity-containing" relationship type indicates that the content of an image is one or more entities in the meaning expressed by text, i.e., the image serves as an entity that embodies the expression of text. For example, the text expresses the meaning "one golden hair dog swimming", the content of the image is only "a photograph of one golden hair dog", and the image acts as an entity which is expressed in the expression text as "golden hair dog". An "event-containing" relationship type indicates that the content of the image is one or more events in the meaning expressed by the text, i.e., the image serves to materialize the event expressed by the text. For example, the text expresses the meaning of "one golden hair dog swimming", the content of the image is "one golden hair dog swimming photo", and the image plays an event of "golden hair dog swimming" expressed in the expression text.

The relationship type of "expansion" can be further divided into a relationship type of "entity expansion" and a relationship type of "event expansion". An "entity-extended" relationship type means that the content of an image is one or more entities outside the meaning range expressed by the text, i.e., the image serves to extend the text-expressed entity. For example, the text expresses the meaning of "who is loved" and the content of the image is "a photograph of a gold hair dog", and the image plays a role of expanding the "loved" entity in the meaning of the text into "gold hair dog". The "event expansion" relationship type indicates that the content of the image is one or more events outside the meaning range expressed by the text, i.e., the image serves to expand the text-expressed event. For example, the text expresses what the gold hair dog is doing, the content of the image is a photo of "one gold hair dog is swimming", and the image plays a role in expanding what event in the text meaning is "doing" to "swimming".

Based on the above description, candidate relationship types may include an "entity-containing" relationship type, an "event-containing" relationship type, an "entity-expanding" relationship type, and an "event-expanding" relationship type. The candidate relationship type with the highest confidence level may be selected as the predicted relationship type.

806, training and parameter adjustment is performed on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text.

In implementations, the sample relationship type and the prediction relationship type may be input into a loss function to obtain the target loss value, and the loss function may be various types of loss functions, such as a square loss function (quadratic loss function), an absolute value loss function (absolute loss function), a cross entropy loss function (cross entropy function), a weighted cross entropy loss function (weighted cross entropy function), and so on. Then, parameter adjustment can be performed on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

Alternatively, the sample relation types of the plurality of sample scripts and the corresponding prediction relation types may be input together into a weighted cross entropy loss function, and the weighted cross entropy loss function may output loss values corresponding to the relation types of the respective types. Then, the loss value corresponding to each relationship type may be multiplied by the weight corresponding to the relationship type to obtain a weighted loss value corresponding to each relationship type, and the weighted loss values corresponding to each relationship type may be added to obtain a target loss value. The weight corresponding to each relation type is inversely related to the proportion of the predicted relation type belonging to the relation type in the total number of the predicted relation types. And carrying out parameter adjustment on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

For example, the sample relation types of 100 sample tweets and the corresponding prediction relation types can be input into a weighted cross entropy loss function to obtain loss values corresponding to the relation types of entity inclusion, event inclusion, entity expansion and event expansion. In the predicted relationship types corresponding to the 100 sample scripts, the number of the relationship types of the entity inclusion is 40, the number of the relationship types of the event inclusion is 30, the number of the relationship types of the entity expansion is 10, the number of the relationship types of the event expansion is 20, the weight corresponding to the relationship types of the entity inclusion can be 3, the weight corresponding to the relationship types of the event inclusion can be 4, the weight corresponding to the relationship types of the entity expansion can be 12, and the weight corresponding to the relationship types of the event expansion can be 6. Then, the loss value corresponding to each relationship type may be multiplied by the weight corresponding to the relationship type to obtain a weighted loss value corresponding to each relationship type, and the weighted loss values corresponding to each relationship type may be added to obtain a target loss value. And carrying out parameter adjustment on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

Fig. 9 is a device for determining a text label according to an embodiment of the present application, where the device includes: an acquiring module 901, configured to acquire a target image and a target text in a target push; a determining module 902, configured to generate a model based on the target image and the image description text, so as to obtain the image description text; obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model; based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features; and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model.

In one possible implementation, the determining module 902 is configured to extract a model based on the image feature, the text feature, and the image attention feature, to obtain an image attention feature; obtaining image description text attention features based on the image description text features, the text features and an image description text attention feature extraction model; and obtaining the image-text relationship feature based on the text feature, the image attention feature, the image description text attention feature and the image-text relationship feature extraction model.

In a possible implementation manner, the obtaining module 901 is further configured to obtain a sample relationship type between a sample image and a sample text, where the sample image and the sample text belong to the same sample text; the determining module 902 is further configured to generate a model based on the sample image and the image description text to be trained, so as to obtain a predicted image description text; obtaining predicted image features based on the sample images and the image feature extraction model to be trained, obtaining predicted text features based on the sample texts and the first text feature extraction model to be trained, and obtaining predicted image description text features based on the predicted image description texts and the second text feature extraction model to be trained; obtaining predicted graph-text relation features based on the predicted image features, the predicted text features, the predicted image description text features and the graph-text relation features to be trained; determining the predicted relationship type of the sample image and the sample text based on the predicted image-text relationship features; and training and adjusting parameters of the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text.

In one possible implementation, the determining module 902 is configured to determine a target loss value based on a sample relationship type of the sample image and the sample text, the prediction relationship type, and a weighted cross entropy loss function; and training and parameter adjustment is carried out on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 1000 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 1001 and one or more memories 1002, where at least one instruction is stored in the memories 1002, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided in the foregoing method embodiments. Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, e.g. a memory comprising instructions executable by a processor in a terminal to perform the method of determining a pushmark in the above embodiment is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be a ROM (read-only memory), a RAM (random access memory ), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., which fall within the spirit and principles of the present application.

Claims

1. A method of determining a pushout tag, the method comprising:

Acquiring a target image and a target text in a target push text;

generating a model based on the target image and the image description text to obtain the image description text;

obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model;

based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features;

and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model.

2. The method of claim 1, wherein the image feature extraction model comprises a convolutional neural network CNN and a fully-connected network, and the first text feature extraction model and the second text feature extraction model each comprise a Bi-directional long-term short-term memory Bi-LSTM network and a fully-connected network.

3. The method of claim 1, wherein the extracting a model based on the image feature, the text feature, the image descriptive text feature, and a teletext feature, obtaining a teletext feature, comprises:

Extracting a model based on the image features, the text features and the image attention features to obtain the image attention features;

obtaining image description text attention features based on the image description text features, the text features and an image description text attention feature extraction model;

and obtaining the image-text relationship feature based on the text feature, the image attention feature, the image description text attention feature and the image-text relationship feature extraction model.

4. A method according to claim 3, wherein the image feature of interest extraction model and the image description text feature of interest extraction model are both multi-headed attention mechanism models.

5. The method of claim 1, further comprising, prior to acquiring the target image and the target text in the target push:

obtaining a sample relation type of a sample image and a sample text, wherein the sample image and the sample text belong to the same sample push;

generating a model based on the sample image and the image description text to be trained to obtain a predicted image description text;

obtaining predicted image features based on the sample images and the image feature extraction model to be trained, obtaining predicted text features based on the sample texts and the first text feature extraction model to be trained, and obtaining predicted image description text features based on the predicted image description texts and the second text feature extraction model to be trained;

Obtaining predicted graph-text relation features based on the predicted image features, the predicted text features, the predicted image description text features and the graph-text relation features to be trained;

determining the predicted relationship type of the sample image and the sample text based on the predicted image-text relationship features;

and training and adjusting parameters of the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the sample relation type and the prediction relation type of the sample image and the sample text.

6. The method of claim 5, wherein the training the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained, and the image-text feature extraction model to be trained based on the sample relationship type and the prediction relationship type of the sample image and the sample text comprises:

Determining a target loss value based on a sample relationship type of the sample image and the sample text, the prediction relationship type and a weighted cross entropy loss function;

and training and parameter adjustment is carried out on the image description text generation model to be trained, the image feature extraction model to be trained, the first text feature extraction model to be trained, the second text feature extraction model to be trained and the image-text relation feature extraction model to be trained based on the target loss value.

7. An apparatus for determining a pushmark, the apparatus comprising:

the acquisition module is used for acquiring a target image and a target text in the target push;

the determining module is used for generating a model based on the target image and the image description text to obtain the image description text; obtaining image features based on the target image and the image feature extraction model, obtaining text features based on the target text and the first text feature extraction model, and obtaining image description text features based on the image description text and the second text feature extraction model; based on the image features, the text features, the image description text features and the image-text relationship feature extraction model, obtaining image-text relationship features; and determining the label of the target push text based on the image features, the text features, the image-text relationship features and a label analysis model.

8. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform the operations performed by the method of determining a pushlabel of any of claims 1 to 7.

9. A computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the method of determining a pushmark of any of claims 1 to 7.

10. A computer program product comprising at least one instruction for loading and executing by a processor to perform the operations performed by the method of determining a pushmark according to any of claims 1 to 7.