WO2024065645A1 - 图像文本匹配模型的训练方法、装置、设备及存储介质 - Google Patents

图像文本匹配模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024065645A1
WO2024065645A1 PCT/CN2022/123188 CN2022123188W WO2024065645A1 WO 2024065645 A1 WO2024065645 A1 WO 2024065645A1 CN 2022123188 W CN2022123188 W CN 2022123188W WO 2024065645 A1 WO2024065645 A1 WO 2024065645A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
entity
positive
matching model
Prior art date
Application number
PCT/CN2022/123188
Other languages
English (en)
French (fr)
Inventor
冀潮
欧歌
钟楚千
张鹏飞
姜博然
魏书琪
Original Assignee
北京京东方技术开发有限公司
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东方技术开发有限公司, 京东方科技集团股份有限公司 filed Critical 北京京东方技术开发有限公司
Priority to PCT/CN2022/123188 priority Critical patent/WO2024065645A1/zh
Priority to CN202280003411.9A priority patent/CN118119935A/zh
Publication of WO2024065645A1 publication Critical patent/WO2024065645A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning

Definitions

  • the present invention relates to the field of artificial intelligence technology, and in particular to a training method, device, equipment and storage medium for an image-text matching model.
  • the present invention provides a training method, device, equipment and storage medium for an image-text matching model to solve the deficiencies in the related art.
  • a method for training an image-text matching model comprising:
  • the positive samples include text and images, the included text is used to describe the content in the included image
  • the negative samples include text and images, the included text description content does not match the content in the included image
  • the image-text matching model is trained based on contrastive learning
  • the image-text matching model is used to predict, for input text and image, whether the input text is used to describe the content in the input image.
  • the image-text matching model includes: a text representation layer and an image representation layer;
  • the image-text matching model is used to: for input text and image, use the text representation layer to obtain text features of the input text, use the image representation layer to obtain image features of the input image, and then based on the obtained text features and image features, predict whether the input text is used to describe the content in the input image.
  • obtaining positive samples and negative samples includes:
  • Positive samples and negative samples are generated according to the correspondence set; the positive samples include: text and images belonging to the same correspondence; the negative samples include: text and images belonging to different correspondences; the text description content in the negative samples does not match the image content in the same negative samples.
  • generating positive samples and negative samples according to the correspondence set includes:
  • a negative sample is generated based on the text in any one of the corresponding relationships and the image in any other corresponding relationship in the multiple groups of corresponding relationships.
  • generating positive samples and negative samples according to the correspondence set includes:
  • a positive sample is generated based on the text and image in each group of correspondences; based on the text in each group of correspondences and the images in the other N-1 groups of correspondences in the N groups of correspondences, N-1 negative samples are generated.
  • the text included in the positive sample is used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in the negative sample does not match the entity in the included image.
  • the texts are used to describe entity categories and entity attributes in the corresponding images;
  • Generating positive samples and negative samples according to the corresponding relationship set includes:
  • the corresponding relationships in the corresponding relationship set with different entity categories and different entity attributes are determined as a third subset; and a third positive and negative sample set is generated according to the third subset.
  • the first loss weight is less than the second loss weight, and the second loss weight is less than the third loss weight; the first loss weight is the loss function weight when the image-text matching model is trained using the first positive and negative sample sets; the second loss weight is the loss function weight when the image-text matching model is trained using the second positive and negative sample sets; the third loss weight is the loss function weight when the image-text matching model is trained using the third positive and negative sample sets.
  • the text representation layer is used to extract entity feature information from the input text.
  • the text representation layer is used to perform text encoding on the input text, and then extract entity feature information from the text encoding result.
  • the method before training the image-text matching model, the method further includes:
  • the image to be restored is used as a sample feature, and the image to be trained is used as a sample label to pre-train the image representation layer.
  • the method before training the image-text matching model, the method further includes:
  • the obtained text to be restored is used as a sample feature, and the first text is used as a sample label to pre-train a text representation layer.
  • a training device for an image-text matching model comprising:
  • a sample unit used to obtain positive samples and negative samples;
  • the positive samples include text and images, the included text is used to describe the content in the included image;
  • the negative samples include text and images, the included text description content is inconsistent with the content in the included image;
  • a training unit used for training an image-text matching model based on contrastive learning by using the acquired positive and negative samples
  • the image-text matching model is used to predict, for input text and image, whether the input text is used to describe the content in the input image.
  • the image-text matching model includes: a text representation layer and an image representation layer;
  • the image-text matching model is used to: for input text and image, use the text representation layer to obtain text features of the input text, use the image representation layer to obtain image features of the input image, and then based on the obtained text features and image features, predict whether the input text is used to describe the content in the input image.
  • sample unit is used to:
  • Positive samples and negative samples are generated according to the correspondence set; the positive samples include: text and images belonging to the same correspondence; the negative samples include: text and images belonging to different correspondences; the text description content in the negative samples does not match the image content in the same negative samples.
  • sample unit is used to:
  • a negative sample is generated based on the text in any one of the corresponding relationships and the image in any other corresponding relationship in the multiple groups of corresponding relationships.
  • sample unit is used to:
  • a positive sample is generated based on the text and image in each group of correspondences; based on the text in each group of correspondences and the images in the other N-1 groups of correspondences in the N groups of correspondences, N-1 negative samples are generated.
  • the text included in the positive sample is used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in the negative sample does not match the entity in the included image.
  • the text is used to describe entity categories and entity attributes in the corresponding image
  • the corresponding relationships in the corresponding relationship set with different entity categories and different entity attributes are determined as a third subset; and a third positive and negative sample set is generated according to the third subset.
  • the first loss weight is less than the second loss weight, and the second loss weight is less than the third loss weight; the first loss weight is the loss function weight when the image-text matching model is trained using the first positive and negative sample sets; the second loss weight is the loss function weight when the image-text matching model is trained using the second positive and negative sample sets; the third loss weight is the loss function weight when the image-text matching model is trained using the third positive and negative sample sets.
  • the text representation layer is used to extract entity feature information from the input text.
  • the text representation layer is used to perform text encoding on the input text, and then extract entity feature information from the text encoding result.
  • the device further comprises an image pre-training unit, which is used to: before training the image-text matching model, determine in advance the position of at least one entity in the image to be trained according to entity information contained in a first text used to describe the content of the image to be trained;
  • the image to be restored is used as a sample feature, and the image to be trained is used as a sample label to pre-train the image representation layer.
  • the device further comprises a text pre-training unit, which is used to: before training the image-text matching model, for any masked entity in the image to be restored, mask the information of the targeted entity in the first text to obtain the text to be restored;
  • a text pre-training unit which is used to: before training the image-text matching model, for any masked entity in the image to be restored, mask the information of the targeted entity in the first text to obtain the text to be restored;
  • the obtained text to be restored is used as a sample feature, and the first text is used as a sample label to pre-train a text representation layer.
  • the image-text matching model is trained by using positive and negative samples and contrastive learning, thereby increasing the number of samples by introducing negative samples and improving the training effect of the image-text matching model.
  • FIG1 is a schematic flow chart of a method for identifying a basketball shot according to an embodiment of the present invention
  • FIG2 is a schematic diagram of the structure of a shot classification network according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a hardware structure of a computer device configured with a method according to an embodiment of the present invention.
  • the embodiment of the invention discloses a training method for an image-text matching model.
  • the image-text matching model can be trained using contrastive learning, which requires obtaining positive samples and negative samples for contrastive learning.
  • the positive sample includes text and image, and the included text can be used to describe the content in the included image, so that it can be determined that there is a correlation between the text and the image in the positive sample.
  • a positive sample may include an image with the content "a dog and a person playing together” and a text with the content "a person playing with a dog”.
  • the negative sample includes text and image, and the included text does not match the content in the included image, so it can be determined that there is no association between the text and the image in the negative sample.
  • a negative example could include an image of a dog playing with a person, and text of a store restocking.
  • the image-text matching model can be trained by using positive and negative samples and contrastive learning, which can easily increase the number of negative samples, thereby increasing the number of samples and improving the training effect of the image-text matching model.
  • FIG. 1 is a flow chart of a training method for an image-text matching model according to an embodiment of the present invention.
  • the embodiment of the present invention does not limit the execution subject of the method flow.
  • the execution subject can be any computing device, for example, a server for image text matching.
  • the method may include the following steps.
  • the positive sample includes text and an image.
  • the text included in the positive sample can be used to describe the content in the included image.
  • the negative sample includes text and an image.
  • the text description content included in the negative sample may not match the content in the included image.
  • the image-text matching model can be used to: predict, with respect to input text and image, whether the input text is used to describe the content in the input image.
  • the above method flow can train the image-text matching model by using positive and negative samples and contrastive learning, and increase the number of samples by introducing negative samples, thereby improving the training effect of the image-text matching model.
  • the method flow does not limit the number of positive samples and negative samples obtained.
  • at least one positive sample and at least one negative sample can be obtained.
  • the text included in any positive sample may be used to describe the content in the included image.
  • the text description content included in any negative sample may not match the content in the included image.
  • the text included in each positive sample may be used to describe the content in the included image.
  • the text description content included in each negative sample may not match the content in the included image.
  • the process of this method does not limit the specific way of obtaining positive samples and negative samples, as long as the text included in the positive sample can be used to describe the content of the included image, and the text description content included in the negative sample may not match the content of the included image.
  • the positive sample can be obtained from the Internet.
  • the corresponding image can be searched on the Internet based on the text as the image included in the positive sample.
  • the positive sample can also directly obtain a dataset of image and text matching.
  • the text included in each piece of data in the dataset can be used to describe the content of the included image.
  • the image may be manually edited and generated into text for describing the content of the image, so that a positive sample can be obtained by combining the image and the generated text.
  • the text “kitten eating” can be manually edited.
  • negative samples can be generated based on positive samples.
  • the text included in the positive sample can be directly replaced with other texts with completely different content, so as to obtain negative samples.
  • the text “kitten eating” can be directly obtained and combined with the image with the content “puppy swimming” to obtain a negative sample.
  • the obtained text can also be “fish swimming”, “clouds”, “opening the window” and other texts that are completely unrelated to "puppy swimming”.
  • negative samples may also be generated manually, specifically, for an image, text that does not match the content of the image may be manually edited and generated.
  • the text “a fish swimming” or “a puppy swimming” etc. can be manually edited, so that the image and the manually edited text can be combined to obtain a negative sample.
  • association relationship between text and image may be obtained first, so that positive samples and negative samples may be generated based on the association relationship, thereby improving the efficiency of sample generation.
  • obtaining positive samples and negative samples may include: obtaining a set of correspondences between texts and images; in any set of correspondences between texts and images, the text is used to describe the content in the corresponding image. Generating positive samples and negative samples according to the set of correspondences.
  • the positive sample may include: text and image belonging to the same corresponding relationship;
  • the negative sample may include: text and image belonging to different corresponding relationships; and the text description content in the negative sample does not match the image content in the same negative sample.
  • association relationship between text and image may include: text “puppy swimming” and image with content “puppy swimming”, text “kitten eating” and image with content “kitten eating”, text “yellow dog playing with ball” and image with content “yellow dog playing with ball”, etc.
  • generating positive samples and negative samples based on a set of correspondence relationships may include: determining multiple sets of correspondence relationships from the set of correspondence relationships, wherein the images and texts between any two sets of correspondence relationships in the determined multiple sets of correspondence relationships are different; for any one of the multiple sets of correspondence relationships, generating a positive sample based on the text and image in the correspondence relationship; and generating a negative sample based on the text in the correspondence relationship and the image in any other correspondence relationship in the multiple sets of correspondence relationships.
  • corresponding negative samples may be generated multiple times directly for the same group of corresponding relationships among the determined multiple groups of corresponding relationships.
  • generating positive samples and negative samples according to a correspondence set may include: determining N groups of correspondences from the correspondence set, wherein the images and texts between any two groups of correspondences in the determined N groups of correspondences are different; for each group of correspondences in the N groups of correspondences, generating a positive sample based on the text and image in each group of correspondences; and generating N-1 negative samples based on the text in each group of correspondences and the images in the other N-1 groups of correspondences in the N groups of correspondences.
  • This embodiment can increase the number and efficiency of generating negative samples, and facilitate improving the training effect of subsequent image-text matching models.
  • the image content described by the text may include information related to entities in the image.
  • the text can be limited to the information used to describe the entity related information in the corresponding image, thereby facilitating the improvement of the training effect of the image-text matching model.
  • the texts may be used to describe entity categories and entity attributes in the corresponding images.
  • the entity category may specifically be a classification of entities, such as animals, plants, objects, etc.; or cats, dogs, birds, etc.
  • the entity attribute may specifically be an attribute of the entity itself, such as the color and size of the entity.
  • the text included in the positive sample can be used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in the negative sample may not match the entity in the included image.
  • the text included in any positive sample may be used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in any negative sample may not match the entity in the included image.
  • the text included in each positive sample can be used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in each negative sample may not match the entity in the included image.
  • the entity categories in the text included in the negative sample may be different from the entity categories in the included image; or the entity attributes in the text included in the negative sample may be different from the entity attributes in the included image; or the entity categories and entity attributes in the text included in the negative sample may be different from the entity categories and entity attributes in the included image.
  • the core idea of contrastive learning may include: shortening the distance between positive samples and increasing the distance between positive samples and negative samples.
  • the image-text matching model pays more attention to entity categories that are different from image entities than to entity attributes that are different from image entities.
  • the distance between the negative sample of an image containing the content of "yellow dog swimming” and the text “black dog swimming” and the positive sample of an image containing the content of "yellow dog swimming” and the text “yellow dog swimming” can be closer than that between the negative sample of an image containing the content of "yellow dog swimming” and the text "small fish swimming”.
  • negative samples can be divided and distinguished and model trained using different loss function weights, thereby increasing the image-text matching model's attention to and recognition sensitivity of "entity categories" and improving the training effect of the image-text matching model.
  • generating positive samples and negative samples according to the correspondence set may include: determining correspondences in the correspondence set with the same entity category and different entity attributes as a first subset; and generating a first positive and negative sample set according to the first subset.
  • the corresponding relationships in the corresponding relationship set with different entity categories and the same entity attributes are determined as a second subset; and a second positive and negative sample set is generated according to the second subset.
  • the corresponding relationships in the corresponding relationship set with different entity categories and different entity attributes are determined as a third subset; and a third positive and negative sample set is generated according to the third subset.
  • model training may be performed for the above classified positive and negative sample sets respectively.
  • different loss functions may be used to train the image-text matching models respectively.
  • the first loss weight is smaller than the second loss weight, and the second loss weight is smaller than the third loss weight.
  • the first loss weight is the weight of the loss function when the image-text matching model is trained using the first positive and negative sample set.
  • the second loss weight is the weight of the loss function when the image-text matching model is trained using the second positive and negative sample sets.
  • the third loss weight is the weight of the loss function when the image-text matching model is trained using the third positive and negative sample set.
  • the embodiment does not limit the manner in which the positive and negative sample sets are generated according to the first subset, the second subset and the third subset.
  • the first subset, the second subset and the third subset can be regarded as the "multiple groups of corresponding relationships" determined in the above embodiment to generate positive samples and negative samples.
  • the negative samples in the first positive and negative sample set include entity categories in the text that are the same as those in the image entities, but entity attributes in the text that are different from those in the image entities.
  • the negative samples in the second positive and negative sample set include entity categories in the text that are different from those in the image entities, but entity attributes in the text that are the same as those in the image entities.
  • the negative samples in the third positive and negative sample set include entity categories in the included text that are different from the image entities, and entity attributes in the included text that are different from the image entities.
  • This embodiment can improve the image-text matching model's attention to and recognition sensitivity of "entity categories" by distinguishing different types of negative samples and using different loss function weights to train the image-text matching model, thereby improving the training effect of the image-text matching model.
  • the process of this method does not limit the specific structure of the image-text matching model.
  • the image-text matching model may be a neural network model or other types of models.
  • the input of the image-text matching model may include the image and text to be matched. Therefore, the image-text matching model may set representation layers to extract features for inputs of two data types, image and text.
  • the image-text matching model may include: a text representation layer and an image representation layer.
  • the image-text matching model can be used to: for the input text and image, use the text representation layer to obtain the text features of the input text, use the image representation layer to obtain the image features of the input image, and then based on the obtained text features and image features, predict whether the input text is used to describe the content in the input image.
  • the image-text matching model may further include an intermediate layer and an output layer.
  • the obtained text features and image features may be input into the intermediate layer for processing, and then the processing result may be input into the output layer, and the output layer may output the prediction result.
  • the prediction result may indicate whether the input text is used to describe the content in the input image.
  • the obtained text features and image features it is predicted whether the input text is used to describe the content in the input image. Specifically, the obtained text features and image features are first integrated, and then prediction is performed based on the integrated feature results.
  • This embodiment does not limit the way of integrating text features and image features.
  • the text features and image features can be concatenated, the product of the text features and image features can be calculated, the sum of the text features and image features can be calculated, and so on.
  • the intermediate layer of the image-text matching model can be used to integrate the input text features and image features.
  • This embodiment integrates text features and image features, and can fuse text feature information and image feature information through a model, thereby facilitating learning and mining the association between text and image.
  • the process of this method does not limit the output of the image-text matching model, as long as it can be used to characterize whether the input text is used to describe the content in the input image.
  • the classification results of positive and negative samples may be output, the probabilities of positive and negative samples may be output, the results or probabilities of whether the image and text match may be output, the matching degree or similarity between the "input image” and the "input text” may be output, and so on.
  • the process of this method does not limit the intermediate layer and output layer structure in the image-text matching model.
  • the intermediate layer may specifically include a fully connected layer
  • the output layer may specifically include a softmax layer
  • the process of this method is not limited to the text representation layer and the image representation layer in the image-text matching model. Specific explanations can be found in the following text.
  • the image-text matching model in order to improve the training effect and prediction accuracy of the image-text matching model, can be limited to focus on identifying entity-related information in the input text and whether it matches the entity in the input image.
  • it may include whether the entity categories and entity attributes in the input text are the same as the entity categories and entity attributes in the input image.
  • the output result of the image-text matching model can also be used to characterize whether the entity categories and entity attributes in the input text are the same as the entity categories and entity attributes in the input image.
  • the output results of the image-text matching model may include entity categories and entity attributes in the input text that are the same as the entity categories and entity attributes in the input image; the entity categories in the input text are the same as the entity categories in the input image, but the entity attributes are different; the entity attributes in the input text are the same as the entity attributes in the input image, but the entity categories are different; the entity categories and entity attributes in the input text are different from the entity categories and entity attributes in the input image.
  • a fixed template text may be preset, and then entity related information may be filled in for the fixed text template, such as "There is an xx in the picture".
  • xx may include entity related information, specifically entity category, or entity category and attribute.
  • the above texts including information related to different entities can be matched respectively, so as to conveniently determine the text matching the image to be matched based on the matching results, for example, the similarity between the image and the text, and thus conveniently determine the relevant information of the entity in the image to be matched.
  • the entity category can be filled in the text template first, input into the image text matching model, and matched with the image to be matched to determine whether there is a matching entity category.
  • the entity attribute can be filled in the text template again, and then input into the image text matching model, and matched with the image to be matched to determine whether there is a matching entity attribute.
  • This embodiment can improve the efficiency of the image-text matching model by matching entity categories and entity attributes respectively.
  • the method flow is not limited to the comparative learning method.
  • the image-text matching model may be trained by clustering positive and negative samples respectively.
  • the loss function of the image-text matching model may include a cross entropy function, and thus corresponding positive and negative sample classifications may be determined for the input image text.
  • the image-text matching model may be trained with the goal of reducing the distance between mapping results of different positive samples and increasing the distance between mapping results of positive samples and negative samples.
  • the value of the loss function is set to be positively correlated with the distance between the mapping results of the positive samples, so that the distance between the mapping results of the positive samples can be reduced by lowering the value of the loss function.
  • multiple positive samples and multiple negative samples that are input can be mapped to a vector space through an image-text matching model, and the value of the loss function can be set to be positively correlated with the distance between the positive sample mapping results, and negatively correlated with the distance between the positive sample mapping results and the negative sample mapping results.
  • the process of this method does not limit the structure and training method of the text representation layer and the image representation layer.
  • the image representation layer may include several convolutional layers; the text representation layer may also include several convolutional layers, and may also include a self-attention mechanism layer, etc.
  • the image representation layer and the text representation layer can be trained directly along with the training of the image-text matching model, or can be trained in advance through samples to determine relatively good initial parameters, thereby improving the overall training effect of the image-text matching model.
  • the process of this method does not limit the pre-training method of the image representation layer and the text representation layer.
  • an image model having business requirements may be trained using image samples, so that a representation layer therein may be extracted and determined as the image representation layer.
  • image samples with labeled detection frames can be used to train image object detection models, and then the representation layer can be extracted.
  • Image samples with labeled entity content labels can also be used to train image recognition models, and then the representation layer can be extracted.
  • the text samples may be used to train a text model that meets business requirements, so that the representation layer therein may be extracted and determined as the text representation layer.
  • text samples with annotated entity labels can be used to train a text entity information extraction model, and then the representation layer can be extracted.
  • Text samples with annotated content labels can also be used to train a text content extraction model, and then the representation layer can be extracted.
  • the text representation layer can specifically use static encoding, such as word2vec (small parameter model), or dynamic encoding, such as BERT (large parameter model).
  • static encoding such as word2vec (small parameter model)
  • dynamic encoding such as BERT (large parameter model).
  • the text representation layer and the image representation layer can be used to extract entity-related information.
  • the text representation layer can be used to extract entity feature information from the input text.
  • the text representation layer can be used to extract entity feature information for triple information (head entity, relationship and tail entity) in the input text.
  • it can include head entity feature information and tail entity feature information.
  • extracting entity feature information can include encoding entity information.
  • the entity feature information can include entity information encoding results.
  • the text representation layer can be used to encode triple information (head entity, relationship and tail entity) in the input text, so as to obtain the head entity encoding result and the tail entity encoding result, and then the head entity encoding result and the tail entity encoding result can be determined as entity feature information.
  • triple information head entity, relationship and tail entity
  • the text representation layer can be used to encode entity information in the input text.
  • entity information can include triple information (head entity, relation and tail entity).
  • the image representation layer can be used to extract entity feature information in the input image.
  • the image representation layer may be used to extract features of the input image, wherein the extracted image features may include feature information of entities in the image, or the extracted image features may include associations with entities in the image.
  • This embodiment does not limit the specific structure of the text representation layer.
  • the text representation layer can be used to perform text encoding on the input text, and then extract entity feature information from the text encoding result.
  • the text representation layer can be used to perform text encoding on the input text, determine entity information in the input text, and then extract entity feature information from the encoded part corresponding to the entity information in the text encoding result.
  • This embodiment does not limit the text encoding method.
  • any text encoding model may be used for text encoding.
  • static encoding such as word2vec (small parameter model) or dynamic encoding such as bert (large parameter model) may be used, or RNN, CNN, LSTM, self-attention model and other models may be used for text encoding.
  • This embodiment does not limit the method of determining entity information in the input text.
  • the entity information can be determined by using a knowledge graph or by the text representation layer itself.
  • the knowledge graph can be used to determine the triples (head entity, relationship, tail entity) in the input text.
  • This embodiment does not limit the method of extracting entity feature information.
  • a knowledge graph embedding model Translate algorithm
  • Translate algorithm can be used to extract entity feature information.
  • extracting entity feature information may include encoding entity information in the input text, and specifically may include encoding triples (head entity, relation, tail entity) in the input text.
  • the specific determination of the triples in the input text may be performed through a knowledge graph.
  • the encoding result (that is, the entity feature) can contain the information of the knowledge graph.
  • the information of the knowledge graph can include the information of the triples, and specifically can include the relationship information between the head entity and the tail entity.
  • triples can be extracted from text to generate a knowledge graph.
  • the triples are represented as (head entity, relationship, tail entity).
  • the extraction result can be (dog, play, ball).
  • Alaskan Malamute is an attribute of dog.
  • the text representation layer can perform text encoding on the input text, and then determine the text encoding part corresponding to the triple.
  • the pre-trained TransR model can be used to project the head entity and the tail entity into the relational space through the projection matrix for the text encoding part corresponding to the triple, and the head entity mapping result and the tail entity mapping result are obtained as the entity feature information input to the text representation layer.
  • the training method of the TransR model is as follows: for each triple (h, r, t), the head entity and the tail entity are projected into the relational space through the projection matrix to obtain the head entity mapping result and the tail entity mapping result.
  • the final evaluation function is: The model is trained to minimize the evaluation function.
  • the entity feature representation can be made more accurate by pre-training the image representation layer and the text representation layer.
  • the text representation layer can be trained by annotating entity labels on the text, which can be done by annotating through a knowledge graph.
  • the above method flow may also include: pre-determining entity information contained in the text to be trained; determining corresponding entity labels based on the determined entity information; and pre-training a text representation layer using the text to be trained and the corresponding entity labels.
  • the specific pre-trained text representation layer may be a pre-trained text entity information extraction model, and then a representation part is extracted therefrom to be determined as the text representation layer.
  • This embodiment does not limit the form of entity information.
  • it may include entity category and/or entity attribute.
  • determining entity information contained in the text to be trained and determining corresponding entity labels based on the determined entity information may include: determining the head entity, relationship, and tail entity contained in the text to be trained; using a pre-trained mapping model to map the determined head entity and tail entity into the relationship space to obtain head entity features and tail entity features; and determining the obtained head entity features and tail entity features as corresponding entity labels.
  • This embodiment uses the knowledge graph to obtain the head entity, relationship and tail entity in the text, determines the entity information in the text, and then obtains the entity features as labels through feature mapping.
  • triples can be extracted from text to generate a knowledge graph.
  • the representation of a triple is (head entity, relation, tail entity).
  • the extraction result may be (dog, play, ball), where Alaskan Malamute is an attribute of dog.
  • the TransR algorithm can be used to train triple embedding.
  • the specific training method is as follows: for each triple (h, r, t), the head entity and the tail entity are projected into the relational space through the projection matrix to obtain the head entity mapping result and the tail entity mapping result.
  • the final evaluation function is: The model is trained to minimize the evaluation function.
  • the trained TransR model can be used to output entity embedding representations that integrate knowledge graph information for the head entity and tail entity in the text, and then determine them as entity information labels to train the text representation layer.
  • the image representation layer may be trained by annotating entity labels on the image.
  • an image of a detection frame marked with entity information may be directly acquired to train the image representation layer, specifically, a model for extracting image entity information may be trained to extract the representation part and determine it as the image representation layer.
  • the above method flow may further include: determining the position of at least one entity in the image to be trained in advance based on the entity information contained in the first text used to describe the content of the image to be trained; masking at least one entity in the image to be trained to obtain at least one image to be restored; and pre-training the image representation layer using the image to be restored as a sample feature and the image to be trained as a sample label.
  • the image representation layer can be used to extract entity feature information in the input image.
  • This embodiment does not limit the form of entity information.
  • it may include entity category and/or entity attribute.
  • the pre-trained image representation layer may include a pre-trained image restoration model, so that the representation part in the trained image restoration model can be extracted and determined as the image representation layer.
  • the backbone network in the image representation layer can select resnet (small parameter model) or Vit (large parameter model), or it can be a CNN model, a Transformer model, or a self-attention model.
  • the first text may also be masked to train the text representation layer.
  • the information of the targeted entity can be masked in the first text to obtain the text to be restored; the obtained text to be restored is used as sample features and the first text is used as sample labels to pre-train the text representation layer; the text representation layer can be used to extract entity feature information in the input text.
  • the pre-trained text representation layer may include a pre-trained text recovery model, so that the representation part in the trained text recovery model can be extracted and determined as the text representation layer.
  • This embodiment can improve the correlation between the entity feature information extracted by the text representation layer and the image representation layer by training the same entity contained in the mask on the associated image and text, thereby improving the correlation between the image representation results and the text representation results, and improving the model training effect and the accuracy of the image-text matching model.
  • the above embodiment can obtain relatively good initial parameters by pre-training the image representation layer and the text representation layer, thereby improving the overall training effect of the image-text matching model compared to the case where the initial parameters are randomly determined.
  • the embodiment of the present invention does not limit the specific method of using the image-text matching model.
  • an image-text matching model may be used to determine whether the input image and the input text match.
  • the image-text matching model may also be used to obtain text describing the content of the input image using the input text that matches the input image, thereby extracting the content information in the image as text.
  • an image to be matched may be obtained; at least one preset text containing preset content information may be obtained; the preset content information in different preset texts is different.
  • the image to be matched and the at least one preset text may be input into an image-text matching model.
  • the image-text matching model is trained based on the above method embodiment.
  • the preset content information may specifically include entity information.
  • entity information may specifically include entity category and/or entity attribute.
  • the preset text may include text obtained by filling in entity information based on a preset text template.
  • the preset text template is, for example, "the image includes an xx”.
  • This embodiment does not limit the form of the output result of the image-text matching model.
  • the model output result may include the degree of matching between the input text and the input image, and may also include a prediction result indicating whether the input text and the input image match.
  • This embodiment does not limit the method of determining whether there is a preset text for describing the content in the to-be-matched image based on the model output result.
  • a preset text with a matching degree higher than a preset matching threshold and the highest matching degree can be determined based on the matching degree output by the model, and then the preset content information in the determined preset text can be determined to include the preset content information in the image to be matched.
  • the matching preset text may be determined based on the prediction result output by the model indicating whether the input text and the input image match, and then the preset content information in the determined preset text may be determined to include the preset content information in the image to be matched.
  • the embodiment of the present invention also provides a device embodiment.
  • FIG. 2 is a schematic diagram of the structure of a training device for an image-text matching model according to an embodiment of the present invention.
  • the apparatus may include the following units.
  • the sample unit 201 is used to obtain positive samples and negative samples; the positive samples include text and images, and the included text is used to describe the content in the included image; the negative samples include text and images, and the included text description content does not match the content in the included image.
  • a training unit 202 used to train an image-text matching model based on contrastive learning by using the acquired positive and negative samples
  • the image-text matching model is used to predict whether the input text is used to describe the content in the input image, given the input text and image.
  • the image-text matching model includes: a text representation layer and an image representation layer;
  • the image-text matching model is used to: for the input text and image, use the text representation layer to obtain the text features of the input text, use the image representation layer to obtain the image features of the input image, and then based on the obtained text features and image features, predict whether the input text is used to describe the content in the input image.
  • sample unit 201 is used for:
  • Positive samples and negative samples are generated according to the correspondence set; positive samples include: text and images belonging to the same correspondence; negative samples include: text and images belonging to different correspondences; the text description content in the negative sample does not match the image content in the same negative sample.
  • sample unit 201 is used for:
  • a positive sample is generated based on the text and image in any corresponding relationship
  • a negative sample is generated.
  • sample unit 201 is used for:
  • a positive sample is generated based on the text and image in each group of correspondences; based on the text in each group of correspondences and the images in the other N-1 groups of correspondences in the N groups of correspondences, N-1 negative samples are generated.
  • the text included in the positive sample is used to describe the entity category and/or entity attribute in the included image; the entity category and/or entity attribute in the text included in the negative sample does not match the entity in the included image.
  • the text is used to describe entity categories and entity attributes in the corresponding image;
  • the sample unit 201 is used to:
  • the corresponding relationships in the corresponding relationship set with different entity categories and different entity attributes are determined as a third subset; and a third positive and negative sample set is generated according to the third subset.
  • the first loss weight is less than the second loss weight, and the second loss weight is less than the third loss weight; the first loss weight is the loss function weight when the image-text matching model is trained using the first positive and negative sample sets; the second loss weight is the loss function weight when the image-text matching model is trained using the second positive and negative sample sets; the third loss weight is the loss function weight when the image-text matching model is trained using the third positive and negative sample sets.
  • the text representation layer is used to extract entity feature information from the input text.
  • the text representation layer is used to perform text encoding on the input text and then extract entity feature information from the text encoding result.
  • the above device further comprises a text pre-training unit 203, which is used to pre-determine entity information contained in the text to be trained before training the image-text matching model;
  • the text pre-training unit 203 is used to:
  • the determined head entity and tail entity are mapped into the relational space to obtain head entity features and tail entity features;
  • the obtained head entity features and tail entity features are determined as corresponding entity labels.
  • the apparatus further comprises an image pre-training unit 204, which is used to: before training the image-text matching model, determine the position of at least one entity in the image to be trained according to the entity information contained in the first text used to describe the content of the image to be trained;
  • the image to be restored is used as the sample feature, and the image to be trained is used as the sample label to pre-train the image representation layer;
  • the image representation layer is used to extract entity feature information from the input image.
  • the apparatus further comprises a text pre-training unit 203, which is used to: before training the image-text matching model, for any masked entity in the image to be restored, mask the information of the targeted entity in the first text to obtain the text to be restored;
  • a text pre-training unit 203 which is used to: before training the image-text matching model, for any masked entity in the image to be restored, mask the information of the targeted entity in the first text to obtain the text to be restored;
  • the text representation layer is used to extract entity feature information from the input text.
  • An embodiment of the present invention further provides a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, any one of the above method embodiments is implemented.
  • An embodiment of the present invention also provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute any of the above-mentioned method embodiments.
  • FIG. 3 is a schematic diagram of the hardware structure of a computer device configured with a method according to an embodiment of the present invention, and the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050.
  • the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040 are connected to each other in communication within the device through the bus 1050.
  • the processor 1010 can be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits, etc., to execute relevant programs to implement the technical solutions provided in the embodiments of the present invention.
  • a general-purpose CPU Central Processing Unit
  • ASIC application specific integrated circuit
  • the memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc.
  • the memory 1020 may store an operating system and other application programs.
  • the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.
  • the input/output interface 1030 is used to connect the input/output module to realize information input and output.
  • the input/output module can be configured in the device as a component (not shown in the figure), or it can be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc.
  • the output device may include a display, a speaker, a vibrator, an indicator light, etc.
  • the communication interface 1040 is used to connect a communication module (not shown) to realize communication interaction between the device and other devices.
  • the communication module can realize communication through a wired mode (such as USB, network cable, etc.) or a wireless mode (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 1050 includes a path that transmits information between the various components of the device (eg, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).
  • the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in the specific implementation process, the device may also include other components necessary for normal operation.
  • the above device may also only include the components necessary for implementing the embodiments of the present invention, and does not necessarily include all the components shown in the figure.
  • An embodiment of the present invention further provides a computer-readable storage medium on which a computer program is stored.
  • a computer program is stored on which a computer program is stored.
  • An embodiment of the present invention further provides a computer-readable storage medium storing a computer program, wherein the computer program implements any of the above method embodiments when executed by a processor.
  • Computer readable media include permanent and non-permanent, removable and non-removable media that can be implemented by any method or technology to store information.
  • Information can be computer readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.
  • the embodiment of the present invention can be implemented by means of software plus a necessary general hardware platform. Based on such an understanding, the technical solution of the embodiment of the present invention can be essentially or in other words, the part that makes the contribution can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiment of the present invention.
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • a typical implementation device is a computer, which may be in the form of a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, a game console, a tablet computer, a wearable device or a combination of any of these devices.
  • each embodiment in this specification is described in a progressive manner, and the same and similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments.
  • the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
  • the device embodiment described above is only schematic, wherein the modules described as separate components may or may not be physically separated, and the functions of each module can be implemented in the same one or more software and/or hardware when implementing the embodiment of the present invention. It is also possible to select some or all of the modules according to actual needs to achieve the purpose of the embodiment. Ordinary technicians in this field can understand and implement it without paying creative work.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.
  • plurality refers to two or more than two, unless otherwise clearly defined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了图像文本匹配模型的训练方法、装置、设备及存储介质。所述方法包括:获取正样本和负样本;所述正样本包括文本和图像,所包括的文本,用于描述所包括图像中的内容;所述负样本包括文本和图像,所包括的文本描述内容,与所包括图像中的内容不符;利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;所述图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。

Description

图像文本匹配模型的训练方法、装置、设备及存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及一种图像文本匹配模型的训练方法、装置、设备及存储介质。
背景技术
目前在多模态学习中,通常存在图像和文本匹配的需求,可以针对图像和文本,判断文本是否与图像内容的描述相近,从而实现图像和文本的关联。
但是,在训练图像文本匹配模型时,往往难以收集训练样本。通常需要人工针对图像进行文本描述,得到对应的匹配文本,因此,训练样本的数量较少,导致图像文本匹配模型的训练效果较差。
发明内容
本发明提供一种图像文本匹配模型的训练方法、装置、设备及存储介质,以解决相关技术中的不足。
根据本发明实施例的第一方面,提供一种图像文本匹配模型的训练方法,包括:
获取正样本和负样本;所述正样本包括文本和图像,所包括的文本,用于描述所包括图像中的内容;所述负样本包括文本和图像,所包括的文本描述内容,与所包括图像中的内容不符;
利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;
所述图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。
可选地,所述图像文本匹配模型包括:文本表征层和图像表征层;
所述图像文本匹配模型用于:针对输入的文本和图像,利用所述文本表征层得到输入文本的文本特征,利用所述图像表征层得到输入图像的图像特征,再基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容。
可选地,所述获取正样本和负样本,包括:
获取文本和图像的对应关系集合;在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容;
根据所述对应关系集合生成正样本和负样本;所述正样本包括:属于同一对应关系的文本和图像;所述负样本包括:属于不同对应关系的文本和图像;所述负样本中的文本描述内容,与同一负样本中的图像内容不符。
可选地,所述根据所述对应关系集合生成正样本和负样本,包括:
从所述对应关系集合中确定多组对应关系,所确定的多组对应关系中,任意两组对应关系之间的图像和文本不同;
针对所述多组对应关系中的任一对应关系,基于所述任一对应关系中的文本和图像,生成一个正样本;
基于所述任一对应关系中的文本,与所述多组对应关系中任一其他对应关系中的图像,生成一个负样本。
可选地,所述根据所述对应关系集合生成正样本和负样本,包括:
从所述对应关系集合中确定N组对应关系,所确定的N组对应关系中,任意两组对应关系之间的图像和文本不同;
针对所述N组对应关系中的每组对应关系,基于所述每组对应关系中的文本和图像,生成一个正样本;基于所述每组对应关系中的文本,与所述N组对应关系中其他N-1组对应关系中的图像,生成N-1个负样本。
可选地,所述正样本所包括的文本,用于描述所包括图像中的实体类别和/或实体属性;所述负样本所包括的文本中的实体类别和/或实体属性,与所包括图像中的实体不符。
可选地,在任意一组文本和图像的对应关系中,文本用于描述对应图像中的实体类别和实体属性;
所述根据所述对应关系集合生成正样本和负样本,包括:
将所述对应关系集合中,实体类别相同且实体属性不同的对应关系确定为第一子集;根据所述第一子集生成第一正负样本集合;
将所述对应关系集合中,实体类别不同且实体属性相同的对应关系确定为第二子集;根据所述第二子集生成第二正负样本集合;
将所述对应关系集合中,实体类别不同且实体属性不同的对应关系确定为第三子集;根据所述第三子集生成第三正负样本集合。
可选地,第一损失权重小于第二损失权重,第二损失权重小于第三损失权重;所述第一损失权重为利用所述第一正负样本集合训练图像文本匹配模型时的损失函数权重;所述第二损失权重为利用所述第二正负样本集合训练图像文本匹配模型时的损失函数权重;所述第三损失权重为利用所述第三正负样本集合训练图像文本匹配模型时的损失函数权重。
可选地,所述文本表征层用于,提取输入文本中的实体特征信息。
可选地,所述文本表征层用于,针对输入文本进行文本编码,再针对文本编码结果提取实体特征信息。
可选地,在训练图像文本匹配模型之前,所述方法还包括:
预先根据用于描述待训练图像内容的第一文本中包含的实体信息,确定至少一个实体在所述待训练图像中的位置;
针对所述待训练图像中的至少一个实体进行遮罩,得到至少一个待恢复图像;
以待恢复图像为样本特征,以所述待训练图像为样本标签,预先训练图像表征层。
可选地,在训练图像文本匹配模型之前,所述方法还包括:
针对任一待恢复图像中被遮罩的实体,在所述第一文本中,对所针对实体的信息进行遮罩,得到待恢复文本;
以所得到的待恢复文本为样本特征,以所述第一文本为样本标签,预先训练文本表征层。
根据本发明实施例的第二方面,提供一种图像文本匹配模型的训练装置,包括:
样本单元,用于获取正样本和负样本;所述正样本包括文本和图像,所包括的文本,用于描述所包括图像中的内容;所述负样本包括文本和图像,所包括的文本描述内容,与所包括图像中的内容不符;
训练单元,用于利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;
所述图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述 输入图像中的内容。
可选地,所述图像文本匹配模型包括:文本表征层和图像表征层;
所述图像文本匹配模型用于:针对输入的文本和图像,利用所述文本表征层得到输入文本的文本特征,利用所述图像表征层得到输入图像的图像特征,再基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容。
可选地,样本单元用于:
获取文本和图像的对应关系集合;在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容;
根据所述对应关系集合生成正样本和负样本;所述正样本包括:属于同一对应关系的文本和图像;所述负样本包括:属于不同对应关系的文本和图像;所述负样本中的文本描述内容,与同一负样本中的图像内容不符。
可选地,样本单元用于:
从所述对应关系集合中确定多组对应关系,所确定的多组对应关系中,任意两组对应关系之间的图像和文本不同;
针对所述多组对应关系中的任一对应关系,基于所述任一对应关系中的文本和图像,生成一个正样本;
基于所述任一对应关系中的文本,与所述多组对应关系中任一其他对应关系中的图像,生成一个负样本。
可选地,样本单元用于:
从所述对应关系集合中确定N组对应关系,所确定的N组对应关系中,任意两组对应关系之间的图像和文本不同;
针对所述N组对应关系中的每组对应关系,基于所述每组对应关系中的文本和图像,生成一个正样本;基于所述每组对应关系中的文本,与所述N组对应关系中其他N-1组对应关系中的图像,生成N-1个负样本。
可选地,所述正样本所包括的文本,用于描述所包括图像中的实体类别和/或实体属性;所述负样本所包括的文本中的实体类别和/或实体属性,与所包括图像中的实体不符。
可选地,在任意一组文本和图像的对应关系中,文本用于描述对应图像中的实体类 别和实体属性;
样本单元用于:
将所述对应关系集合中,实体类别相同且实体属性不同的对应关系确定为第一子集;根据所述第一子集生成第一正负样本集合;
将所述对应关系集合中,实体类别不同且实体属性相同的对应关系确定为第二子集;根据所述第二子集生成第二正负样本集合;
将所述对应关系集合中,实体类别不同且实体属性不同的对应关系确定为第三子集;根据所述第三子集生成第三正负样本集合。
可选地,第一损失权重小于第二损失权重,第二损失权重小于第三损失权重;所述第一损失权重为利用所述第一正负样本集合训练图像文本匹配模型时的损失函数权重;所述第二损失权重为利用所述第二正负样本集合训练图像文本匹配模型时的损失函数权重;所述第三损失权重为利用所述第三正负样本集合训练图像文本匹配模型时的损失函数权重。
可选地,所述文本表征层用于,提取输入文本中的实体特征信息。
可选地,所述文本表征层用于,针对输入文本进行文本编码,再针对文本编码结果提取实体特征信息。
可选地,所述装置还包括图像预训练单元,用于:在训练图像文本匹配模型之前,预先根据用于描述待训练图像内容的第一文本中包含的实体信息,确定至少一个实体在所述待训练图像中的位置;
针对所述待训练图像中的至少一个实体进行遮罩,得到至少一个待恢复图像;
以待恢复图像为样本特征,以所述待训练图像为样本标签,预先训练图像表征层。
可选地,所述装置还包括文本预训练单元,用于:在训练图像文本匹配模型之前,针对任一待恢复图像中被遮罩的实体,在所述第一文本中,对所针对实体的信息进行遮罩,得到待恢复文本;
以所得到的待恢复文本为样本特征,以所述第一文本为样本标签,预先训练文本表征层。
根据上述实施例可知,通过利用正负样本和对比学习的方式,训练图像文本匹配模型,从而通过引入负样本提高样本数量,提高图像文本匹配模型的训练效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本发明。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。
图1是根据本发明实施例示出的一种投篮识别方法的流程示意图;
图2是根据本发明实施例示出的一种投篮分类网络的结构示意图;
图3是根据本发明实施例示出的一种配置本发明实施例方法的计算机设备硬件结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。
在多模态学习中,通常存在图像和文本匹配的需求,可以针对图像和文本,判断文本是否与图像内容的描述相近,从而实现图像和文本的关联。
但是,在训练图像文本匹配模型时,往往难以收集训练样本。通常需要人工针对图像进行文本描述,得到对应的匹配文本,因此,训练样本的数量较少,导致图像文本匹配模型的训练效果较差。
本发明实施例公开了一种图像文本匹配模型的训练方法。
在该方法中,可以使用对比学习的方式训练图像文本匹配模型,从而需要获取正样本和负样本进行对比学习。
其中,正样本包括文本和图像,所包括的文本可以用于描述所包括图像中的内容,从而可以确定正样本中的文本和图像之间存在关联。
例如,在一个正样本中,可以包含内容为“狗和人共同玩耍”的图像,以及“人与狗玩耍”的文本。
而负样本包括文本和图像,所包括的文本与所包括图像中的内容不符,从而可以确定负样本中的文本和图像之间不存在关联。
例如,在一个负样本中,可以包含“狗和人玩耍”内容的图像,以及“商店进货”的文本。
其中,相比于正样本,负样本的获取较为简单。
例如,可以针对一个正样本,将其中包括的文本替换为多种不同的文本,就可以生成多个负样本。
因此,可以通过利用正负样本和对比学习的方式,训练图像文本匹配模型,从而可以方便提高负样本的数量,也就可以提高样本数量,提高图像文本匹配模型的训练效果。
下面针对本发明实施例提供的一种图像文本匹配模型的训练方法进行详细解释。
如图1所示,图1是根据本发明实施例示出的一种图像文本匹配模型的训练方法的流程示意图。
本发明实施例并不限定本方法流程的执行主体。可选地,执行主体可以是任一计算设备。例如,用于图像文本匹配的服务端。
该方法可以包括以下步骤。
S101:获取正样本和负样本。
其中,可选地,正样本包括文本和图像。正样本所包括的文本,可以用于描述所包括图像中的内容。
可选地,负样本包括文本和图像。负样本所包括的文本描述内容,可以与所包括图像中的内容不符。
S102:利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型。
可选地,图像文本匹配模型可以用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。
上述方法流程,可以通过利用正负样本和对比学习的方式,训练图像文本匹配模型,通过引入负样本提高样本数量,提高图像文本匹配模型的训练效果。
其中,由于负样本的获取难度较低,也可以方便提高负样本的数量,进一步提高样本数量,提高图像文本匹配模型的训练效果。
下面针对各个步骤进行详细的解释。
一、S101:获取正样本和负样本。
本方法流程并不限定所获取的正样本数量和负样本数量。可选地,可以获取至少一个正样本和至少一个负样本。
其中,可选地,任一正样本包括的文本,可以用于描述所包括图像中的内容。任一负样本包括的文本描述内容,可以与所包括图像中的内容不符。
可选地,每个正样本包括的文本,可以用于描述所包括图像中的内容。每个负样本包括的文本描述内容,可以与所包括图像中的内容不符。
本方法流程并不限定获取正样本和负样本的具体方式,只要正样本中包括的文本,可以用于描述所包括图像中的内容,负样本中包括的文本描述内容,可以与所包括图像中的内容不符即可。
可选地,正样本可以从网络中获取。具体可以是基于文本在网络中搜索相应的图像,作为正样本中包含的图像。
例如,针对文本“小狗游泳”,可以直接在网络上搜索该文本,从而获取到包含“小狗游泳”内容的图像,进而可以综合该文本和获取的图像生成正样本。
可选地,正样本也可以直接获取图像和文本匹配的数据集。数据集中的每条数据包括的文本,可以用于描述所包括图像中的内容。
可选地,也可以由人工针对图像,编辑生成用于描述该图像内容的文本,从而可以综合该图像和生成的文本得到正样本。
例如,针对包含“小猫吃饭”内容的图像,可以由人工编辑文本“小猫吃饭”。
可选地,负样本可以基于正样本生成。具体可以是针对正样本包括的文本,直接替换为内容完全不同的其他文本,从而可以得到负样本。
例如,针对文本“小狗游泳”和内容为“小狗游泳”的图像,可以直接获取文本“小猫吃饭”,与内容为“小狗游泳”的图像综合得到负样本。所获取的文本也可以是“鱼儿游泳”、“云朵”、“打开窗户”等等与“小狗游泳”完全无关的文本。
可选地,负样本也可以基于人工生成。具体可以是针对图像,由人工编辑生成与该图像内容不符的文本。
例如,针对包含“小猫吃饭”内容的图像,可以由人工编辑文本“鱼儿游泳”或者“小狗游泳”等,从而可以综合该图像和人工编辑的文本,得到负样本。
在一种可选的实施例中,可以先获取到文本和图像之间的关联关系,从而可以基于关联关系,生成正样本和负样本,提高样本生成的效率。
可选地,获取正样本和负样本,可以包括:获取文本和图像的对应关系集合;在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容。根据对应关系集合生成正样本和负样本。
其中,正样本可以包括:属于同一对应关系的文本和图像;负样本可以包括:属于不同对应关系的文本和图像;并且,负样本中的文本描述内容,与同一负样本中的图像内容不符。
例如,文本和图像之间的关联关系可以包括:文本“小狗游泳”和内容为“小狗游泳”的图像、文本“小猫吃饭”和内容为“小猫吃饭”的图像、文本“黄狗玩球”和内容为“黄狗玩球”的图像等。
可选地,根据对应关系集合生成正样本和负样本,可以包括:从对应关系集合中确定多组对应关系,所确定的多组对应关系中,任意两组对应关系之间的图像和文本不同;针对多组对应关系中的任一对应关系,基于该对应关系中的文本和图像,生成一个正样本;基于该对应关系中的文本,与多组对应关系中任一其他对应关系中的图像,生成一个负样本。
其中,由于文本和图像之间的对应关系,本身已经可以表征在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容。因此,可以直接将任一组对应关系中包含的文本和图像,综合为一个正样本。
而对于负样本的生成,为了提高效率,可以在所确定的多组对应关系之间图像文本都不同的情况下,直接采用分别属于不同对应关系的图像和文本,综合为一个负样本。
可选地,为了方便提高负样本的数量,可以直接针对所确定的多组对应关系中的同一组对应关系,多次生成相应的负样本。
可选地,根据对应关系集合生成正样本和负样本,可以包括:从对应关系集合中确定N组对应关系,所确定的N组对应关系中,任意两组对应关系之间的图像和文本不同;针对N组对应关系中的每组对应关系,基于每组对应关系中的文本和图像,生 成一个正样本;基于每组对应关系中的文本,与N组对应关系中其他N-1组对应关系中的图像,生成N-1个负样本。
本实施例可以提高负样本的生成数量和生成效率,方便提高后续图像文本匹配模型的训练效果。
此外,在一种可选的实施例中,为了提高图像文本匹配模型的训练效果,文本所描述的图像内容,可以包括图像中的实体相关信息。
其中,由于图像中的实体特征方便提取,因此,可以将文本限定为用于描述对应图像中的实体相关信息,从而方便提高图像文本匹配模型的训练效果。
可选地,在任意一组文本和图像的对应关系中,文本可以用于描述对应图像中的实体类别和实体属性。
可选地,实体类别具体可以是实体的分类,例如,动物、植物、物品等;又例如猫、狗、鸟等。
可选地,实体属性具体可以是实体本身的属性,例如,实体的颜色、尺寸等。
可选地,正样本所包括的文本,可以用于描述所包括图像中的实体类别和/或实体属性;负样本所包括的文本中的实体类别和/或实体属性,可以与所包括图像中的实体不符。
可选地,任一正样本所包括的文本,可以用于描述所包括图像中的实体类别和/或实体属性;任一负样本所包括的文本中的实体类别和/或实体属性,可以与所包括图像中的实体不符。
可选地,每个正样本所包括的文本,可以用于描述所包括图像中的实体类别和/或实体属性;每个负样本所包括的文本中的实体类别和/或实体属性,可以与所包括图像中的实体不符。
具体可以是负样本所包括的文本中的实体类别,与所包括图像中的实体类别不同;或者负样本所包括的文本中的实体属性,与所包括图像中的实体属性不同;或者负样本所包括的文本中的实体类别和实体属性,与所包括图像中的实体类别和实体属性不同。
可选地,由于所获取的正负样本用于后续的对比学习,而对比学习的核心思想可以包括:把正样本之间的距离拉近,正样本与负样本之间的距离拉远。
而对于负样本而言,相比于与图像实体不同的实体属性,图像文本匹配模型更关注与图像实体不同的实体类别。
例如,针对包含“黄狗游泳”内容的图像和文本“黑狗游泳”的负样本,与包含“黄狗游泳”内容的图像和文本“黄狗游泳”正样本之间的距离,相比于包含“黄狗游泳”内容的图像和文本“小鱼游泳”的负样本可以更近。因为“黑狗游泳”与“黄狗游泳”两个文本之间的实体类别相同,但实体属性不同,而“小鱼游泳”和“黄狗游泳”两个文本之间的实体类别和实体属性都不同。
因此,可以针对负样本进行划分,利用不同的损失函数权重进行区分和模型训练,从而可以提高图像文本匹配模型对“实体类别”的关注程度和识别敏感程度,提高图像文本匹配模型的训练效果。
可选地,根据对应关系集合生成正样本和负样本,可以包括:将对应关系集合中,实体类别相同且实体属性不同的对应关系确定为第一子集;根据第一子集生成第一正负样本集合。
将对应关系集合中,实体类别不同且实体属性相同的对应关系确定为第二子集;根据第二子集生成第二正负样本集合。
将对应关系集合中,实体类别不同且实体属性不同的对应关系确定为第三子集;根据第三子集生成第三正负样本集合。
例如,可以从对应关系集合中,收集文本中实体类别都为“狗”的对应关系,但是相应的文本中“狗”的实体属性都不同,具体可以是“黄狗”“大狗”“黑狗”等。从而可以得到第一子集,进而可以根据第一子集,生成正样本和负样本。
也可以从对应关系集合中,收集文本中实体属性都为“黄色”的对应关系,但是相应的文本中“黄色”的实体类别都不同,具体可以是“黄狗”“黄猫”“黄鱼”等。从而可以得到第二子集,进而可以根据第二子集,生成正样本和负样本。
也可以从对应关系集合中,收集文本中实体类别和实体属性都不同的对应关系,具体可以是“黄狗”“黑鱼”“白猫”等。从而可以得到第三子集,进而可以根据第三子集,生成正样本和负样本。
在一种可选的实施例中,针对上述经过分类的正负样本集合,可以分别进行模型训练。可选地,可以是采用不同的损失函数分别对图像文本匹配模型进行训练。
可选地,第一损失权重小于第二损失权重,第二损失权重小于第三损失权重。
第一损失权重为利用第一正负样本集合训练图像文本匹配模型时的损失函数权重。
第二损失权重为利用第二正负样本集合训练图像文本匹配模型时的损失函数权重。
第三损失权重为利用第三正负样本集合训练图像文本匹配模型时的损失函数权重。
其中,根据第一子集、第二子集和第三子集生成正负样本集合的方式,本实施例并不限定。可选地,具体可以将第一子集、第二子集和第三子集看作是上述实施例中所确定的“多组对应关系”,从而生成正样本和负样本。
第一正负样本集合中的负样本,包括的文本中的实体类别与图像实体相同,但包括的文本中的实体属性与图像实体不同。
第二正负样本集合中的负样本,包括的文本中的实体类别与图像实体不同,但包括的文本中的实体属性与图像实体相同。
第三正负样本集合中的负样本,包括的文本中的实体类别与图像实体不同,并且包括的文本中的实体属性与图像实体不同。
本实施例可以通过区分不同的负样本种类,并利用不同的损失函数权重训练图像文本匹配模型,从而可以提高图像文本匹配模型对“实体类别”的关注程度和识别敏感程度,提高图像文本匹配模型的训练效果。
二、S102:利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型。
1、关于图像文本匹配模型。
本方法流程并不限定图像文本匹配模型的具体结构。
可选地,图像文本匹配模型可以是神经网络模型,也可以是其他类型的模型。
可选地,图像文本匹配模型为了实现图像和文本之间的匹配,模型的输入可以包括待匹配的图像和文本。因此,图像文本匹配模型可以针对图像和文本两种数据类型的输入,分别设置表征层提取特征。
可选地,图像文本匹配模型可以包括:文本表征层和图像表征层。
可选地,图像文本匹配模型可以用于:针对输入的文本和图像,利用文本表征层得到输入文本的文本特征,利用图像表征层得到输入图像的图像特征,再基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容。
可选地,图像文本匹配模型还可以包括中间层和输出层。具体可以是将所得到的文本特征和图像特征输入到中间层进行处理,再将处理结果输入到输出层,由输出层输出预测结果。预测结果可以表征输入文本是否用于描述输入图像中的内容。
可选地,基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容,具体可以是先综合所得到的文本特征和图像特征,再针对综合特征结果进行预测。
本实施例并不限定综合文本特征和图像特征的方式。可选地,可以将文本特征和图像特征进行拼接,也可以针对文本特征和图像特征计算乘积,也可以针对文本特征和图像特征求和等等。
可选地,图像文本匹配模型的中间层可以用于综合输入的文本特征和图像特征。
本实施例通过综合文本特征和图像特征,可以通过模型融合文本特征信息和图像特征信息,方便学习挖掘文本与图像之间的关联。
本方法流程并不限定图像文本匹配模型的输出,只要能够用于表征输入的文本是否用于描述输入图像中的内容即可。
可选地,具体可以是输出正负样本的分类结果,也可以输出正负样本的概率,也可以输出图像文本是否匹配的结果或概率,也可以输出“输入图像”与“输入文本”之间的匹配程度或者相似度等等。
本方法流程也并不限定图像文本匹配模型中的中间层和输出层结构。
可选地,中间层具体可以包括全连接层,输出层具体可以包括softmax层。
本方法流程并不限定图像文本匹配模型中的文本表征层和图像表征层。具体的解释可以参见后文。
此外,在一种可选的实施例中,为了提高图像文本匹配模型的训练效果和预测准确度,可以限定图像文本匹配模型重点识别输入文本中的实体相关信息,是否与输入图像中的实体相符合。
具体可以包括,输入文本中的实体类别和实体属性,是否与输入图像中的实体类别和实体属性相同。
相对应地,可选地,图像文本匹配模型的输出结果,还可以用于表征输入文本中的实体类别和实体属性,是否与输入图像中的实体类别和实体属性相同。
例如,图像文本匹配模型的输出结果,可以包括输入文本中的实体类别和实体属性,与输入图像中的实体类别和实体属性相同;输入文本中的实体类别与输入图像中的实体类别相同,但实体属性不同;输入文本中的实体属性与输入图像中的实体属性相同,但实体类别不同;输入文本中的实体类别和实体属性,与输入图像中的实体类别和实体属性不同。
在本实施例中,由于图像和文本中的实体相对方便提取出特征进行识别和匹配,因此,可以通过重点识别文本和图像中的实体信息是否相同,来判断输入文本是否用于描述输入图像中的内容,提高图像文本匹配模型的训练效果和预测准确度。
在一种具体的示例中,可以预先设置固定模板的文本,进而可以针对固定文本模板,填充实体相关信息。例如“图中有一只xx”。其中xx可以包括实体的相关信息,具体可以包括实体类别,也可以包括实体的类别和属性。
例如,“图中有一只猫”、“图中有一只猪”、“图中有一只狗”、“图中有一只黄猫”、“图中有一只黑狗”、“图中有一只大狗”、“图中有一只小猫”等等。
针对同一个待匹配图像,可以分别与上述包括不同实体相关信息的文本进行匹配,从而方便根据匹配结果,例如,图像与文本的相似度,确定出与待匹配图像相匹配的文本。进而方便确定出待匹配图像中实体的相关信息。
当然,可选地,也可以先在文本模板中填充实体类别,输入到图像文本匹配模型中,与待匹配图像进行匹配,确定是否存在相匹配的实体类别。在确定匹配的实体类别之后,可以在文本模板中再填充实体属性,继续输入到图像文本匹配模型中,与待匹配图像进行匹配,确定是否存在相匹配的实体属性。
本实施例可以通过分别匹配实体类别和实体属性,提高图像文本匹配模型的效率。
2、关于对比学习。
本方法流程并不限定对比学习的方式。
可选地,可以通过将正负样本分别聚类的方式,对图像文本匹配模型进行训练。相对应地,图像文本匹配模型的损失函数可以包括交叉熵函数,进而可以针对输入的图像文本,确定对应的正负样本分类。
可选地,可以是以减小不同正样本之间映射结果的距离,以及增大正样本和负样本之间映射结果的距离为目标,对图像文本匹配模型进行训练。
可选地,具体可以是针对输入的多个正样本,通过图像文本匹配模型映射到一个向量空间,设置损失函数的取值与正样本映射结果之间的距离正相关,从而可以通过降低损失函数的取值,减小正样本之间映射结果的距离。
可选地,还可以针对输入的多个正样本和多个负样本,通过图像文本匹配模型映射到一个向量空间,设置损失函数的取值与正样本映射结果之间的距离正相关,并且与正样本映射结果和负样本映射结果之间的距离负相关,从而可以通过降低损失函数的取值,减小正样本之间映射结果的距离,增大正样本与负样本之间映射结果的距离。
3、关于文本表征层和图像表征层。
本方法流程并不限定文本表征层和图像表征层的结构和训练方式。
可选地,图像表征层可以包括若干卷积层;文本表征层也可以包括若干卷积层,也可以包括自注意力机制层等。
可选地,图像表征层和文本表征层可以直接随着对图像文本匹配模型的训练开始训练,也可以预先通过样本进行训练,从而确定相对较好的初始参数,进而提高图像文本匹配模型的整体训练效果。
本方法流程并不限定图像表征层和文本表征层的预训练方式。
可选地,可以利用图像样本训练具有业务需求的图像模型,从而可以提取出其中的表征层,确定为图像表征层。
例如,可以利用已标注检测框的图像样本,训练图像目标检测模型,进而可以提取其中的表征层。也可以利用已标注实体内容标签的图像样本,训练图像识别模型,进而可以提取其中的表征层。
可选地,也可以利用文本样本,训练具有业务需求的文本模型,从而可以提取出其中的表征层,确定为文本表征层。
例如,可以利用已标注实体标签的文本样本,训练文本实体信息提取模型,进 而可以提取其中的表征层。也可以利用已标注内容标签的文本样本,训练文本内容提取模型,进而可以提取其中的表征层。
可选地,文本表征层具体可以使用静态编码,如word2vec(小参数模型),或者动态编码,如bert(大参数模型)。
在一种可选的实施例中,由于后续图像文本匹配模型可以更关注输入图像和输入文本中的实体相关信息,因此,文本表征层和图像表征层可以用于提取实体相关信息。
可选地,文本表征层可以用于,提取输入文本中的实体特征信息。
可选地,文本表征层可以用于,针对输入文本中的三元组信息(头实体,关系和尾实体),提取出实体特征信息。具体可以包括头实体特征信息和尾实体特征信息。具体提取实体特征信息,可以包括,针对实体信息进行编码。实体特征信息可以包括实体信息编码结果。
可选地,文本表征层可以用于,针对输入文本中的三元组信息(头实体,关系和尾实体)进行编码,从而可以得到头实体编码结果和尾实体编码结果,进而可以将头实体编码结果和尾实体编码结果确定为实体特征信息。
可选地,文本表征层可以用于,针对输入文本中的实体信息进行编码。实体信息可以包括三元组信息(头实体,关系和尾实体)。
可选地,图像表征层可以用于,提取输入图像中的实体特征信息。
可选地,图像表征层可以用于,提取输入图像的特征。其中,所提取的图像特征中可以包括图像中的实体特征信息,或者所提取的图像特征中可以包括图像中的实体相关联。
本实施例并不限定文本表征层的具体结构。
可选地,文本表征层可以用于,针对输入文本进行文本编码,再针对文本编码结果提取实体特征信息。
可选地,文本表征层可以用于,针对输入文本进行文本编码,并且确定输入文本中的实体信息,再针对文本编码结果中实体信息对应的编码部分,提取实体特征信息。
本实施例并不限定文本编码的方式。具体可以是使用任一文本编码模型进行文本编码。可选地,可以使用静态编码,如word2vec(小参数模型),或者动态编码,如bert(大参数模型),也可以使用RNN、CNN、LSTM、自注意力模型等模型进行文本 编码。
本实施例并不限定确定输入文本中实体信息的方式。可选地,可以采用知识图谱进行确定,可以由文本表征层自身确定。
其中,可以采用知识图谱确定出输入文本中的三元组(头实体、关系、尾实体)。
本实施例并不限定提取实体特征信息的方式。可选地,可以采用知识图谱嵌入模型(Translate算法)提取实体特征信息。
知识图谱嵌入模型(Translate算法),可以包括TransE,TransH,TransR,TransD模型等。
可选地,提取实体特征信息,可以包括,针对输入文本中的实体信息进行编码。具体可以包括针对输入文本中的三元组(头实体、关系、尾实体)进行编码。
可选地,具体确定输入文本中的三元组,可以通过知识图谱进行确定。
在本实施例中,通过针对输入文本中的三元组进行编码,从而可以使得编码结果(也就是实体特征)中包含知识图谱的信息。知识图谱的信息可以包括三元组的信息,具体可以包括头实体和尾实体之间的关系信息。
在一种具体的示例中,可以根据文本抽取三元组,生成知识图谱。三元组的表示形式为(头实体、关系、尾实体)。针对例子“阿拉斯加雪橇犬玩球”,抽取结果可以为(狗、玩、球)。其中阿拉斯加雪橇犬是狗的属性。
文本表征层可以针对输入文本进行文本编码,再确定出三元组所对应的文本编码部分,之后可以使用预先训练的TransR模型,针对三元组所对应的文本编码部分,将头实体和尾实体通过投影矩阵投影到关系空间中,得到头实体映射结果和尾实体映射结果,作为文本表征层输入的实体特征信息。
其中,TransR模型的训练方式如下:对于每个三元组(h,r,t),将头实体和尾实体通过投影矩阵投影到关系空间中,得到头实体映射结果和尾实体映射结果。
最终的评价函数为:
Figure PCTCN2022123188-appb-000001
训练模型以使评价函数取得最小值。
在一种可选的实施例中,可以通过对图像表征层和文本表征层的预训练,使得实体特征表征更加准确。
可选地,可以采用对文本标注实体标签的方式,训练文本表征层。具体可以是 通过知识图谱的方式,进行标注。
可选地,在训练图像文本匹配模型之前,上述方法流程还可以包括:预先确定待训练文本包含的实体信息;根据所确定的实体信息,确定对应的实体标签;利用待训练文本和对应的实体标签,预先训练文本表征层。
其中,具体预先训练文本表征层,可以是预先训练文本实体信息提取模型,进而从中提取出表征部分,确定为文本表征层。
本实施例并不限定实体信息的形式。可选地,可以包括实体类别和/或实体属性。
可选地,确定待训练文本包含的实体信息,根据所确定的实体信息,确定对应的实体标签,可以包括:确定待训练文本包含的头实体、关系和尾实体;利用预先训练的映射模型,将所确定的头实体和尾实体映射到关系空间中,得到头实体特征和尾实体特征;将得到的头实体特征和尾实体特征确定为对应的实体标签。
本实施例采用知识图谱得到文本中的头实体、关系和尾实体,确定出文本中的实体信息,进而通过特征映射,获取到实体特征作为标签。
在一种具体的示例中,可以根据文本抽取三元组,生成知识图谱。
三元组的表示形式为(头实体、关系、尾实体)。
针对例子“阿拉斯加雪橇犬玩球”,抽取结果可以为(狗、玩、球)。其中阿拉斯加雪橇犬是狗的属性。
之后可以使用TransR算法训练三元组embedding,具体的训练方式如下:对于每个三元组(h,r,t),将头实体和尾实体通过投影矩阵投影到关系空间中,得到头实体映射结果和尾实体映射结果。
最终的评价函数为:
Figure PCTCN2022123188-appb-000002
训练模型以使评价函数取得最小值。
之后可以利用训练完成的TransR模型,针对文本中的头实体和尾实体,输出将融合了知识图谱信息的实体嵌入表示,进而确定为实体信息标签,训练文本表征层。
可选地,也可以采用对图像标注实体标签的方式,训练图像表征层。
可选地,可以直接获取标注有实体信息的检测框的图像,训练图像表征层。具体可以是训练图像实体信息提取模型,进而提取出其中的表征部分,确定为图像表征层。
可选地,在训练图像文本匹配模型之前,上述方法流程还可以包括:预先根据 用于描述待训练图像内容的第一文本中包含的实体信息,确定至少一个实体在待训练图像中的位置;针对待训练图像中的至少一个实体进行遮罩,得到至少一个待恢复图像;以待恢复图像为样本特征,以待训练图像为样本标签,预先训练图像表征层。图像表征层可以用于,提取输入图像中的实体特征信息。
本实施例并不限定实体信息的形式。可选地,可以包括实体类别和/或实体属性。
可选地,预先训练图像表征层,可以包括预先训练图像恢复模型,从而可以提取出训练完成的图像恢复模型中的表征部分,确定为图像表征层。
可选地,图像表征层中的骨干网络可以选择resnet(小参数模型)或者Vit(大参数模型),也可以是CNN模型,或者Transformer模型,或者自注意力模型。
可选地,对应于上述针对待训练图像的遮罩,也可以针对第一文本进行遮罩,训练文本表征层。
可选地,在训练图像文本匹配模型之前,可以针对任一待恢复图像中被遮罩的实体,在第一文本中,对所针对实体的信息进行遮罩,得到待恢复文本;以所得到的待恢复文本为样本特征,以第一文本为样本标签,预先训练文本表征层;文本表征层可以用于,提取输入文本中的实体特征信息。
可选地,预先训练文本表征层,可以包括预先训练文本恢复模型,从而可以提取出训练完成的文本恢复模型中的表征部分,确定为文本表征层。
本实施例可以通过针对关联的图像和文本,遮罩包含的同一实体进行训练,可以提高文本表征层和图像表征层所提取实体特征信息之间的关联性,从而可以提高图像表征结果和文本表征结果之间的关联性,提高模型训练效果和图像文本匹配模型的准确性。
上述实施例通过预训练图像表征层和文本表征层,可以得到相对较好的初始参数,进而可以相比于随机确定初始参数的情况下,提高图像文本匹配模型的整体训练效果。
在训练完成上述图像文本匹配模型的情况下,本发明实施例并不限定图像文本匹配模型的具体使用方法。
可选地,可以利用图像文本匹配模型,确定输入图像和输入文本是否匹配。也可以利用图像文本匹配模型,利用与输入图像相匹配的输入文本,得到用于描述输入图 像内容的文本,从而将图像中的内容信息提取为文本。
在一种可选的实施例中,可以获取待匹配图像;获取包含预设内容信息的至少一个预设文本;不同预设文本中的预设内容信息不同。将待匹配图像和至少一个预设文本,输入到图像文本匹配模型中。
其中,可选地,图像文本匹配模型是基于上述方法实施例训练得到的。
之后可以根据图像文本匹配模型的输出结果,判断是否存在用于描述待匹配图像中内容的预设文本。
可选地,预设内容信息具体可以包括,实体信息。实体信息具体可以包括实体类别和/或实体属性。
可选地,预设文本可以包括,基于预设文本模板,填充实体信息后得到的文本。预设文本模板例如,“图像中包括一个xx”。
本实施例并不限定图像文本匹配模型的输出结果的形式。
可选地,模型输出结果可以包括,输入文本和输入图像的匹配程度,也可以包括表征输入文本和输入图像是否匹配的预测结果。
本实施例并不限定根据模型输出结果,判断是否存在用于描述待匹配图像中内容的预设文本的方式。
可选地,可以根据模型输出的匹配程度,确定出匹配程度高于预设匹配阈值,并且匹配程度最高的预设文本。进而可以根据所确定预设文本中的预设内容信息,确定待匹配图像中包括预设内容信息。
可选地,也可以根据模型输出的表征输入文本和输入图像是否匹配的预测结果,确定相匹配的预设文本。进而可以根据所确定预设文本中的预设内容信息,确定待匹配图像中包括预设内容信息。
对应于上述方法实施例,本发明实施例还提供了一种装置实施例。
如图2所示,图2是根据本发明实施例示出的一种图像文本匹配模型的训练装置的结构示意图。
该装置可以包括以下单元。
样本单元201,用于获取正样本和负样本;正样本包括文本和图像,所包括的文 本,用于描述所包括图像中的内容;负样本包括文本和图像,所包括的文本描述内容,与所包括图像中的内容不符。
训练单元202,用于利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;
图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。
可选地,图像文本匹配模型包括:文本表征层和图像表征层;
图像文本匹配模型用于:针对输入的文本和图像,利用文本表征层得到输入文本的文本特征,利用图像表征层得到输入图像的图像特征,再基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容。
可选地,样本单元201用于:
获取文本和图像的对应关系集合;在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容;
根据对应关系集合生成正样本和负样本;正样本包括:属于同一对应关系的文本和图像;负样本包括:属于不同对应关系的文本和图像;负样本中的文本描述内容,与同一负样本中的图像内容不符。
可选地,样本单元201用于:
从对应关系集合中确定多组对应关系,所确定的多组对应关系中,任意两组对应关系之间的图像和文本不同;
针对多组对应关系中的任一对应关系,基于任一对应关系中的文本和图像,生成一个正样本;
基于任一对应关系中的文本,与多组对应关系中任一其他对应关系中的图像,生成一个负样本。
可选地,样本单元201用于:
从对应关系集合中确定N组对应关系,所确定的N组对应关系中,任意两组对应关系之间的图像和文本不同;
针对N组对应关系中的每组对应关系,基于每组对应关系中的文本和图像,生 成一个正样本;基于每组对应关系中的文本,与N组对应关系中其他N-1组对应关系中的图像,生成N-1个负样本。
可选地,正样本所包括的文本,用于描述所包括图像中的实体类别和/或实体属性;负样本所包括的文本中的实体类别和/或实体属性,与所包括图像中的实体不符。
可选地,在任意一组文本和图像的对应关系中,文本用于描述对应图像中的实体类别和实体属性;样本单元201用于:
将对应关系集合中,实体类别相同且实体属性不同的对应关系确定为第一子集;根据第一子集生成第一正负样本集合;
将对应关系集合中,实体类别不同且实体属性相同的对应关系确定为第二子集;根据第二子集生成第二正负样本集合;
将对应关系集合中,实体类别不同且实体属性不同的对应关系确定为第三子集;根据第三子集生成第三正负样本集合。
可选地,第一损失权重小于第二损失权重,第二损失权重小于第三损失权重;第一损失权重为利用第一正负样本集合训练图像文本匹配模型时的损失函数权重;第二损失权重为利用第二正负样本集合训练图像文本匹配模型时的损失函数权重;第三损失权重为利用第三正负样本集合训练图像文本匹配模型时的损失函数权重。
可选地,文本表征层用于,提取输入文本中的实体特征信息。
可选地,文本表征层用于,针对输入文本进行文本编码,再针对文本编码结果提取实体特征信息。
可选地,上述装置还包括文本预训练单元203,用于在训练图像文本匹配模型之前,预先确定待训练文本包含的实体信息;
根据所确定的实体信息,确定对应的实体标签;
利用待训练文本和对应的实体标签,预先训练文本表征层。
可选地,文本预训练单元203用于:
确定待训练文本包含的头实体、关系和尾实体;
利用预先训练的映射模型,将所确定的头实体和尾实体映射到关系空间中,得到头实体特征和尾实体特征;
将得到的头实体特征和尾实体特征确定为对应的实体标签。
可选地,上述装置还包括图像预训练单元204,用于:在训练图像文本匹配模型之前,预先根据用于描述待训练图像内容的第一文本中包含的实体信息,确定至少一个实体在所述待训练图像中的位置;
针对待训练图像中的至少一个实体进行遮罩,得到至少一个待恢复图像;
以待恢复图像为样本特征,以待训练图像为样本标签,预先训练图像表征层;
图像表征层用于,提取输入图像中的实体特征信息。
可选地,上述装置还包括文本预训练单元203,用于:在训练图像文本匹配模型之前,针对任一待恢复图像中被遮罩的实体,在第一文本中,对所针对实体的信息进行遮罩,得到待恢复文本;
以所得到的待恢复文本为样本特征,以第一文本为样本标签,预先训练文本表征层;
文本表征层用于,提取输入文本中的实体特征信息。
具体的解释可以参见上述方法实施例。
本发明实施例还提供一种计算机设备,其至少包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现上述任一方法实施例。
本发明实施例还提供一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述任一方法实施例。
图3是根据本发明实施例示出的一种配置本发明实施例方法的计算机设备硬件结构示意图,该设备可以包括:处理器1010、存储器1020、输入/输出接口1030、通信接口1040和总线1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。
处理器1010可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本发明实施例所提供的技术方 案。
存储器1020可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序,在通过软件或者固件来实现本发明实施例所提供的技术方案时,相关的程序代码保存在存储器1020中,并由处理器1010来调用执行。
输入/输出接口1030用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。
通信接口1040用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。
总线1050包括一通路,在设备的各个组件(例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040)之间传输信息。
需要说明的是,尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本发明实施例方案所必需的组件,而不必包含图中所示的全部组件。
本发明实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一方法实施例。
本发明实施例还提供一种存储有计算机程序的计算机可读存储介质,所述计算机程序在由处理器执行时实现上述任一方法实施例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存 储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本发明实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本发明实施例的技术方案本质上或者说做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明实施例各个实施例或者实施例的某些部分所述的方法。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本发明实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述仅是本发明实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明实施例的保护。
在本发明中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上,除非另有明确的限定。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、 用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (16)

  1. 一种图像文本匹配模型的训练方法,其特征在于,包括:
    获取正样本和负样本;所述正样本包括文本和图像,所包括的文本,用于描述所包括图像中的内容;所述负样本包括文本和图像,所包括的文本描述内容,与所包括图像中的内容不符;
    利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;
    所述图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。
  2. 根据权利要求1所述的方法,其特征在于,所述图像文本匹配模型包括:文本表征层和图像表征层;
    所述图像文本匹配模型用于:针对输入的文本和图像,利用所述文本表征层得到输入文本的文本特征,利用所述图像表征层得到输入图像的图像特征,再基于所得到的文本特征和图像特征,预测输入文本是否用于描述输入图像中的内容。
  3. 根据权利要求1所述的方法,其特征在于,所述获取正样本和负样本,包括:
    获取文本和图像的对应关系集合;在任意一组文本和图像的对应关系中,文本用于描述对应图像中的内容;
    根据所述对应关系集合生成正样本和负样本;所述正样本包括:属于同一对应关系的文本和图像;所述负样本包括:属于不同对应关系的文本和图像;所述负样本中的文本描述内容,与同一负样本中的图像内容不符。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述对应关系集合生成正样本和负样本,包括:
    从所述对应关系集合中确定多组对应关系,所确定的多组对应关系中,任意两组对应关系之间的图像和文本不同;
    针对所述多组对应关系中的任一对应关系,基于所述任一对应关系中的文本和图像,生成一个正样本;
    基于所述任一对应关系中的文本,与所述多组对应关系中任一其他对应关系中的图像,生成一个负样本。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述对应关系集合生成正样本和负样本,包括:
    从所述对应关系集合中确定N组对应关系,所确定的N组对应关系中,任意两组对应关系之间的图像和文本不同;
    针对所述N组对应关系中的每组对应关系,基于所述每组对应关系中的文本和图像,生成一个正样本;基于所述每组对应关系中的文本,与所述N组对应关系中其他N-1组对应关系中的图像,生成N-1个负样本。
  6. 根据权利要求1所述的方法,其特征在于,所述正样本所包括的文本,用于描述所包括图像中的实体类别和/或实体属性;所述负样本所包括的文本中的实体类别和/或实体属性,与所包括图像中的实体不符。
  7. 根据权利要求3所述的方法,其特征在于,在任意一组文本和图像的对应关系中,文本用于描述对应图像中的实体类别和实体属性;
    所述根据所述对应关系集合生成正样本和负样本,包括:
    将所述对应关系集合中,实体类别相同且实体属性不同的对应关系确定为第一子集;根据所述第一子集生成第一正负样本集合;
    将所述对应关系集合中,实体类别不同且实体属性相同的对应关系确定为第二子集;根据所述第二子集生成第二正负样本集合;
    将所述对应关系集合中,实体类别不同且实体属性不同的对应关系确定为第三子集;根据所述第三子集生成第三正负样本集合。
  8. 根据权利要求7所述的方法,其特征在于,第一损失权重小于第二损失权重,第二损失权重小于第三损失权重;
    所述第一损失权重为利用所述第一正负样本集合训练图像文本匹配模型时的损失函数权重;所述第二损失权重为利用所述第二正负样本集合训练图像文本匹配模型时的损失函数权重;所述第三损失权重为利用所述第三正负样本集合训练图像文本匹配模型时的损失函数权重。
  9. 根据权利要求2所述的方法,其特征在于,所述文本表征层用于,提取输入文本中的实体特征信息。
  10. 根据权利要求9所述的方法,其特征在于,所述文本表征层用于,针对输入文本进行文本编码,再针对文本编码结果提取实体特征信息。
  11. 根据权利要求2所述的方法,其特征在于,在训练图像文本匹配模型之前,所述方法还包括:
    预先根据用于描述待训练图像内容的第一文本中包含的实体信息,确定至少一个实体在所述待训练图像中的位置;
    针对所述待训练图像中的至少一个实体进行遮罩,得到至少一个待恢复图像;
    以待恢复图像为样本特征,以所述待训练图像为样本标签,预先训练图像表征层。
  12. 根据权利要求11所述的方法,其特征在于,在训练图像文本匹配模型之前,所述方法还包括:
    针对任一待恢复图像中被遮罩的实体,在所述第一文本中,对所针对实体的信息进行遮罩,得到待恢复文本;
    以所得到的待恢复文本为样本特征,以所述第一文本为样本标签,预先训练文本表征层。
  13. 一种图像文本匹配方法,其特征在于,包括:
    获取待匹配图像;
    获取包含预设内容信息的至少一个预设文本;不同预设文本中的预设内容信息不同;
    将所述待匹配图像和所述至少一个预设文本,输入到图像文本匹配模型中,所述图像文本匹配模型是基于权利要求1至12中任一项所述的图像文本匹配模型的训练方法得到的;
    根据所述图像文本匹配模型的输出结果,判断是否存在用于描述所述待匹配图像中内容的预设文本。
  14. 一种图像文本匹配模型的训练装置,其特征在于,包括:
    样本单元,用于获取正样本和负样本;所述正样本包括的文本,用于描述所包括图像中的内容;所述负样本包括的文本描述内容,与所包括图像中的内容不符;
    训练单元,用于利用所获取的正负样本,基于对比学习的方式训练图像文本匹配模型;
    所述图像文本匹配模型用于:针对输入的文本和图像,预测输入文本是否用于描述输入图像中的内容。
  15. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至13中任一项所述方法。
  16. 一种存储有计算机程序的计算机可读存储介质,其特征在于,所述计算机程序在由处理器执行时实现权利要求1至13中任一项所述方法。
PCT/CN2022/123188 2022-09-30 2022-09-30 图像文本匹配模型的训练方法、装置、设备及存储介质 WO2024065645A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/123188 WO2024065645A1 (zh) 2022-09-30 2022-09-30 图像文本匹配模型的训练方法、装置、设备及存储介质
CN202280003411.9A CN118119935A (zh) 2022-09-30 2022-09-30 图像文本匹配模型的训练方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/123188 WO2024065645A1 (zh) 2022-09-30 2022-09-30 图像文本匹配模型的训练方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024065645A1 true WO2024065645A1 (zh) 2024-04-04

Family

ID=90475547

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/123188 WO2024065645A1 (zh) 2022-09-30 2022-09-30 图像文本匹配模型的训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN118119935A (zh)
WO (1) WO2024065645A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990297A (zh) * 2021-03-10 2021-06-18 北京智源人工智能研究院 多模态预训练模型的训练方法、应用方法及装置
CN114091427A (zh) * 2021-11-19 2022-02-25 海信电子科技(武汉)有限公司 一种图像文本相似度模型训练方法及显示设备
CN114782722A (zh) * 2022-04-29 2022-07-22 北京百度网讯科技有限公司 图文相似度的确定方法、装置及电子设备
CN114841243A (zh) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 跨模态检索模型训练方法、跨模态检索方法、设备及介质
US20220284321A1 (en) * 2021-03-03 2022-09-08 Adobe Inc. Visual-semantic representation learning via multi-modal contrastive training

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220284321A1 (en) * 2021-03-03 2022-09-08 Adobe Inc. Visual-semantic representation learning via multi-modal contrastive training
CN112990297A (zh) * 2021-03-10 2021-06-18 北京智源人工智能研究院 多模态预训练模型的训练方法、应用方法及装置
CN114091427A (zh) * 2021-11-19 2022-02-25 海信电子科技(武汉)有限公司 一种图像文本相似度模型训练方法及显示设备
CN114841243A (zh) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 跨模态检索模型训练方法、跨模态检索方法、设备及介质
CN114782722A (zh) * 2022-04-29 2022-07-22 北京百度网讯科技有限公司 图文相似度的确定方法、装置及电子设备

Also Published As

Publication number Publication date
CN118119935A (zh) 2024-05-31

Similar Documents

Publication Publication Date Title
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
US11574122B2 (en) Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN113283551B (zh) 多模态预训练模型的训练方法、训练装置及电子设备
US11140446B2 (en) Sensitivity assessment for media production using artificial intelligence
WO2021139191A1 (zh) 数据标注的方法以及数据标注的装置
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
CN110851641B (zh) 跨模态检索方法、装置和可读存储介质
WO2018196718A1 (zh) 图像消歧方法、装置、存储介质和电子设备
US20200012862A1 (en) Multi-model Techniques to Generate Video Metadata
CN112632226B (zh) 基于法律知识图谱的语义搜索方法、装置和电子设备
CN108985133B (zh) 一种人脸图像的年龄预测方法及装置
CN112163099A (zh) 基于知识图谱的文本识别方法、装置、存储介质和服务器
CN113627151B (zh) 跨模态数据的匹配方法、装置、设备及介质
CN113656660A (zh) 跨模态数据的匹配方法、装置、设备及介质
JP2023536773A (ja) テキスト品質評価モデルのトレーニング方法及びテキスト品質の決定方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN115861995A (zh) 一种视觉问答方法、装置及电子设备和存储介质
CN109408175B (zh) 通用高性能深度学习计算引擎中的实时交互方法及系统
CN110867225A (zh) 字符级临床概念提取命名实体识别方法及系统
WO2024065645A1 (zh) 图像文本匹配模型的训练方法、装置、设备及存储介质
CN112241470A (zh) 一种视频分类方法及系统
CN109657710B (zh) 数据筛选方法、装置、服务器及存储介质
CN117009570A (zh) 一种基于位置信息与置信度感知的图文检索方法及装置
WO2022237065A1 (zh) 分类模型的训练方法、视频分类方法及相关设备
CN111460206B (zh) 图像处理方法、装置、电子设备和计算机可读存储介质
CN115700790A (zh) 用于对象属性分类模型训练的方法、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960268

Country of ref document: EP

Kind code of ref document: A1