CN112148839A

CN112148839A - Image-text matching method, device and storage medium

Info

Publication number: CN112148839A
Application number: CN202011052223.4A
Authority: CN
Inventors: 崔志
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2020-12-29

Abstract

The present disclosure relates to a method, device and storage medium for image and text matching. The graphic-text matching method includes: acquiring an image to be matched with graphic-text; inputting the image into a pre-trained graphic-text encoding model, encoding the image, and obtaining an image vector; A text vector similar to an image vector; wherein, the image-text encoding model further includes a text encoding sub-network for encoding text in the training process, and the pre-stored text vector is preset by the text encoding sub-network. Determining text encoding; determining the text corresponding to a preset number of text vectors similar to the image vector as the text matching the image. Through the present disclosure, the image-text matching efficiency of the image-text matching server can be improved, and the system delay of the image-text matching server can be reduced.

Description

Image-text matching method, device and storage medium

技术领域technical field

本公开涉及计算机技术领域，尤其涉及图文匹配方法、装置及存储介质。The present disclosure relates to the field of computer technology, and in particular, to a method, an apparatus, and a storage medium for image-text matching.

背景技术Background technique

多模态检索是一种新型的检索方式，其可以实现不同模态之间的数据检索。例如图文检索时，用户可以输入一张图像来检索与该图像匹配的描述文本，或者，用户可以输入一个文本来检索该语句所描述的图像。Multimodal retrieval is a new retrieval method, which can realize data retrieval between different modalities. For example, during image and text retrieval, the user can input an image to retrieve the description text matching the image, or the user can input a text to retrieve the image described by the sentence.

以根据图像检索文本为例，服务器需要根据用户输入的图像，需要在文本库中粗略地检索与输入图像关联度比较大的候选文本，得到候选文本后，再利用文本编码模型对候选文本进行编码，得到候选文本的文本向量，根据候选文本的文本向量最终确定与用户输入图像的匹配度。Taking the retrieval of text based on images as an example, the server needs to roughly retrieve candidate texts that are relatively closely related to the input images in the text library according to the images input by the user. After the candidate texts are obtained, the text encoding model is used to encode the candidate texts. , obtain the text vector of the candidate text, and finally determine the matching degree with the user input image according to the text vector of the candidate text.

通过上述方式进行图文匹配时，服务器需要经过初步检索、编码以及计算匹配度等几个步骤，才能实现图文匹配，服务器处理效率低下。When the image and text matching is performed in the above manner, the server needs to go through several steps such as preliminary retrieval, coding, and calculation of the matching degree to realize the image and text matching, and the processing efficiency of the server is low.

发明内容SUMMARY OF THE INVENTION

为克服相关技术中存在的问题，本公开提供一种图文匹配方法、装置及存储介质。In order to overcome the problems existing in the related art, the present disclosure provides a picture-text matching method, device and storage medium.

根据本公开实施例的第一方面，提供一种图文匹配方法，图文匹配方法包括：获取待进行图文匹配的图像；将所述图像输入预先训练的图文编码模型，对所述图像进行编码，得到图像向量；从预先存储的文本向量中确定与所述图像向量相似的文本向量；其中，所述图文编码模型在训练过程中还包括用于对文本进行编码的文本编码子网络，所述预先存储的文本向量通过所述文本编码子网络对预设文本编码确定；将与所述图像向量相似的预设数量文本向量对应的文本，确定为与所述图像匹配的文本。According to a first aspect of the embodiments of the present disclosure, there is provided a graphic-text matching method, the graphic-text matching method includes: acquiring an image to be subjected to graphic-text matching; inputting the image into a pre-trained graphic-text encoding model, Encoding to obtain an image vector; determining a text vector similar to the image vector from a pre-stored text vector; wherein, the image-text encoding model also includes a text encoding sub-network for encoding the text during the training process , the pre-stored text vector is determined by encoding the preset text through the text encoding sub-network; the text corresponding to a preset number of text vectors similar to the image vector is determined as the text matching the image.

在一示例中，所述从预先存储的文本向量中确定与所述图像向量相似的文本向量，包括：In an example, the determining a text vector similar to the image vector from a pre-stored text vector includes:

分别确定预先存储的文本向量中的各文本向量与所述图像向量的余弦距离；respectively determine the cosine distance between each text vector in the pre-stored text vector and the image vector;

根据计算的余弦距离，获取与所述图像向量的余弦距离大于设定距离阈值的文本向量，并According to the calculated cosine distance, obtain the text vector whose cosine distance from the image vector is greater than the set distance threshold, and

将余弦距离大于设定距离阈值的文本向量，确定为与所述图像向量相似的文本向量。A text vector whose cosine distance is greater than the set distance threshold is determined as a text vector similar to the image vector.

在一示例中，所述预先存储的文本向量通过所述文本编码子网络对预设文本编码确定：调用所述图文编码模型；将所述预设文本输入所述图文编码模型，通过所述图文编码模型包括的文本编码子网络对输入的所述预设文本进行编码，得到所述预先存储的文本向量。In an example, the pre-stored text vector is determined by the text encoding sub-network to encode the preset text: the image and text encoding model is called; the preset text is input into the image and text encoding model, and the The text encoding sub-network included in the image-text encoding model encodes the input preset text to obtain the pre-stored text vector.

在一示例中，所述图文编码模型基于第一训练样本对和第二训练样本对训练得到，所述第一训练样本对包括所述预设文本以及与所述预设文本相匹配的图像样本，所述第二训练样本对包括所述预设文本以及与所述预设文本不相匹配的图像样本。In an example, the image-text encoding model is obtained by training based on a first training sample pair and a second training sample pair, and the first training sample pair includes the preset text and an image matching the preset text samples, and the second training sample pair includes the preset text and image samples that do not match the preset text.

根据本公开实施例的第二方面，提供一种图文编码模型训练方法，图文编码模型训练方法包括：确定预设文本样本；确定与所述预设文本样本相匹配的图像，得到第一训练样本对，并确定与预设文本样本不匹配的图像，得到第二训练样本对；基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型。According to a second aspect of the embodiments of the present disclosure, there is provided a method for training an image-text encoding model. The method for training an image-text encoding model includes: determining a preset text sample; determining an image matching the preset text sample, and obtaining a first training sample pairs, and determining images that do not match the preset text samples to obtain a second training sample pair; and training based on the first training sample pair and the second training sample pair to obtain an image-text encoding model.

在一示例中，所述图文编码模型包括图像编码子网络和文本编码子网络，所述基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型，包括：通过图像编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各图像样本的图像向量，并通过文本编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各文本样本的文本向量；基于所述图像样本的图像向量和所述文本样本的文本向量，确定第一余弦距离和第二余弦距离，其中，所述第一余弦距离为所述第一训练样本对中的文本向量和图像向量之间的余弦距离，所述第二余弦距离为所述第二训练样本中的文本向量和图像向量之间的余弦距离；根据所述第一余弦距离、所述第二余弦距离、以及损失函数，调整所述图像编码子网络和所述文本编码子网络的训练参数，得到满足损失值的所述图文编码模型；其中，所述图文编码模型满足使所述第一余弦距离大于预设的第一距离阈值，所述第二余弦距离小于预设的第二距离阈值。In an example, the image and text encoding model includes an image encoding sub-network and a text encoding sub-network, and the image and text encoding model obtained by training based on the first training sample pair and the second training sample pair includes: The image coding sub-network separately extracts the image vector of each image sample in the first training sample pair and the second training sample pair, and through the text coding sub-network, respectively extracts the first training sample pair and the second training sample pair. 2. The text vector of each text sample in the training sample pair; the first cosine distance and the second cosine distance are determined based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is The distance is the cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is the cosine distance between the text vector and the image vector in the second training sample; according to The first cosine distance, the second cosine distance, and the loss function adjust the training parameters of the image encoding sub-network and the text encoding sub-network to obtain the image-text encoding model that satisfies the loss value; Wherein, the image-text encoding model satisfies that the first cosine distance is greater than a preset first distance threshold, and the second cosine distance is less than a preset second distance threshold.

根据本公开实施例的第三方面，提供一种图文匹配装置，图文匹配装置包括：获取单元，被配置为获取待进行图文匹配的图像；处理单元，被配置为将所述图像输入预先训练的图文编码模型，对所述图像进行编码，得到图像向量；确定单元，被配置为从预先存储的文本向量中确定与所述图像向量相似的文本向量；其中，所述图文编码模型在训练过程中还包括用于对文本进行编码的文本编码子网络，所述预先存储的文本向量通过所述文本编码子网络对预设文本编码确定；所述确定单元，还被配置为将与所述图像向量相似的预设数量文本向量对应的文本，确定为与所述图像匹配的文本。According to a third aspect of the embodiments of the present disclosure, there is provided an image-text matching apparatus, the image-text matching apparatus includes: an acquiring unit configured to acquire an image to be subjected to image-text matching; a processing unit configured to input the image into a pre-trained image-text encoding model, which encodes the image to obtain an image vector; the determining unit is configured to determine a text vector similar to the image vector from the pre-stored text vector; wherein, the image-text encoding The model also includes a text encoding sub-network for encoding text during the training process, and the pre-stored text vector is determined by the text encoding sub-network to encode the preset text; the determining unit is further configured to Text corresponding to a preset number of text vectors similar to the image vector is determined as the text matching the image.

在一示例中，所述确定单元采用如下方式从预先存储的文本向量中确定与所述图像向量相似的文本向量：分别确定预先存储的文本向量中的各文本向量与所述图像向量的余弦距离；根据计算的余弦距离，获取与所述图像向量的余弦距离大于设定距离阈值的文本向量，并将余弦距离大于设定距离阈值的文本向量，确定为与所述图像向量相似的文本向量。In an example, the determining unit determines a text vector similar to the image vector from the pre-stored text vectors in the following manner: respectively determining the cosine distance between each text vector in the pre-stored text vector and the image vector According to the calculated cosine distance, obtain the text vector whose cosine distance with the image vector is greater than the set distance threshold, and determine the text vector with the cosine distance greater than the set distance threshold as the text vector similar to the image vector.

在一示例中，所述确定单元采用如下方式通过所述文本编码子网络对预设文本编码确定所述预先存储的文本向量：调用所述图文编码模型；将所述预设文本输入所述图文编码模型，通过所述图文编码模型包括的文本编码子网络对输入的所述预设文本进行编码，得到所述预先存储的文本向量。In an example, the determining unit determines the pre-stored text vector by encoding the preset text through the text encoding sub-network in the following manner: calling the image-text encoding model; inputting the preset text into the The image-text encoding model encodes the input preset text through a text encoding sub-network included in the image-text encoding model to obtain the pre-stored text vector.

根据本公开实施例的第四方面，提供一种图文编码模型训练装置，图文编码模型训练装置包括：确定单元，被配置为确定预设文本样本，以及确定与所述预设文本样本相匹配的图像，得到第一训练样本对，并确定与预设文本样本不匹配的图像，得到第二训练样本对；训练单元，被配置为基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型。According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for training an image-text encoding model, the apparatus for training an image-text encoding model includes: a determining unit configured to determine a preset text sample, and determine a text sample corresponding to the preset text sample. matching images, obtain a first training sample pair, and determine an image that does not match a preset text sample, obtain a second training sample pair; a training unit is configured to be based on the first training sample pair and the second training sample pair The sample pairs are trained to obtain the image-text encoding model.

在一示例中，所述图文编码模型包括图像编码子网络和文本编码子网络，所述训练单元采用如下方式基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型：通过图像编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各图像样本的图像向量，并通过文本编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各文本样本的文本向量；基于所述图像样本的图像向量和所述文本样本的文本向量，确定第一余弦距离和第二余弦距离，其中，所述第一余弦距离为所述第一训练样本对中的文本向量和图像向量之间的余弦距离，所述第二余弦距离为所述第二训练样本中的文本向量和图像向量之间的余弦距离；根据所述第一余弦距离、所述第二余弦距离、以及损失函数，调整所述图像编码子网络和所述文本编码子网络的训练参数，得到满足损失值的所述图文编码模型；其中，所述图文编码模型满足使所述第一余弦距离大于预设的第一距离阈值，所述第二余弦距离小于预设的第二距离阈值。In an example, the image and text encoding model includes an image encoding sub-network and a text encoding sub-network, and the training unit obtains the image and text encoding based on the first training sample pair and the second training sample pair in the following manner: Model: through the image coding sub-network, respectively extract the image vector of each image sample in the first training sample pair and the second training sample pair, and through the text coding sub-network, respectively extract the first training sample pair and The text vector of each text sample in the second training sample pair; the first cosine distance and the second cosine distance are determined based on the image vector of the image sample and the text vector of the text sample, wherein the A cosine distance is the cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is the cosine distance between the text vector and the image vector in the second training sample distance; according to the first cosine distance, the second cosine distance, and the loss function, adjust the training parameters of the image encoding sub-network and the text encoding sub-network to obtain the image and text satisfying the loss value An encoding model; wherein, the image-text encoding model satisfies that the first cosine distance is greater than a preset first distance threshold, and the second cosine distance is less than a preset second distance threshold.

根据本公开的第五方面，提供了一种图文匹配装置，图文匹配装置包括：存储器，配置用于存储指令。以及处理器，配置用于调用指令执行前述第一方面或者第一方面中任意一示例中的图文匹配方法。According to a fifth aspect of the present disclosure, there is provided an image-text matching apparatus, the image-text matching apparatus comprising: a memory configured to store an instruction. and a processor configured to invoke an instruction to execute the image-text matching method in the foregoing first aspect or any example of the first aspect.

根据本公开的第六方面，提供了一种非临时性计算机可读存储介质，非临时性计算机可读存储介质存储有计算机可执行指令，计算机可执行指令在由处理器执行时，执行前述第一方面或者第一方面中任意一示例中的图文匹配方法。According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions, when executed by a processor, execute the aforementioned first The image-text matching method in one aspect or any example of the first aspect.

根据本公开的第七方面，提供了一种图文编码模型训练装置，图文编码模型训练装置包括：存储器，配置用于存储指令。以及处理器，配置用于调用指令执行前述第二方面或者第二方面中任意一示例中的图文编码模型训练方法。According to a seventh aspect of the present disclosure, there is provided an apparatus for training an image-text encoding model. The apparatus for training an image-text encoding model includes: a memory configured to store instructions. and a processor, configured to invoke an instruction to execute the image-text encoding model training method in the second aspect or any example of the second aspect.

根据本公开的第八方面，提供了一种非临时性计算机可读存储介质，非临时性计算机可读存储介质存储有计算机可执行指令，计算机可执行指令在由处理器执行时，执行前述第二方面或者第二方面中任意一示例中的图文编码模型训练方法。According to an eighth aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, where the non-transitory computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions, when executed by a processor, execute the aforementioned first The second aspect or the image-text encoding model training method in any example of the second aspect.

本公开的实施例提供的技术方案可以包括以下有益效果：用于图文匹配的服务器获取待进行图文匹配的图像，并将待进行图文匹配的图像输入预先训练的图文编码模型，对待进行图文匹配的图像进行编码，得到待进行图文匹配图像的图像向量后，从预先存储的文本向量中确定与待进行图文匹配图像的图像向量相似的文本向量，将与待进行图文匹配图像的图像向量相似的预设数量文本向量对应的文本，确定为与所述图像匹配的文本。由此，用于图文匹配的服务器无需先在预设文本中检索与用户输入图像匹配的多个候选文本，无需对候选文本实时进行编码，得到文本向量，提升图文匹配的服务器的图文匹配效率，降低图文匹配的服务器的系统延迟。The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects: the server for image and text matching acquires the image to be matched with the image and text, and inputs the image to be matched with the image and text into a pre-trained image and text encoding model, The image to be matched is encoded, and after the image vector of the image to be matched is obtained, a text vector similar to the image vector of the image to be matched is determined from the pre-stored text vector, and a text vector similar to the image vector of the image to be matched is determined from the pre-stored text vectors. The text corresponding to a preset number of text vectors that are similar to the image vector of the matching image is determined as the text matching the image. As a result, the server for image-text matching does not need to retrieve multiple candidate texts matching the user input image from the preset text first, and does not need to encode the candidate texts in real time to obtain a text vector, thereby improving the image-text matching of the image-text matching server. Matching efficiency, reducing the system delay of the server for image and text matching.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.

图1是根据一示例性实施例示出的一种图文匹配方法的流程图。FIG. 1 is a flow chart of a method for image-text matching according to an exemplary embodiment.

图2是根据一示例性实施例示出的一种图文匹配方法的流程图。Fig. 2 is a flow chart of a method for image-text matching according to an exemplary embodiment.

图3是根据一示例性实施例示出的训练文本编码模型和图像编码模型的流程图。FIG. 3 is a flow chart of training a text encoding model and an image encoding model according to an exemplary embodiment.

图4是根据一示例性实施例示出的一种用于图文匹配装置的框图。Fig. 4 is a block diagram of an apparatus for image-text matching according to an exemplary embodiment.

图5是根据一示例性实施例示出的一种图文编码模型训练装置框图。Fig. 5 is a block diagram of an apparatus for training an image-text encoding model according to an exemplary embodiment.

图6是根据一示例性实施例示出的一种装置的框图。Fig. 6 is a block diagram of an apparatus according to an exemplary embodiment.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

本公开的示例性实施例的技术方案可以应用于针对图文匹配系统中，根据用户输入到图文匹配系统的图像，返回与图像相匹配文本的应用场景。在该场景下，图文匹配系统例如可以包括用于输入待进行图文匹配的终端和用于对输入的待进行图文匹配的图像，进行文本匹配的服务器，即图文匹配的服务器。其中，用户终端包括但不限于：智能手机、平板电脑、笔记本电脑、台式电脑、电子书阅读器等固定式或移动式电子设备。用于图文匹配的服务器可以是独立的应用服务设备，也可以是由多个服务器构成的服务集群，实际应用中，其可以是云服务器、云主机、虚拟中心等，本公开对该服务器的结构及其实现形式不作限定。The technical solutions of the exemplary embodiments of the present disclosure can be applied to an application scenario in which text matching the image is returned to the image-text matching system according to the image input by the user to the image-text matching system. In this scenario, the image-text matching system may include, for example, a terminal for inputting images to be matched with images and texts and a server for performing text matching on the input images to be matched with images and texts, that is, a server for image-text matching. The user terminal includes, but is not limited to, fixed or mobile electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, and e-book readers. The server used for image and text matching may be an independent application service device, or a service cluster composed of multiple servers. In practical applications, it may be a cloud server, a cloud host, a virtual center, etc. The structure and its realization form are not limited.

相关技术中，利用多模态的方式检索，例如根据输入的图像检索与该图像匹配的描述文本时，用户在终端上输入需要匹配文本的图像，用于图文匹配的服务器在接收到用户输入的图像后，需要在文本库中检索与用户输入图像匹配的多个候选文本，得到候选文本后，基于网络模型对候选文本进行编码，得到候选文本的文本向量，再根据候选文本的文本向量，确定候选文本中与用户输入图像匹配的文本，并将与用户输入图像匹配的文本反馈给至终端，供用户选择。In the related art, multi-modal retrieval is used. For example, when retrieving the description text matching the image according to the input image, the user inputs the image whose text needs to be matched on the terminal, and the server for image-text matching receives the user input. After obtaining the image, it is necessary to retrieve multiple candidate texts that match the user input image in the text library. After obtaining the candidate text, encode the candidate text based on the network model to obtain the text vector of the candidate text, and then according to the text vector of the candidate text, The text matching the user-input image in the candidate text is determined, and the text matching the user-input image is fed back to the terminal for the user to select.

由于用于图文匹配的服务器在执行图文匹配的过程中，需经过上述多个步骤，才能实现图文匹配，使得用于图文匹配的服务器在执行操作时效率低下、系统延迟现象严重。Since the server used for image and text matching needs to go through the above-mentioned steps in the process of performing image and text matching, the image and text matching can be realized, so that the server used for image and text matching has low efficiency and serious system delay when performing operations.

本公开实施例，提供一种图文匹配方法。在本公开的图文匹配方法中，用于图文匹配的服务器获取待进行图文匹配的图像，并将待进行图文匹配的图像输入预先训练的图文编码模型，对待进行图文匹配的图像进行编码，得到待进行图文匹配图像的图像向量后，从预先存储的文本向量中确定与待进行图文匹配图像的图像向量相似的文本向量，将与待进行图文匹配图像的图像向量相似的预设数量文本向量对应的文本，确定为与所述图像匹配的文本。由此，用于图文匹配的服务器无需先在预设文本中检索与用户输入图像匹配的多个候选文本，无需对候选文本实时进行编码，得到文本向量，提升图文匹配的服务器的图文匹配效率，降低图文匹配的服务器的系统延迟。The embodiments of the present disclosure provide a method for matching image and text. In the image-text matching method of the present disclosure, a server for image-text matching acquires an image to be matched with image and text, and inputs the image to be matched with image-text into a pre-trained image-text encoding model, and a The image is encoded, and after obtaining the image vector of the image to be matched, a text vector similar to the image vector of the image to be matched is determined from the pre-stored text vectors, and the image vector of the image to be matched is determined. The text corresponding to the similar preset number of text vectors is determined as the text matching the image. As a result, the server for image-text matching does not need to retrieve multiple candidate texts matching the user input image from the preset text first, and does not need to encode the candidate texts in real time to obtain a text vector, thereby improving the image-text matching of the image-text matching server. Matching efficiency, reducing the system delay of the server for image and text matching.

图1是根据一示例性实施例示出的一种图文匹配方法的流程图，如图1所示，图文匹方法，包括以下步骤。FIG. 1 is a flow chart of a method for image-text matching according to an exemplary embodiment. As shown in FIG. 1 , the image-text matching method includes the following steps.

在步骤S11中，获取待进行图文匹配的图像。In step S11, an image to be image-text matching is acquired.

本公开中，待进行图文匹配的图像可以是用户输入的一张或者几张图片，也可以是对用户输入视频后，对视频提取的图像帧。In the present disclosure, the image to be image-text matching may be one or several pictures input by the user, or may be an image frame extracted from the video after inputting the video to the user.

在步骤S12中，将图像输入预先训练的图文编码模型，对图像进行编码，得到图像向量。In step S12, the image is input into a pre-trained image and text encoding model, and the image is encoded to obtain an image vector.

本公开中的图文编码模型，可根据输入的图像和/或文本，对图像和/或文本进行编码，输出图像向量和/或文本向量。The image and text encoding model in the present disclosure can encode images and/or texts according to input images and/or texts, and output image vectors and/or text vectors.

进而，本公开中，将待进行图文匹配的图像输入图文编码模型，通过图像编码子网络对待进行图文匹配的图像进行编码，得到待进行图文匹配图像的图像向量。Furthermore, in the present disclosure, the image to be image-text matched is input into the image-text encoding model, and the image to be image-text matched is encoded by the image coding sub-network to obtain an image vector of the image to be image-text matched.

在步骤S13中，从预先存储的文本向量中确定与图像向量相似的文本向量。In step S13, a text vector similar to the image vector is determined from the pre-stored text vectors.

在实际应用中，由于图像和文本来自两个异构的空间，为了直接度量图像与文本的相似度，可以将图像与文本映射到一个空间中，利用图像的图像向量和文本的文本向量，来度量图像与文本的相似度。In practical applications, since images and texts come from two heterogeneous spaces, in order to directly measure the similarity between images and texts, the image and text can be mapped into one space, and the image vector of the image and the text vector of the text can be used to Measure how similar an image is to text.

基于向量相似性原理，一种实施方式中，本公开可从预先存储的文本向量中检索与图像向量相似的文本向量，并将与图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本。Based on the vector similarity principle, in one embodiment, the present disclosure may retrieve text vectors similar to image vectors from pre-stored text vectors, and determine the text corresponding to a preset number of text vectors similar to the image vector as the same as the image vector. Image matching text.

其中，预先存储的文本向量基于图文编码模型中包括的文本编码子网络对预设文本编码后得到。The pre-stored text vector is obtained by encoding the preset text based on the text encoding sub-network included in the image-text encoding model.

一种实施方式中，文本编码子网络例如可以是利用来自变换器的双向编码器表征量(Bidirectional Encoder Representations from Transformers，BERT)神经网络。In one embodiment, the text encoding sub-network may be, for example, a neural network using Bidirectional Encoder Representations from Transformers (BERT) from the transformer.

在步骤S14中，将与图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本。In step S14, the text corresponding to a preset number of text vectors similar to the image vector is determined as the text matching the image.

本公开中，在获取待进行图文匹配的图像后，可根据待进行图文匹配图像的图像向量，从预先存储的文本向量中检索与待进行图文匹配图像的图像向量相似的文本向量，并将与待进行图文匹配的图像向量相似的预设数量文本向量对应的文本，返回给用户，供用户选择。In the present disclosure, after acquiring an image to be matched with images, a text vector similar to the image vector of the image to be matched can be retrieved from a pre-stored text vector according to the image vector of the image to be matched. The text corresponding to the preset number of text vectors similar to the image vector to be matched with the image and text is returned to the user for selection by the user.

例如，用户输入一张风景照图像，通过图文编码模型对风景照图像进行编码得到风景照图像的图像向量后，从预先存储的文本向量中检索与输入的风景照图像的图像向量相似的文本向量，并将与风景照图像的图像向量相似的预设数量文本向量对应的文本，例如包括古诗，优美的文字等返回给用户，供用户选择。基于返回的例如包括古诗，优美的文字等返回给用户，大大缩减了用户对该风景照配文字的时间。并且，用户还可根据返回的与风景照图像的图像向量相似的预设数量的文本，用于后期对图像进一步的处理等操作。For example, after the user inputs a landscape image, the image vector of the landscape image is obtained by encoding the landscape image through the image-text encoding model, and then the text that is similar to the image vector of the input landscape image is retrieved from the pre-stored text vector. vector, and texts corresponding to a preset number of text vectors similar to the image vector of the landscape image, such as ancient poems, beautiful words, etc., are returned to the user for selection by the user. Based on the returned, for example, ancient poems, beautiful words, etc. are returned to the user, which greatly reduces the time for the user to match the text to the scenery. In addition, the user can also use a preset amount of text that is similar to the returned image vector of the landscape image to be used for further processing of the image in the later stage.

在本公开的示例性实施例中，用于图文匹配的服务器获取待进行图文匹配的图像，并将待进行图文匹配的图像输入预先训练的图文编码模型，对待进行图文匹配的图像进行编码，得到待进行图文匹配图像的图像向量后，从预先存储的文本向量中确定与待进行图文匹配图像的图像向量相似的文本向量，将与待进行图文匹配图像的图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本。由此，用于图文匹配的服务器无需先在预设文本中检索与用户输入图像匹配的多个候选文本，无需对候选文本实时进行编码，得到文本向量，提升图文匹配的服务器的图文匹配效率，降低图文匹配的服务器的系统延迟。In an exemplary embodiment of the present disclosure, a server for image-text matching acquires an image to be matched with image and text, and inputs the image to be matched into a pre-trained image-text encoding model, and a The image is encoded, and after obtaining the image vector of the image to be matched, a text vector similar to the image vector of the image to be matched is determined from the pre-stored text vectors, and the image vector of the image to be matched is determined. The text corresponding to the similar preset number of text vectors is determined as the text matching the image. As a result, the server for image-text matching does not need to retrieve multiple candidate texts matching the user input image from the preset text first, and does not need to encode the candidate texts in real time to obtain a text vector, thereby improving the image-text matching of the image-text matching server. Matching efficiency, reducing the system delay of the server for image and text matching.

图2是根据一示例性实施例示出的一种图文匹配方法的流程图，如图2所示，图文匹方法，包括以下步骤。FIG. 2 is a flow chart of a method for image-text matching according to an exemplary embodiment. As shown in FIG. 2 , the image-text matching method includes the following steps.

在步骤S21中，获取待进行图文匹配的图像。In step S21, an image to be image-text matching is acquired.

在步骤S22中，将图像输入预先训练的图文编码模型，对图像进行编码，得到图像向量。In step S22, the image is input into a pre-trained image-text encoding model, and the image is encoded to obtain an image vector.

在步骤S23中，分别确定预先存储的文本向量中的各文本向量与图像向量的余弦距离，根据计算的余弦距离，获取与图像向量的余弦距离大于设定距离阈值的文本向量，并将余弦距离大于设定距离阈值的文本向量，确定为与图像向量相似的文本向量。In step S23, the cosine distance between each text vector in the pre-stored text vector and the image vector is respectively determined, and according to the calculated cosine distance, a text vector whose cosine distance from the image vector is greater than the set distance threshold is obtained, and the cosine distance is Text vectors larger than the set distance threshold are determined as text vectors similar to image vectors.

基于向量相似性原理，一种实施方式中，本公开可通过测量两个向量之间的余弦距离，即两个向量夹角的余弦值来度量向量之间的相似性。当两个向量之间的余弦距离趋近于1时，表征两个向量为相似向量。当两个向量之间的余弦距离趋近于0时，表征两个向量为不相似向量。Based on the principle of vector similarity, in one embodiment, the present disclosure can measure the similarity between vectors by measuring the cosine distance between two vectors, that is, the cosine value of the angle between the two vectors. When the cosine distance between the two vectors approaches 1, the two vectors are characterized as similar vectors. When the cosine distance between the two vectors approaches 0, the two vectors are characterized as dissimilar vectors.

根据上述余弦距离的特性，可设定与图像向量相似的余弦距离的距离阈值，例如设定与图像向量相似的余弦距离的距离阈值为0.98。即当预先存储的文本向量中的文本向量与图像向量的余弦距离大于0.98时，将预先存储的文本向量中与图像向量的余弦距离大于0.98的文本向量，确定为与图像向量相似的文本向量。进而，可将与图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本。According to the characteristics of the above cosine distance, the distance threshold of the cosine distance similar to the image vector can be set, for example, the distance threshold of the cosine distance similar to the image vector is set to 0.98. That is, when the cosine distance between the text vector in the pre-stored text vector and the image vector is greater than 0.98, the text vector in the pre-stored text vector whose cosine distance from the image vector is greater than 0.98 is determined as a text vector similar to the image vector. Furthermore, the text corresponding to a preset number of text vectors similar to the image vector can be determined as the text matching the image.

例如，获取用户输入的一张高山流水的图像，调用图文编码模型对高山流水图像进行编码得到高山流水图像的图像向量后，分别确定预先存储的文本向量中的各文本向量与高山流水图像向量的余弦距离，根据计算的余弦距离，获取与图像向量的余弦距离大于0.98的文本向量，并将余弦距离大于0.98的文本向量，确定为与图像向量相似的文本向量。将与风景照图像的图像向量相似的预设数量文本向量对应的文本，例如包括高山流水的古诗，以及与高山流水相关的优美的文字返回给用户，供用户选择。基于用户输入的图像返回与图像匹配的文字给用户，大大缩减了用户对该高山流水图像配文字的时间。并且，户还可根据返回的与高山流水图像的图像向量相似的预设数量的文本，用于后期对图像进一步的处理等操作。For example, obtain an image of mountains and flowing water input by the user, call the image-text encoding model to encode the image of mountains and flowing water to obtain the image vector of the image of mountains and flowing water, and determine the cosine distance between each text vector in the pre-stored text vector and the image vector of mountains and flowing water, respectively, According to the calculated cosine distance, a text vector with a cosine distance greater than 0.98 from the image vector is obtained, and a text vector with a cosine distance greater than 0.98 is determined as a text vector similar to the image vector. Text corresponding to a preset number of text vectors similar to the image vector of the landscape image, such as ancient poems including mountains and rivers, and beautiful texts related to mountains and rivers are returned to the user for selection by the user. Based on the image input by the user, the text matching the image is returned to the user, which greatly reduces the time for the user to match the text to the high mountain and flowing water image. In addition, the user can also use a preset amount of text that is similar to the returned image vector of the mountain and flowing water image to be used for further processing of the image in the later stage.

再例如，获取用户输入的一段视频，由于相邻视频帧的内容差别不大，可每隔预设数量的视频帧，对输入的视频进行采样，得到各个图像帧。调用图文编码模型对各图像帧进行编码得到图像帧的图像向量后，分别确定预先存储的文本向量中的各文本向量与各图像帧的图像向量的余弦距离，根据计算的余弦距离，获取与各图像帧图像向量的余弦距离大于0.98的文本向量，并将余弦距离大于0.98的文本向量，确定为与各图像帧的图像向量相似的文本向量。将与各图像帧图像的图像向量相似的预设数量文本向量对应的文本，例如包括与各图像帧相关的优美的文字返回给用户，供用户选择。For another example, when acquiring a video input by the user, since the content of adjacent video frames is not very different, the input video may be sampled every preset number of video frames to obtain each image frame. After calling the image-text encoding model to encode each image frame to obtain the image vector of the image frame, determine the cosine distance between each text vector in the pre-stored text vector and the image vector of each image frame, and obtain the cosine distance according to the calculated cosine distance. A text vector whose cosine distance of the image vector of each image frame is greater than 0.98, and a text vector whose cosine distance is greater than 0.98 is determined as a text vector similar to the image vector of each image frame. Text corresponding to a preset number of text vectors similar to the image vector of each image frame image, for example, including beautiful text related to each image frame, is returned to the user for selection by the user.

在步骤S24中，将与图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本。In step S24, the text corresponding to the preset number of text vectors similar to the image vector is determined as the text matching the image.

在本公开的示例性实施例中，用于图文匹配的服务器获取待进行图文匹配的图像，并通过调用预设图文编码模型，对待进行图文匹配的图像进行编码，得到图像向量后，通过分别确定预先存储的文本向量中的各文本向量与图像向量的余弦距离，并根据计算的余弦距离，将余弦距离大于设定距离阈值的文本向量，确定为与图像向量相似的文本向量。由此，用于图文匹配的服务器无需先在文本库中检索与用户输入图像匹配的多个候选文本，无需对候选文本实时进行编码，得到文本向量，提升图文匹配的服务器的图文匹配效率，降低图文匹配的处理延时。In an exemplary embodiment of the present disclosure, a server for image-text matching obtains an image to be matched with image and text, and by invoking a preset image-text encoding model, encodes the image to be matched with image and text, and obtains an image vector. , by respectively determining the cosine distance between each text vector in the pre-stored text vector and the image vector, and according to the calculated cosine distance, the text vector whose cosine distance is greater than the set distance threshold is determined as a text vector similar to the image vector. As a result, the server for image-text matching does not need to first retrieve multiple candidate texts matching the user input image in the text library, and does not need to encode the candidate texts in real time to obtain text vectors, which improves the image-text matching of the image-text matching server. efficiency, reducing the processing delay of image and text matching.

为了提升图文匹配的准确度，并降低用于图文匹配的处理延迟，一种实时方式中，本公开可采用端对端的训练方式，训练图文编码模型。In order to improve the accuracy of image-text matching and reduce the processing delay for image-text matching, in a real-time manner, the present disclosure can use an end-to-end training method to train a graphics-text encoding model.

图3是根据一示例性实施例示出的训练文本编码子网络和图像编码模型的流程图，如图3所示，包括以下步骤。Fig. 3 is a flow chart of training a text encoding sub-network and an image encoding model according to an exemplary embodiment, as shown in Fig. 3, including the following steps.

在步骤S31中，获取第一训练样本对和第二训练样本对。In step S31, a first training sample pair and a second training sample pair are obtained.

一种实施方式中，本公开可基于第一训练样本对和第二训练样本对训练图文匹配模型。其中，第一训练样本对包括预设文本以及与预设文本相匹配的图像样本，第二训练样本对包括预设文本以及与预设文本不相匹配的图像样本。并且例如可将第一训练样本对和第二训练样本对中70％的样本对数据作为训练数据集训练图文匹配模型，将第一训练样本对和第二训练样本对中30％的样本对数据作为测试数据集验证图文匹配模型。In one embodiment, the present disclosure may train the image-text matching model based on the first training sample pair and the second training sample pair. The first training sample pair includes preset text and image samples matching the preset text, and the second training sample pair includes preset text and image samples that do not match the preset text. And, for example, 70% of the sample pair data in the first training sample pair and the second training sample pair can be used as the training data set to train the image-text matching model, and 30% of the sample pair in the first training sample pair and the second training sample pair can be used. The data is used as a test dataset to verify the image-text matching model.

本公开为描述方便，将预设文本样本以及与预设文本相匹配的图像样本组成的训练样本对，称为第一训练样本对。将预设文本以及与预设文本不相匹配的图像样本组成的训练样本对，称为第二训练样本对。For the convenience of description in the present disclosure, a training sample pair composed of a preset text sample and an image sample matching the preset text is referred to as a first training sample pair. The training sample pair composed of the preset text and the image samples that do not match the preset text is called the second training sample pair.

一种实施方式中，本公开中的第一训练样本对和第二训练样本对例如通过如下方式确定：In one embodiment, the first training sample pair and the second training sample pair in the present disclosure are determined, for example, in the following manner:

获取预设文本，其中，预设文本可以是具有完整语义的一个语句或多个语句的组合。这里所说的语句可以是任意一种自然语言中的语句，例如可包括唐诗、宋词、歌词、名人名言、经典电影台词等大量的文本。Acquire preset text, where the preset text may be one sentence or a combination of multiple sentences with complete semantics. The sentences mentioned here may be sentences in any natural language, for example, it may include a large amount of texts such as Tang poems, Song poems, lyrics, famous quotes, classic movie lines and so on.

构建与预设文本中的文本相匹配的图像，得到第一训练样本对，并构建与预设文本不匹配的图像，得到第二训练样本对。An image matching the text in the preset text is constructed to obtain a first pair of training samples, and an image that does not match the preset text is constructed to obtain a second pair of training samples.

在步骤S32中，训练图文匹配模型，并对图文匹配模型进行优化。In step S32, the image-text matching model is trained, and the image-text matching model is optimized.

本公开中，图文匹配模型可以是包括图像编码子网络和文本编码子网络的模型，其中，图像编码子网络可以是ResNet或者是VGG网络。文本编码子网络例如可以是利用来自变换器的双向编码器表征量(Bidirectional Encoder Representations fromTransformers，BERT)神经网络。In the present disclosure, the image-text matching model may be a model including an image encoding sub-network and a text encoding sub-network, wherein the image encoding sub-network may be a ResNet or a VGG network. The text encoding sub-network can be, for example, a neural network using Bidirectional Encoder Representations from Transformers (BERT) from the Transformer.

为了避免从文本检索库中检索与用户输入图像匹配的多个候选文本时出现检索遗漏和检索不全面的情况，以及为了避免由于检索候选文本导致的处理延迟的情况，一种实施方式中，本公开可采用端对端的训练方式训练图文匹配模型，即通过联合训练图像编码子网络和文本编码子网络的方式，实现以全局的角度对待匹配文本的图像进行图文匹配，提升图文匹配的准确度，并提升图文匹配的处理效率，降低图文匹配的服务器的处理延迟。In order to avoid the situation of missing and incomplete retrieval when retrieving multiple candidate texts matching the user input image from the text retrieval database, and in order to avoid the situation of processing delay caused by retrieving candidate texts, in one embodiment, this It is disclosed that the image-text matching model can be trained by the end-to-end training method, that is, by jointly training the image coding sub-network and the text coding sub-network, it is possible to perform image-text matching on images with matching text from a global perspective, and improve the performance of image-text matching. accuracy, improve the processing efficiency of image and text matching, and reduce the processing delay of the server for image and text matching.

由于第一训练样本对为文本检索库中文本以及与文本检索库中的文本相匹配的图像样本组成的训练样本对，故，第一训练样本对中的文本向量和图像向量之间的余弦距离需趋向于1，第二训练样本对为文本检索库中文本以及与文本检索库中的文本不相匹配的图像样本组成的训练样本对，故，第二训练样本对中的文本向量和图像向量之间的余弦距离需趋向于0。Since the first training sample pair is a training sample pair composed of the text in the text retrieval database and the image samples matching the text in the text retrieval database, the cosine distance between the text vector and the image vector in the first training sample pair It needs to tend to 1, and the second training sample pair is a training sample pair composed of the text in the text retrieval library and the image samples that do not match the text in the text retrieval library. Therefore, the text vector and image vector in the second training sample pair are The cosine distance between them should tend to 0.

由此采用端对端的方式训练包括图像编码子网络和文本编码子网络的图文匹配模型时，将第一训练样本对和第二训练样本对输入图文匹配模型，通过图文匹配模型中的图像编码子网络对第一训练样本对和第二训练样本对中各图像样本进行编码，通过图文编码模型输出图像样本的图像向量，通过图文匹配模型中的文本编码子网络对第一训练样本对和第二训练样本对中各文本样本进行编码，通过图文编码模型输出文本样本的文本向量。Therefore, when training the image-text matching model including the image encoding sub-network and the text encoding sub-network in an end-to-end manner, the first training sample pair and the second training sample pair are input into the image-text matching model, and the image-text matching model The image encoding sub-network encodes each image sample in the first training sample pair and the second training sample pair, outputs the image vector of the image sample through the image-text encoding model, and uses the text encoding sub-network in the image-text matching model to train the first training sample. Each text sample in the sample pair and the second training sample pair is encoded, and the text vector of the text sample is output through the image-text encoding model.

之后，基于图像样本的图像向量和文本样本的文本向量，确定第一训练样本对中的文本向量和图像向量之间的余弦距离，以及确定第二训练样本中的文本向量和图像向量之间的余弦距离。Then, based on the image vector of the image sample and the text vector of the text sample, the cosine distance between the text vector and the image vector in the first training sample pair is determined, and the distance between the text vector and the image vector in the second training sample is determined. Cosine distance.

本公开为描述方便将第一训练样本对中的文本向量和图像向量之间的余弦距离称为第一余弦距离，将第二训练样本对中的文本向量和图像向量之间的余弦距离称为第二余弦距离。In the present disclosure, for the convenience of description, the cosine distance between the text vector and the image vector in the first training sample pair is called the first cosine distance, and the cosine distance between the text vector and the image vector in the second training sample pair is called the cosine distance. is the second cosine distance.

根据第一余弦距离、第二余弦距离，以及损失函数调整图文编码模型的参数，即分别调整图文编码模型包括的图像编码子网络和文本编码子网络的训练参数，使得图文编码模型满足使第一余弦距离大于预设的第一距离阈值，第二余弦距离小于预设的第二距离阈值。Adjust the parameters of the graphic encoding model according to the first cosine distance, the second cosine distance, and the loss function, that is, adjust the training parameters of the image encoding sub-network and the text encoding sub-network included in the graphic encoding model respectively, so that the graphic encoding The model satisfies that the first cosine distance is greater than the preset first distance threshold, and the second cosine distance is less than the preset second distance threshold.

例如预设的第一距离阈值为0.97预设的第二距离阈值为0.02。利用交叉熵损失函数分别调节图像编码子网络和文本编码子网络的训练参数，使得从图文编码模型输出的图像向量和文本向量，满足第一训练样本对中的文本向量和图像向量之间的第一余弦距离大于0.97，且第二训练样本对中的文本向量和图像向量之间的第二余弦距离小于0.02。至此，可得到训练好的图文编码模型。For example, the preset first distance threshold is 0.97 and the preset second distance threshold is 0.02. The cross-entropy loss function is used to adjust the training parameters of the image encoding sub-network and the text encoding sub-network respectively, so that the image vector and text vector output from the image and text encoding model satisfy the difference between the text vector and the image vector in the first training sample pair. The first cosine distance is greater than 0.97, and the second cosine distance between the text vector and the image vector in the second pair of training samples is less than 0.02. So far, the trained image and text encoding model can be obtained.

在步骤S33中，将测试数据集输入到训练好的图文匹配模型中进行验证，得到验证好的图文匹配模型。In step S33, the test data set is input into the trained image-text matching model for verification, and a verified image-text matching model is obtained.

本公开中，可将测试数据集输入到训练好的图文匹配模型中进行验证，得到验证好的图文匹配模型。In the present disclosure, the test data set can be input into the trained image-text matching model for verification, and a verified image-text matching model can be obtained.

由此，在训练好图文编码模型后，利用训练好的图文编码模型中包括的文本编码子网络对预设文本中的所有文本进行编码，之后当用户需要对图像进行图文匹配时，通过训练好的图文编码模型对待进行图文匹配的图像进行编码，得到图像向量后，基于图像向量和文本向量的余弦距离，可以直接从预先存储的文本向量中检索与待进行图文匹配的图像向量相似的文本向量，将与待进行图文匹配的图像向量相似的预设数量文本向量对应的文本，确定为与图像匹配的文本，输出供用户选择。Therefore, after the image-text encoding model is trained, the text encoding sub-network included in the trained image-text encoding model is used to encode all the text in the preset text, and then when the user needs to perform image-text matching on the image, The image to be matched with image and text is encoded by the trained image-text encoding model, and after the image vector is obtained, based on the cosine distance between the image vector and the text vector, the image to be matched with the image and text can be directly retrieved from the pre-stored text vector. For the text vector similar to the image vector, the text corresponding to the preset number of text vectors similar to the image vector to be image-text matching is determined as the text matching the image, and is output for the user to select.

并且，为了快速检索与待进行图文匹配的图像向量相似的文本向量，可在图文编码模型中包括的文本编码子网络对预设文本中的所有文本编码得到文本向量之后，对编码后的文本向量可通过相似性搜索例如faiss构建一个向量检索库。由于faiss支持十亿级别的检索，基于faiss构建的向量检索库检索与图像向量相似的文本向量时，在保证搜索准确性的基础上，可大幅提高检索效率，进而提升图文匹配的效率。Moreover, in order to quickly retrieve the text vector similar to the image vector to be matched with the image and text, after the text encoding sub-network included in the image-text encoding model encodes all the texts in the preset text to obtain the text vector, the encoded Text vectors can be used to build a vector retrieval library through similarity searches such as faiss. Since faiss supports billion-level retrieval, the vector retrieval library constructed based on faiss can greatly improve the retrieval efficiency while ensuring the search accuracy when retrieving text vectors similar to image vectors, thereby improving the efficiency of image-text matching.

在本公开的示例性实施例中，通过采用端对端的方式训练包括图像编码子网络和文本编码子网络的图文匹配模型，使得训练好的图文编码模型满足使第一训练样本对中的文本向量和图像向量之间的余弦距离大于预设的第一距离阈值，第二余弦距离为第二训练样本中的文本向量和图像向量之间的余弦距离小于预设的第二距离阈值。进而对获取的图片进行图文匹配时，只需将图片输入图文编码模型，通过图文编码模型对获取的图像进行编码，得到图像向量后，基于图文匹配模型中包括的文本编码子网络对预设文本编码确定的所有文本向量，可确定得到与图像向量相似的文本向量，无需在文本检索库中先选定候选文本，并实时对候选文本编码，基于编码后的候选文本向量再和图像向量进一步地匹配，提升了图文匹配的服务器的图文匹配效率，降低图文匹配的处理延时。In the exemplary embodiment of the present disclosure, the image-text matching model including the image coding sub-network and the text-coding sub-network is trained in an end-to-end manner, so that the trained image-text coding model satisfies the requirement that the first training sample pair is centered. The cosine distance between the text vector and the image vector is greater than the preset first distance threshold, and the second cosine distance is that the cosine distance between the text vector and the image vector in the second training sample is less than the preset second distance threshold. Then, when performing image-text matching on the acquired image, it is only necessary to input the image into the image-text encoding model, encode the acquired image through the image-text encoding model, and obtain the image vector based on the text encoding sub-network included in the image-text matching model. For all the text vectors determined by the preset text encoding, it can be determined to obtain text vectors similar to the image vectors. It is not necessary to select candidate texts in the text retrieval library first, and encode the candidate texts in real time. Based on the encoded candidate text vectors, the The image vector is further matched, which improves the image-text matching efficiency of the image-text matching server and reduces the processing delay of image-text matching.

基于相同的构思，本公开实施例还提供一种图文匹配装置。Based on the same concept, an embodiment of the present disclosure also provides a picture-text matching apparatus.

可以理解的是，本公开实施例提供的图文匹配装置为了实现上述功能，其包含了执行各个功能相应的硬件结构和/或软件模块。结合本公开实施例中所公开的各示例的单元及算法步骤，本公开实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同的方法来实现所描述的功能，但是这种实现不应认为超出本公开实施例的技术方案的范围。It can be understood that, in order to realize the above-mentioned functions, the image-text matching apparatus provided by the embodiments of the present disclosure includes corresponding hardware structures and/or software modules for executing each function. Combining with the units and algorithm steps of each example disclosed in the embodiments of the present disclosure, the embodiments of the present disclosure can be implemented in hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the technical solutions of the embodiments of the present disclosure.

图4是根据一示例性实施例示出的一种图文匹配装置框图。参照图4，图文匹配装置400包括获取单元401、处理单元402和确定单元403。Fig. 4 is a block diagram of an apparatus for image-text matching according to an exemplary embodiment. Referring to FIG. 4 , the image-text matching apparatus 400 includes an acquisition unit 401 , a processing unit 402 and a determination unit 403 .

其中，获取单元401，被配置为获取待进行图文匹配的图像；处理单元402，被配置为将所述图像输入预先训练的图文编码模型，对所述图像进行编码，得到图像向量；确定单元403，被配置为从预先存储的文本向量中确定与所述图像向量相似的文本向量；其中，所述图文编码模型在训练过程中还包括用于对文本进行编码的文本编码子网络，所述预先存储的文本向量通过所述文本编码子网络对预设文本编码确定；所述确定单元403，还被配置为将与所述图像向量相似的预设数量文本向量对应的文本，确定为与所述图像匹配的文本。Wherein, the acquiring unit 401 is configured to acquire an image to be matched with graphics and text; the processing unit 402 is configured to input the image into a pre-trained graphics and text encoding model, encode the image, and obtain an image vector; determine Unit 403, configured to determine a text vector similar to the image vector from a pre-stored text vector; wherein, the image-text encoding model also includes a text encoding sub-network for encoding the text in the training process, The pre-stored text vector is determined by encoding the preset text through the text encoding sub-network; the determining unit 403 is further configured to determine the text corresponding to a preset number of text vectors similar to the image vector as Text that matches the image.

在一示例中，所述确定单元403采用如下方式从预先存储的文本向量中确定与所述图像向量相似的文本向量：分别确定预先存储的文本向量中的各文本向量与所述图像向量的余弦距离；根据计算的余弦距离，获取与所述图像向量的余弦距离大于设定距离阈值的文本向量，并将余弦距离大于设定距离阈值的文本向量，确定为与所述图像向量相似的文本向量。In an example, the determining unit 403 determines a text vector similar to the image vector from the pre-stored text vector in the following manner: respectively determining the cosine of each text vector in the pre-stored text vector and the image vector distance; according to the calculated cosine distance, obtain a text vector whose cosine distance from the image vector is greater than the set distance threshold, and determine the text vector whose cosine distance is greater than the set distance threshold as a text vector similar to the image vector .

在一示例中，所述确定单元403采用如下方式通过所述文本编码子网络对预设文本编码确定所述预先存储的文本向量：调用所述图文编码模型；将所述预设文本输入所述图文编码模型，通过所述图文编码模型包括的文本编码子网络对输入的所述预设文本进行编码，得到所述预先存储的文本向量。In an example, the determining unit 403 determines the pre-stored text vector for the preset text encoding through the text encoding sub-network in the following manner: calling the image-text encoding model; inputting the preset text into any The image-text encoding model is used to encode the input preset text through a text encoding sub-network included in the image-text encoding model to obtain the pre-stored text vector.

图5是根据一示例性实施例示出的一种图文编码模型训练装置框图。参照图5，图文编码模型训练装置500包括确定单元501和训练单元502。Fig. 5 is a block diagram of an apparatus for training an image-text encoding model according to an exemplary embodiment. Referring to FIG. 5 , the image-text coding model training apparatus 500 includes a determining unit 501 and a training unit 502 .

其中，确定单元501，被配置为确定预设文本样本，以及确定与所述预设文本样本相匹配的图像，得到第一训练样本对，并确定与预设文本样本不匹配的图像，得到第二训练样本对；训练单元502，被配置为基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型。The determining unit 501 is configured to determine a preset text sample, and determine an image matching the preset text sample to obtain a first pair of training samples, and determine an image that does not match the preset text sample to obtain a first pair of training samples. Two pairs of training samples; the training unit 502 is configured to obtain an image-text encoding model based on the first pair of training samples and the second pair of training samples.

在一示例中，所述图文编码模型包括图像编码子网络和文本编码子网络，所述训练单元502采用如下方式基于所述第一训练样本对和所述第二训练样本对训练得到图文编码模型：通过图像编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各图像样本的图像向量，并通过文本编码子网络，分别提取所述第一训练样本对和所述第二训练样本对中各文本样本的文本向量；基于所述图像样本的图像向量和所述文本样本的文本向量，确定第一余弦距离和第二余弦距离，其中，所述第一余弦距离为所述第一训练样本对中的文本向量和图像向量之间的余弦距离，所述第二余弦距离为所述第二训练样本中的文本向量和图像向量之间的余弦距离；根据所述第一余弦距离、所述第二余弦距离、以及损失函数，调整所述图像编码子网络和所述文本编码子网络的训练参数，得到满足损失值的所述图文编码模型；其中，所述图文编码模型满足使所述第一余弦距离大于预设的第一距离阈值，所述第二余弦距离小于预设的第二距离阈值。In an example, the image and text encoding model includes an image encoding sub-network and a text encoding sub-network, and the training unit 502 uses the following method to train the image and text based on the first training sample pair and the second training sample pair. Coding model: through the image coding sub-network, respectively extract the image vector of each image sample in the first training sample pair and the second training sample pair, and through the text coding sub-network, respectively extract the first training sample pair and the text vector of each text sample in the second training sample pair; based on the image vector of the image sample and the text vector of the text sample, determine the first cosine distance and the second cosine distance, wherein the The first cosine distance is the cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is the distance between the text vector and the image vector in the second training sample. Cosine distance; according to the first cosine distance, the second cosine distance, and the loss function, adjust the training parameters of the image encoding sub-network and the text encoding sub-network to obtain the graph that satisfies the loss value A text encoding model; wherein, the image and text encoding model satisfies that the first cosine distance is greater than a preset first distance threshold, and the second cosine distance is less than a preset second distance threshold.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.

图6是根据一示例性实施例示出的一种用于图文匹配的装置600的框图。例如，装置600可以被提供为一服务器。参照图6，装置600包括处理组件622，其进一步包括一个或多个处理器，以及由存储器632所代表的存储器资源，用于存储可由处理组件622的执行的指令，例如应用程序。存储器632中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件622被配置为执行指令，以执行上述图文匹配方法。FIG. 6 is a block diagram of an apparatus 600 for image-text matching according to an exemplary embodiment. For example, the apparatus 600 may be provided as a server. 6, apparatus 600 includes a processing component 622, which further includes one or more processors, and a memory resource, represented by memory 632, for storing instructions executable by processing component 622, such as applications. An application program stored in memory 632 may include one or more modules, each corresponding to a set of instructions. In addition, the processing component 622 is configured to execute the instructions to perform the above-described image-text matching method.

装置600还可以包括一个电源组件626被配置为执行装置600的电源管理，一个有线或无线网络接口660被配置为将装置600连接到网络，和一个输入输出(I/O)接口668。装置600可以操作基于存储在存储器632的操作系统，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM或类似。Device 600 may also include a power component 626 configured to perform power management of device 600 , a wired or wireless network interface 660 configured to connect device 600 to a network, and an input output (I/O) interface 668 . Device 600 may operate based on an operating system stored in memory 632, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

可以理解的是，本公开中“多个”是指两个或两个以上，其它量词与之类似。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。It should be understood that in the present disclosure, "plurality" refers to two or more than two, and other quantifiers are similar. "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects are an "or" relationship. The singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise.

进一步可以理解的是，术语“第一”、“第二”等用于描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开，并不表示特定的顺序或者重要程度。实际上，“第一”、“第二”等表述完全可以互换使用。例如，在不脱离本公开范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。It is further understood that the terms "first", "second", etc. are used to describe various information, but the information should not be limited to these terms. These terms are only used to distinguish the same type of information from one another, and do not imply a particular order or level of importance. In fact, the expressions "first", "second" etc. are used completely interchangeably. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.

进一步可以理解的是，除非有特殊说明，“连接”包括两者之间不存在其他构件的直接连接，也包括两者之间存在其他元件的间接连接。It should be further understood that, unless otherwise specified, "connection" includes a direct connection between the two without other components, and also includes an indirect connection between the two with other elements.

进一步可以理解的是，本公开实施例中尽管在附图中以特定的顺序描述操作，但是不应将其理解为要求按照所示的特定顺序或是串行顺序来执行这些操作，或是要求执行全部所示的操作以得到期望的结果。在特定环境中，多任务和并行处理可能是有利的。It is further to be understood that, although the operations in the embodiments of the present disclosure are described in a specific order in the drawings, it should not be construed as requiring that the operations be performed in the specific order shown or the serial order, or requiring Perform all operations shown to obtain the desired result. In certain circumstances, multitasking and parallel processing may be advantageous.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for matching images and texts, the method comprising:

acquiring an image to be subjected to image-text matching;

inputting the image into a pre-trained image-text coding model, and coding the image to obtain an image vector;

determining a text vector similar to the image vector from pre-stored text vectors;

the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code;

and determining the texts corresponding to the preset number of text vectors similar to the image vectors as texts matched with the image.

2. The teletext matching method according to claim 1, wherein determining a text vector from pre-stored text vectors that is similar to the image vector comprises:

respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors;

according to the calculated cosine distance, obtaining a text vector of which the cosine distance with the image vector is greater than a set distance threshold, and

and determining the text vector with the cosine distance larger than a set distance threshold as the text vector similar to the image vector.

3. The teletext matching method according to claim 1, wherein the pre-stored text vector is determined by the text encoding sub-network for a pre-set text encoding, comprising:

and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.

4. The method of claim 1, wherein the teletext coding model is trained on a first pair of training samples and a second pair of training samples, the first pair of training samples including the predetermined text and image samples matching the predetermined text, and the second pair of training samples including the predetermined text and image samples not matching the predetermined text.

5. A method for training a teletext model, the method comprising:

determining a preset text sample;

determining images matched with the preset text samples to obtain a first training sample pair, and determining images not matched with the preset text samples to obtain a second training sample pair;

and training based on the first training sample pair and the second training sample pair to obtain a graphic coding model.

6. The method of claim 5, wherein the teletext model comprises an image coding sub-network and a text coding sub-network, and wherein the training based on the first training sample pair and the second training sample pair to obtain the teletext model comprises:

respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and

respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network;

determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample;

adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value;

the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.

7. An apparatus for matching graphics and text, the apparatus comprising:

an acquisition unit configured to acquire an image to be subjected to image-text matching;

the processing unit is configured to input the image into a pre-trained image-text coding model, and code the image to obtain an image vector;

a determination unit configured to determine a text vector similar to the image vector from among pre-stored text vectors;

the determining unit is further configured to determine a text corresponding to a preset number of text vectors similar to the image vector as a text matching the image.

8. Teletext matching arrangement according to claim 7, wherein the determination unit determines a text vector similar to the image vector from pre-stored text vectors in the following way:

9. Teletext matching arrangement according to claim 7, wherein the determination unit determines the pre-stored text vector for a pre-set text encoding by the sub-network of text encodings as follows:

calling the image-text coding model;

10. The device according to claim 7, wherein the teletext coding model is trained on a first pair of training samples and a second pair of training samples, the first pair of training samples comprising the predetermined text and image samples matching the predetermined text, and the second pair of training samples comprising the predetermined text and image samples not matching the predetermined text.

11. An apparatus for training a teletext model, the apparatus comprising:

a determination unit configured to determine a preset text sample, an

a training unit configured to train to obtain a teletext coding model based on the first pair of training samples and the second pair of training samples.

12. The apparatus of claim 11, wherein the teletext model comprises an image coding sub-network and a text coding sub-network, and wherein the training unit is configured to train the teletext model based on the first pair of training samples and the second pair of training samples in the following manner:

13. An image-text matching device, characterized by comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: -performing the teletext matching method of any one of claims 1-4.

14. A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the teletext matching method according to any one of claims 1-4.

15. A device for training a graphic coding model, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the teletext code model training method of any one of claims 5-6.

16. A non-transitory computer readable storage medium having instructions that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the teletext code model training method of any one of claims 5-6.