CN113822427A

CN113822427A - Model training method, image matching device and storage medium

Info

Publication number: CN113822427A
Application number: CN202110866443.9A
Authority: CN
Inventors: 汪翔; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-12-21

Abstract

The application discloses a model training method based on artificial intelligence technology, comprising the following steps of; acquiring a first image to be trained; acquiring a first region segmentation result through a semantic segmentation model based on a first image to be trained; obtaining a first sample image and a second sample image from a first image to be trained according to a first region segmentation result, wherein the region occupation ratios of regions of interest respectively included in the first sample image and the second sample image are both greater than or equal to a occupation ratio threshold value; and updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met, so as to obtain the image matching model. The application also provides a method, a device and a medium for image matching. The method and the device can more pointedly select the frame of the image and take the representative image block as the sample image, thereby being beneficial to training to obtain the image matching model with more robustness.

Description

Model training method, image matching device and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a model training method, an image matching device, and a storage medium.

Background

With the rapid development of internet technology, it is increasingly difficult for users to find the content required by themselves in the mutual image ocean. Therefore, extracting visual contents of images and organizing the images using a Computer while improving efficiency of image retrieval and recommendation has been a research hotspot in the field of Computer Vision (CV).

At present, the similarity between two images is generally determined by using an image matching model. Specifically, the image matching model can be trained by adopting a method of self-supervision learning. In the process of self-supervision learning, two image blocks are randomly framed and selected from the same image as positive samples, and one image block is randomly framed and selected from two different images as negative samples.

However, image blocks selected from a random frame in an image often do not have good representativeness. For example, one core area (i.e., the area containing text and pictures) may be boxed, and the other blank area. Alternatively, two blank areas are selected. It can be seen that training with a sample that is not representative results in a poor model.

Disclosure of Invention

The embodiment of the application provides a model training method, an image matching device and a storage medium. The image is divided into the region of interest and the background region, so that when the image blocks are randomly selected from the image, the image can be selected in a frame mode in a more targeted mode, representative image blocks are taken as sample images, and the method is favorable for training to obtain an image matching model with higher robustness.

In view of the above, an aspect of the present application provides a method for model training, including:

acquiring a first image to be trained;

acquiring a first region segmentation result through a semantic segmentation model based on the first image to be trained, wherein the first region segmentation result is used for determining a region of interest and a background region in the first image to be trained, and the region of interest comprises at least one of a text region or a picture region;

acquiring a first sample image and a second sample image derived from a first image to be trained according to a first region segmentation result, wherein the region proportion of a region of interest included in the first sample image is greater than or equal to a proportion threshold value, and the region proportion of the region of interest included in the second sample image is greater than or equal to the proportion threshold value;

and updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met, so as to obtain the image matching model.

Another aspect of the present application provides an image matching method, including:

acquiring a first image to be matched;

acquiring a second image to be matched;

and determining a matching result through an image matching model based on the first image to be matched and the second image to be matched, wherein the image matching model is obtained by adopting the method in the aspect.

Another aspect of the present application provides a model training apparatus, including:

the acquisition module is used for acquiring a first image to be trained;

the acquisition module is further used for acquiring a first region segmentation result through a semantic segmentation model based on the first image to be trained, wherein the first region segmentation result is used for determining a region of interest and a background region in the first image to be trained, and the region of interest comprises at least one of a text region or a picture region;

the acquisition module is further used for acquiring a first sample image and a second sample image which are derived from the first image to be trained according to the first region segmentation result, wherein the region proportion of the region of interest included in the first sample image is greater than or equal to a proportion threshold value, and the region proportion of the region of interest included in the second sample image is greater than or equal to the proportion threshold value;

and the training module is used for updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met, so as to obtain the image matching model.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for randomly selecting a first image and a second image from a first image to be trained;

determining the region proportion of the region of interest in the first image according to the first region segmentation result, and determining the region proportion of the region of interest in the second image;

if the area proportion of the interest area in the first image is larger than or equal to the proportion threshold value, taking the first image as a first sample image;

and if the area proportion of the interest area in the second image is larger than or equal to the proportion threshold value, taking the second image as a second sample image.

the acquisition module is specifically used for randomly selecting a first image to be processed and a second image to be processed from the first image to be trained;

according to the first region segmentation result, determining the region proportion of the region of interest in the first image to be processed, and determining the region proportion of the region of interest in the second image to be processed;

if the area proportion of the interest area in the first image to be processed is larger than or equal to the proportion threshold value, carrying out data augmentation processing on the first image to be processed to obtain a first sample image;

and if the area proportion of the interest area in the second image to be processed is greater than or equal to the proportion threshold, performing data amplification processing on the second image to be processed to obtain a second sample image.

the training module is specifically used for acquiring a first feature map through a coding network included in an image matching model to be trained on the basis of the first sample image;

based on the first feature map, acquiring a first feature vector through a projection network included in an image matching model to be trained;

based on the first feature vector, obtaining a target feature vector through a prediction network included in the image matching model to be trained;

acquiring a second feature map through a coding network included in the target model based on the second sample image;

acquiring a second feature vector through a projection network included in the target model based on the second feature map;

and updating the model parameters of the image matching model to be trained through the first loss function according to the target characteristic vector and the second characteristic vector.

the acquisition module is also used for acquiring a second image to be trained;

the acquisition module is further used for acquiring a second region segmentation result through the semantic segmentation model based on a second image to be trained, wherein the second region segmentation result is used for determining an interested region and a background region in the second image to be trained;

the acquisition module is further used for acquiring a third sample image from the second image to be trained according to the second region segmentation result, wherein the ratio of the region of interest included in the third sample image is greater than or equal to a ratio threshold;

and the training module is specifically used for updating model parameters of the image matching model to be trained according to the first sample image, the second sample image and the third sample image until model training conditions are met, so as to obtain the image matching model.

the training module is specifically used for obtaining a feature vector corresponding to the first sample image and a feature vector corresponding to the second sample image through an image matching model to be trained based on the first sample image and the second sample image, wherein the first sample image and the second sample image belong to a positive sample image pair, and the positive sample image pair corresponds to a positive sample label;

based on the first sample image and the third sample image, acquiring a feature vector corresponding to the first sample image and a feature vector corresponding to the third sample image through an image matching model to be trained, wherein the first sample image and the third sample image belong to a negative sample image pair, and the negative sample image pair corresponds to a negative sample label;

determining a first characteristic distance according to the characteristic vector corresponding to the first sample image and the characteristic vector corresponding to the second sample image;

determining a second characteristic distance according to the characteristic vector corresponding to the first sample image and the characteristic vector corresponding to the third sample image;

and updating the model parameters of the image matching model to be trained by adopting a second loss function according to the first characteristic distance, the second characteristic distance, the positive sample label and the negative sample label.

the training module is specifically used for acquiring an embedded vector corresponding to the first sample image through the image matching model to be trained based on the first sample image;

based on a second sample image, acquiring an embedded vector corresponding to the second sample image through an image matching model to be trained, wherein the second sample image belongs to a positive sample image;

based on a third sample image, obtaining an embedded vector corresponding to the third sample image through an image matching model to be trained, wherein the third sample image belongs to a negative sample image;

determining a first embedding distance according to the embedding vector corresponding to the first sample image and the embedding vector corresponding to the second sample image;

determining a second embedding distance according to the embedding vector corresponding to the first sample image and the embedding vector corresponding to the third sample image;

and updating the model parameters of the image matching model to be trained by adopting a third loss function according to the first embedding distance and the second embedding distance.

the training module is specifically used for obtaining a prediction classification result through an image matching model to be trained based on a first sample image and a second sample image, and the first sample image and the second sample image correspond to the labeling classification labels;

and updating the model parameters of the image matching model to be trained by adopting a fourth loss function according to the prediction classification result and the labeling classification label.

the acquisition module is further used for acquiring a training sample image, wherein the training sample image comprises M pixel points, each pixel point corresponds to a category label, the category label is used for indicating that the pixel point belongs to an interested area or a background area, and M is an integer greater than 1;

the acquisition module is also used for acquiring the class prediction probability of each pixel point through the semantic segmentation model to be trained based on the training sample image;

and the training module is also used for updating the model parameters of the semantic segmentation model to be trained by adopting a fifth loss function according to the class prediction probability of each pixel point and the class label of each pixel point until the model training conditions are met, so as to obtain the semantic segmentation model.

Another aspect of the present application provides an image matching apparatus, including:

the acquisition module is used for acquiring a first image to be matched;

the acquisition module is also used for acquiring a second image to be matched;

and the determining module is used for determining a matching result through an image matching model based on the first image to be matched and the second image to be matched, wherein the image matching model is obtained by adopting the method in the aspect.

the determining module is specifically used for acquiring a feature vector corresponding to the first image to be matched through an image matching model based on the first image to be matched;

based on the second image to be matched, acquiring a feature vector corresponding to the second image to be matched through an image matching model;

determining a target characteristic distance according to the characteristic vector corresponding to the first image to be matched and the characteristic vector corresponding to the second image to be matched;

if the target characteristic distance is smaller than or equal to the distance threshold, determining that the matching result is successful;

and if the target characteristic distance is greater than the distance threshold, determining that the matching result is matching failure.

the determining module is specifically used for acquiring a similarity score through an image matching model based on the first image to be matched and the second image to be matched;

if the similarity score is larger than or equal to the similarity threshold, determining that the matching result is successful;

and if the similarity value is smaller than the similarity threshold value, determining that the matching result is matching failure.

Another aspect of the present application provides a computer device, comprising: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

a processor for executing the program in the memory, the processor for performing the above-described aspects of the method according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a model training method, which includes the steps of firstly obtaining a first image to be trained, then obtaining a first region segmentation result through a semantic segmentation model based on the first image to be trained, wherein the first region segmentation result is used for determining an interested region and a background region in the first image to be trained, obtaining a first sample image and a second sample image derived from the first image to be trained based on the first region segmentation result, the region proportion of the interested region included in the first sample image is larger than or equal to a proportion threshold value, and the region proportion of the interested region included in the second sample image is larger than or equal to the proportion threshold value. And finally, updating model parameters of the image matching model to be trained according to the first sample image and the second sample image until model training conditions are met, so as to obtain the image matching model. By the mode, the image is divided into the region of interest and the background region, so that when the image blocks are randomly selected from the image, the image can be more specifically selected, and representative image blocks are taken as sample images, so that the image matching model with higher robustness can be obtained through training.

Drawings

FIG. 1 is a block diagram of an embodiment of an image matching system;

FIG. 2 is a schematic flow chart of model training and image matching according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of model training based on a semantic segmentation model in an embodiment of the present application;

FIG. 5 is a schematic diagram of randomly framing image blocks in an embodiment of the present application;

FIG. 6 is another diagram illustrating randomly framing image blocks in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an image matching model to be trained in an embodiment of the present application;

FIG. 8 is another diagram illustrating randomly framing image blocks in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an image matching model to be trained in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an image matching model to be trained in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an image matching model to be trained in an embodiment of the present application;

FIG. 12 is a schematic diagram illustrating region of interest and background region labeling in an embodiment of the present application;

FIG. 13 is a flowchart illustrating an image matching method according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of an image matching model in an embodiment of the present application;

FIG. 15 is a schematic diagram of another structure of an image matching model in the embodiment of the present application;

FIG. 16 is a schematic view of a model training apparatus according to an embodiment of the present application;

FIG. 17 is a schematic diagram of an image matching apparatus according to an embodiment of the present application;

FIG. 18 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 19 is a schematic structural diagram of a terminal device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In recent years, the field of Artificial Intelligence (AI) has received unprecedented attention from all societal communities, and machines continually simulate, and even surpass, to some extent, human-specific advantages and skills. The CV is a science for researching how to make a machine see, and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, smart transportation, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition. The CV technique is used to realize image matching, and some application scenarios of the image matching method provided in the present application will be described below.

Searching an application scene of a picture by using a picture;

(1) e, E-commerce shopping: the user may search for similar items for comparison with the same money. Namely, the user uploads the commodity picture A, and the server calls the image matching model to match the commodity picture A with other commodity pictures stored in the background in pairs, so that the commodity picture with higher similarity is found out.

(2) The direction of entertainment: the user can find the name of the movie or tv show by means of the movie screen shot. The user uploads the screenshot A, and the server calls an image matching model to match the screenshot A with other screenshots stored in the background in pairs, so that the screenshot with the highest similarity is found out.

(3) And (3) image retrieval: and uploading the picture A in the picture searching website, calling an image matching model by the server to match the picture A with other pictures stored in the background pairwise, and finding out the picture with higher similarity as a retrieval result.

Identifying an application scene of the image-text content;

identifying interested contents: and intercepting a graphic image from the webpage, wherein the graphic image comprises at least one of a text image and a picture, and the text image refers to text presented in an image form. And calling an image matching model by the server to match the character image with the character image stored in the background in pairs, thereby finding out the image-text image with higher similarity.

In order to reduce image mismatch generated during retrieval in the above scenario, an image matching model with better robustness needs to be trained, and the present application proposes a model training method, where the method is applied to the image matching system shown in fig. 1, as shown in the figure, the image matching system includes a server and a terminal device, and a client is deployed on the terminal device, where the client may run on the terminal device in the form of a browser, or may run on the terminal device in the form of an independent Application (APP), and a specific presentation form of the client is not limited herein. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, and the like, but is not limited thereto. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited. The scheme provided by the application can be independently completed by the terminal device, can also be independently completed by the server, and can also be completed by the cooperation of the terminal device and the server, so that the application is not particularly limited.

Based on this, in conjunction with the image matching system shown in fig. 1, model training and image matching can be achieved. For ease of understanding, please refer to fig. 2, fig. 2 is a schematic flowchart of a process of model training and image matching in an embodiment of the present application, and specifically as shown in the figure:

in step S1, it is first necessary to collect training sample images in a target scene, where the target scene may refer to a web page or a public account, and the training sample images represent images captured in the target scene.

In step S2, some training sample images are selected from these training sample images for manual labeling. For example, 1 ten thousand training sample images are selected for labeling, that is, a text region, a picture region, and a background region in the training sample images are labeled respectively.

In step S3, a semantic segmentation model is obtained by training using the annotation image, so as to segment the text region, the picture region, and the background region in the image to be trained. And segmenting all images to be trained by using the trained semantic segmentation model, thereby obtaining the segmentation region of each image to be trained.

In step S4, during random sampling, it is detected whether the sum of the ratio of the text area and the picture area of the sample image is greater than or equal to a ratio threshold, and if the sum is greater than or equal to the ratio threshold, the ratio meets the requirement, and thus, the sample image meeting the requirement is selected and trained to obtain the image matching model.

In step S5, image feature values of the two images are extracted using the trained image matching model, and the similarity between the two images is measured according to the euclidean distance between the two image feature values.

With reference to fig. 3, a method for training a model in the present application will be described below, and an embodiment of the method for training a model in the present application includes:

110. acquiring a first image to be trained;

in one or more embodiments, the model training apparatus may select one image to be trained from the image set to be trained as the first image to be trained, where the image to be trained included in the image set to be trained may be a page screenshot, or an application screenshot, and the like, which is not limited herein.

It should be noted that the model training apparatus provided in the present application may be deployed in a server, or may be deployed in a terminal device, or may be deployed in a system formed by a server and a terminal device, which is not limited herein.

120. Acquiring a first region segmentation result through a semantic segmentation model based on the first image to be trained, wherein the first region segmentation result is used for determining a region of interest and a background region in the first image to be trained, and the region of interest comprises at least one of a text region or a picture region;

in one or more embodiments, the model training apparatus inputs the first image to be trained into a semantic segmentation model, and outputs a first region segmentation result by the semantic segmentation model, wherein the first region segmentation result is used for segmenting the first image to be trained into a region of interest and a background region. It is to be understood that the region of interest includes at least one of a text region or a picture region, the text region including textual content presented in the form of a picture.

130. Acquiring a first sample image and a second sample image derived from a first image to be trained according to a first region segmentation result, wherein the region proportion of a region of interest included in the first sample image is greater than or equal to a proportion threshold value, and the region proportion of the region of interest included in the second sample image is greater than or equal to the proportion threshold value;

in one or more embodiments, the model training apparatus divides the first image to be trained into an area of interest and a background area according to the first region segmentation result, and when the first sample image and the second sample image are selected, the area proportion of the area of interest included in the first sample image needs to be greater than or equal to a proportion threshold, and the area proportion of the area of interest included in the second sample image also needs to be greater than or equal to a proportion threshold. There are various ways to extract the first sample image and the second sample image, which will be described separately below.

For example, a first sample image that is satisfactory (i.e., the region proportion of the region of interest is greater than or equal to the proportion threshold) may be randomly cut out from the first image to be trained, and another second sample image that is satisfactory may be randomly cut out from the first image to be trained.

For example, a first sample image that meets the requirement (i.e., the area ratio of the region of interest is greater than or equal to the ratio threshold) may be randomly intercepted from the first image to be trained, and then the first sample image is subjected to data augmentation processing to obtain a second sample image.

140. And updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met, so as to obtain the image matching model.

In one or more embodiments, the model training apparatus inputs the first sample image and the second sample image to the image matching model to be trained, and outputs the predicted value from the image matching model to be trained. And estimating the inconsistency degree of the predicted value and the true value of the image matching model to be trained by using a loss function, and enabling the loss function to be smaller and smaller by using an optimization algorithm with gradient descent through continuous iterative computation in the process of training the model.

It is understood that the model training conditions include "exhaustion criteria" as well as "observation criteria". Illustratively, taking "exhaustion criteria" as an example, if the number of model iterations reaches an iteration number threshold (e.g., 10000), it indicates that the model training condition is satisfied, and thus, the updated model parameters are taken as the model parameters of the image matching model. Illustratively, taking "observation criteria" as an example, if the loss value has converged, it indicates that the model training condition is satisfied, and thus, the updated model parameters are taken as the model parameters of the image matching model.

Based on this, for convenience of introduction, please refer to fig. 4, where fig. 4 is a schematic diagram of model training performed based on a semantic segmentation model in the embodiment of the present application, as shown in the figure, a first image to be trained is input to the semantic segmentation model, and a first region segmentation result is output by the semantic segmentation model. The first region segmentation result includes a circled region of interest including a picture region indicated by a1 and a text region indicated by a 2. Thus, two image patches, namely a first sample image and a second sample image, can be randomly framed from the first image to be trained, wherein the first sample image and the second sample image belong to one positive sample image pair. And training the image matching model based on the positive sample image until the model training condition is met, and obtaining the image matching model.

The application provides a model training method, and the image is divided into an interesting region and a background region through the mode, so that when the image block is randomly selected from the image, the image can be more specifically selected, and a representative image block is taken as a sample image, thereby being beneficial to training to obtain an image matching model with higher robustness.

Optionally, on the basis of the various embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiment of the present application, the obtaining, according to the first region segmentation result, the first sample image and the second sample image derived from the first image to be trained may specifically include:

randomly selecting a first image and a second image from a first image to be trained;

In one or more embodiments, a way to directly extract a positive sample image pair is presented. As can be seen from the foregoing embodiments, the first sample image and the second sample image that meet the requirements can be directly taken from the image to be trained (e.g., the first image to be trained).

Specifically, for easy understanding, please refer to fig. 5, where fig. 5 is a schematic diagram of a randomly framed image block in the embodiment of the present application, and as shown in the figure, a first region segmentation result can be obtained after a first image to be trained passes through a semantic segmentation model. Thus, two image blocks are randomly selected from the first image to be trained, the two image blocks being the first image indicated by B1 and the second image indicated by B2, respectively. Based on this, from the first region segmentation result, the region proportion of the region of interest in the first image and the region proportion of the region of interest in the second image can be determined. Taking fig. 5 as an example, it is assumed that the area occupancy ratio of the region of interest in the first image is 100%, the area occupancy ratio of the region of interest in the second image is 80%, and the occupancy ratio threshold value is 70%, so that it can be seen that the area occupancy ratio of the region of interest in the first image is greater than or equal to the occupancy ratio threshold value, and therefore, the first image is taken as the first sample image. Similarly, the area occupancy ratio of the region of interest in the second image is greater than or equal to the occupancy threshold value, and therefore the second image is taken as the second sample image.

Secondly, in the embodiment of the present application, a manner of directly extracting a positive sample image pair is provided, and through the manner, sample images meeting requirements can be directly extracted randomly from images to be trained, so that subsequent model training can be performed by using the sample images, and thus, the feasibility and operability of the scheme are increased.

randomly selecting a first image to be processed and a second image to be processed from the first image to be trained;

In one or more embodiments, a way to extract positive sample image pairs based on data augmentation is presented. As can be seen from the foregoing embodiments, the first to-be-processed image and the second to-be-processed image that meet the requirements can be taken out from the to-be-trained image (e.g., the first to-be-trained image), and then the data expansion processing is performed on the to-be-processed images, respectively, so as to obtain the first sample image and the second sample image.

Specifically, for easy understanding, please refer to fig. 6, where fig. 6 is another schematic diagram of randomly framing image blocks in the embodiment of the present application, and as shown in the figure, a first region segmentation result can be obtained after a first image to be trained passes through a semantic segmentation model. Thus, two image blocks are randomly selected from the first to-be-trained image, the two image blocks being the first to-be-processed image indicated by C1 and the second to-be-processed image indicated by C2, respectively. Based on this, from the first region segmentation result, the region proportion of the region of interest in the first image to be processed and the region proportion of the region of interest in the second image to be processed can be determined. Taking fig. 6 as an example, it is assumed that the area proportion of the region of interest in the first image to be processed is 100%, the area proportion of the region of interest in the second image to be processed is 80%, and the proportion threshold is 70%, so that it can be seen that the area proportion of the region of interest in the first image to be processed is greater than or equal to the proportion threshold, and therefore, the data augmentation processing is performed on the first image to be processed to obtain the first sample image, and the data augmentation processing is performed on the second image to be processed to obtain the second sample image.

The data expansion process includes, but is not limited to, gaussian blur, color transform, gray scale transform, rotation, and the like, and fig. 6 uses rotation as a form of the data expansion process, however, this should not be construed as limiting the present application.

Secondly, in the embodiment of the application, a mode of extracting a positive sample image pair based on data augmentation is provided, through the mode, a sample image meeting requirements can be randomly extracted from an image to be trained, then data augmentation is performed on the sample image to obtain another sample image, and therefore subsequent model training can be performed by using the sample images, and feasibility and operability of a scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image may specifically include:

based on the first sample image, acquiring a first feature map through a coding network included in an image matching model to be trained;

In one or more embodiments, a way to train an image matching model based on a bootstrap your own potential (BYOL) framework is introduced. In the BYOL framework, for an input image (e.g., a first image to be trained), two images (e.g., a first sample image and a second sample image) can be obtained by two random image enhancement strategies, respectively.

Specifically, for convenience of understanding, please refer to fig. 7, and fig. 7 is a schematic structural diagram of an image matching model to be trained in the embodiment of the present application, and as shown in the figure, the BYOL framework includes an online network (online network) and a target network (target network), where the image matching model to be trained shown in fig. 7 is an online network. The first sample image is input to the coding network (i.e., coding network 1) included in the image matching model to be trained, thereby outputting a first feature map. The first feature map is then input to a projection network (i.e., projection network 1) included in the image matching model to be trained, thereby outputting a first feature vector. And inputting the first feature vector to a prediction network included in the image matching model to be trained, thereby outputting a target feature vector.

Similarly, the second sample image is input to the encoding network (i.e., the encoding network 2) included in the image matching model to be trained, thereby outputting the second feature map. The second feature map is then input to a projection network (i.e., projection network 2) included in the image matching model to be trained, thereby outputting a second feature vector.

Based on the above, the first loss function can be adopted to calculate the loss value between the target feature vector and the second feature vector, and the model parameters of the image matching model to be trained are updated. The first loss function may be an L2 loss function, specifically:

where θ represents the model parameters of the online network (i.e., the image matching model to be trained). ξ represent the model parameters of the target network. z is a radical of_θRepresenting a first feature vector. q. q.s_θ(z_θ) Representing the target feature vector. z'_ξRepresenting the second feature vector.

It should be noted that the coding Network may be a Residual Network-50 (Resnet-50). The projection network may include a fully connected layers (FC), a Batch Normalization (BN), and a Rectified Linear Unit (ReLU).

After the model training is completed, the image matching model to be trained (i.e., the online network) can be used as an image matching model for subsequent image matching.

Secondly, this application embodiment provides a mode of training image matching model based on the BYOL frame, through above-mentioned mode, relies on the interact and mutual study between two networks, can reach higher classification accuracy under the condition that does not have negative sample pair to promote image matching model's robustness.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, another optional embodiment provided in the embodiments of the present application may further include:

acquiring a second image to be trained;

acquiring a second region segmentation result through a semantic segmentation model based on a second image to be trained, wherein the second region segmentation result is used for determining an interested region and a background region in the second image to be trained;

obtaining a third sample image from the second image to be trained according to the second region segmentation result, wherein the ratio of the region of interest included in the third sample image is greater than or equal to a ratio threshold;

according to the first sample image and the second sample image, updating model parameters of the image matching model to be trained until model training conditions are met, and obtaining the image matching model, wherein the method specifically comprises the following steps:

and updating the model parameters of the image matching model to be trained according to the first sample image, the second sample image and the third sample image until the model training conditions are met, so as to obtain the image matching model.

In one or more embodiments, a manner of training together using positive and negative image sample pairs is described. In the foregoing embodiments, a first to-be-processed image and a second to-be-processed image that meet requirements can be taken from an image to be trained (e.g., a first to-be-trained image), and the first to-be-processed image and the second to-be-processed image form a positive sample image pair. Similarly, a first to-be-processed image (or a second to-be-processed image, or a fourth sample image) and a third to-be-processed image that meet requirements can be taken from two to-be-trained images (e.g., the first to-be-trained image and the second to-be-trained image), respectively, and the first to-be-processed image (or the second to-be-processed image, or the fourth sample image) and the third to-be-processed image form a negative sample image pair. It will be appreciated that the two sample images comprised by the negative sample image pair are each derived from a different image to be processed.

Specifically, for easy understanding, please refer to fig. 8, where fig. 8 is another schematic diagram of the randomly framed image block in the embodiment of the present application, and as shown in the figure, the first to-be-trained image may obtain the first region segmentation result after passing through the semantic segmentation model. Thus, two satisfactory image blocks are randomly selected from the first image to be trained, for example, the two image blocks are respectively the first sample image indicated by D1 and the second sample image indicated by D2. The first sample image and the second sample image together form a positive sample image pair. Similarly, a second region segmentation result can be obtained after the second image to be trained passes through the semantic segmentation model. Thus, a satisfactory image block is randomly selected from the first image to be trained, for example, the image block is the third sample image indicated by D3. In addition, a satisfactory image block may be extracted from an image to be trained, and taking the first image to be trained as an example, the first sample image may be directly extracted, so that the first sample image and the third sample image together form a negative sample image pair. For example, the second sample image may be directly extracted, and thus the second sample image and the third sample image together form a negative sample image pair. Illustratively, the fourth sample image may be directly extracted (i.e., one sample image is re-extracted), whereby the fourth sample image and the third sample image together constitute one negative sample image pair.

It is understood that a satisfactory sample image specifically means that the proportion of the region of interest in the image is greater than or equal to a proportion threshold.

It is assumed that a positive sample image pair is formed by the first sample image and the second sample image together, and a negative sample image pair is formed by the first sample image and the third sample image together. Based on the method, model parameters of the image matching model to be trained are updated by combining the positive sample image pair and the negative sample image pair until model training conditions are met, and the image matching model is obtained.

Secondly, in the embodiment of the application, a mode of training the positive and negative image sample pairs together is provided, and by the above mode, the positive sample image pair can be extracted from the same image to be trained, and the negative sample image pair can be extracted from different images to be trained respectively.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, updating the model parameters of the image matching model to be trained according to the first sample image, the second sample image, and the third sample image may specifically include:

based on a first sample image and a second sample image, acquiring a feature vector corresponding to the first sample image and a feature vector corresponding to the second sample image through an image matching model to be trained, wherein the first sample image and the second sample image belong to a positive sample image pair, and the positive sample image pair corresponds to a positive sample label;

In one or more embodiments, a manner of training an image matching model using a contrast loss function in conjunction with a pair of positive and negative image samples is presented. In the foregoing embodiment, in the process of the self-supervision network training, for the image to be trained, the first sample image, the second sample image and the third sample image that meet the requirements may be randomly selected. It is understood that the present application describes the first sample image and the second sample image as belonging to a positive sample image pair, and the first sample image and the third sample image as belonging to a negative sample image pair.

Specifically, for convenience of understanding, please refer to fig. 9, and fig. 9 is another schematic structural diagram of an image matching model to be trained in the embodiment of the present application, and as shown in the figure, the image matching model to be trained includes a sub-model 1 and a sub-model 2. Where the sample image 1 and the sample image 2 are a sample image pair (for example, a positive sample image pair composed of the first sample image and the second sample image, or a negative sample image pair composed of the first sample image and the third sample image), thereby the sample image 1 is input to the submodel 1, and the feature vector 1 is output by the submodel 1. The sample image 2 is input to the submodel 2, and the submodel 2 outputs the feature vector 2. According to the feature vector 1 and the feature vector 2, the Euclidean distance between the two sample images can be calculated. In practical application, the N groups of sample image pairs can be respectively input to the image matching model to be trained, so that the euclidean distance between the sample image pairs is calculated according to the predicted feature vectors, and then the model parameters are updated.

Based on the above, the second loss function can be adopted to calculate the loss value between the Euclidean distance and the real label, and the model parameters of the image matching model to be trained are updated. The second Loss function may adopt a contrast Loss (contrast Loss) function, specifically:

wherein L is_CRepresenting a second loss function. y represents a exemplar pair label, e.g., a positive exemplar label is 1 and a negative exemplar label is 0. d represents the euclidean distance between the two feature vectors. max (. cndot.) represents taking the maximum value.

It can be understood that the contrast loss function is mainly used for feature extraction, that is, originally similar sample images are still similar in the feature space after feature extraction; and after the originally dissimilar sample images are subjected to feature extraction, the two sample images are still dissimilar in the feature space. Also, the loss function can well express the matching degree between the pair of sample images.

It should be noted that the image matching model to be trained belongs to a twinning neural network (SNN), and therefore, the image matching model to be trained may include two branches, that is, two sub-models. The network structures of the two submodels are the same, and the model parameters are shared. The twin neural network needs to input two sample images at a time as one sample image pair. The input sample image is mapped to a feature vector, and the sample image pair between two feature vectors is used to represent the difference in image semantics between the inputs. A twin neural network may be used to determine the similarity between two sample images.

After the model training is finished, any sub-model can be extracted from the image matching model to be trained to serve as the image matching model for subsequent image matching.

In the embodiment of the application, a mode of training an image matching model by using a contrast loss function in combination with a positive image sample pair and a negative image sample pair is provided, and by the mode, a network structure of a twin neural network is used as a network structure of the image matching model, so that the contrast loss function can be used for effectively processing the relation of paired sample images in the twin neural network, the matching degree of the paired sample images is well expressed, and the model for extracting features is well trained.

based on the first sample image, acquiring an embedded vector corresponding to the first sample image through an image matching model to be trained;

In one or more embodiments, a way to combine positive and negative sample images and train a twin network using a triple loss function is presented. In the foregoing embodiment, in the process of the self-supervision network training, for the image to be trained, the first sample image, the second sample image and the third sample image that meet the requirements may be randomly selected. It is understood that the present application takes the first sample image as a reference image (anchor), the second sample image as a positive sample image, and the third sample image as a negative sample image as an example for explanation.

Specifically, for convenience of understanding, please refer to fig. 10, and fig. 10 is a schematic structural diagram of an image matching model to be trained in an embodiment of the present application, where the image matching model to be trained includes a sub-model 1, a sub-model 2, and a sub-model 3. Where sample image 2 is a first sample image (i.e., a reference image), sample image 1 is a second sample image (i.e., a positive sample image), and sample image 3 is a third sample image (i.e., a negative sample image). Thus, the sample image 1 is input to the submodel 1, and the embedded vector 1 is output from the submodel 1. The sample image 2 is input to the submodel 2, and the embedded vector 2 is output from the submodel 2. The sample image 3 is input to the submodel 3, and the embedded vector 3 is output from the submodel 3. From the embedding vector 1 and the embedding vector 2, the embedding distance between the two sample images can be calculated. From embedding vector 2 and embedding vector 3, the embedding distance between the two sample images can also be calculated. In practical application, the N triple sample images can be respectively input to the image matching model to be trained, the euclidean distance between the triple sample images is calculated based on the embedded vector obtained by prediction, and then the model parameters are updated.

If the robustness of the model needs to be improved, not only positive and negative sample images need to be distinguished, but also the intra-class is required to be more compact, and the inter-class is required to be more distant. Therefore, the image matching model to be trained may change the input into three, and one positive sample image and two negative sample images are used in the training, or one negative sample image and two positive sample images are used, and this application takes the example of using one negative sample image and two positive sample images as an example, which should not be construed as limiting the application.

Based on the above, the third loss function can be adopted to calculate the loss value between the embedding distances, and the model parameters of the image matching model to be trained are updated. The third Loss function may adopt a triple Loss (triple Loss) function, specifically:

L_T＝max(0,D(A,P)-D(A,N)+margin)；

wherein L is_TRepresenting a third loss function. A denotes a reference image (e.g., a first sample image). P denotes a positive sample image (e.g., a second sample image). N represents a positive sample image (e.g., a second sample image). margin represents a preset distance, e.g., 1. D (a, P) represents an embedding distance between the reference image and the positive sample image. D (a, N) represents the embedding distance between the reference image and the negative sample image. max (. cndot.) represents taking the maximum value. Net (a) represents the embedded vector of the reference image output by one sub-model. Net (p) represents the embedded vector of the positive sample image output by the other sub-model. Net (n) represents the embedded vector of the negative sample image output by the other submodel.

It should be noted that the image matching model to be trained belongs to a twin neural network, and therefore, the image matching model to be trained may include three branches, that is, three submodels. The network structures of the two three networks are the same, and the model parameters are shared. The twin neural network needs to input three sample images at a time as one triplet sample image.

In the embodiment of the application, a method for training the twin network by combining the positive and negative sample images and adopting the triple loss function is provided, and the network structure of the twin neural network is used as the network structure of the image matching model in the above manner, so that the triple loss function can be used for effectively processing the relationship among the three associated sample images in the twin neural network, and the matching degree of the three associated sample images is well expressed, so that the model for extracting the features is well trained.

obtaining a prediction classification result through an image matching model to be trained based on a first sample image and a second sample image, wherein the first sample image and the second sample image correspond to labeling classification labels;

In one or more embodiments, a way to combine positive samples and train a twin network with a loss function is presented. In the foregoing embodiment, in the network training process, for the image to be trained, the first sample image and the second sample image that meet the requirement may be randomly selected. It is understood that the present application describes the first sample image and the second sample image as belonging to the positive sample image.

Specifically, for convenience of understanding, please refer to fig. 11, and fig. 11 is a schematic structural diagram of an image matching model to be trained in the embodiment of the present application, where as shown in the figure, the image matching model to be trained includes a sub-model 1 and a sub-model 2. Where the sample image 1 and the sample image 2 are a sample image pair (for example, a positive sample image pair composed of the first sample image and the second sample image, or a negative sample image pair composed of the first sample image and the third sample image), thereby the sample image 1 is input to the submodel 1, and the feature vector 1 is output by the submodel 1. The sample image 2 is input to the submodel 2, and the submodel 2 outputs the feature vector 2. And inputting the feature vector 1 and the feature vector 2 into a classification network together, and outputting a prediction classification result through the classification network. Wherein the classification network comprises a fully connected layer. The prediction classification result can be represented as a similarity score which is greater than or equal to 0 and less than or equal to 1, wherein the greater the similarity score, the higher the similarity is.

In practical application, the N groups of sample image pairs can be respectively input to the image matching model to be trained, and thus, the model parameters are updated according to the distance between the prediction classification result and the labeling classification label.

Based on the above, the fourth loss function can be adopted to calculate the loss value between the Euclidean distance and the real label, and the model parameters of the image matching model to be trained are updated. In one case, the fourth loss function may be a Mean Squared Error (MSE) loss function, specifically:

wherein L is_MA fourth loss function is represented. N represents the number of pairs of sample images. i denotes the ith sample image pair.

Label classification label representing ith sample image pair, label classification label of positive sample image pair is 1, and negative sample image pairThe label classification label of the image pair is 0. y is_iAnd the prediction classification result of the ith sample image pair is represented and is greater than or equal to 1, and the similarity score of the prediction classification result is less than or equal to 1.

In another case, the fourth loss function may use a Mean Absolute Error (MAE) loss function, specifically:

And the labeling classification label of the ith sample image pair is represented, the labeling classification label of the positive sample image pair is 1, and the labeling classification label of the negative sample image pair is 0. y is_iAnd the prediction classification result of the ith sample image pair is represented and is greater than or equal to 1, and the similarity score of the prediction classification result is less than or equal to 1.

It is understood that the fourth loss function may also be other types of loss functions, and is not limited herein.

It should be noted that the image matching model to be trained belongs to a twin neural network, and therefore, the image matching model to be trained may include two branches and one classification network, that is, two submodels and one classification network. The network structures of the two submodels are the same, and the model parameters are shared. The twin neural network needs to input two sample images at a time as one sample image pair. After mapping the input sample image into a feature vector, respectively inputting the feature vector into a classification network, and outputting a prediction classification result by the classification network. A twin neural network may be used to determine the similarity between two sample images.

After the model training is completed, the image matching model to be trained can be used as an image matching model for subsequent image matching.

Secondly, in the embodiment of the application, a method for training the twin network by combining the positive sample and adopting the loss function is provided, and the network structure of the twin neural network is used as the network structure of the image matching model in the above manner, so that the relation between the paired sample images in the twin neural network can be effectively processed by using the cross entropy loss function, the matching degree of the paired sample images is well expressed, and the model for training the output image similarity score is well used.

acquiring a training sample image, wherein the training sample image comprises M pixel points, each pixel point corresponds to a category label, the category label is used for indicating that the pixel point belongs to an interested region or a background region, and M is an integer greater than 1;

based on a training sample image, acquiring the class prediction probability of each pixel point through a semantic segmentation model to be trained;

and updating the model parameters of the semantic segmentation model to be trained by adopting a fifth loss function according to the class prediction probability of each pixel point and the class label of each pixel point until the model training condition is met, so as to obtain the semantic segmentation model.

In one or more embodiments, a manner of training a semantic segmentation model is presented. With the foregoing embodiments, it can be seen that the image can be divided into the region of interest and the background region by using the semantic segmentation model, and how to train the semantic segmentation model will be described below.

Specifically, a certain number (e.g., 50000 or more) of training sample images need to be collected first, and the training sample images should be selected as randomly as possible to cover various cases in the target scene as much as possible. In order to better select representative image blocks in the process of self-supervision learning of similarity, the regions of a training sample image need to be divided and labeled in advance in the training process, so as to obtain a region of interest or a background region, and further, if the region of interest includes a text region and a picture region, three types of regions, namely the text region, the picture region and the background region, need to be labeled. For convenience of understanding, please refer to fig. 12, where fig. 12 is a schematic diagram illustrating a region of interest and a background region in an embodiment of the present application, and as shown in the drawing, a region indicated by E1 is a picture region, a region indicated by E2 is a text region, and the remaining portion is a background region.

It should be noted that, labeling all training sample images often requires a lot of manpower, so it may be considered to label only a part of the training sample images (for example, 10000 sheets), train a semantic segmentation model using these labeled data to learn a text region, a picture region, and a background region, and then segment the images by using the semantic segmentation model.

And taking the marked training sample image as the input of the semantic segmentation model to be trained, and taking the marked result as the supervision information. The category label of each pixel point in the text region may be set to "1", the category label of each pixel point in the picture region may be set to "2", and the category label of each pixel point in the background region may be set to "0". Based on the above, a fifth loss function can be adopted to calculate the loss value between the class prediction probability and the class label of each pixel point, and the model parameters of the semantic segmentation model to be trained are updated. The fifth loss function can adopt a Cross Entropy (Cross Entropy) loss function, and specifically comprises the following steps:

wherein L is_EA fifth loss function is represented. I denotes the training sample image. S denotes a label region (e.g., a text region, a picture region, and a background region). C denotes the total number of region categories (e.g., three categories). c denotes the c-th area category. f. of_u,c(I) Representing the probability that the label y of the pixel u in the semantic segmentation model description image to be trained belongs to the c region class, i.e. f_u,c(I)＝p(y_u＝c\I)。S_CThe representation is marked asAnd c, collecting the pixel points of the regional category.

It should be noted that the semantic segmentation model provided in the present application includes, but is not limited to, a Full Convolution Network (FCN), a segmentation network (SegNet), a deep b, a pyramid scene parsing network (PSPNet), a high-resource network (HRNet), and the like, and is not limited herein.

Secondly, in the embodiment of the application, a method for training a semantic segmentation model is provided, and through the method, not only an interesting region and a background region can be segmented from an image, but also a text region and a picture region in the interesting region can be segmented, so that a randomly sampled image block can cover more effective contents, and therefore, the image can be more specifically selected in a frame mode, a representative image block is taken as a sample image, and the method is favorable for training to obtain an image matching model with higher robustness.

With reference to fig. 13, an embodiment of an image matching method in the present application is described below, where the method includes:

210. acquiring a first image to be matched;

in one or more embodiments, the image matching apparatus obtains the first image to be matched, where the first image to be matched may be an image uploaded by a user, or the first image to be matched may be an image captured from a web page, or the first image to be matched is an image obtained through another route, which is not limited herein.

The image matching apparatus provided in the present application may be deployed in a server, or may be deployed in a terminal device, or may be deployed in a system formed by a server and a terminal device, which is not limited herein.

220. Acquiring a second image to be matched;

in one or more embodiments, the image matching apparatus obtains a second image to be matched, where the second image to be matched may be an image uploaded by a user, or the second image to be matched may be an image stored in the background, or the second image to be matched is an image obtained through another way, which is not limited herein.

230. And determining a matching result through an image matching model based on the first image to be matched and the second image to be matched, wherein the image matching model is obtained by adopting the method provided by the embodiment for training.

In one or more embodiments, the image matching apparatus takes the first image to be matched and the second image to be matched as input of the image matching model. In one case, the image matching model may support two inputs simultaneously, i.e., the first image to be matched and the second image to be matched are input to the image matching model together. In another case, the image matching model may support one input, that is, the first image to be matched is input to the image matching model, and then the second image to be matched is input to the image matching model. And generating matching results corresponding to the two images to be matched based on the result output by the image matching model.

In the image retrieval scenario, if the user inputs one image (i.e., the first image to be matched), another image similar thereto may be retrieved. In an image classification scene, images with higher similarity can be clustered by using an image matching model, so that the efficiency and the accuracy of image classification are improved.

In the embodiment of the application, an image matching method is provided, and the image is divided into an interesting region and a background region through the method, so that when an image block is randomly selected from the image, the image can be more specifically selected, and a representative image block is taken as a sample image, thereby being beneficial to training to obtain an image matching model with higher robustness. Based on this, in the model reasoning process, the occurrence of mismatching can be reduced.

Optionally, on the basis of each embodiment corresponding to fig. 13, in another optional embodiment provided in this embodiment of the present application, determining a matching result through an image matching model based on the first image to be matched and the second image to be matched specifically may include:

acquiring a feature vector corresponding to the first image to be matched through an image matching model based on the first image to be matched;

In one or more embodiments, a manner of determining a match result based on inter-feature distances is presented. As can be seen from the foregoing embodiment, the image matching model may be a sub-model in the image matching model to be trained, based on which, the first image to be matched and the second image to be matched need to be input to the image matching model respectively, and thus, two feature vectors are output. Then, a calculation formula of the euclidean distance may be adopted to calculate the target feature distance between the two feature vectors. It will be appreciated that the target feature distance may be a euclidean distance, also known as a euclidean distance, which is a common distance metric used to measure the absolute distance between two objects in a multidimensional space. The smaller the target characteristic distance is, the greater the similarity of the two images to be matched is, and conversely, the greater the target characteristic distance is, the smaller the two images to be matched is.

Specifically, for convenience of understanding, please refer to fig. 14, where fig. 14 is a schematic structural diagram of an image matching model in the embodiment of the present application, as shown in the figure, a first image to be matched is input to a trained image matching model, and a feature vector corresponding to the first image to be matched is output by the image matching model. Similarly, the second image to be matched is input into the trained image matching model, and the feature vector corresponding to the second image to be matched is output by the image matching model. Based on the above, a calculation formula of the euclidean distance may be adopted to calculate the target feature distance between the feature vector corresponding to the first image to be matched and the feature vector corresponding to the second image to be matched.

And if the target feature distance is smaller than or equal to the distance threshold, determining that the matching result is successful. Otherwise, if the target feature distance is larger than the distance threshold, the matching result is determined to be matching failure. Since the smaller the target feature distance is, the more similar the images are, another image to be matched that is closest to the image to be matched can be determined according to the target feature distance.

Secondly, in the embodiment of the application, a mode for determining a matching result based on the distance between the features is provided, and through the mode, the feature vectors of two images to be matched are directly output by the image matching model, and then the distance between the features is determined according to the feature vectors of the two images to be matched. The smaller the distance between the features is, the higher the similarity is, so that the feature vector can be output based on the image matching model to determine the matching result of the two images to be matched, thereby improving the accuracy of image matching.

acquiring a similarity score through an image matching model based on the first image to be matched and the second image to be matched;

In one or more embodiments, a manner of determining a match result based on a similarity score is presented. As can be seen from the foregoing embodiments, the image matching model may be an image matching model to be trained, that is, include two sub-models, based on which, the first image to be matched and the second image to be matched need to be input to the image matching model together, and thus, a similarity score is output. Therefore, a cosine similarity calculation formula can be adopted to calculate and obtain a similarity score between the two images to be matched. It can be understood that the similarity value may be cosine similarity, the cosine similarity first needs to vectorize the image to be matched, and the similarity of the two feature vectors is evaluated by calculating the cosine value of the included angle between the two feature vectors. Cosine similarity is generally used in the positive space, and therefore, in general, the value of cosine similarity is between 0 and 1, where the cosine value of an angle of 0 degree is 1, and the cosine value of any other angle is not greater than 1. Namely, the greater the similarity value is, the greater the similarity of the two images to be matched is, and conversely, the smaller the similarity value is, the smaller the two images to be matched is.

Specifically, for the convenience of understanding, please refer to fig. 15, where fig. 15 is another structural schematic diagram of the image matching model in the embodiment of the present application, as shown in the figure, the first image to be matched and the second image to be matched are input to the trained image matching model together, and the image matching model outputs the similarity score between the first image to be matched and the second image to be matched.

And if the similarity score is larger than or equal to the similarity threshold, determining that the matching result is successful. Otherwise, if the similarity score is smaller than the similarity threshold, the matching result is determined to be matching failure. Since the images are more similar as the similarity score is larger, another image to be matched which is closest to the image to be matched can be determined according to the similarity score.

Secondly, in the embodiment of the application, a way of determining the matching result based on the similarity score is provided, through the way, the image matching model directly outputs the similarity score between the two images to be matched, and the similarity score is larger, which indicates that the similarity is higher, so that the similarity score can be directly output based on the image matching model to determine the matching result of the two images to be matched, thereby improving the accuracy of image matching.

Referring to fig. 16, fig. 16 is a schematic diagram of an embodiment of the model training device in the embodiment of the present application, and the model training device 30 includes:

an obtaining module 310, configured to obtain a first image to be trained;

the obtaining module 310 is further configured to obtain a first region segmentation result through a semantic segmentation model based on the first image to be trained, where the first region segmentation result is used to determine a region of interest and a background region in the first image to be trained, and the region of interest includes at least one of a text region or a picture region;

the obtaining module 310 is further configured to obtain a first sample image and a second sample image derived from the first image to be trained according to the first region segmentation result, where a region proportion of a region of interest included in the first sample image is greater than or equal to a proportion threshold, and a region proportion of a region of interest included in the second sample image is greater than or equal to the proportion threshold;

the training module 320 is configured to update the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met, so as to obtain the image matching model.

The application provides a model training device, which is used for dividing an image into an interesting region and a background region, so that when an image block is randomly framed and selected from the image, the image can be framed and selected more pertinently, and a representative image block is taken as a sample image, thereby being beneficial to training to obtain an image matching model with higher robustness.

Alternatively, on the basis of the embodiment corresponding to fig. 16, in another embodiment of the model training device 30 provided in the embodiment of the present application,

an obtaining module 310, specifically configured to randomly select a first image and a second image from a first image to be trained;

The application provides a model training device, and by adopting the device, sample images meeting requirements can be directly and randomly extracted from images to be trained, so that subsequent model training can be carried out by utilizing the sample images, and the feasibility and operability of a scheme are improved.

an obtaining module 310, specifically configured to randomly select a first image to be processed and a second image to be processed from the first image to be trained;

The application provides a model training device, and by adopting the device, a sample image meeting requirements can be randomly extracted from an image to be trained, data amplification is carried out on the sample image, and another sample image is obtained, so that subsequent model training can be carried out by using the sample images, and the feasibility and operability of a scheme are improved.

the training module 320 is specifically configured to obtain a first feature map through a coding network included in the image matching model to be trained based on the first sample image;

The application provides a model training device, adopts above-mentioned device, relies on the interact between two networks and mutual study, can reach higher categorised rate of accuracy under the condition that does not have the negative sample to promote the robustness of image matching model.

the obtaining module 310 is further configured to obtain a second image to be trained;

the obtaining module 310 is further configured to obtain a second region segmentation result through the semantic segmentation model based on a second image to be trained, where the second region segmentation result is used to determine a region of interest and a background region in the second image to be trained;

the obtaining module 310 is further configured to obtain a third sample image derived from the second image to be trained according to the second region segmentation result, where a ratio of a region of interest included in the third sample image is greater than or equal to a ratio threshold;

the training module 320 is specifically configured to update a model parameter of the image matching model to be trained according to the first sample image, the second sample image, and the third sample image until a model training condition is met, so as to obtain the image matching model.

The application provides a model training device, adopts above-mentioned device, can follow and draw positive sample image to the image pair from same waiting to train to respectively draw negative sample image to the image pair from waiting to train the difference, based on this, use positive sample image to pair and negative sample image to train jointly, can promote the robustness of image matching model.

the training module 320 is specifically configured to obtain, based on a first sample image and a second sample image, a feature vector corresponding to the first sample image and a feature vector corresponding to the second sample image through an image matching model to be trained, where the first sample image and the second sample image belong to a positive sample image pair, and the positive sample image pair corresponds to a positive sample label;

The application provides a model training device, and by adopting the device, the network structure of a twin neural network is used as the network structure of an image matching model, so that the relation of paired sample images in the twin neural network can be effectively processed by using a contrast loss function, the matching degree of the paired sample images is well expressed, and the model for training and extracting characteristics is well used.

the training module 320 is specifically configured to obtain, based on the first sample image, an embedded vector corresponding to the first sample image through the image matching model to be trained;

The application provides a model training device, and by adopting the device, the network structure of a twin neural network is used as the network structure of an image matching model, so that the relation between three associated sample images in the twin neural network can be effectively processed by using a triple loss function, the matching degree of the three associated sample images is well expressed, and the model for training and extracting characteristics is well used.

the training module 320 is specifically configured to obtain a prediction classification result through an image matching model to be trained based on a first sample image and a second sample image, where the first sample image and the second sample image correspond to an annotation classification label;

The application provides a model training device, which adopts the device to take the network structure of a twin neural network as the network structure of an image matching model, so that the relation between paired sample images in the twin neural network can be effectively processed by using a cross entropy loss function, the matching degree of the paired sample images is well expressed, and the model for training the similarity score of an output image is well used.

the obtaining module 310 is further configured to obtain a training sample image, where the training sample image includes M pixel points, each pixel point corresponds to a category label, the category label is used to indicate that the pixel point belongs to an interested area or a background area, and M is an integer greater than 1;

the obtaining module 310 is further configured to obtain, based on the training sample image, a category prediction probability of each pixel point through the to-be-trained semantic segmentation model;

the training module 320 is further configured to update a model parameter of the semantic segmentation model to be trained by using a fifth loss function according to the class prediction probability of each pixel point and the class label of each pixel point until a model training condition is met, so as to obtain the semantic segmentation model.

By adopting the device, the region of interest and the background region can be segmented from the image, and the text region and the picture region in the region of interest can be segmented, so that the randomly sampled image block can cover more effective contents.

Referring to fig. 17, fig. 17 is a schematic diagram of an embodiment of an image matching apparatus in an embodiment of the present application, where the image matching apparatus 40 includes:

an obtaining module 410, configured to obtain a first image to be matched;

the obtaining module 410 is further configured to obtain a second image to be matched;

the determining module 420 is configured to determine a matching result through an image matching model based on the first image to be matched and the second image to be matched, where the image matching model is obtained by training using the method provided in the foregoing embodiment.

The application provides an image matching device, and the device is adopted to divide an image into an interesting region and a background region, so that when an image block is randomly framed and selected from the image, the image can be framed and selected more pertinently, and a representative image block is taken as a sample image, thereby being beneficial to training to obtain an image matching model with higher robustness. Based on this, in the model reasoning process, the occurrence of mismatching can be reduced.

Alternatively, on the basis of the embodiment corresponding to fig. 17, in another embodiment of the image matching apparatus 40 provided in the embodiment of the present application,

the determining module 420 is specifically configured to obtain, based on the first image to be matched, a feature vector corresponding to the first image to be matched through the image matching model;

The application provides an image matching device, which is characterized in that an image matching model directly outputs the characteristic vectors of two images to be matched and then determines the distance between the characteristics according to the characteristic vectors of the two images to be matched. The smaller the distance between the features is, the higher the similarity is, so that the feature vector can be output based on the image matching model to determine the matching result of the two images to be matched, thereby improving the accuracy of image matching.

the determining module 420 is specifically configured to obtain a similarity score through an image matching model based on the first image to be matched and the second image to be matched;

The application provides an image matching device, adopts above-mentioned device, by the similarity score between two images of waiting to match of image matching model direct output, because the similarity score is big more, it is higher to indicate the degree of similarity, consequently, can directly output the similarity score based on image matching model to confirm the matching result of two images of waiting to match, thereby promote image matching's accuracy.

The embodiment of the application also provides another model training device and an image matching device, and the model training device and the image matching device can be deployed in a server. Fig. 18 is a schematic diagram of a server structure provided by an embodiment of the present application, where the server 500 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 18.

The embodiment of the application also provides another model training device and an image matching device, and the model training device and the image matching device can be deployed on terminal equipment. As shown in fig. 19, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and details of the specific techniques are not disclosed, please refer to the method portion of the embodiments of the present application. In the embodiment of the present application, a terminal device is taken as an example to explain:

fig. 19 is a block diagram illustrating a partial structure of a smartphone related to a terminal device provided in an embodiment of the present application. Referring to fig. 19, the smart phone includes: radio Frequency (RF) circuitry 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuitry 660, wireless fidelity (WiFi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 19 is not limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The following describes each component of the smartphone in detail with reference to fig. 19:

the RF circuit 610 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 680; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smartphone. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations of a user (e.g., operations of the user on the touch panel 631 or near the touch panel 631 by using any suitable object or accessory such as a finger or a stylus) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 680, and can receive and execute commands sent by the processor 680. In addition, the touch panel 631 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 640 may be used to display information input by or provided to the user and various menus of the smartphone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 631 can cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 19, the touch panel 631 and the display panel 641 are two separate components to implement the input and output functions of the smart phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the smart phone.

The smartphone may also include at least one sensor 650, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 641 and/or the backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the smartphone, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the smart phone, further description is omitted here.

Audio circuit 660, speaker 661, microphone 662 can provide an audio interface between the user and the smartphone. The audio circuit 660 may transmit the electrical signal converted from the received audio data to the speaker 661, and convert the electrical signal into an audio signal through the speaker 661 for output; on the other hand, the microphone 662 converts the collected sound signals into electrical signals, which are received by the audio circuit 660 and converted into audio data, which are processed by the audio data output processor 680 and then passed through the RF circuit 610 to be sent to, for example, another smartphone or output to the memory 620 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the smart phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 670, and provides wireless broadband internet access for the user. Although fig. 19 shows the WiFi module 670, it is understood that it does not belong to the essential constitution of the smartphone and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 680 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions of the smart phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620, thereby integrally monitoring the smart phone. Optionally, processor 680 may include one or more processing units; optionally, the processor 680 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

The smartphone also includes a power supply 690 (e.g., a battery) that provides power to the various components, optionally, the power supply may be logically connected to the processor 680 via a power management system, so that functions such as managing charging, discharging, and power consumption are implemented via the power management system.

Although not shown, the smart phone may further include a camera, a bluetooth module, and the like, which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device configuration shown in fig. 19.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of model training, comprising:

acquiring a first image to be trained;

obtaining a first sample image and a second sample image derived from the first image to be trained according to the first region segmentation result, wherein the region proportion of the region of interest included in the first sample image is greater than or equal to a proportion threshold value, and the region proportion of the region of interest included in the second sample image is greater than or equal to the proportion threshold value;

and updating model parameters of the image matching model to be trained according to the first sample image and the second sample image until model training conditions are met, so as to obtain the image matching model.

2. The method according to claim 1, wherein the obtaining a first sample image and a second sample image derived from the first image to be trained according to the first region segmentation result comprises:

randomly selecting a first image and a second image from the first image to be trained;

determining the region proportion of the region of interest in the first image and determining the region proportion of the region of interest in the second image according to the first region segmentation result;

if the area proportion of the interested area in the first image is larger than or equal to the proportion threshold value, taking the first image as the first sample image;

and if the area proportion of the interest area in the second image is larger than or equal to the proportion threshold value, taking the second image as the second sample image.

3. The method according to claim 1, wherein the obtaining a first sample image and a second sample image derived from the first image to be trained according to the first region segmentation result comprises:

if the area proportion of the region of interest in the first image to be processed is larger than or equal to the proportion threshold, performing data augmentation processing on the first image to be processed to obtain a first sample image;

and if the area proportion of the region of interest in the second image to be processed is greater than or equal to the proportion threshold, performing data amplification processing on the second image to be processed to obtain a second sample image.

4. The method of claim 1, wherein updating model parameters of an image matching model to be trained according to the first sample image and the second sample image comprises:

based on the first sample image, acquiring a first feature map through a coding network included in the image matching model to be trained;

based on the first feature map, acquiring a first feature vector through a projection network included in the image matching model to be trained;

based on the second feature map, acquiring a second feature vector through a projection network included in the target model;

and updating the model parameters of the image matching model to be trained through a first loss function according to the target characteristic vector and the second characteristic vector.

5. The method of claim 1, further comprising:

acquiring a second image to be trained;

acquiring a second region segmentation result through the semantic segmentation model based on the second image to be trained, wherein the second region segmentation result is used for determining a region of interest and a background region in the second image to be trained;

obtaining a third sample image derived from the second image to be trained according to the second region segmentation result, wherein the ratio of the region of interest included in the third sample image is greater than or equal to a ratio threshold;

the updating the model parameters of the image matching model to be trained according to the first sample image and the second sample image until the model training conditions are met to obtain the image matching model comprises the following steps:

and updating the model parameters of the image matching model to be trained according to the first sample image, the second sample image and the third sample image until model training conditions are met to obtain the image matching model.

6. The method of claim 5, wherein the updating the model parameters of the image matching model to be trained according to the first sample image, the second sample image and the third sample image comprises:

based on the first sample image and the second sample image, obtaining a feature vector corresponding to the first sample image and a feature vector corresponding to the second sample image through the image matching model to be trained, wherein the first sample image and the second sample image belong to a positive sample image, and the positive sample corresponds to a positive sample label;

based on the first sample image and the third sample image, obtaining a feature vector corresponding to the first sample image and a feature vector corresponding to the third sample image through the image matching model to be trained, wherein the first sample image and the third sample image belong to a negative sample image, and the negative sample corresponds to a negative sample label;

and updating the model parameters of the image matching model to be trained by adopting a second loss function according to the first characteristic distance, the second characteristic distance, the positive sample and the negative sample label.

7. The method of claim 5, wherein the updating the model parameters of the image matching model to be trained according to the first sample image, the second sample image and the third sample image comprises:

based on the first sample image, acquiring an embedded vector corresponding to the first sample image through the image matching model to be trained;

based on the second sample image, obtaining an embedded vector corresponding to the second sample image through the image matching model to be trained, wherein the second sample image belongs to a positive sample image;

based on the third sample image, obtaining an embedded vector corresponding to the third sample image through the image matching model to be trained, wherein the third sample image belongs to a negative sample image;

8. The method of claim 1, wherein updating model parameters of an image matching model to be trained according to the first sample image and the second sample image comprises:

obtaining a prediction classification result through the image matching model to be trained based on the first sample image and the second sample image, wherein the first sample image and the second sample image correspond to an annotation classification label;

9. The method according to any one of claims 1 to 8, further comprising:

acquiring a training sample image, wherein the training sample image comprises M pixel points, each pixel point corresponds to a category label, the category label is used for indicating that the pixel point belongs to an interested area or a background area, and M is an integer greater than 1;

based on the training sample image, acquiring the class prediction probability of each pixel point through a semantic segmentation model to be trained;

and updating the model parameters of the semantic segmentation model to be trained by adopting a fifth loss function according to the class prediction probability of each pixel point and the class label of each pixel point until model training conditions are met, so as to obtain the semantic segmentation model.

10. A method of image matching, comprising:

acquiring a first image to be matched;

acquiring a second image to be matched;

determining a matching result through an image matching model based on the first image to be matched and the second image to be matched, wherein the image matching model is obtained by training through the method of any one of claims 1 to 9.

11. The method according to claim 10, wherein the determining a matching result by an image matching model based on the first image to be matched and the second image to be matched comprises:

based on the first image to be matched, acquiring a feature vector corresponding to the first image to be matched through the image matching model;

based on the second image to be matched, acquiring a feature vector corresponding to the second image to be matched through the image matching model;

if the target characteristic distance is smaller than or equal to a distance threshold, determining that the matching result is successful;

12. The method according to claim 10, wherein the determining a matching result by an image matching model based on the first image to be matched and the second image to be matched comprises:

acquiring a similarity score through the image matching model based on the first image to be matched and the second image to be matched;

if the similarity score is larger than or equal to a similarity threshold, determining that the matching result is successful;

13. A model training apparatus, comprising:

the acquisition module is used for acquiring a first image to be trained;

the obtaining module is further configured to obtain a first region segmentation result through a semantic segmentation model based on the first image to be trained, where the first region segmentation result is used to determine a region of interest and a background region in the first image to be trained, and the region of interest includes at least one of a text region or a picture region;

the obtaining module is further configured to obtain a first sample image and a second sample image derived from the first image to be trained according to the first region segmentation result, where a region proportion of a region of interest included in the first sample image is greater than or equal to a proportion threshold, and a region proportion of a region of interest included in the second sample image is greater than or equal to the proportion threshold;

and the training module is used for updating model parameters of the image matching model to be trained according to the first sample image and the second sample image until model training conditions are met to obtain the image matching model.

14. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute the program in the memory, the processor is configured to perform the method of any one of claims 1 to 9 or the method of any one of claims 10 to 12 according to instructions in the program code;

15. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 9, or perform the method of any one of claims 10 to 12.