CN116187439A

CN116187439A - Graph searching model building method, graph searching method, system, equipment and medium

Info

Publication number: CN116187439A
Application number: CN202211711389.1A
Authority: CN
Inventors: 喻晨曦
Original assignee: Qizhi Technology Co ltd
Current assignee: Qizhi Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-30

Abstract

The application relates to a graph searching model building method, a graph searching method, a system, equipment and a medium, belonging to the technical field of image retrieval, wherein the model building method comprises the following steps: pre-training a teacher model constructed based on a transducer model; determining a loss function based on the hyperbolic spatial calculation distance; and carrying out knowledge distillation training on the student model constructed based on the transducer model by utilizing the pre-trained teacher model and the loss function to obtain a graph searching model. The method and the device can solve the problems of low precision and low efficiency of searching the pictures by using the pictures.

Description

Graph searching model building method, graph searching method, system, equipment and medium

Technical Field

The present disclosure relates to the field of image retrieval technologies, and in particular, to a method for creating a graph searching model, a method, a system, a device, and a medium for searching a graph.

Background

With the increasing abundance of various image information on the internet, the user's demand for network image searching is also increasing, and various image search engines based on image searching have been developed. The image searching is a professional search engine system for providing relevant graphic image data searching service on the Internet for users by searching image texts or visual features, and is widely applied to the fields of online shopping, patent searching, paper searching and the like.

At present, the model generally adopted by a graph searching system is low in precision, irrelevant pictures are easily searched, the reasoning time of some models with better precision is too long, and the searching efficiency is low.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present application provides a graph searching model building method, a graph searching method, a system, a device and a medium.

In a first aspect, the present application provides a method for building a graph searching model, which adopts the following technical scheme:

a method for establishing a graph searching model comprises the following steps:

pre-training a teacher model constructed based on a transducer model;

determining a loss function based on the hyperbolic spatial calculation distance;

and carrying out knowledge distillation training on the student model constructed based on the transducer model by utilizing the pre-trained teacher model and the loss function to obtain a graph searching model.

Optionally, the training of knowledge distillation on the student model constructed based on the transducer model by using the pre-trained teacher model and the loss function to obtain the graph searching model includes:

acquiring an image training set;

performing first data augmentation processing on sample images in the image training set, and encoding parameters in the first data augmentation processing into a storage space by using an encoder;

taking the sample image subjected to the first data augmentation process as an input of the pre-trained teacher model;

taking the sparse codes and indexes output by the pre-trained teacher model as sparse soft labels for training the student model;

performing second data augmentation processing on sample images in the image training set, and transmitting parameters in the first data augmentation processing encoded into the storage space to the second data augmentation processing by using a decoder;

taking the sample image subjected to the second data augmentation process as an input of the student model;

and carrying out knowledge distillation training on the student model based on the sparse soft tag, the prediction output of the student model and the loss function to obtain the graph searching model.

Optionally, if the number of the pre-trained teacher models is multiple, the taking the sparse codes and indexes output by the pre-trained teacher models as the sparse soft labels trained by the student models includes:

aligning the sparse codes and the index dimensions output by a plurality of pre-trained teacher models by using a full connection layer;

and calculating an average value of a plurality of sparse codes and indexes after dimension alignment, and taking the average value as the sparse soft label.

Optionally, before the first data augmentation processing is performed on the image data in the image training set, and before the second data augmentation processing is performed on the image data in the image training set, the method further includes:

dividing an original image in the image training set, and taking a foreground image obtained by division as a target object;

if the proportion of the black pixel values of the target object exceeds a first preset threshold value, embedding the target object into a pure white background image;

if the proportion of the black pixel values of the target object does not exceed a first preset threshold value, embedding the target object into a pure black background image;

and taking the embedded image as a sample image in the image training set.

Optionally, the determining the loss function based on hyperbolic space calculation distance includes:

inputting the two sample images into a backbone network for feature extraction;

classifying the extracted features through the full connection layer FC;

mapping from Euclidean distance to a compact sphere model space is realized through exponential mapping;

mapping an infinite hyperboloid into a limited circular area by using a hyperbolic conformal transformation through a conformal disk model;

comparing two sample images through a pariwise model, if one sample image is more front than the other sample image in the correlation order, then the two sample images are positive example pairs, otherwise, the two sample images are negative example pairs, and determining that the loss function of the pariwise model is multi-similarity loss

Optionally, the pre-training the teacher model constructed based on the transducer model includes:

firstly, a public data set CLIP LAION-400M is adopted to pretrain the teacher model, then a public data set ImageNet 22k is adopted to fine tune the teacher model, an optimal pretraining model is obtained, and the optimal pretraining model is used as the pretrained teacher model.

In a second aspect, the present application provides a graph searching method, which adopts the following technical scheme:

a method for searching a graph in a graph, comprising:

inputting an image to be predicted into a graph search model established according to the method of any one of the first aspect, and outputting a prediction result.

In a third aspect, the present application provides a system for building a graph searching model, which adopts the following technical scheme:

a graph search model building system, comprising:

the pre-training module is used for pre-training a teacher model constructed based on the transducer model;

a loss function determination module for determining a loss function based on the hyperbolic space calculation distance;

and the knowledge distillation module is used for carrying out knowledge distillation training on the student model constructed based on the transducer model by utilizing the pre-trained teacher model and the loss function to obtain a graph searching model.

In a fourth aspect, the present application provides an electronic device, which adopts the following technical scheme:

an electronic device includes a memory and a processor; the memory has stored thereon a computer program capable of being loaded by the processor and performing the method of any of the first or second aspects.

In a fifth aspect, the present application provides a computer readable storage medium, which adopts the following technical scheme:

a computer readable storage medium storing a computer program capable of being loaded by a processor and performing the method of any one of the first or second aspects.

By adopting the technical scheme, the pre-trained teacher model is adopted to distill the pupil model which is close to the accuracy and time delay of the teacher model, a data augmentation and teacher model sparse coding and index storage method is provided to support a rapid model distillation process, and a similarity between pictures and a related similarity loss function are measured based on hyperbolic space calculation distance, so that the accuracy and efficiency of searching pictures by pictures can be improved.

Drawings

Fig. 1 is a schematic flow chart of a graph search model building method according to an embodiment of the application.

FIG. 2 is a block diagram of an architecture for knowledge distillation training in accordance with an embodiment of the present application.

Fig. 3 is a block diagram of an architecture of loss function derivation of an embodiment of the present application.

Fig. 4 is a block diagram of a system for creating a graph model according to an embodiment of the present application.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Fig. 1 is a schematic flow chart of a method for creating a graph searching model according to the present embodiment. The method can be applied to a server and also can be applied to terminal equipment. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing service; the terminal device may be a smart phone, a tablet computer, a desktop computer, etc., and the embodiment is not particularly limited.

As shown in fig. 1, the main flow of the method is described as follows (steps S101 to S103):

step S101, pre-training a teacher model constructed based on a transducer model;

in this embodiment, the teacher model may be a large model such as a vitae_v2 model, a maxvit model, a swin transducer_v2 model, and the like, which is not particularly limited.

Step S102, determining a loss function based on hyperbolic space calculation distance;

and step S103, performing knowledge distillation training on the student model constructed based on the transducer model by utilizing the pre-trained teacher model and the loss function to obtain a graph searching model.

It should be noted that, there is no sequence of execution between step S101 and step S102, as long as it is executed before step S103 is executed, and fig. 1 is merely an example in which step S101 is executed before step S102.

Fig. 2 shows a block diagram of an architecture for knowledge distillation training. As shown in fig. 2, first, an image training set is acquired; the image training set is divided into two paths, one path is subjected to first data augmentation treatment, and parameters in the first data augmentation treatment are encoded into a storage space by an encoder; and then taking the sample image subjected to the first data augmentation process as the input of a pre-trained teacher model, and taking the sparse code and the index output by the pre-trained teacher model as the sparse soft label for training the student model.

The other path is subjected to second data augmentation processing, and parameters in the first data augmentation processing coded into the storage space are transmitted to the second data augmentation processing by using a decoder; the sample image subjected to the second data augmentation process is then used as an input to the student model.

And finally, carrying out knowledge distillation training on the student model based on the sparse soft label, the prediction output of the student model and the loss function to obtain a final graph searching model.

In this embodiment, the first data augmentation process and the second data augmentation process are mainly used to increase sample image data in the image training set, so that the sample image data is diversified as much as possible, and the trained model has stronger generalization capability. For example, data augmentation may be implemented in the form of rotation, data parsing, data size transformation, mirror transformation, color space transformation, tossor, normalization, contrast, noise, and the like.

The data augmentation methods used in the first data augmentation process and the second data augmentation process may be the same or different, and the embodiment is not particularly limited.

Further, in order to reduce volatility of erroneous judgment of the whole teacher model, further improve effectiveness of knowledge distillation training, and reduce duplication of overfitting of a certain teacher model, a plurality of different teacher models may be set, and pretraining is performed on each teacher model, and each pretrained teacher model outputs sparse codes and indexes corresponding to the sparse codes. Because the teacher models are different, the output sparse codes and the corresponding index dimensions are different, at the moment, the full-connection layer is required to align the dimensions of the sparse codes and the indexes output by the plurality of pre-trained teacher models, after the dimensions are aligned, the average value of the sparse codes and the indexes output by all the teacher models can be calculated, and the average value is used as a sparse soft label.

Further, in order to enhance the subsequent training effect of the student model, the preprocessing for enhancing the image contrast of the original image in the image training set is required before the first data augmentation processing and the second data augmentation processing are performed on the image data in the image training set.

The original image in the image training set is firstly segmented, and a foreground image obtained by segmentation is used as a target object. The image segmentation model may be a Mask dinio model, or may be another image segmentation model, which is not specifically limited in this embodiment.

If the proportion of the black pixel values of the target object exceeds a first preset threshold value, embedding the target object into a pure white background image; if the proportion of the black pixel values of the target object does not exceed the first preset threshold value, embedding the target object into the pure black background image. The first preset threshold may be set by user, for example, 80%.

And finally, taking the image of the target object embedded in the background image as a sample image in the image training set.

Of course, if the original image cannot be segmented, the original image is directly input to the knowledge distillation training stage.

Fig. 3 shows a block diagram of the architecture of a loss function derivation. As shown in fig. 3, two sample images are input into a backbone network for feature extraction; then entering a full connection layer FC to classify the extracted features; mapping from Euclidean distance (euclidean distance) to a compact sphere model space is achieved through exponential mapping (exponential mapping), namely mapping a flat space into a curved space; then, a conformal disk model (conformal disk) is input, infinite hyperboloid is mapped into a limited circular area by using hyperbolic conformal transformation, finally, a comparison relation is carried out on two sample images through a pairwise model, if one sample image is more front than the other sample image in related order, the comparison is positive, otherwise, the comparison is negative, and the loss function of the pairwise model is determined to be multi-similarity loss.

In the exponential mapping phase, mapping lambda from Euclidean distance to compact sphere model space _x The calculation formula of (2) is as follows:

in the formula (1), II x II ² Refers to the square of the modulus of the x input vector and c is an adjustable parameter.

For each set of input vectors v, the mapped result vector is:

in formula (2)

The operation is an addition rule in the hyperbolic space, assuming that the vectors x, m belong to the hyperbolic space,/->

In the method, in the process of the invention,<x,m>is the total multiplication of the pointing quantities x, m.

Assuming that the vectors x, m belong to the uniform disk space, the distance d of the vectors x, m ^c The calculation formula of (x, m) is as follows:

the loss function in this space is multi-similarity loss, the formula is as follows:

in the formula, alpha, beta and b are super parameters, and n represents the number of extracted positive samples.

An opposite case holds true that satisfies k ε N for the class in which i is located _i The distance between samples of different classes is smaller than the minimum distance of samples of the same class. For example, for the counterexample pair (x _i ,y _j ),d ^c (i,j)<min d ^c (i,k)-ε,y _i ＝y _k 。

Establishment of a positive example pairThe class where i is located satisfies k epsilon P _i The distance between samples of the same class is greater than the maximum distance between samples of different classes. For example, for a positive example pair (x _i ,y _j ),d ^c (i,j)>max d ^c (i,k)-ε,y _i ≠y _k 。

Wherein y represents a category label; n (N) _i Representing a negative sample set; p (P) _i Representing a positive sample set; epsilon is a super parameter, e.g., 0.1, 0.5, 3, etc.

Calculation of the distance by hyperbolic space is an absolute advantage over the method of euclidean distance (Euclidean Distance) in pushing ebedding away from the boundary and preventing gradient from disappearing.

In some embodiments, the public data set CLIP network-400M is used for pre-training the teacher model, and then the public data set ImageNet 22k is used for fine tuning the teacher model to obtain an optimal pre-training model, and the optimal pre-training model is used as a pre-trained teacher model.

The fine tuning involves learning strategy adjustment, a learning plan dynamic optimizer, learning iteration times and the like.

Based on the graph searching model constructed by the embodiment, the embodiment of the application also provides a graph searching method. The method can be applied to a server and also can be applied to terminal equipment. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing service; the terminal device may be a smart phone, a tablet computer, a desktop computer, etc., and the embodiment is not particularly limited.

Firstly, a plurality of images to be predicted are obtained, then the images to be predicted are input into a graph searching model for probability prediction, finally, a prediction result, namely the similarity probability among the images to be predicted, is output, the images can be ordered according to the size of the similarity probability, the higher the similarity probability is, the higher the ordering priority is, and the most similar images can be conveniently displayed to users.

Fig. 4 is a block diagram of a system 400 for creating a graph model according to an embodiment of the present application. As shown in fig. 4, the system 400 mainly includes:

a pre-training module 401, configured to pre-train a teacher model constructed based on a transducer model;

a loss function determination module 402 for determining a loss function based on the hyperbolic space calculation distance;

and the knowledge distillation module 403 is configured to perform knowledge distillation training on the student model constructed based on the transducer model by using the pre-trained teacher model and the loss function, so as to obtain a graph search model.

In some embodiments, knowledge distillation module 403 includes:

the acquisition module is used for acquiring an image training set;

the augmentation encoding module is used for performing first data augmentation processing on the sample images in the image training set and encoding parameters in the first data augmentation processing into a storage space by using an encoder;

the first input module is used for taking the sample image subjected to the first data augmentation process as the input of a pre-trained teacher model; the soft label setting module is used for taking the sparse codes and indexes output by the pre-trained teacher model as sparse soft labels for training the student model;

the augmentation decoding module is used for carrying out second data augmentation processing on the sample images in the image training set, and transmitting parameters in the first data augmentation processing which are coded into the storage space to the second data augmentation processing by utilizing the decoder;

the second input module is used for taking the sample image subjected to the second data augmentation process as the input of the student model;

and the training module is used for carrying out knowledge distillation training on the student model based on the sparse soft label and the prediction output of the student model to obtain a graph searching model.

Optionally, if the number of the pre-trained teacher models is multiple, the soft tag setting module is specifically configured to align the dimensions of the sparse codes and the indexes output by the multiple pre-trained teacher models by using the full connection layer; and calculating the average value of the plurality of sparse codes and indexes after the dimension alignment, and taking the average value as a sparse soft label.

In some embodiments, the system 400 further comprises:

the image preprocessing module is used for dividing an original image in the image training set before performing first data augmentation processing on the image data in the image training set and before performing second data augmentation processing on the image data in the image training set, and taking a foreground image obtained by division as a target object; if the proportion of the black pixel values of the target object exceeds a first preset threshold value, embedding the target object into a pure white background image; if the proportion of the black pixel values of the target object does not exceed the first preset threshold value, embedding the target object into a pure black background image; and taking the embedded image as a sample image in the image training set.

In some embodiments, the loss function determining module 402 is specifically configured to input the two sample images into the backbone network for feature extraction; classifying the extracted features through the full connection layer FC; mapping from Euclidean distance to a compact sphere model space is realized through exponential mapping; mapping an infinite hyperboloid into a limited circular area by using a hyperbolic conformal transformation through a conformal disk model; and comparing the two sample images through a paper model, if one sample image is more front than the other sample image in the related order, then the two sample images are positive example pairs, otherwise, the two sample images are negative example pairs, and determining that the loss function of the paper model is multi-similarity loss.

In some embodiments, the pre-training module 401 is specifically configured to pre-train the teacher model by using the public data set CLIP area-400M, and then fine-tune the teacher model by using the public data set ImageNet 22k to obtain an optimal pre-training model, and use the optimal pre-training model as a pre-trained teacher model.

The functional modules in the embodiments of the present application may be integrated together to form a single unit, for example, integrated in a processing unit, or each module may exist alone physically, or two or more modules may be integrated to form a single unit. The integrated units may be implemented in hardware or in software functional units. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Various changes and specific examples in the method provided in the embodiment of the present application are also applicable to the system for creating a graph model provided in the embodiment, and by the foregoing detailed description of the method for creating a graph model, those skilled in the art can clearly know the implementation method of the system for creating a graph model in the embodiment, which is not described in detail herein for brevity of description.

Fig. 5 is a block diagram of an electronic device 500 according to an embodiment of the present application. As shown in fig. 5, the electronic device 500 includes a memory 501, a processor 502, and a communication bus 503; the memory 501 and the processor 502 are connected by a communication bus 503.

Memory 501 may be used to store instructions, programs, code sets, or instruction sets. The memory 501 may include a storage program area and a storage data area, wherein the storage program area may store instructions for implementing an operating system, instructions for at least one function, and instructions for implementing the graph search model creation method, the graph search method, and the like provided by the above embodiments; the storage data area may store the data related to the graph searching method, and the like provided in the above embodiments.

The processor 502 may include one or more processing cores. The processor 502 performs various functions of the present application and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 501, invoking data stored in the memory 501. The processor 502 may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the functions of the processor 502 described above may be other for different devices, and embodiments of the present application are not specifically limited.

Communication bus 503 may include a path to transfer information between the above components. The communication bus 503 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 503 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one double arrow is shown in FIG. 5, but not only one bus or one type of bus. And the electronic device shown in fig. 5 is only an example and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

The embodiment of the application provides a computer readable storage medium storing a computer program capable of being loaded by a processor and executing the graph searching graph model building method and the graph searching graph method provided by the embodiment.

In this embodiment, the computer-readable storage medium may be a tangible device that holds and stores instructions for use by the instruction execution device. The computer readable storage medium may be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding. In particular, the computer readable storage medium may be a portable computer disk, hard disk, USB flash disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), podium random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital Versatile Disk (DVD), memory stick, floppy disk, optical disk, magnetic disk, mechanical coding device, and any combination of the foregoing.

The computer program in this embodiment contains program code for executing the method shown in fig. 1, and the program code may include instructions corresponding to the execution of the steps of the method provided in the above embodiment. The computer program may be downloaded from a computer readable storage medium to the respective computing/processing device or to an external computer or external storage device via a network (e.g., the internet, a local area network, a wide area network, and/or a wireless network). The computer program may execute entirely on the user's computer and as a stand-alone software package.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, it is to be understood that relational terms such as first and second are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The method for establishing the graph searching model is characterized by comprising the following steps of:

pre-training a teacher model constructed based on a transducer model;

2. The method of claim 1, wherein performing knowledge distillation training on a student model constructed based on a transducer model using a pre-trained teacher model and the loss function to obtain a graph search model comprises:

acquiring an image training set;

3. The method of claim 2, wherein if the number of pre-trained teacher models is a plurality, the sparse coding and indexing of the pre-trained teacher model output as the sparse soft labels for the student model training comprises:

4. A method according to claim 2 or 3, characterized in that before said first data augmentation processing is performed on image data in said image training set, and before said second data augmentation processing is performed on image data in said image training set, further comprising:

and taking the embedded image as a sample image in the image training set.

5. A method according to any one of claims 1 to 3, wherein said determining a loss function based on hyperbolic space calculation distance comprises:

inputting the two sample images into a backbone network for feature extraction;

classifying the extracted features through the full connection layer FC;

and comparing the two sample images through a paper model, if one sample image is more front than the other sample image in the related order, then the two sample images are positive example pairs, otherwise, the two sample images are negative example pairs, and determining that the loss function of the paper model is multi-similarity loss.

6. A method according to any one of claims 1 to 3, wherein pre-training a teacher model constructed based on a transducer model comprises:

7. A method for searching a graph in a graph, comprising:

inputting an image to be predicted into a graph search model established according to the method of any one of claims 1 to 6, and outputting a prediction result.

8. A graph search model building system, comprising:

9. An electronic device comprising a memory and a processor; the memory has stored thereon a computer program that can be loaded by the processor and that performs the method according to any of claims 1 to 7.

10. A computer readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which performs the method according to any of claims 1 to 7.