CN112990296A

CN112990296A - Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation

Info

Publication number: CN112990296A
Application number: CN202110261563.6A
Authority: CN
Inventors: 王亮; 黄岩; 王聿铭; 袁辉; 纪文峰; 李凯
Original assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Current assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-18
Anticipated expiration: 2041-03-10
Also published as: CN112990296B

Abstract

The invention provides a method and a system for compressing and accelerating an image-text matching model based on orthogonal similarity distillation, wherein the method comprises the following steps: s1: acquiring a picture-text matching data set, and constructing a student network model and a teacher network model; s2: preprocessing and data loading are carried out on the image-text matching data set; s3: calculating a difference similarity matrix based on the similarity matrix of the student network model and the similarity matrix of the teacher network model; calculating a singular value based on the difference similarity matrix; constructing an orthogonal similarity soft distillation loss function and an orthogonal similarity hard distillation loss function based on the singular value; calculating a joint loss function; training a student network model based on a joint loss function; s4: performing performance test on the trained student network model to obtain a performance evaluation result of the image-text matching data set and the trained student network model; s5: and inputting the image or the text to be detected into the trained student network model, and outputting the text or the image.

Description

Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an orthogonal similarity distillation-based image-text matching model compression and acceleration method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The image-text matching technology has wide application requirements in many fields, such as: pedestrian/identity/behavior/event/attribute/target retrieval based on natural language in a security monitoring scene, voice-image cross-modal retrieval in a human-computer interaction background, cross-modal matching of text description and image photos of commodities in an Internet e-commerce platform, recommendation of related products and the like. In addition, the progress of the image-text matching technology can also promote the common progress of a plurality of visual-language multi-modal tasks such as indicative expression, visual question and answer, image description, interactive three-dimensional visual scene multi-turn dialogue, visual-assisted cross-language translation, visual-language navigation, language-based image synthesis and the like.

The task of teletext matching is always faced with a huge challenge from the "semantic understanding gap" between vision-language, which comes from the huge data structure differences existing between images and text. Although the image-text matching has important research progresses from bottom to top in recent years, such as a bottom-to-top attention mechanism, a pre-training language model, image-text fusion modeling and the like, a plurality of published relevant works have better and better effects, and the performance of the image-text matching is improved to an unprecedented new height, the accompanying model parameters are large and the matching time is long, so that great challenges are brought to the application of the image-text matching in a pure CPU platform such as a common household computer and a low-power-consumption mobile embedded platform such as a smart phone, and the wide-range application deployment of the vision-language cross-modal analysis comprehension capability is limited.

Moreover, the traditional model compression and calculation acceleration method aiming at the task in the single field of CV or NLP cannot well solve the problem of model compression and calculation acceleration of the vision-language multi-modal task in the two fields of CV and NLP. Although these conventional single-mode model compression and acceleration methods can reduce the parameters and the calculation amount of the visual encoder and the text encoder in the image-text matching model to a certain extent, the overall cross-mode retrieval performance of the image-text matching model suffers from a relatively serious loss, so that the inference efficiency and the retrieval performance cannot be well considered. Aiming at the problems of model compression and acceleration of image-text matching, the industry of the academic circles at home and abroad does not provide an effective solution strategy at present.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides an orthogonal similarity distillation-based image-text matching model compression and acceleration method and system, and the basic principle is that the core technology of the orthogonal similarity distillation training is applied, a teacher network model with stronger performance is used as a source of high-performance knowledge, and the high-performance knowledge of the teacher network is 'distilled' and is taught to a student network model with smaller model and higher efficiency, so that the student network model has high efficiency and high performance. The method solves the problem that the high efficiency and high precision of the image-text matching task are difficult to be considered at first at home and abroad, and obtains the international leading level.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a compression and acceleration method of an image-text matching model based on orthogonal similarity distillation.

An orthogonal similarity distillation-based image-text matching model compression and acceleration method comprises the following steps:

s1: acquiring a picture-text matching data set, and constructing a student network model and a teacher network model;

s2: preprocessing and data loading are carried out on the image-text matching data set;

s3: calculating a difference similarity matrix based on the similarity matrix of the student network model and the similarity matrix of the teacher network model; calculating a singular value based on the difference similarity matrix; constructing an orthogonal similarity soft distillation loss function and an orthogonal similarity hard distillation loss function based on the singular value; calculating a joint loss function based on the orthogonal similarity soft distillation loss function or the orthogonal similarity hard distillation loss function; training a student network model based on a joint loss function;

s4: performing performance test on the trained student network model to obtain a performance evaluation result of the image-text matching data set and the trained student network model;

s5: and inputting the image or text to be detected into the trained student network model, and outputting the text corresponding to the image or the image corresponding to the text.

The second aspect of the invention provides a graph-text matching model compression and acceleration system based on orthogonal similarity distillation.

An orthogonal similarity distillation-based image-text matching model compression and acceleration system comprises:

a model building module configured to: acquiring a picture-text matching data set, and constructing a student network model and a teacher network model;

a pre-processing and data loading module configured to: preprocessing and data loading are carried out on the image-text matching data set;

a training module configured to: calculating a difference similarity matrix based on the similarity matrix of the student network model and the similarity matrix of the teacher network model; calculating a singular value based on the difference similarity matrix; constructing an orthogonal similarity soft distillation loss function and an orthogonal similarity hard distillation loss function based on the singular value; calculating a joint loss function based on the orthogonal similarity soft distillation loss function or the orthogonal similarity hard distillation loss function; training a student network model based on a joint loss function;

a result diagnostic module configured to: performing performance test on the trained student network model to obtain a performance evaluation result of the image-text matching data set and the trained student network model;

an output module configured to: and inputting the image or text to be detected into the trained student network model, and outputting the text corresponding to the image or the image corresponding to the text.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for compressing and accelerating a text-matching model based on orthogonal similarity distillation as defined in the first aspect.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for compressing and accelerating a text matching model based on orthogonal similarity distillation as described in the first aspect when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

the invention applies the model compression and acceleration technology to the image-text matching task, realizes the further improvement of the image-text matching performance under the condition of smaller model parameter quantity and calculation cost, finally realizes the miniaturization and high-efficiency inference of the image-text matching model and the CPU platform deployment, achieves the international leading level in the three aspects of parameter compression, inference acceleration and matching performance, and has the characteristics of small size, high speed and accuracy.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of the compression and acceleration method of the image-text matching model based on orthogonal similarity distillation according to the present invention;

FIG. 2 is a core technology flow diagram of the present invention;

fig. 3 is a schematic diagram of an implementation framework of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Interpretation of terms:

Image-Text Matching (ITM, also known as Image-Text Retrieval or Image-Text Alignment) is a classic task in the cross-discipline intersection field of Computer Vision (CV) and Natural Language Processing (NLP) -Vision-Language Multi-mode (Vision-Language Multi-mode), one of the signs that an artificial intelligence system has cross-Modal analysis comprehension capability, and other tasks belonging to the same field of Vision-Language Multi-mode, such as: the common model foundation of reference Expression (Refering Expression), Visual Question Answering (VQA) and Image description (Image capturing) is an important bridge for communicating CV and NLP, is also a representative task of brain-like intelligence research in the multi-modal field, and has a self-evident meaning.

The meaning of the image-text matching task is as follows: if an image (or a piece of text) is input as query content in a visual-language multimodal database with both images and corresponding text descriptions, a piece of text (or an image) can be output as a retrieval result, and the input and output images and text are semantically connected, then the task of performing cross-modal semantic retrieval and matching between image and text is image-text matching.

Aiming at the problems in the background technology, the invention applies the model compression and acceleration technology to the image-text matching task, takes the teacher network model with stronger performance as the source of high-performance knowledge, and teaches the high-performance knowledge 'distillation' of the teacher network model to the student network model with smaller model and higher efficiency, so that the student network model has high efficiency and high performance.

Specifically, the invention mainly comprises 1 core technology, 3 matching technologies and 1 implementation frame. Wherein, 1 core technology is as follows: the orthogonal similarity distillation training technology is a core part of the invention, is also a key of the invention which is different from other model compression and acceleration methods, and is also a key for ensuring the small, fast and accurate image-text matching model.

The key link in the core technology is the calculation of the orthogonal similarity distillation loss function, and the basic principle of the method is that a similarity matrix is used as a knowledge carrier, SVD orthogonal decomposition is used as a tool for analyzing the knowledge, similar whitening transformation is used as a means for reducing the correlation and variance of a difference similarity matrix, and the F norm square after singular value attenuation is used as the measurement of the teaching effect of knowledge distillation.

The 3 matching technologies are respectively as follows: (1) preparing a student network and a teacher network; (2) calculating a joint loss function; (3) two-stage training of a student network model. 1 set of implementation frame is: the complete implementation flow framework from data and model preparation, training-verification-test stages and CPU platform inference acceleration is covered. The 3 matching technologies and 1 implementation framework are matched with the core technology, and guarantee that the knowledge distillation training effect and CPU deployment inference efficiency of the core technology are brought into full play is achieved.

Example one

The embodiment provides a compression and acceleration method of an image-text matching model based on orthogonal similarity distillation.

According to the summary of the present invention, the corresponding relationship between 1 core technology, 3 matching technologies and 1 set of implementation framework and the specific steps of the method is as follows. S3 corresponds to the aforementioned 1 core technique (orthogonal similarity distillation training technique), as shown in fig. 2. S1.4, S3.4, S3.5 correspond to the 3 supporting technologies described above (preparation of student network model and teacher network model, calculation of joint loss function, two-stage training of student network model), respectively. S1 to S5 correspond to the aforementioned 1-set implementing frame, as shown in fig. 3. The detailed meanings of the specific steps of the above process are further described in the detailed description of the embodiments which follow.

As shown in fig. 1, a method for compressing and accelerating a text matching model based on orthogonal similarity distillation includes:

the preparation work of the data and the model aims at selecting/constructing a graph-text matching data set which meets the task requirement, and selecting/training a network model of students and teachers which is suitable for orthogonal similarity distillation training, and specifically comprises the following 4 sub-steps:

s1.1: acquiring a graph-text matching data set;

the graph-text matching data set (dataset) is a sample source for model training and testing, images in the graph-text matching data set and corresponding text descriptions are correctly matched in semantic content, such as a public data set Flickr30k or MSCOCO, and each image in the graph-text matching data set is provided with 5 sentences of artificially labeled English descriptions. And images can be collected and manual text labels can be given by self according to the specific requirements of the task, or other image-text multi-modal public data sets can be selected.

S1.2: performing word segmentation on the text in the image-text matching data set by adopting a word segmentation device, and matching corresponding integer numbers according to the appearance sequence of words to construct a bidirectional dictionary set;

the lexicon (vocarbulariy) is the basis for modeling text, whose total number of entries represents the vocabulary recognized by the text encoder, from which words should be extracted at least from the training set of the S1 data set. The preparation of the dictionary requires that a word cutter (tokenizer) is used for cutting words of texts (generally phrases/sentences) in a data set, and corresponding integer numbers are matched according to the appearance sequence of the words to construct a bidirectional dictionary set. The forward dictionary takes words as keys (keys) and numbers as values (values) and is used for translating character string sentences into number sequences; the reverse dictionary has numbers as keys and words as values for translating the number sequence into character string sentences. If the total number of entries of the dictionary is too large and the dictionary has long tail distribution, words with word frequency exceeding a certain threshold value can be selected to construct a smaller dictionary, rare words are removed, and high-frequency words are reserved.

S1.3: respectively extracting image features and text features of the image-text matching data set by adopting an image encoder and a text encoder;

an image encoder (image encoder) and a text encoder (text encoder) are one of the necessary components in the image-text matching model, and are used for performing feature extraction on images and texts respectively.

The image encoder selects a pre-trained convolutional neural network CNN (such as ResNet152 and parameter number 60M) on an image classification data set such as ImageNet; or a pre-trained Regional Convolutional Neural Network (RCNN) on Visual-language multimodal datasets such as Visual Genome (e.g., fast-RCNN); or further carrying out model compression techniques such as lightweight components to obtain a small CNN model (such as ResNeXt50, reference number 25M). The last fully-connected layer (FC layer) of these CNN/RCNN models is then removed as the final image encoder.

The text encoder selects a recurrent neural network RNN (such as GRU, LSTM; also divided into unidirectional and bidirectional types) to be matched with Word Embedding (such as Word2vec, glove); or pre-trained language models (e.g., Transformer, BERT) on corpora such as WMT 2014English-German dataset (4.5M sentence pair), WMT 2014English-French dataset (36M sentence pair), or BOOKCORPUS (800M word), English Wikipedia (2.5B word); or further performing model compression techniques such as parameter sharing and knowledge distillation to obtain lightweight pre-training language models (such as ALBERT and TinyBERT).

S1.4: and constructing a student network model and a teacher network model, and teaching the knowledge of the teacher network model to the student network model, wherein the student network model and the teacher network model comprise an image encoder and a text encoder.

Because the method needs orthogonal similarity distillation training, a student network model and a teacher network model need to be constructed, and the knowledge of the teacher network model is taught to the student network model. The important components of the student network model and the teacher network model are the image and text encoder of S1.3.

Firstly, a group of image and text encoders are respectively selected for a student network model and a teacher network model, and pre-training model parameter files which are respectively prepared in advance are respectively loaded on the image and text encoders of the student network model and the teacher network model. Wherein, the image and text coder of the student network model selects small and medium CNN and RNN (such as ResNeXt50 and ALBERT) with small parameter/calculated amount and weak performance; the image and text coder of the teacher network model selects large and medium CNN and RNN (such as fast-RCNN and BERT) with large parameter/calculation amount and strong performance.

Then, the image and text encoders of the teacher network model are jointly trained. The training set is identical to the training set of the student network model to be subjected to orthogonal similarity distillation training, so that the knowledge to be taught by the teacher network model and the knowledge to be learned by the student network model are from the same graph-text matching data set. And (3) obtaining a teacher network model (such as VSRN and SAEM) which can finally carry out orthogonal similarity distillation training on the student network model through the joint training of an image encoder and a text encoder.

Finally, the gradients of learnable parameters of the teacher network model need to be closed, so that the teacher network model is ensured not to perform back propagation and gradient updating in the training process, and the high-performance knowledge of the teacher network model is ensured to be protected. The gradient of the learnable parameters of the student network model can be set according to the specific requirements of parameter adjustment, and is generally opened; however, the patent adopts a two-stage training mode for the student network model, see S3.5 in detail, so the gradient switch situation is different from the general situation.

the data set preprocessing and data loading aim to obtain a graph-text matching training set, a verification set and a test set which meet the requirements of orthogonal similarity distillation training, and specifically comprises the following three substeps.

S2.1: according to task needs, preprocessing the image in the image-text matching data set, wherein the image preprocessing at least comprises the following steps: one of normalization, scaling, random clipping, and random flipping;

s2.2: according to task needs, preprocessing the text in the image-text matching data set, wherein the text preprocessing at least comprises the following steps: a word cutter is adopted to cut the sentence into single words, the bidirectional dictionary set of S1.2 is adopted to map each word from the character string into an integer number, and then the integer number is mapped into one-bit effective code, or the sentence with insufficient length is filled with zero, or the sentence length is arranged in a descending order;

the text (phrase/sentence) needs to use word cutter to cut the sentence into words, and use dictionary of S1.2 to map each word from character string to integer number, and then further map the integer number to one-bit effective (one-hot) code, fill in zero for the sentence with insufficient length, sort the sentence length in descending order, and so on.

S2.3: and (4) carrying out segmentation, disorder and batch arrangement on the image-text matching data set to complete the loading of the image-text matching training set, the image-text matching verification set and the image-text matching test set.

the orthogonal similarity distillation training technology is the core technology of the method provided by the invention, is also the key of the method which is different from other model compression and acceleration methods, and is also the key for ensuring the small, fast and accurate image-text matching model.

Wherein S3 includes:

s3.1: acquiring a batch of image-text matching training set;

from the training set (train _ set) of S2.3, a batch (batch) of image and text data is taken. If all samples in the training set are taken, the training set is reloaded. If the training set has less than one batch size (batch size) of remaining samples, the remaining samples are either taken away or optionally ignored and the next round of data set loading is continued.

Equation (1) represents the loading of a batch of data. Wherein, I_i、C_iThe ith image and the ith sentence in a batch are loaded, N is the batch size (batch size), and N is 128 in the experiment.

{I_i,C_i}_i＝1,...,N (1)

S3.2: processing the student network model and the teacher network model by adopting forward propagation to obtain a similarity scoring matrix;

forward-propagation (forward-propagation) of a student and teacher network model needs three stages of feature extraction, combined semantic embedding and similarity matching, and aims to obtain a similarity scoring matrix which is used as an analysis basis of an orthogonal similarity distillation loss function.

S3.2.1: respectively extracting the feature vector code of each image and each sentence from the images and texts of a batch by adopting an image coder and a text coder;

and (3) respectively using the images and texts of the student network model and the teacher network model of S1.4 for a batch of images and texts of S3.1, and extracting a feature (feature) vector code of each image and each sentence by a text coder.

The encoding process of the image feature vector comprises the following steps: for CNN type (classification task) image encoders, it is necessary to extract the feature vectors of the entire image directly from the images in the current batch; for an image encoder of the RCNN type (object detection task), object detection needs to be performed from images in a current batch, region-level feature vectors are extracted from a plurality of detected objects, and finally, the region-level feature vectors are integrated into a feature vector of the whole image by means of Average Pooling (Average Pooling) and the like.

And (3) encoding the text feature vector: firstly, the word embedding of S1.3 is used for mapping one-hot codes of each word of each sentence in the current batch into one-dimensional continuous word embedding vector representation, then the word embedding vectors are sequentially sent into a text encoder according to the sequence in the sentence for context (context) coding to obtain word-level feature vectors, and finally the word-level feature vectors are integrated into the feature vectors of the whole sentence in the modes of average pooling and the like.

Formulas (2) and (3) respectively represent the extraction processes of the image of the student network model and the feature vector of the text encoder. Wherein the image encoder is a 32x4d version of ResNeXt50 and the text encoder is ALBERT, I_i、C_jThe ith image and the jth sentence, v_i、c_jThe feature vectors of the ith image and the jth sentence respectively, and F, E represent the dimensions of the feature vectors of the image and the text respectively. F2048 and E1024 were chosen for the experiment.

v_i＝ResNeXt(I_i)∈R^F (2)

c_j＝ALBERT(C_j)∈R^E (3)

S3.2.2: respectively embedding the image and text feature vectors into respective united semantic spaces by using respective full connection layers for the student network and the teacher network, and normalizing to obtain united semantic embedded vectors;

and respectively embedding the image and text feature vectors into respective joint semantic spaces by using respective full connection layers for the student network model and the teacher network model, and normalizing to obtain joint semantic embedding (embedding) vectors, so that the image and text embedding vectors of the student network model and the teacher network model have the same dimension and the vector model length is 1.

Wv in formula (4)_iRepresenting image encoder for image feature vector progressionAnd (3) line joint semantic embedding. Wherein W ∈ R^E×FThe fully-connected layer learnable parameters representing the image encoder use.

S3.2.3: and performing similarity matching on the respective image and text embedded vectors of the student network model and the teacher network model in the current batch by using cosine similarity to obtain respective cosine similarity scoring matrixes of the student network model and the teacher network model.

And (3) performing similarity matching (matching) on the image and text embedded vectors of the student network model and the teacher network model in the current batch by using cosine similarity (cosine similarity), and obtaining a cosine similarity scoring matrix of N multiplied by N size of each of the student network and the teacher network, wherein N is the batch size.

Formula (4) represents the calculation process when an image is matched with a sentence of text for cosine similarity. Wherein s is_ijRepresenting the cosine similarity between the ith image and the jth sentence, cos (·,) representing the cosine similarity, implying the normalization process on the input vector, and the dot product calculation mode, with the range of [ -1, 1]. S and T in formulas (5) and (6) represent similarity scoring matrices of the teacher network model, respectively. Wherein the similarity matrix element s in the formula (5)_ijThe cosine similarity calculation result from the formula (4) and the acquisition mode of the similarity scoring matrix T of the teacher network model in the formula (6) are similar to that of the student network model, and are obtained by loading a batch of times of data of the formula (1) in the step S3.1, extracting the characteristics of the image and text encoders of the formulas (2) and (3) in the step S3.2 and calculating the cosine similarity of the formula (4).

s_ij＝cos(Wv_i，c_j)∈[-1，1] (4)

S＝[s_ij]∈[-1，1]^N×N (5)

T＝[t_ij]∈[-1，1]^N×N (6)

S3.3: calculating an orthogonal similarity soft distillation loss function and an orthogonal similarity hard distillation loss function based on the similarity matrix of the student network and the similarity matrix of the teacher network;

the key link in the orthogonal similarity distillation technology is the calculation of an orthogonal similarity distillation loss function, and the method specifically comprises the following five substeps, aiming at obtaining the orthogonal similarity distillation loss. The basic principle of the method is that a similarity matrix is used as a knowledge carrier, SVD orthogonal decomposition is used as a tool for knowledge analysis, similar whitening transformation is used as a means for reducing the correlation and variance of a difference similarity matrix, and the F norm square after singular value attenuation is used as a measure for the teaching effect of knowledge distillation.

S3.3.1: subtracting the similarity matrix of the teacher network model from the similarity matrix of the student network model element by element to obtain a difference similarity matrix;

and subtracting the similarity matrix of the teacher network element by using the similarity matrix of the student network to obtain a difference similarity matrix.

Equation (7) represents the calculation process of the difference similarity matrix D.

D＝S-T (7)

S3.3.2: multiplying the transpose of the difference similarity matrix by the difference similarity matrix to obtain a semi-positive definite similarity matrix;

and multiplying the difference similarity matrix by the transpose (transpose) of the difference similarity matrix to obtain a semi-positive definite (SPD) similarity matrix.

At this time, the semi-positive definite similarity matrix D^TThe trace of D is the distillation loss value of the differential similarity matrix in the F-norm squared sense. Also, this Value is equal to the sum of squares of respective Singular values after Singular Value Decomposition (SVD) of the differential similarity matrix. For convenience of description, the singular values are arranged in descending order of their squared magnitude. For SVD orthogonal decomposition, the selection of the orthonormal base is not unique and can be selected at will, and the effect of the orthogonal similarity distillation in the patent is not influenced.

Equation (8) shows the trace of the above-mentioned semidefinite similarity matrix, the F-norm squared distillation loss, the square of the singular value, and the equivalence relation between the three. Wherein the content of the first and second substances,

represents the F-norm square of the matrix, Tr (-) represents the trace of the matrix,

the square of the ith singular value obtained by the SVD decomposition of the difference similarity matrix.

Equation (9) represents the definition of the F-norm squared distillation loss.

Furthermore, this value is also approximately equal to the value of the Variance component obtained after Bias-Variance Decomposition (Bias-Variance Decomposition) of the F-norm squared distillation loss for the differential similarity matrix (experiments show that Bias ≈ 0).

Equation (10) represents the bias-variance decomposition. Wherein E [. cndot. ], Var (·), and Cov (·,) respectively represent a mean, a variance, and a covariance.

Equation (11) represents the new equivalence resulting from a deviation Bias equal to about 0.

S3.3.3: SVD is carried out on the difference similarity matrix to obtain a singular value, and the singular value is equal to the evolution of the singular value obtained by SVD of the semi-positive definite similarity matrix;

the difference similarity matrix is subjected to SVD to obtain singular values (S3.3.2 has been arranged in descending order according to the square magnitude of the singular values).

Equation (12) represents the singular value decomposition of the differential similarity matrix DWhere Σ ═ diag ([ σ ])₁，...，σ_N]) Is a diagonal matrix in which the singular values are arranged in descending order of their squared magnitude, i.e.:

1≤i<and j is less than or equal to N, U and V are a left singular matrix and a right singular matrix obtained by decomposition, all orthogonal vectors corresponding to the row/column dimensions of the difference similarity matrix D are included in the matrix, and T is the matrix transposition.

D＝UΣV^T (12)

Semi-positive definite similarity matrix D for S3.3.2^TD, singular value sigma obtained by SVD^TΣ is equal to the square of the singular value Σ decomposed by SVD of the differential similarity matrix of equation (12) in this step (S3.3.3).

Equation (13) represents a semi-positive definite similarity matrix D^TThe squared relation between the singular values of D and the singular values of the difference similarity matrix D. Wherein equation (14) is a derivation process, which represents ∑^TΣ is a semi-positive definite similarity matrix D^TD singular value result after SVD decomposition.

D^TD＝(VΣ^TU^T)(UΣV^T)＝V(Σ^TΣ)V^T (14)

S3.3.4: carrying out equalization processing and similar whitening transformation under the square meaning on singular values obtained by SVD (singular value decomposition) of the difference similarity matrix, wherein the similar whitening transformation comprises the following steps: soft-like whitening transformation and hard-like whitening transformation;

the Whitening-like transform specifically includes two transform modes, namely a soft transform and a hard transform. The common idea of the two methods is to perform similar whitening transformation of equalization processing under the square meaning on singular values (S3.3.2 are arranged in descending order according to the square size of the singular values) obtained by SVD decomposition of a differential similarity matrix, so as to greatly reduce the correlation between rows/columns of the differential similarity matrix where the singular values are located and reduce the absolute value difference between the singular values. In fact, the core idea of the Whitening-like transform is consistent with ZCA Whitening (Zero-phase Component Analysis Whitening), but there is also an improvement specific to the method, so it is called the Whitening-like transform.

The soft-type whitening transformation mode is as follows: use of

A function (where, b is the base of the log function,

is the square of the singular value; choosing b-1 in the experiment) attenuating the square of the large singular value of the percentage preceding k (corresponding to that in equation (15)

) The remaining singular values remain unchanged (corresponding to those in equation (15))

) This is equivalent to balancing all singular values.

The hard whitening transformation method is as follows: attenuation using direct 0-setting of the large singular value of percentage preceding k (corresponding to that in equation (16))

) The remaining singular values remain unchanged (corresponding to those in equation (16))

) This is equivalent to balancing the magnitude of the singular values in the percentage of the last 1-k.

S3.3.5: summing the squares of singular values obtained by soft-class whitening transformation to obtain orthogonal similarity soft distillation loss; and summing the squares of singular values obtained by the hard whitening transformation to obtain the orthogonal similarity hard distillation loss.

Summing the squares of singular values obtained by soft-class whitening transformation to obtain orthogonal similarity soft distillation loss; and summing the squares of singular values obtained by the hard whitening transformation to obtain the orthogonal similarity hard distillation loss.

The equations (15) and (16) represent the orthogonal similarity soft distillation loss L_softAnd quadrature similarity hard distillation loss L_hardThe calculation process of (2). Wherein D is^*、D^-Respectively a soft transform differential similarity matrix and a hard transform differential similarity matrix. In the experiment, k is 5% -15% which is the best percentage range.

S3.4: calculating a joint loss function based on the orthogonal similarity soft distillation loss function and the orthogonal similarity hard distillation loss function;

although the orthogonal similarity distillation penalty obtained at S3.3 is better training than the ranking penalty commonly used for teletext matching, a better training result will be obtained if the two penalty functions can be combined to guide the training.

S3.4.1: calculating a ranking loss function of the student network model by using the similarity scoring matrix of the student network model obtained in the S3.2;

the student network similarity scoring matrix obtained in the step S3.2 is used to calculate Ranking Loss (Ranking Loss, also called triple Loss or Max Loss, belonging to contrast Loss) of the student network, and is used to fully dig guidance potential of hard-negative samples (hard-negative samples), separate the distance between the hard-negative sample pairs and the positive sample pairs by at least m intervals (margin), and finally perform Loss calculation only on the hard-negative samples (hard-negative samples) which do not meet the minimum interval requirement, so as to ensure discrimination and generalization.

Equation (17) represents the calculation process of the rank loss function. Wherein m is [0, 1 ]]Represents the interval, s_iiRepresents the distance between pairs of positive samples, s_ikAnd s_kiRepresents the refractory negative sample pair, and ReLU (·) max {0, · } represents the maximum function for screening the most refractory negative samples from the refractory negative samples. In the experiment, m is 0.2.

S3.4.2: ranking losses are scaled using an equilibrium coefficient and summed with the orthogonal similarity soft distillation loss function or the orthogonal similarity soft distillation loss function as a joint loss function.

The ranking loss is scaled by a balance coefficient theta >0 and is summed with the orthogonal similarity distillation loss (soft/hard) to serve as a joint loss, so that the student network model can obtain double guidance of graphic and text contrast marking information from a data set and unsupervised knowledge distillation information from a teacher network at the same time, and the performance of the student network model can be improved to the maximum extent.

Equation (18) represents the calculation process of the joint loss function. In the experiment, θ is 1.

L＝L_soft|hard+θL_rank (18)

S3.5: performing two-stage training on the student network model based on the joint loss function;

two-stage training of student network models is divided into general training (training) and fine training (training).

In the normal training phase, 30 epochs are trained, using a learning rate (learning rate) of 2 e-4. The learnable parameters of the image encoder of the student network are always fixed (without parameter adjustment), while the learnable parameters of the text encoder of the student network are fixed (without parameter adjustment) in the first 15 epochs and are open in gradient (parameter adjustment) in the last 15 epochs, so that the performance of the student network model is improved as much as possible under the condition of avoiding the training breakdown of the text encoder. The gradient (tune-in) may also be opened during an earlier epoch if it can be ensured that the text encoder training does not crash.

In the fine training phase, 15 more epochs are trained, using a learning rate of 2 e-5. The image of the student network and the learnable parameters of the text encoder are all opened with gradient (parameter adjustment) all the time, so that the student network model can be learnt most fully, and the training performance reaches the highest level.

Both training phases use back-propagation (back or BP) algorithms to compute model gradients (grads), and both use ADAptive Moment estimation (ADAM) optimizers to perform gradient descent parameter update (update) optimization.

S3.6: performing performance evaluation on the trained student network model by adopting a graph-text matching verification set, and if the evaluation result obtains new optimal precision, keeping a parameter file of the current student network model; otherwise, not storing; and if the training times reach the maximum and the new optimal precision is not obtained during verification, quitting the training.

After training for a certain iteration number (1 forward and 1 backward, i.e. 1 iteration or 1 pass), the performance of the student network model needs to be evaluated by using verification set data, and the purpose is three: the verification set performance of the model is checked (and logged), and it is determined whether the model learnable parameter file is saved (or updated), and it is determined whether training is stopped early (early stopping).

The forward propagation mode of the student network during verification is the same as S3.2, and a similarity scoring matrix is obtained. However, there are 2 differences, one is to turn off the gradient of all learnable parameters of the student network to reduce unnecessary consumption of GPU video memory; one is to switch the mode of operation of the model from the training mode to the verification mode to eliminate the uncertain forward propagation behavior of some special neural network layers (e.g., BatchNormalization and Dropout) in the training mode.

And then calculating performance evaluation indexes including 1/5/10 recall (call @1/5/10) indexes before the respective retrieval directions of the image-to-text (i 2t) and the text-to-image (t 2i), average rank (mean call, mean), median rank (media, medr) and total recall (call sum, rsum) indexes by using the similarity scoring matrix, and recording the indexes into a log file.

And then, determining whether to store the model parameter file of the current student network and whether to continue training according to the quality of the evaluation result of the verification set. If the new optimal precision (rsum optimal) is obtained, maintaining the current model parameter file; otherwise, the data is not saved. If the new optimal precision is not obtained during verification continuously for a certain number of times, early stopping, quitting training and executing S4; if the maximum epoch number of training is exceeded, exiting the training and executing S4; otherwise, training is continued and S3 is repeated.

the performance test of the student network model comprises a single model test and an integrated model test.

Single model testing: after the student network model training is completed, test set data needs to be loaded for evaluation. The forward propagation mode and the evaluation index of the student network during testing are the same as S3.6, and only the data set is replaced. And finally obtaining the performance evaluation result of the test set.

Testing an integrated model: 2 different student network models need to be trained (for example, different random number seeds are set), then the similarity scoring matrix obtained by forward propagation of the 2 student networks is averaged, and a test set is used for performance evaluation. The forward propagation mode and the evaluation index during the test are the same as those during the single model test.

Example two

An orthogonal similarity distillation-based image-text matching model compression and acceleration system is characterized by comprising:

It should be noted here that the model building module, the preprocessing and data loading module, the training module, the result diagnosis module and the output module correspond to steps S1 to S5 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method for compressing and accelerating the orthonormal similarity distillation-based teletext matching model according to the first embodiment.

Example four

The embodiment provides a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the steps of the method for compressing and accelerating the orthogonal similarity distillation-based teletext matching model according to the embodiment.

In order to further improve the inference efficiency of the obtained image-text matching model of the method, so that the image-text matching model can be efficiently deployed on a CPU platform, extraction of pre-computing (pre-computing) embedded vectors can be performed on images and texts of an image-text matching data set.

The extraction method of the pre-calculation embedded vector comprises the following steps: and (3) extracting respective joint semantic embedded feature vectors of the image and the text by forward propagation for a given data set needing the cross-modal retrieval of image and text matching by using the image and text matching student network model trained by the step S3 orthogonal similarity distillation, and storing the joint semantic embedded feature vectors into a pre-calculation embedded vector file (such as a npy format).

Therefore, when cosine similarity comparison of the embedded vector is carried out again later, the precomputed embedded vector file can be directly loaded without carrying out forward propagation calculation of the model, retrieval time and storage cost are reduced, and deployment and inference efficiency of the CPU platform are improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An orthogonal similarity distillation-based image-text matching model compression and acceleration method is characterized by comprising the following steps:

2. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 1, wherein the step S1 includes:

s1.1: acquiring a graph-text matching data set;

3. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 2, wherein the step S2 includes:

4. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 3, wherein the S3 includes:

s3.1: acquiring a batch of image-text matching training set;

5. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 4, wherein S3.2 comprises:

6. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 4, wherein S3.3 comprises:

7. The method for compressing and accelerating an orthogonal similarity distillation-based graphic matching model according to claim 4, wherein S3.4 comprises:

8. An orthogonal similarity distillation-based image-text matching model compression and acceleration system is characterized by comprising:

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for compression and acceleration of a text matching model based on orthogonal similarity distillation as claimed in any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for compressing and accelerating a text matching model based on orthogonal similarity distillation as claimed in any one of claims 1 to 7.