CN115544277A

CN115544277A - Rapid knowledge graph embedded model compression method based on iterative distillation

Info

Publication number: CN115544277A
Application number: CN202211535321.2A
Authority: CN
Inventors: 汪鹏; 刘嘉骏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2022-12-30

Abstract

A fast knowledge map embedding model compression method based on iterative distillation, 1) pre-training a high-dimensional teacher knowledge map embedding model; 2) Self-adaptive distillation of soft label weight; 3) Iterative distillation; 4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction. The method can realize the excellent performance of the distillation compression knowledge map embedded model, simultaneously keeps the model reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick updating of large-scale knowledge map embedded models in practical application.

Description

Rapid knowledge graph embedded model compression method based on iterative distillation

Technical Field

The invention belongs to the field of artificial intelligence knowledge maps, and particularly relates to a rapid knowledge map embedded model compression method based on iterative distillation.

Background

A knowledge graph is a structure that describes concepts and facts using a graphical model, where knowledge is stored in the form of triples. The knowledge-graph embedding model aims to embed triples in a knowledge-graph into a continuous vector space. With the rapid growth of the scale of the knowledge graph, the efficient knowledge graph embedding model plays a key role in downstream applications such as knowledge question answering, recommendation systems, knowledge graph completion and the like. Most knowledge graph embedding models perform better as the embedding dimensions increase, however, in practical applications models with higher dimensions tend to have longer inference times and require more storage. Specifically, the 512-dimensional model has 7-15 times more model parameters and 2-6 times more inference time than the 32-dimensional model. Therefore, how to compress a high-dimensional knowledge map embedded model into a low-dimensional knowledge map embedded model is an important problem. Knowledge distillation is a common model compression method, and aims to use a larger teacher model as a teacher model, use a smaller model as a student model, and then enable the student model to simulate the output of the teacher model. Recently, although several knowledge-based distillation knowledge-map-embedded model compression methods have been proposed, they still suffer from three drawbacks. First, existing work directly distills high-dimensional teacher models into low-dimensional student models, but better teachers are not always able to teach better students. Previous studies have shown that too much difference between teacher and student models can affect distillation performance. Similar phenomena exist for knowledge-based distillation knowledge-graph embedding models, which represent the gap between embedding dimensions. Specifically, low-dimensional student models are distilled directly from high-dimensional teacher models, and excessive dimensional gaps lead to difficulties for better teachers to teach better students. This dimensional gap implies different performance, which further results in significant differences in output distribution between the teacher and the student models, which makes the student models difficult to learn. Second, another significant challenge faced by the existing methods is that the difference in optimization direction between the distillation objective and the original mission objective causes the model to be difficult to converge. Thirdly, the current work mainly focuses on improving the reasoning efficiency, but has the problem of time consumption in the training process. In particular, existing knowledge-graph-embedded model distillation methods tend to use a multi-teacher distillation framework or train teacher models and student models together, which makes training times several times longer than direct distillation.

Therefore, an application number CN202111152202.4 is named as a knowledge graph embedding compression method based on knowledge graph distillation, which fully captures triple information and embedded structure information in a high-dimensional knowledge graph embedding model (Teacher model) and distills the triple information and the embedded structure information into the knowledge graph embedding model (Student model), improves the expression capability of the Student model under the condition of ensuring the storage and reasoning efficiency of the Student model, considers the double influence between the Teacher model and the Student model in the distillation process, provides a soft label evaluation mechanism to distinguish the quality of soft labels of different triples, and provides a training mode of firstly fixing the Teacher model and then releasing the fixed Teacher model to improve the adaptability of the Student model to the Teacher model and finally improve the performance of the Student model.

But there are cases as follows;

1) The emphasis on solving the problem is different: the 'knowledge map embedding compression method based on knowledge map distillation' focuses on solving the problem that the teacher model and the student model have double influences. The invention aims to solve the problem that the distillation effect is reduced due to overlarge embedding dimension between a teacher model and a student model. Furthermore, the present invention focuses on the problem of training efficiency.

2) The methods for solving the problems are different: the knowledge graph embedding compression method based on knowledge graph distillation provides a two-stage distillation method, and a teacher model with high dimensionality is directly distilled into a student model with low dimensionality. The invention provides a rapid knowledge graph embedding method based on iterative distillation, which gradually reduces the dimensionality of a teacher model by using the iterative distillation method, so that knowledge can be smoothly transferred, a student model with better performance is finally obtained, and the problem of overlarge dimensionality difference between the teacher model and the student model is solved.

3) The efficiency of the solution is different: in order to accelerate the training process and further solve the problem that the optimization directions of the hard label and the soft label are inconsistent in the distillation process, the invention provides a dynamic soft label weight adjusting mechanism, the weight between the soft label and the hard label is dynamically adjusted through the loss in the training process, and the training time has great advantage.

The invention provides a novel iterative distillation method for a knowledge graph embedded model. Unlike the method of directly distilling low-dimensional student models with high-dimensional teacher models, the iterative distillation method adopts a method of gradually reducing the sizes of models to reduce the dimension difference between the teacher models and the student models of each distillation. Specifically, a trained high-dimensional knowledge map is embedded into a model as a teacher model, and the teacher model is compressed into a smaller student model according to a specific compression ratio by a knowledge distillation method. This process will then be iterated, and the student model generated in the previous iteration will be used as the teacher model for the next iteration to guide lower dimension student model training. And finally, stopping iteration when the dimension of the obtained student model reaches the required dimension. By the iterative distillation method, an overlarge dimension difference between the student model and the teacher model is reduced. In other words, iterative distillation allows knowledge to be smoothly transferred from a high-dimensional teacher model to a low-dimensional student model, while allowing the last iteration of student models to better inherit the performance of the first iteration of teacher models. In order to obtain better distillation performance, the invention provides a novel soft label weight self-adaptive adjustment method. In the early stages of training, hard tag losses are given greater weight and soft tag losses are given lesser weight. At this point, the hard tag dominates the optimization of the model. The weight of the soft label loss is gradually increased along with the training, so that the soft label loss dominates the optimization of the model in the later training period. The soft label weight self-adaption method can solve the problem that the model cannot be well converged due to different optimization directions of hard label loss and soft label loss. To further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation.

Disclosure of Invention

In order to solve the problems, the invention provides a rapid knowledge graph embedding model compression method based on iterative distillation. Then, the invention provides a soft label weight self-adaptive distillation mechanism, and in the process of teaching student models to train by a teacher model, the weight of soft label loss is gradually increased according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss, so that the training efficiency is improved. Furthermore, the invention provides an iterative distillation framework, so that the knowledge graph embedded model can be alternately changed into a student model and a teacher model in the iterative distillation process. Specifically, the student model of the last iteration is generated as the teacher model of the next iteration to guide the training of the student model of lower dimensionality. Thus, knowledge can be migrated between the high-dimensional teacher model and the low-dimensional student model in a smooth manner while maintaining good performance of the student models. Next, to further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation. The method can realize the excellent performance of distillation and compression of the knowledge-map embedded model, simultaneously keeps the reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick update of large-scale knowledge-map embedded models in the real world.

In order to achieve the purpose, the invention adopts the technical scheme that:

a rapid knowledge graph embedding model compression method based on iterative distillation comprises the following specific steps:

1) Pre-training a knowledge graph embedding model of a high-dimensional teacher;

training a high-embedding dimension teacher model to prepare for guiding a low-embedding dimension student model;

2) Self-adaptive distillation of soft label weight;

providing a soft label weight self-adaptive distillation mechanism, and gradually increasing the weight of soft label loss according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss in the process of guiding the training of a student model by a teacher model;

3) Carrying out iterative distillation;

providing an iterative distillation frame, enabling a knowledge graph embedded model to alternately become a student model and a teacher model in an iterative distillation process, accelerating a training process, using a single teacher to distill, and fixing parameters of the teacher model in the distillation process;

4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction.

As a further improvement of the invention, the process of training a highly embedded dimension teacher model in the step 1) is as follows;

first, a series of entities is giveneAnd relationRA knowledge mapGExpressed as a set of a series of triples, usingh,r,tRepresenting triples, namely head entities, relations and tail entities, the knowledge graph embedding model takes the triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triples

Simultaneously replacing randomly

Head entity and tail entity in (1) as negative triples

；

The knowledge-graph embedding model then embeds each triplet as a vector, and then uses a scoring functionSCalculating a score for each triplet vector representation;

different knowledge map embedded models have different scoring functions, after the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:

wherein the content of the first and second substances,

for a positive triplet, the positive triplet is,

(ii) a For the case of a negative triple, it is,

，

is thatSoftmaxA function;

and after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model.

As a further improvement of the invention, the specific steps of the step 2) are as follows;

given a tripleth,r,tFirstly, it is simultaneously inputted into teacher model and student model, and respectively passed through teacher model and student model to make coding, then defining the scoring function of teacher model and scoring result as

The scoring function result of the student model is

The hard tag loss during distillation is the original loss of the student model, defined as follows:

wherein the content of the first and second substances,

for a positive triplet, the positive triplet is,

(ii) a In the case of a negative triple, for example,

，

is thatSoftmaxFunction, soft tag loss adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:

finally, the total loss of distillation

The weighted sum of hard tag loss and soft tag loss is as follows:

wherein the content of the first and second substances,

is the weight of the soft tag that balances the soft tag loss and the hard tag loss.

As a further improvement of the invention, in the step 2), only the student model is trained in the distillation process of the soft label weight adaptive distillation mechanism, and the model parameters of the teacher model are fixed.

As a further improvement of the invention, the step 2) dynamically adjusts the weight of the soft label in the distillation process of the weight adaptive distillation mechanism of the soft label

Dividing the complete training process into two stages;

in the first stage, hard tag loss is dominant, and soft tag loss weight is distributed with a smaller initial value and is gradually increased;

in the second stage, the soft label weight is fixed;

define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:

wherein the parameterskThe value of (A) is dynamically adjusted in the training process, thereby ensuring that

Has a value of

Inner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,

is the initial soft tag weight.

As a further improvement of the invention, the specific steps of the step 3) are as follows;

is defined inkThe embedding dimension of the teacher model in the sub-iteration is

The embedding dimension of the teacher model is

The compression ratio of each iteration

The definition is as follows:

this fixed compression ratio is then used for each iteration

Model compression is performed, in the first iteration, the first student model is trained using the pre-trained teacher model, in the second iterationkIn the second iteration, the first one is usedkStudents generated in 1 iteration as the secondkTeacher in the second iterationkHard tag loss for the next iteration is defined as follows:

wherein the content of the first and second substances,

is as followskStudent model of sub-iterationScoring results of type scoring functions, firstkThe soft tag loss for the sub-iteration is defined as follows:

wherein the content of the first and second substances,

is as followskThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheel

The definition is as follows:

defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:

final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the student

The following relationship is satisfied:

。

as a further improvement of the invention, the required student model dimension and the compression ratio of each iteration are preset during model compression in the step 3), and then the iteration times are determined by the teacher model dimension, the student model dimension and the compression ratio of each iteration.

As a further improvement of the invention, the specific steps of the step 4) are as follows;

through step 3), the final generation dimension is

The low-dimensional student model is the final result of the compressed embedded knowledge map model, and low-dimensional student model prediction can be carried out after the distilled low-dimensional model is obtained.

As a further improvement of the invention, the evaluation index adopted by the low-dimensional student model prediction in the step 4) is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1

And average proportion of triples ranked less than or equal to 10

And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed.

As a further improvement of the invention, step 4) is carried out a prediction phase of low-dimensional student model prediction, given a query

WhereinhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst；

The specific prediction process is as follows, first, a given query is

The head entity and the relation in (1) are embedded into a vector, and all candidate tail entities are simultaneously embeddedtThe embedding is carried out as a vector quantity,

then, the query is executedqInputting the triples formed by all candidate tail entities into a scoring function for scoring, sorting the scores of all triples, and calculating the index

The calculation formula is as follows:

wherein the content of the first and second substances,Nindicates the number of all the triples,

is an indication function when a condition is satisfied

The value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention provides a rapid knowledge map embedded model compression method based on iterative distillation, which reduces the dimension difference between a teacher model and a student model of each distillation by the iterative distillation method so as to enable knowledge to be smoothly transferred from the teacher model to the student model. Meanwhile, the soft label weight self-adaptive mechanism solves the problem that training efficiency is influenced due to inconsistent optimization directions between hard label loss and soft label loss by dynamically adjusting the weight of the soft label. Furthermore, a single teacher distillation strategy is adopted in the distillation process, parameters of a teacher model are fixed, and the distillation speed is obviously improved. Verification is carried out on a link prediction task, and the rapid knowledge map embedded model compression method based on iterative distillation provided by the invention is proved to have better universality and can ensure high efficiency in practical application. Therefore, the invention has good application prospect and popularization range.

Drawings

FIG. 1 is a logic flow diagram of the method of the present invention;

FIG. 2 is a model flow diagram of the method of the present invention;

fig. 3 is a diagram of a training arrangement for the method of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

As a specific embodiment of the present invention, the present invention provides a logic flow chart as shown in fig. 1, a model flow chart as shown in fig. 2, and a training configuration chart as shown in fig. 3, wherein the method for compressing a fast knowledge-graph embedded model based on iterative distillation comprises the steps of:

1) And pre-training a knowledge graph embedding model of the high-dimensional teacher.

The process of knowledge-graph embedding is to given a set of triples in a knowledge-graph, embedding all entities and relationships therein as vectors in a continuous space. And when the high-dimensional teacher model is trained, selecting a knowledge graph embedding model with higher embedding dimension for pre-training. The reason for selecting a higher-dimensional model as a teacher model is that for most knowledge graph embedded models, the higher the embedding dimension, the stronger the expression capability of the model.

The process of training a high-dimensional teacher model is as follows, first, a series of entities are giveneAnd relationRA knowledge mapGMay be represented as a set of a series of triples. Generally, use is made ofh,r,tRepresenting triplets (head entity, relationship)Tail entity). The knowledge graph embedding model takes the original triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triples

Simultaneously replacing randomly

The head entity and the tail entity in (1) are used as negative triples

. Then, the knowledge-graph embedding model embeds each triple into a vector, and then uses a scoring functionSA score is calculated for each triplet vector representation. Different knowledge-graph embedded models have different scoring functions. After the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:

wherein the content of the first and second substances,

for a positive triplet, the positive triplet is,

(ii) a For the case of a negative triple, it is,

，

is thatSoftmaxA function. And after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model. By training the high-dimensional teacher model, a knowledge graph embedded model with better performance is obtained for distilling the low-dimensional knowledge graph embedded model.

2) Soft label weight adaptive distillation.

The soft label weight adaptive distillation uses the high-dimensional knowledge map embedded model trained in the step 1) as a teacher model, and aims to enable a low-dimensional student model to simulate the output of the high-dimensional teacher model to learn knowledge, so that the student model achieves the performance matched with the teacher model.

In the distillation process of the traditional knowledge graph embedded model, because the optimization directions of the soft label loss and the hard label loss are inconsistent, the problem that the model is difficult to converge in the training process can be caused. In order to solve the problem, the weight of the soft label is dynamically adjusted in the distillation process by the weight adaptive distillation of the soft label, and the weight of the soft label is innovatively changed according to the change of the total loss, so that the soft label has smaller weight in the early training period, and the hard label dominates the training process of the model. As training progresses, soft label weights are gradually increased so that the student model is gradually optimized in the hard label dominated optimization direction. By the strategy, the model is converged to a better direction, and the convergence speed of the model is accelerated.

In particular, a given tripleh,r,tFirstly, the codes are simultaneously input to a teacher model and a student model, and are respectively encoded by the teacher model and the student model. Then defining the scoring function of the teacher model as the scoring result

The scoring function result of the student model is

wherein the content of the first and second substances,

for a positive triplet, the positive triplet is,

(ii) a For the case of a negative triple, it is,

，

is thatSoftmaxA function. Loss of soft label adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:

finally, the total loss of distillation

The weighted sum of hard tag loss and soft tag loss is as follows:

wherein the content of the first and second substances,

is a weight for the soft tag that balances soft tag loss and hard tag loss. Note that only the student model was trained during the distillation, and the model parameters of the teacher model were fixed.

To further balance hard tag loss and soft tag loss during retort, the weights of the soft tags are dynamically adjusted during retort

. At the beginning of training, it is desirable that the weight lost by the soft label be as small as possible. Then, as the training process progresses, the soft label weights are continuously increased and finally fixed. In particular, the complete training process is divided into two phases. In the first stage, hard tag losses dominate,the soft tag loss weight is assigned a smaller initial value and gradually increases. In the second stage, the soft tag weights are fixed. Define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:

wherein the parameterskThe value of (A) is dynamically adjusted in the training process, so that the condition that

In the range of

is the initial soft tag weight. The soft label weight adaptive distillation solves the problem that the model is difficult to converge due to different directions of hard label loss optimization and soft label loss optimization at the initial training stage.

3) And (4) carrying out iterative distillation.

And (3) repeating the operations of the step 1) and the step 2) in the iterative distillation, and taking the student model generated in the last iteration as a teacher model in the next iteration, so that the dimension difference between the student model and the teacher model in each distillation is reduced, and the knowledge can be transferred more smoothly. Previous methods distilled low dimensional student models directly with high dimensional teacher models, however the better performing teacher model did not necessarily teach the better performing student model when the distillation knowledge graph was embedded in the model. This is because the difference in embedding dimensions between the teacher model and the student models is too large, resulting in too large a difference between output results, and it is difficult to directly transfer knowledge from the teacher model to the student models. The invention provides an iterative distillation strategy, which does not directly distill the original teacher model into a final student model, but gradually reduces the embedding dimension of the model for distillation, thereby reducing the dimension difference between the teacher model and the student model in each distillation process, and further smoothly transferring knowledge from the teacher model to the student model. Through the iterative distillation method, the finally generated student model well inherits the performance of the original teacher model.

Specifically, it is defined inkThe embedding dimension of the teacher model in the sub-iteration is

The embedding dimension of the teacher model is

The compression ratio of each iteration

The definition is as follows:

this fixed compression ratio is then used for each iteration

And carrying out model compression. In the first iteration, a first student model is trained using a pre-trained teacher model. In the first placekIn the second iteration, the first iteration is usedkStudents generated in 1 iteration as the secondkTeacher in the second iteration. First, thekHard tag loss for the next iteration is defined as follows:

wherein the content of the first and second substances,

is as followskScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:

wherein the content of the first and second substances,

is a firstkThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheel

The definition is as follows:

The following relationship is satisfied:

。

in practical application, required student model dimensions and compression ratio of each iteration are preset when the model is compressed, and then the iteration times are determined by the teacher model dimensions, the student model dimensions and the compression ratio of each iteration.

Through step 3), the final generation dimension is

Low-dimensional student model most knowledge map embedding modelThe final result after compression. And after a distilled low-dimensional model is obtained, the low-dimensional student model prediction can be carried out. The performance of the knowledge graph embedded model on a link prediction task is generally adopted to evaluate the quality of the knowledge graph embedded model, and the adopted evaluation index is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1

And average proportion of triples ranked less than or equal to 10

And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed. In particular, in the prediction phase, a query is given

In whichhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst. The specific prediction process is as follows, first, a given query is

Embedding head entities and relations in the vector, and simultaneously embedding all candidate tail entities into the vectortEmbedded as a vector. Then, the query is executedqThe triples with all candidate tail entities are input into a scoring function for scoring. Sorting the scores of all the triples and calculating indexes

The calculation formula is as follows:

is to indicateFunction when the condition is satisfied

[ example 1 ] A method for producing a polycarbonate

In an embodiment, the iterative distillation-based fast knowledge map embedding model compression method is applied to a universal data setFB15K-237AndWN18RRtraining and prediction were performed as above, and the same data as this example was used in all other examples. Data set usage.FB15K-23714541 entities and 237 relations are contained in the data set, wherein 272115 triples are contained in the training set, 17535 triples are contained in the verification set, and 20466 triples are contained in the test set.WN18RR40943 entities and 11 relationships are contained in the data set, wherein 86835 triplets are contained in the training set, 3034 triplets are contained in the validation set, and 3134 triplets are contained in the test set. Because the model has good robustness and generalization, the same hyper-parameter setting can be used in different general data sets.

The specific implementation is as follows: knowledge graph embedding model selectionTransE,ComplEx,SimplEAndRotatE. The embedding dimension of the initial teacher model embedding layer is 512, and the embedding dimension of the final student model obtained by distillation is 32. Compressibility of each layer of distillation

Set to 2, number of iterationsNSet to 4. Hyper-parameterpThe value of (d) is set to 2. In the training phase, useAdagradAs an optimizer, the value of the learning rate is set to 0.1. The batch size for each training is set to 1024 and the number of training rounds for each iteration is set to 1000. Because the stability of the model is good, only 1000 training rounds are needed for convergence at most in each iteration, and therefore the same batch setting and round setting can be adopted for different models and different data sets.

Testing the effectiveness of iterative distillation-based fast knowledge-map-embedded model compression methods using different knowledge-map-embedded models, and will be most effectiveThe final prediction results of the student model were compared with the student models obtained by other distillation methods, and were found in

The three indexes reach the highest performance, which shows that the iterative distillation-based rapid knowledge map embedded model compression method can reach the optimal level in the actual application scene.

[ example 2 ]

The iterative distillation-based fast knowledge map embedded model compression method has the advantage of training. Firstly, the distillation is performed by a single teacher, so that the training time can be greatly reduced compared with the training time of a multi-teacher distillation frame; secondly, teacher parameters are fixed and fixed in the distillation stage, a soft label weight dynamic adjustment mechanism is adopted to accelerate the training process, and the training time is reduced by 50% on average compared with the optimal method. The more complex the knowledge-graph embedding model, the more significantly the training time is shortened. For example embedding models for more complex knowledge graphsRotatEBy way of example, the training time can be reduced by 63%.

In practical applications, since real knowledge-maps often need to be updated, existing knowledge-map embedded models need to be retrained each time a new entity is encountered. Therefore, compared with the traditional distillation method, the iterative distillation-based fast knowledge map embedded model compression method can ensure that the knowledge map embedded model is updated efficiently, and simultaneously greatly saves the consumption of resources.

[ example 3 ]

Compared with a model without distillation in the same dimension, the student model generated by each iteration of the rapid knowledge map embedded model compression method based on iterative distillation has a good effect. Specifically, a 512-dimensional knowledge graph is embedded into a model by an iterative distillation method, the compression rate of each layer is set to be 2, the layers are distilled to be 256-dimensional, 128-dimensional, 64-dimensional and 32-dimensional respectively, and the effects of the intermediate models are tested respectively compared with the same-dimensional model without distillation, and the performance of the model obtained after distillation can be improved by 5% -90%. Wherein, the lower the embedding dimension of the model is, the more obvious the effect of performance improvement is. This shows that the iterative distillation method has high flexibility, and in practical application, the dimension needing distillation can be selected according to practical requirements. For different demand dimensions, the iterative distillation method can obtain a low-dimensional model with high performance.

[ example 4 ]

The compression ratio of each iteration of the iterative distillation-based rapid knowledge map embedded model compression method is set to be 2, and meanwhile, the distillation performance and efficiency are guaranteed. If the compression rate of each iteration is increased, the required student model can be obtained by using fewer iterations, but the performance is influenced to a certain extent. For example, setting the compression ratio per layer to 4 and 16, the number of iterations of training can be reduced by 50% and 75%, but the performance drops by 5% to 10%. From this, it is understood that the smaller the compressibility of each layer, the more remarkable the effect of the iterative distillation. Meanwhile, due to the huge training time advantage, different compression ratios can be flexibly selected according to requirements in practical application.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims

1. A rapid knowledge map embedded model compression method based on iterative distillation specifically comprises the following steps:

2) Self-adaptive distillation of soft label weight;

3) Iterative distillation;

4) And (5) embedding a low-dimensional student knowledge map into a model for prediction.

2. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 1, wherein: step 1) training a high-embedding dimension teacher model as follows;

Simultaneously replacing randomly

Head entity and tail entity in (1) as negative triples

；

wherein the content of the first and second substances,

for a positive triplet, the positive triplet is,

(ii) a In the case of a negative triple, for example,

，

is thatSoftmaxA function;

3. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 2, wherein: step 2) comprises the following specific steps;

The scoring function result of the student model is

wherein the content of the first and second substances,

to aIn the case of a positive triplet, the positive triplet,

(ii) a For the case of a negative triple, it is,

，

finally, the total loss of distillation

The weighted sum of hard tag loss and soft tag loss is as follows:

wherein the content of the first and second substances,

is a weight for the soft tag that balances soft tag loss and hard tag loss.

4. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: and 2) only training a student model in the distillation process of the soft label weight self-adaptive distillation mechanism, wherein the model parameters of the teacher model are fixed.

5. Rapid knowledge based on iterative distillation according to claim 3The recognition map embedded model compression method is characterized in that: the step 2) dynamically adjusting the weight of the soft label in the distillation process of the soft label weight adaptive distillation mechanism

Dividing the complete training process into two stages;

in the second stage, the soft label weight is fixed;

Has a value of

is the initial soft tag weight.

6. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: step 3) comprises the following specific steps;

The embedding dimension of the teacher model is

The compression ratio of each iteration

The definition is as follows:

this fixed compression ratio is then used for each iteration

wherein the content of the first and second substances,

is a firstkScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:

wherein the content of the first and second substances,

The definition is as follows:

The following relationship is satisfied:

。

7. the iterative distillation-based fast knowledge-map-embedded model compression method according to claim 6, wherein: and 3) presetting needed student model dimensionality and compression ratio of each iteration during model compression in the step 3), and then determining the number of iterations by the teacher model dimensionality, the student model dimensionality and the compression ratio of each iteration.

8. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 6, wherein: step 4), the concrete steps are as follows;

through step 3), the final generation dimension is

Low dimension student modelEmbedding the maximum knowledge graph into the final result of model compression to obtain a distilled low-dimensional model, and then performing low-dimensional student model prediction.

9. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 7, wherein: step 4) the evaluation index adopted by the low-dimensional student model prediction is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1

And average proportion of triples with rank less than or equal to 10

10. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 9, wherein: step 4) carrying out a prediction stage of low-dimensional student model prediction and giving query

The specific prediction process is as follows, first, a given query is

then, the query is executedqInputting the triples formed by the triples and all candidate tail entities into a scoring function for scoring, sorting the scores of all the triples, and calculating indexes

The calculation formula is as follows:

is an indication function when a condition is satisfied