CN115544277A - Rapid knowledge graph embedded model compression method based on iterative distillation - Google Patents

Rapid knowledge graph embedded model compression method based on iterative distillation Download PDF

Info

Publication number
CN115544277A
CN115544277A CN202211535321.2A CN202211535321A CN115544277A CN 115544277 A CN115544277 A CN 115544277A CN 202211535321 A CN202211535321 A CN 202211535321A CN 115544277 A CN115544277 A CN 115544277A
Authority
CN
China
Prior art keywords
model
distillation
teacher
knowledge
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211535321.2A
Other languages
Chinese (zh)
Inventor
汪鹏
刘嘉骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202211535321.2A priority Critical patent/CN115544277A/en
Publication of CN115544277A publication Critical patent/CN115544277A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames

Abstract

A fast knowledge map embedding model compression method based on iterative distillation, 1) pre-training a high-dimensional teacher knowledge map embedding model; 2) Self-adaptive distillation of soft label weight; 3) Iterative distillation; 4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction. The method can realize the excellent performance of the distillation compression knowledge map embedded model, simultaneously keeps the model reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick updating of large-scale knowledge map embedded models in practical application.

Description

Rapid knowledge graph embedded model compression method based on iterative distillation
Technical Field
The invention belongs to the field of artificial intelligence knowledge maps, and particularly relates to a rapid knowledge map embedded model compression method based on iterative distillation.
Background
A knowledge graph is a structure that describes concepts and facts using a graphical model, where knowledge is stored in the form of triples. The knowledge-graph embedding model aims to embed triples in a knowledge-graph into a continuous vector space. With the rapid growth of the scale of the knowledge graph, the efficient knowledge graph embedding model plays a key role in downstream applications such as knowledge question answering, recommendation systems, knowledge graph completion and the like. Most knowledge graph embedding models perform better as the embedding dimensions increase, however, in practical applications models with higher dimensions tend to have longer inference times and require more storage. Specifically, the 512-dimensional model has 7-15 times more model parameters and 2-6 times more inference time than the 32-dimensional model. Therefore, how to compress a high-dimensional knowledge map embedded model into a low-dimensional knowledge map embedded model is an important problem. Knowledge distillation is a common model compression method, and aims to use a larger teacher model as a teacher model, use a smaller model as a student model, and then enable the student model to simulate the output of the teacher model. Recently, although several knowledge-based distillation knowledge-map-embedded model compression methods have been proposed, they still suffer from three drawbacks. First, existing work directly distills high-dimensional teacher models into low-dimensional student models, but better teachers are not always able to teach better students. Previous studies have shown that too much difference between teacher and student models can affect distillation performance. Similar phenomena exist for knowledge-based distillation knowledge-graph embedding models, which represent the gap between embedding dimensions. Specifically, low-dimensional student models are distilled directly from high-dimensional teacher models, and excessive dimensional gaps lead to difficulties for better teachers to teach better students. This dimensional gap implies different performance, which further results in significant differences in output distribution between the teacher and the student models, which makes the student models difficult to learn. Second, another significant challenge faced by the existing methods is that the difference in optimization direction between the distillation objective and the original mission objective causes the model to be difficult to converge. Thirdly, the current work mainly focuses on improving the reasoning efficiency, but has the problem of time consumption in the training process. In particular, existing knowledge-graph-embedded model distillation methods tend to use a multi-teacher distillation framework or train teacher models and student models together, which makes training times several times longer than direct distillation.
Therefore, an application number CN202111152202.4 is named as a knowledge graph embedding compression method based on knowledge graph distillation, which fully captures triple information and embedded structure information in a high-dimensional knowledge graph embedding model (Teacher model) and distills the triple information and the embedded structure information into the knowledge graph embedding model (Student model), improves the expression capability of the Student model under the condition of ensuring the storage and reasoning efficiency of the Student model, considers the double influence between the Teacher model and the Student model in the distillation process, provides a soft label evaluation mechanism to distinguish the quality of soft labels of different triples, and provides a training mode of firstly fixing the Teacher model and then releasing the fixed Teacher model to improve the adaptability of the Student model to the Teacher model and finally improve the performance of the Student model.
But there are cases as follows;
1) The emphasis on solving the problem is different: the 'knowledge map embedding compression method based on knowledge map distillation' focuses on solving the problem that the teacher model and the student model have double influences. The invention aims to solve the problem that the distillation effect is reduced due to overlarge embedding dimension between a teacher model and a student model. Furthermore, the present invention focuses on the problem of training efficiency.
2) The methods for solving the problems are different: the knowledge graph embedding compression method based on knowledge graph distillation provides a two-stage distillation method, and a teacher model with high dimensionality is directly distilled into a student model with low dimensionality. The invention provides a rapid knowledge graph embedding method based on iterative distillation, which gradually reduces the dimensionality of a teacher model by using the iterative distillation method, so that knowledge can be smoothly transferred, a student model with better performance is finally obtained, and the problem of overlarge dimensionality difference between the teacher model and the student model is solved.
3) The efficiency of the solution is different: in order to accelerate the training process and further solve the problem that the optimization directions of the hard label and the soft label are inconsistent in the distillation process, the invention provides a dynamic soft label weight adjusting mechanism, the weight between the soft label and the hard label is dynamically adjusted through the loss in the training process, and the training time has great advantage.
The invention provides a novel iterative distillation method for a knowledge graph embedded model. Unlike the method of directly distilling low-dimensional student models with high-dimensional teacher models, the iterative distillation method adopts a method of gradually reducing the sizes of models to reduce the dimension difference between the teacher models and the student models of each distillation. Specifically, a trained high-dimensional knowledge map is embedded into a model as a teacher model, and the teacher model is compressed into a smaller student model according to a specific compression ratio by a knowledge distillation method. This process will then be iterated, and the student model generated in the previous iteration will be used as the teacher model for the next iteration to guide lower dimension student model training. And finally, stopping iteration when the dimension of the obtained student model reaches the required dimension. By the iterative distillation method, an overlarge dimension difference between the student model and the teacher model is reduced. In other words, iterative distillation allows knowledge to be smoothly transferred from a high-dimensional teacher model to a low-dimensional student model, while allowing the last iteration of student models to better inherit the performance of the first iteration of teacher models. In order to obtain better distillation performance, the invention provides a novel soft label weight self-adaptive adjustment method. In the early stages of training, hard tag losses are given greater weight and soft tag losses are given lesser weight. At this point, the hard tag dominates the optimization of the model. The weight of the soft label loss is gradually increased along with the training, so that the soft label loss dominates the optimization of the model in the later training period. The soft label weight self-adaption method can solve the problem that the model cannot be well converged due to different optimization directions of hard label loss and soft label loss. To further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation.
Disclosure of Invention
In order to solve the problems, the invention provides a rapid knowledge graph embedding model compression method based on iterative distillation. Then, the invention provides a soft label weight self-adaptive distillation mechanism, and in the process of teaching student models to train by a teacher model, the weight of soft label loss is gradually increased according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss, so that the training efficiency is improved. Furthermore, the invention provides an iterative distillation framework, so that the knowledge graph embedded model can be alternately changed into a student model and a teacher model in the iterative distillation process. Specifically, the student model of the last iteration is generated as the teacher model of the next iteration to guide the training of the student model of lower dimensionality. Thus, knowledge can be migrated between the high-dimensional teacher model and the low-dimensional student model in a smooth manner while maintaining good performance of the student models. Next, to further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation. The method can realize the excellent performance of distillation and compression of the knowledge-map embedded model, simultaneously keeps the reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick update of large-scale knowledge-map embedded models in the real world.
In order to achieve the purpose, the invention adopts the technical scheme that:
a rapid knowledge graph embedding model compression method based on iterative distillation comprises the following specific steps:
1) Pre-training a knowledge graph embedding model of a high-dimensional teacher;
training a high-embedding dimension teacher model to prepare for guiding a low-embedding dimension student model;
2) Self-adaptive distillation of soft label weight;
providing a soft label weight self-adaptive distillation mechanism, and gradually increasing the weight of soft label loss according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss in the process of guiding the training of a student model by a teacher model;
3) Carrying out iterative distillation;
providing an iterative distillation frame, enabling a knowledge graph embedded model to alternately become a student model and a teacher model in an iterative distillation process, accelerating a training process, using a single teacher to distill, and fixing parameters of the teacher model in the distillation process;
4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction.
As a further improvement of the invention, the process of training a highly embedded dimension teacher model in the step 1) is as follows;
first, a series of entities is giveneAnd relationRA knowledge mapGExpressed as a set of a series of triples, usingh,r,tRepresenting triples, namely head entities, relations and tail entities, the knowledge graph embedding model takes the triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triples
Figure 908291DEST_PATH_IMAGE001
Simultaneously replacing randomly
Figure 481224DEST_PATH_IMAGE002
Head entity and tail entity in (1) as negative triples
Figure 152639DEST_PATH_IMAGE003
The knowledge-graph embedding model then embeds each triplet as a vector, and then uses a scoring functionSCalculating a score for each triplet vector representation;
different knowledge map embedded models have different scoring functions, after the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
Figure 219952DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 221275DEST_PATH_IMAGE005
for a positive triplet, the positive triplet is,
Figure 512579DEST_PATH_IMAGE006
(ii) a For the case of a negative triple, it is,
Figure 217097DEST_PATH_IMAGE007
Figure 353680DEST_PATH_IMAGE008
is thatSoftmaxA function;
and after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model.
As a further improvement of the invention, the specific steps of the step 2) are as follows;
given a tripleth,r,tFirstly, it is simultaneously inputted into teacher model and student model, and respectively passed through teacher model and student model to make coding, then defining the scoring function of teacher model and scoring result as
Figure 943930DEST_PATH_IMAGE009
The scoring function result of the student model is
Figure 828972DEST_PATH_IMAGE010
The hard tag loss during distillation is the original loss of the student model, defined as follows:
Figure 317722DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 992417DEST_PATH_IMAGE005
for a positive triplet, the positive triplet is,
Figure 591501DEST_PATH_IMAGE006
(ii) a In the case of a negative triple, for example,
Figure 755766DEST_PATH_IMAGE007
Figure 669496DEST_PATH_IMAGE008
is thatSoftmaxFunction, soft tag loss adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
Figure 397149DEST_PATH_IMAGE013
finally, the total loss of distillation
Figure 197877DEST_PATH_IMAGE014
The weighted sum of hard tag loss and soft tag loss is as follows:
Figure 470727DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 121020DEST_PATH_IMAGE016
is the weight of the soft tag that balances the soft tag loss and the hard tag loss.
As a further improvement of the invention, in the step 2), only the student model is trained in the distillation process of the soft label weight adaptive distillation mechanism, and the model parameters of the teacher model are fixed.
As a further improvement of the invention, the step 2) dynamically adjusts the weight of the soft label in the distillation process of the weight adaptive distillation mechanism of the soft label
Figure 871938DEST_PATH_IMAGE016
Dividing the complete training process into two stages;
in the first stage, hard tag loss is dominant, and soft tag loss weight is distributed with a smaller initial value and is gradually increased;
in the second stage, the soft label weight is fixed;
define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
Figure 733365DEST_PATH_IMAGE017
wherein the parameterskThe value of (A) is dynamically adjusted in the training process, thereby ensuring that
Figure 223121DEST_PATH_IMAGE019
Has a value of
Figure 377021DEST_PATH_IMAGE020
Inner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,
Figure 151205DEST_PATH_IMAGE021
is the initial soft tag weight.
As a further improvement of the invention, the specific steps of the step 3) are as follows;
is defined inkThe embedding dimension of the teacher model in the sub-iteration is
Figure 441372DEST_PATH_IMAGE022
The embedding dimension of the teacher model is
Figure 39712DEST_PATH_IMAGE023
The compression ratio of each iteration
Figure 428712DEST_PATH_IMAGE024
The definition is as follows:
Figure 521433DEST_PATH_IMAGE025
this fixed compression ratio is then used for each iteration
Figure 728423DEST_PATH_IMAGE024
Model compression is performed, in the first iteration, the first student model is trained using the pre-trained teacher model, in the second iterationkIn the second iteration, the first one is usedkStudents generated in 1 iteration as the secondkTeacher in the second iterationkHard tag loss for the next iteration is defined as follows:
Figure 763244DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 891737DEST_PATH_IMAGE027
is as followskStudent model of sub-iterationScoring results of type scoring functions, firstkThe soft tag loss for the sub-iteration is defined as follows:
Figure 538881DEST_PATH_IMAGE028
wherein the content of the first and second substances,
Figure 69220DEST_PATH_IMAGE029
is as followskThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheel
Figure 822412DEST_PATH_IMAGE030
The definition is as follows:
Figure 218627DEST_PATH_IMAGE031
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
Figure 715468DEST_PATH_IMAGE032
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the student
Figure DEST_PATH_IMAGE033
The following relationship is satisfied:
Figure 514360DEST_PATH_IMAGE034
as a further improvement of the invention, the required student model dimension and the compression ratio of each iteration are preset during model compression in the step 3), and then the iteration times are determined by the teacher model dimension, the student model dimension and the compression ratio of each iteration.
As a further improvement of the invention, the specific steps of the step 4) are as follows;
through step 3), the final generation dimension is
Figure 172874DEST_PATH_IMAGE033
The low-dimensional student model is the final result of the compressed embedded knowledge map model, and low-dimensional student model prediction can be carried out after the distilled low-dimensional model is obtained.
As a further improvement of the invention, the evaluation index adopted by the low-dimensional student model prediction in the step 4) is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1
Figure DEST_PATH_IMAGE035
And average proportion of triples ranked less than or equal to 10
Figure 462910DEST_PATH_IMAGE036
And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed.
As a further improvement of the invention, step 4) is carried out a prediction phase of low-dimensional student model prediction, given a query
Figure DEST_PATH_IMAGE037
WhereinhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst
The specific prediction process is as follows, first, a given query is
Figure 451857DEST_PATH_IMAGE037
The head entity and the relation in (1) are embedded into a vector, and all candidate tail entities are simultaneously embeddedtThe embedding is carried out as a vector quantity,
then, the query is executedqInputting the triples formed by all candidate tail entities into a scoring function for scoring, sorting the scores of all triples, and calculating the index
Figure 425629DEST_PATH_IMAGE038
The calculation formula is as follows:
Figure DEST_PATH_IMAGE039
wherein the content of the first and second substances,Nindicates the number of all the triples,
Figure 238733DEST_PATH_IMAGE040
is an indication function when a condition is satisfied
Figure DEST_PATH_IMAGE041
The value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides a rapid knowledge map embedded model compression method based on iterative distillation, which reduces the dimension difference between a teacher model and a student model of each distillation by the iterative distillation method so as to enable knowledge to be smoothly transferred from the teacher model to the student model. Meanwhile, the soft label weight self-adaptive mechanism solves the problem that training efficiency is influenced due to inconsistent optimization directions between hard label loss and soft label loss by dynamically adjusting the weight of the soft label. Furthermore, a single teacher distillation strategy is adopted in the distillation process, parameters of a teacher model are fixed, and the distillation speed is obviously improved. Verification is carried out on a link prediction task, and the rapid knowledge map embedded model compression method based on iterative distillation provided by the invention is proved to have better universality and can ensure high efficiency in practical application. Therefore, the invention has good application prospect and popularization range.
Drawings
FIG. 1 is a logic flow diagram of the method of the present invention;
FIG. 2 is a model flow diagram of the method of the present invention;
fig. 3 is a diagram of a training arrangement for the method of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a rapid knowledge map embedded model compression method based on iterative distillation, which reduces the dimension difference between a teacher model and a student model of each distillation by the iterative distillation method so as to enable knowledge to be smoothly transferred from the teacher model to the student model. Meanwhile, the soft label weight self-adaptive mechanism solves the problem that training efficiency is influenced due to inconsistent optimization directions between hard label loss and soft label loss by dynamically adjusting the weight of the soft label. Furthermore, a single teacher distillation strategy is adopted in the distillation process, parameters of a teacher model are fixed, and the distillation speed is obviously improved. Verification is carried out on a link prediction task, and the rapid knowledge map embedded model compression method based on iterative distillation provided by the invention is proved to have better universality and can ensure high efficiency in practical application. Therefore, the invention has good application prospect and popularization range.
As a specific embodiment of the present invention, the present invention provides a logic flow chart as shown in fig. 1, a model flow chart as shown in fig. 2, and a training configuration chart as shown in fig. 3, wherein the method for compressing a fast knowledge-graph embedded model based on iterative distillation comprises the steps of:
1) And pre-training a knowledge graph embedding model of the high-dimensional teacher.
The process of knowledge-graph embedding is to given a set of triples in a knowledge-graph, embedding all entities and relationships therein as vectors in a continuous space. And when the high-dimensional teacher model is trained, selecting a knowledge graph embedding model with higher embedding dimension for pre-training. The reason for selecting a higher-dimensional model as a teacher model is that for most knowledge graph embedded models, the higher the embedding dimension, the stronger the expression capability of the model.
The process of training a high-dimensional teacher model is as follows, first, a series of entities are giveneAnd relationRA knowledge mapGMay be represented as a set of a series of triples. Generally, use is made ofh,r,tRepresenting triplets (head entity, relationship)Tail entity). The knowledge graph embedding model takes the original triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triples
Figure 563536DEST_PATH_IMAGE042
Simultaneously replacing randomly
Figure 618823DEST_PATH_IMAGE042
The head entity and the tail entity in (1) are used as negative triples
Figure DEST_PATH_IMAGE043
. Then, the knowledge-graph embedding model embeds each triple into a vector, and then uses a scoring functionSA score is calculated for each triplet vector representation. Different knowledge-graph embedded models have different scoring functions. After the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
Figure 915943DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 634369DEST_PATH_IMAGE044
for a positive triplet, the positive triplet is,
Figure DEST_PATH_IMAGE045
(ii) a For the case of a negative triple, it is,
Figure 180888DEST_PATH_IMAGE046
Figure 42796DEST_PATH_IMAGE008
is thatSoftmaxA function. And after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model. By training the high-dimensional teacher model, a knowledge graph embedded model with better performance is obtained for distilling the low-dimensional knowledge graph embedded model.
2) Soft label weight adaptive distillation.
The soft label weight adaptive distillation uses the high-dimensional knowledge map embedded model trained in the step 1) as a teacher model, and aims to enable a low-dimensional student model to simulate the output of the high-dimensional teacher model to learn knowledge, so that the student model achieves the performance matched with the teacher model.
In the distillation process of the traditional knowledge graph embedded model, because the optimization directions of the soft label loss and the hard label loss are inconsistent, the problem that the model is difficult to converge in the training process can be caused. In order to solve the problem, the weight of the soft label is dynamically adjusted in the distillation process by the weight adaptive distillation of the soft label, and the weight of the soft label is innovatively changed according to the change of the total loss, so that the soft label has smaller weight in the early training period, and the hard label dominates the training process of the model. As training progresses, soft label weights are gradually increased so that the student model is gradually optimized in the hard label dominated optimization direction. By the strategy, the model is converged to a better direction, and the convergence speed of the model is accelerated.
In particular, a given tripleh,r,tFirstly, the codes are simultaneously input to a teacher model and a student model, and are respectively encoded by the teacher model and the student model. Then defining the scoring function of the teacher model as the scoring result
Figure 991161DEST_PATH_IMAGE009
The scoring function result of the student model is
Figure DEST_PATH_IMAGE047
The hard tag loss during distillation is the original loss of the student model, defined as follows:
Figure 146067DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 711041DEST_PATH_IMAGE044
for a positive triplet, the positive triplet is,
Figure 157066DEST_PATH_IMAGE045
(ii) a For the case of a negative triple, it is,
Figure 713599DEST_PATH_IMAGE046
Figure 790139DEST_PATH_IMAGE008
is thatSoftmaxA function. Loss of soft label adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
Figure 294939DEST_PATH_IMAGE013
finally, the total loss of distillation
Figure 13496DEST_PATH_IMAGE048
The weighted sum of hard tag loss and soft tag loss is as follows:
Figure 467611DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 465785DEST_PATH_IMAGE016
is a weight for the soft tag that balances soft tag loss and hard tag loss. Note that only the student model was trained during the distillation, and the model parameters of the teacher model were fixed.
To further balance hard tag loss and soft tag loss during retort, the weights of the soft tags are dynamically adjusted during retort
Figure 270930DEST_PATH_IMAGE016
. At the beginning of training, it is desirable that the weight lost by the soft label be as small as possible. Then, as the training process progresses, the soft label weights are continuously increased and finally fixed. In particular, the complete training process is divided into two phases. In the first stage, hard tag losses dominate,the soft tag loss weight is assigned a smaller initial value and gradually increases. In the second stage, the soft tag weights are fixed. Define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
Figure DEST_PATH_IMAGE049
wherein the parameterskThe value of (A) is dynamically adjusted in the training process, so that the condition that
Figure 714550DEST_PATH_IMAGE050
In the range of
Figure DEST_PATH_IMAGE051
Inner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,
Figure 757592DEST_PATH_IMAGE052
is the initial soft tag weight. The soft label weight adaptive distillation solves the problem that the model is difficult to converge due to different directions of hard label loss optimization and soft label loss optimization at the initial training stage.
3) And (4) carrying out iterative distillation.
And (3) repeating the operations of the step 1) and the step 2) in the iterative distillation, and taking the student model generated in the last iteration as a teacher model in the next iteration, so that the dimension difference between the student model and the teacher model in each distillation is reduced, and the knowledge can be transferred more smoothly. Previous methods distilled low dimensional student models directly with high dimensional teacher models, however the better performing teacher model did not necessarily teach the better performing student model when the distillation knowledge graph was embedded in the model. This is because the difference in embedding dimensions between the teacher model and the student models is too large, resulting in too large a difference between output results, and it is difficult to directly transfer knowledge from the teacher model to the student models. The invention provides an iterative distillation strategy, which does not directly distill the original teacher model into a final student model, but gradually reduces the embedding dimension of the model for distillation, thereby reducing the dimension difference between the teacher model and the student model in each distillation process, and further smoothly transferring knowledge from the teacher model to the student model. Through the iterative distillation method, the finally generated student model well inherits the performance of the original teacher model.
Specifically, it is defined inkThe embedding dimension of the teacher model in the sub-iteration is
Figure 392579DEST_PATH_IMAGE022
The embedding dimension of the teacher model is
Figure DEST_PATH_IMAGE053
The compression ratio of each iteration
Figure 809654DEST_PATH_IMAGE024
The definition is as follows:
Figure 89588DEST_PATH_IMAGE054
this fixed compression ratio is then used for each iteration
Figure DEST_PATH_IMAGE055
And carrying out model compression. In the first iteration, a first student model is trained using a pre-trained teacher model. In the first placekIn the second iteration, the first iteration is usedkStudents generated in 1 iteration as the secondkTeacher in the second iteration. First, thekHard tag loss for the next iteration is defined as follows:
Figure 377350DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 763332DEST_PATH_IMAGE027
is as followskScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:
Figure 246133DEST_PATH_IMAGE056
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE057
is a firstkThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheel
Figure 234818DEST_PATH_IMAGE030
The definition is as follows:
Figure 721294DEST_PATH_IMAGE058
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
Figure 294489DEST_PATH_IMAGE032
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the student
Figure 702468DEST_PATH_IMAGE033
The following relationship is satisfied:
Figure 353898DEST_PATH_IMAGE034
in practical application, required student model dimensions and compression ratio of each iteration are preset when the model is compressed, and then the iteration times are determined by the teacher model dimensions, the student model dimensions and the compression ratio of each iteration.
4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction.
Through step 3), the final generation dimension is
Figure 429301DEST_PATH_IMAGE033
Low-dimensional student model most knowledge map embedding modelThe final result after compression. And after a distilled low-dimensional model is obtained, the low-dimensional student model prediction can be carried out. The performance of the knowledge graph embedded model on a link prediction task is generally adopted to evaluate the quality of the knowledge graph embedded model, and the adopted evaluation index is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1
Figure 904888DEST_PATH_IMAGE035
And average proportion of triples ranked less than or equal to 10
Figure DEST_PATH_IMAGE059
And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed. In particular, in the prediction phase, a query is given
Figure 252693DEST_PATH_IMAGE060
In whichhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst. The specific prediction process is as follows, first, a given query is
Figure 927388DEST_PATH_IMAGE060
Embedding head entities and relations in the vector, and simultaneously embedding all candidate tail entities into the vectortEmbedded as a vector. Then, the query is executedqThe triples with all candidate tail entities are input into a scoring function for scoring. Sorting the scores of all the triples and calculating indexes
Figure 139189DEST_PATH_IMAGE038
The calculation formula is as follows:
Figure DEST_PATH_IMAGE061
wherein the content of the first and second substances,Nindicates the number of all the triples,
Figure 959246DEST_PATH_IMAGE040
is to indicateFunction when the condition is satisfied
Figure 607396DEST_PATH_IMAGE041
The value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.
[ example 1 ] A method for producing a polycarbonate
In an embodiment, the iterative distillation-based fast knowledge map embedding model compression method is applied to a universal data setFB15K-237AndWN18RRtraining and prediction were performed as above, and the same data as this example was used in all other examples. Data set usage.FB15K-23714541 entities and 237 relations are contained in the data set, wherein 272115 triples are contained in the training set, 17535 triples are contained in the verification set, and 20466 triples are contained in the test set.WN18RR40943 entities and 11 relationships are contained in the data set, wherein 86835 triplets are contained in the training set, 3034 triplets are contained in the validation set, and 3134 triplets are contained in the test set. Because the model has good robustness and generalization, the same hyper-parameter setting can be used in different general data sets.
The specific implementation is as follows: knowledge graph embedding model selectionTransE,ComplEx,SimplEAndRotatE. The embedding dimension of the initial teacher model embedding layer is 512, and the embedding dimension of the final student model obtained by distillation is 32. Compressibility of each layer of distillation
Figure 839444DEST_PATH_IMAGE055
Set to 2, number of iterationsNSet to 4. Hyper-parameterpThe value of (d) is set to 2. In the training phase, useAdagradAs an optimizer, the value of the learning rate is set to 0.1. The batch size for each training is set to 1024 and the number of training rounds for each iteration is set to 1000. Because the stability of the model is good, only 1000 training rounds are needed for convergence at most in each iteration, and therefore the same batch setting and round setting can be adopted for different models and different data sets.
Testing the effectiveness of iterative distillation-based fast knowledge-map-embedded model compression methods using different knowledge-map-embedded models, and will be most effectiveThe final prediction results of the student model were compared with the student models obtained by other distillation methods, and were found in
Figure 420598DEST_PATH_IMAGE062
The three indexes reach the highest performance, which shows that the iterative distillation-based rapid knowledge map embedded model compression method can reach the optimal level in the actual application scene.
[ example 2 ]
The iterative distillation-based fast knowledge map embedded model compression method has the advantage of training. Firstly, the distillation is performed by a single teacher, so that the training time can be greatly reduced compared with the training time of a multi-teacher distillation frame; secondly, teacher parameters are fixed and fixed in the distillation stage, a soft label weight dynamic adjustment mechanism is adopted to accelerate the training process, and the training time is reduced by 50% on average compared with the optimal method. The more complex the knowledge-graph embedding model, the more significantly the training time is shortened. For example embedding models for more complex knowledge graphsRotatEBy way of example, the training time can be reduced by 63%.
In practical applications, since real knowledge-maps often need to be updated, existing knowledge-map embedded models need to be retrained each time a new entity is encountered. Therefore, compared with the traditional distillation method, the iterative distillation-based fast knowledge map embedded model compression method can ensure that the knowledge map embedded model is updated efficiently, and simultaneously greatly saves the consumption of resources.
[ example 3 ]
Compared with a model without distillation in the same dimension, the student model generated by each iteration of the rapid knowledge map embedded model compression method based on iterative distillation has a good effect. Specifically, a 512-dimensional knowledge graph is embedded into a model by an iterative distillation method, the compression rate of each layer is set to be 2, the layers are distilled to be 256-dimensional, 128-dimensional, 64-dimensional and 32-dimensional respectively, and the effects of the intermediate models are tested respectively compared with the same-dimensional model without distillation, and the performance of the model obtained after distillation can be improved by 5% -90%. Wherein, the lower the embedding dimension of the model is, the more obvious the effect of performance improvement is. This shows that the iterative distillation method has high flexibility, and in practical application, the dimension needing distillation can be selected according to practical requirements. For different demand dimensions, the iterative distillation method can obtain a low-dimensional model with high performance.
[ example 4 ]
The compression ratio of each iteration of the iterative distillation-based rapid knowledge map embedded model compression method is set to be 2, and meanwhile, the distillation performance and efficiency are guaranteed. If the compression rate of each iteration is increased, the required student model can be obtained by using fewer iterations, but the performance is influenced to a certain extent. For example, setting the compression ratio per layer to 4 and 16, the number of iterations of training can be reduced by 50% and 75%, but the performance drops by 5% to 10%. From this, it is understood that the smaller the compressibility of each layer, the more remarkable the effect of the iterative distillation. Meanwhile, due to the huge training time advantage, different compression ratios can be flexibly selected according to requirements in practical application.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims (10)

1. A rapid knowledge map embedded model compression method based on iterative distillation specifically comprises the following steps:
1) Pre-training a knowledge graph embedding model of a high-dimensional teacher;
training a high-embedding dimension teacher model to prepare for guiding a low-embedding dimension student model;
2) Self-adaptive distillation of soft label weight;
providing a soft label weight self-adaptive distillation mechanism, and gradually increasing the weight of soft label loss according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss in the process of guiding the training of a student model by a teacher model;
3) Iterative distillation;
providing an iterative distillation frame, enabling a knowledge graph embedded model to alternately become a student model and a teacher model in an iterative distillation process, accelerating a training process, using a single teacher to distill, and fixing parameters of the teacher model in the distillation process;
4) And (5) embedding a low-dimensional student knowledge map into a model for prediction.
2. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 1, wherein: step 1) training a high-embedding dimension teacher model as follows;
first, a series of entities is giveneAnd relationRA knowledge mapGExpressed as a set of a series of triples, usingh,r,tRepresenting triples, namely head entities, relations and tail entities, the knowledge graph embedding model takes the triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triples
Figure 508267DEST_PATH_IMAGE001
Simultaneously replacing randomly
Figure 706030DEST_PATH_IMAGE002
Head entity and tail entity in (1) as negative triples
Figure 971795DEST_PATH_IMAGE003
The knowledge-graph embedding model then embeds each triplet as a vector, and then uses a scoring functionSCalculating a score for each triplet vector representation;
different knowledge map embedded models have different scoring functions, after the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
Figure 407456DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 382365DEST_PATH_IMAGE005
for a positive triplet, the positive triplet is,
Figure 80806DEST_PATH_IMAGE006
(ii) a In the case of a negative triple, for example,
Figure 432153DEST_PATH_IMAGE007
Figure 456741DEST_PATH_IMAGE008
is thatSoftmaxA function;
and after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model.
3. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 2, wherein: step 2) comprises the following specific steps;
given a tripleth,r,tFirstly, it is simultaneously inputted into teacher model and student model, and respectively passed through teacher model and student model to make coding, then defining the scoring function of teacher model and scoring result as
Figure 586240DEST_PATH_IMAGE009
The scoring function result of the student model is
Figure DEST_PATH_IMAGE010
The hard tag loss during distillation is the original loss of the student model, defined as follows:
Figure 181432DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE012
to aIn the case of a positive triplet, the positive triplet,
Figure 805311DEST_PATH_IMAGE006
(ii) a For the case of a negative triple, it is,
Figure 464831DEST_PATH_IMAGE007
Figure 984805DEST_PATH_IMAGE008
is thatSoftmaxFunction, soft tag loss adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
Figure 644457DEST_PATH_IMAGE013
finally, the total loss of distillation
Figure 102987DEST_PATH_IMAGE014
The weighted sum of hard tag loss and soft tag loss is as follows:
Figure 102167DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE016
is a weight for the soft tag that balances soft tag loss and hard tag loss.
4. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: and 2) only training a student model in the distillation process of the soft label weight self-adaptive distillation mechanism, wherein the model parameters of the teacher model are fixed.
5. Rapid knowledge based on iterative distillation according to claim 3The recognition map embedded model compression method is characterized in that: the step 2) dynamically adjusting the weight of the soft label in the distillation process of the soft label weight adaptive distillation mechanism
Figure 42310DEST_PATH_IMAGE016
Dividing the complete training process into two stages;
in the first stage, hard tag loss is dominant, and soft tag loss weight is distributed with a smaller initial value and is gradually increased;
in the second stage, the soft label weight is fixed;
define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
Figure 392520DEST_PATH_IMAGE017
wherein the parameterskThe value of (A) is dynamically adjusted in the training process, so that the condition that
Figure 374514DEST_PATH_IMAGE019
Has a value of
Figure 290517DEST_PATH_IMAGE020
Inner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,
Figure 214611DEST_PATH_IMAGE021
is the initial soft tag weight.
6. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: step 3) comprises the following specific steps;
is defined inkThe embedding dimension of the teacher model in the sub-iteration is
Figure DEST_PATH_IMAGE022
The embedding dimension of the teacher model is
Figure 770226DEST_PATH_IMAGE023
The compression ratio of each iteration
Figure 274020DEST_PATH_IMAGE024
The definition is as follows:
Figure 730015DEST_PATH_IMAGE025
this fixed compression ratio is then used for each iteration
Figure 825010DEST_PATH_IMAGE024
Model compression is performed, in the first iteration, the first student model is trained using the pre-trained teacher model, in the second iterationkIn the second iteration, the first one is usedkStudents generated in 1 iteration as the secondkTeacher in the second iterationkHard tag loss for the next iteration is defined as follows:
Figure DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 399080DEST_PATH_IMAGE027
is a firstkScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:
Figure DEST_PATH_IMAGE028
wherein the content of the first and second substances,
Figure 394980DEST_PATH_IMAGE029
is as followskThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheel
Figure DEST_PATH_IMAGE030
The definition is as follows:
Figure 692100DEST_PATH_IMAGE031
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
Figure DEST_PATH_IMAGE032
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the student
Figure 676106DEST_PATH_IMAGE033
The following relationship is satisfied:
Figure DEST_PATH_IMAGE034
7. the iterative distillation-based fast knowledge-map-embedded model compression method according to claim 6, wherein: and 3) presetting needed student model dimensionality and compression ratio of each iteration during model compression in the step 3), and then determining the number of iterations by the teacher model dimensionality, the student model dimensionality and the compression ratio of each iteration.
8. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 6, wherein: step 4), the concrete steps are as follows;
through step 3), the final generation dimension is
Figure 913970DEST_PATH_IMAGE033
Low dimension student modelEmbedding the maximum knowledge graph into the final result of model compression to obtain a distilled low-dimensional model, and then performing low-dimensional student model prediction.
9. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 7, wherein: step 4) the evaluation index adopted by the low-dimensional student model prediction is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1
Figure 556304DEST_PATH_IMAGE035
And average proportion of triples with rank less than or equal to 10
Figure DEST_PATH_IMAGE036
And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed.
10. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 9, wherein: step 4) carrying out a prediction stage of low-dimensional student model prediction and giving query
Figure 19515DEST_PATH_IMAGE037
WhereinhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst
The specific prediction process is as follows, first, a given query is
Figure 456313DEST_PATH_IMAGE037
The head entity and the relation in (1) are embedded into a vector, and all candidate tail entities are simultaneously embeddedtThe embedding is carried out as a vector quantity,
then, the query is executedqInputting the triples formed by the triples and all candidate tail entities into a scoring function for scoring, sorting the scores of all the triples, and calculating indexes
Figure DEST_PATH_IMAGE038
The calculation formula is as follows:
Figure 912964DEST_PATH_IMAGE039
wherein the content of the first and second substances,Nindicates the number of all the triples,
Figure DEST_PATH_IMAGE040
is an indication function when a condition is satisfied
Figure 562251DEST_PATH_IMAGE041
The value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.
CN202211535321.2A 2022-12-02 2022-12-02 Rapid knowledge graph embedded model compression method based on iterative distillation Pending CN115544277A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211535321.2A CN115544277A (en) 2022-12-02 2022-12-02 Rapid knowledge graph embedded model compression method based on iterative distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211535321.2A CN115544277A (en) 2022-12-02 2022-12-02 Rapid knowledge graph embedded model compression method based on iterative distillation

Publications (1)

Publication Number Publication Date
CN115544277A true CN115544277A (en) 2022-12-30

Family

ID=84722403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211535321.2A Pending CN115544277A (en) 2022-12-02 2022-12-02 Rapid knowledge graph embedded model compression method based on iterative distillation

Country Status (1)

Country Link
CN (1) CN115544277A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402116A (en) * 2023-06-05 2023-07-07 山东云海国创云计算装备产业创新中心有限公司 Pruning method, system, equipment, medium and image processing method of neural network
CN116415005A (en) * 2023-06-12 2023-07-11 中南大学 Relationship extraction method for academic network construction of scholars

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674880A (en) * 2019-09-27 2020-01-10 北京迈格威科技有限公司 Network training method, device, medium and electronic equipment for knowledge distillation
CN112418343A (en) * 2020-12-08 2021-02-26 中山大学 Multi-teacher self-adaptive joint knowledge distillation
CN113987196A (en) * 2021-09-29 2022-01-28 浙江大学 Knowledge graph embedding compression method based on knowledge graph distillation
CN114386409A (en) * 2022-01-17 2022-04-22 深圳大学 Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674880A (en) * 2019-09-27 2020-01-10 北京迈格威科技有限公司 Network training method, device, medium and electronic equipment for knowledge distillation
CN112418343A (en) * 2020-12-08 2021-02-26 中山大学 Multi-teacher self-adaptive joint knowledge distillation
CN113987196A (en) * 2021-09-29 2022-01-28 浙江大学 Knowledge graph embedding compression method based on knowledge graph distillation
CN114386409A (en) * 2022-01-17 2022-04-22 深圳大学 Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
葛仕明等: "基于深度特征蒸馏的人脸识别", 《北京交通大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402116A (en) * 2023-06-05 2023-07-07 山东云海国创云计算装备产业创新中心有限公司 Pruning method, system, equipment, medium and image processing method of neural network
CN116402116B (en) * 2023-06-05 2023-09-05 山东云海国创云计算装备产业创新中心有限公司 Pruning method, system, equipment, medium and image processing method of neural network
CN116415005A (en) * 2023-06-12 2023-07-11 中南大学 Relationship extraction method for academic network construction of scholars
CN116415005B (en) * 2023-06-12 2023-08-18 中南大学 Relationship extraction method for academic network construction of scholars

Similar Documents

Publication Publication Date Title
CN115544277A (en) Rapid knowledge graph embedded model compression method based on iterative distillation
CN110647619B (en) General knowledge question-answering method based on question generation and convolutional neural network
CN111199242A (en) Image increment learning method based on dynamic correction vector
CN110413785A (en) A kind of Automatic document classification method based on BERT and Fusion Features
CN112000772B (en) Sentence-to-semantic matching method based on semantic feature cube and oriented to intelligent question and answer
CN112000770B (en) Semantic feature graph-based sentence semantic matching method for intelligent question and answer
CN110110140A (en) Video summarization method based on attention expansion coding and decoding network
CN115294407B (en) Model compression method and system based on preview mechanism knowledge distillation
CN111178093B (en) Neural machine translation system training acceleration method based on stacking algorithm
CN113204633B (en) Semantic matching distillation method and device
CN114398976A (en) Machine reading understanding method based on BERT and gate control type attention enhancement network
CN115424177A (en) Twin network target tracking method based on incremental learning
CN113239209A (en) Knowledge graph personalized learning path recommendation method based on RankNet-transformer
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN113971367A (en) Automatic design method of convolutional neural network framework based on shuffled frog-leaping algorithm
CN113191445A (en) Large-scale image retrieval method based on self-supervision countermeasure Hash algorithm
CN112950414A (en) Legal text representation method based on decoupling legal elements
CN116610789A (en) Accurate low-cost large language model using method and system
CN116894120A (en) Unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation
CN113626537B (en) Knowledge graph construction-oriented entity relation extraction method and system
CN112966527B (en) Method for generating relation extraction model based on natural language reasoning
CN115455162A (en) Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion
CN111259860B (en) Multi-order characteristic dynamic fusion sign language translation method based on data self-driving
CN113222142A (en) Channel pruning and quick connection layer pruning method and system
Zhang et al. Anonymous model pruning for compressing deep neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination