CN115544277A - Rapid knowledge graph embedded model compression method based on iterative distillation - Google Patents
Rapid knowledge graph embedded model compression method based on iterative distillation Download PDFInfo
- Publication number
- CN115544277A CN115544277A CN202211535321.2A CN202211535321A CN115544277A CN 115544277 A CN115544277 A CN 115544277A CN 202211535321 A CN202211535321 A CN 202211535321A CN 115544277 A CN115544277 A CN 115544277A
- Authority
- CN
- China
- Prior art keywords
- model
- distillation
- teacher
- knowledge
- student
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/027—Frames
Abstract
A fast knowledge map embedding model compression method based on iterative distillation, 1) pre-training a high-dimensional teacher knowledge map embedding model; 2) Self-adaptive distillation of soft label weight; 3) Iterative distillation; 4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction. The method can realize the excellent performance of the distillation compression knowledge map embedded model, simultaneously keeps the model reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick updating of large-scale knowledge map embedded models in practical application.
Description
Technical Field
The invention belongs to the field of artificial intelligence knowledge maps, and particularly relates to a rapid knowledge map embedded model compression method based on iterative distillation.
Background
A knowledge graph is a structure that describes concepts and facts using a graphical model, where knowledge is stored in the form of triples. The knowledge-graph embedding model aims to embed triples in a knowledge-graph into a continuous vector space. With the rapid growth of the scale of the knowledge graph, the efficient knowledge graph embedding model plays a key role in downstream applications such as knowledge question answering, recommendation systems, knowledge graph completion and the like. Most knowledge graph embedding models perform better as the embedding dimensions increase, however, in practical applications models with higher dimensions tend to have longer inference times and require more storage. Specifically, the 512-dimensional model has 7-15 times more model parameters and 2-6 times more inference time than the 32-dimensional model. Therefore, how to compress a high-dimensional knowledge map embedded model into a low-dimensional knowledge map embedded model is an important problem. Knowledge distillation is a common model compression method, and aims to use a larger teacher model as a teacher model, use a smaller model as a student model, and then enable the student model to simulate the output of the teacher model. Recently, although several knowledge-based distillation knowledge-map-embedded model compression methods have been proposed, they still suffer from three drawbacks. First, existing work directly distills high-dimensional teacher models into low-dimensional student models, but better teachers are not always able to teach better students. Previous studies have shown that too much difference between teacher and student models can affect distillation performance. Similar phenomena exist for knowledge-based distillation knowledge-graph embedding models, which represent the gap between embedding dimensions. Specifically, low-dimensional student models are distilled directly from high-dimensional teacher models, and excessive dimensional gaps lead to difficulties for better teachers to teach better students. This dimensional gap implies different performance, which further results in significant differences in output distribution between the teacher and the student models, which makes the student models difficult to learn. Second, another significant challenge faced by the existing methods is that the difference in optimization direction between the distillation objective and the original mission objective causes the model to be difficult to converge. Thirdly, the current work mainly focuses on improving the reasoning efficiency, but has the problem of time consumption in the training process. In particular, existing knowledge-graph-embedded model distillation methods tend to use a multi-teacher distillation framework or train teacher models and student models together, which makes training times several times longer than direct distillation.
Therefore, an application number CN202111152202.4 is named as a knowledge graph embedding compression method based on knowledge graph distillation, which fully captures triple information and embedded structure information in a high-dimensional knowledge graph embedding model (Teacher model) and distills the triple information and the embedded structure information into the knowledge graph embedding model (Student model), improves the expression capability of the Student model under the condition of ensuring the storage and reasoning efficiency of the Student model, considers the double influence between the Teacher model and the Student model in the distillation process, provides a soft label evaluation mechanism to distinguish the quality of soft labels of different triples, and provides a training mode of firstly fixing the Teacher model and then releasing the fixed Teacher model to improve the adaptability of the Student model to the Teacher model and finally improve the performance of the Student model.
But there are cases as follows;
1) The emphasis on solving the problem is different: the 'knowledge map embedding compression method based on knowledge map distillation' focuses on solving the problem that the teacher model and the student model have double influences. The invention aims to solve the problem that the distillation effect is reduced due to overlarge embedding dimension between a teacher model and a student model. Furthermore, the present invention focuses on the problem of training efficiency.
2) The methods for solving the problems are different: the knowledge graph embedding compression method based on knowledge graph distillation provides a two-stage distillation method, and a teacher model with high dimensionality is directly distilled into a student model with low dimensionality. The invention provides a rapid knowledge graph embedding method based on iterative distillation, which gradually reduces the dimensionality of a teacher model by using the iterative distillation method, so that knowledge can be smoothly transferred, a student model with better performance is finally obtained, and the problem of overlarge dimensionality difference between the teacher model and the student model is solved.
3) The efficiency of the solution is different: in order to accelerate the training process and further solve the problem that the optimization directions of the hard label and the soft label are inconsistent in the distillation process, the invention provides a dynamic soft label weight adjusting mechanism, the weight between the soft label and the hard label is dynamically adjusted through the loss in the training process, and the training time has great advantage.
The invention provides a novel iterative distillation method for a knowledge graph embedded model. Unlike the method of directly distilling low-dimensional student models with high-dimensional teacher models, the iterative distillation method adopts a method of gradually reducing the sizes of models to reduce the dimension difference between the teacher models and the student models of each distillation. Specifically, a trained high-dimensional knowledge map is embedded into a model as a teacher model, and the teacher model is compressed into a smaller student model according to a specific compression ratio by a knowledge distillation method. This process will then be iterated, and the student model generated in the previous iteration will be used as the teacher model for the next iteration to guide lower dimension student model training. And finally, stopping iteration when the dimension of the obtained student model reaches the required dimension. By the iterative distillation method, an overlarge dimension difference between the student model and the teacher model is reduced. In other words, iterative distillation allows knowledge to be smoothly transferred from a high-dimensional teacher model to a low-dimensional student model, while allowing the last iteration of student models to better inherit the performance of the first iteration of teacher models. In order to obtain better distillation performance, the invention provides a novel soft label weight self-adaptive adjustment method. In the early stages of training, hard tag losses are given greater weight and soft tag losses are given lesser weight. At this point, the hard tag dominates the optimization of the model. The weight of the soft label loss is gradually increased along with the training, so that the soft label loss dominates the optimization of the model in the later training period. The soft label weight self-adaption method can solve the problem that the model cannot be well converged due to different optimization directions of hard label loss and soft label loss. To further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation.
Disclosure of Invention
In order to solve the problems, the invention provides a rapid knowledge graph embedding model compression method based on iterative distillation. Then, the invention provides a soft label weight self-adaptive distillation mechanism, and in the process of teaching student models to train by a teacher model, the weight of soft label loss is gradually increased according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss, so that the training efficiency is improved. Furthermore, the invention provides an iterative distillation framework, so that the knowledge graph embedded model can be alternately changed into a student model and a teacher model in the iterative distillation process. Specifically, the student model of the last iteration is generated as the teacher model of the next iteration to guide the training of the student model of lower dimensionality. Thus, knowledge can be migrated between the high-dimensional teacher model and the low-dimensional student model in a smooth manner while maintaining good performance of the student models. Next, to further speed up the training process, the present invention uses a single teacher to perform the distillation while fixing the parameters of the teacher model during the distillation. The method can realize the excellent performance of distillation and compression of the knowledge-map embedded model, simultaneously keeps the reasoning speed, reduces 50% of training time, has the advantage of quick training, and can meet the requirement of quick update of large-scale knowledge-map embedded models in the real world.
In order to achieve the purpose, the invention adopts the technical scheme that:
a rapid knowledge graph embedding model compression method based on iterative distillation comprises the following specific steps:
1) Pre-training a knowledge graph embedding model of a high-dimensional teacher;
training a high-embedding dimension teacher model to prepare for guiding a low-embedding dimension student model;
2) Self-adaptive distillation of soft label weight;
providing a soft label weight self-adaptive distillation mechanism, and gradually increasing the weight of soft label loss according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss in the process of guiding the training of a student model by a teacher model;
3) Carrying out iterative distillation;
providing an iterative distillation frame, enabling a knowledge graph embedded model to alternately become a student model and a teacher model in an iterative distillation process, accelerating a training process, using a single teacher to distill, and fixing parameters of the teacher model in the distillation process;
4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction.
As a further improvement of the invention, the process of training a highly embedded dimension teacher model in the step 1) is as follows;
first, a series of entities is giveneAnd relationRA knowledge mapGExpressed as a set of a series of triples, usingh,r,tRepresenting triples, namely head entities, relations and tail entities, the knowledge graph embedding model takes the triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triplesSimultaneously replacing randomlyHead entity and tail entity in (1) as negative triples;
The knowledge-graph embedding model then embeds each triplet as a vector, and then uses a scoring functionSCalculating a score for each triplet vector representation;
different knowledge map embedded models have different scoring functions, after the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
wherein the content of the first and second substances,for a positive triplet, the positive triplet is,(ii) a For the case of a negative triple, it is,,is thatSoftmaxA function;
and after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model.
As a further improvement of the invention, the specific steps of the step 2) are as follows;
given a tripleth,r,tFirstly, it is simultaneously inputted into teacher model and student model, and respectively passed through teacher model and student model to make coding, then defining the scoring function of teacher model and scoring result asThe scoring function result of the student model isThe hard tag loss during distillation is the original loss of the student model, defined as follows:
wherein the content of the first and second substances,for a positive triplet, the positive triplet is,(ii) a In the case of a negative triple, for example,,is thatSoftmaxFunction, soft tag loss adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
finally, the total loss of distillationThe weighted sum of hard tag loss and soft tag loss is as follows:
wherein the content of the first and second substances,is the weight of the soft tag that balances the soft tag loss and the hard tag loss.
As a further improvement of the invention, in the step 2), only the student model is trained in the distillation process of the soft label weight adaptive distillation mechanism, and the model parameters of the teacher model are fixed.
As a further improvement of the invention, the step 2) dynamically adjusts the weight of the soft label in the distillation process of the weight adaptive distillation mechanism of the soft labelDividing the complete training process into two stages;
in the first stage, hard tag loss is dominant, and soft tag loss weight is distributed with a smaller initial value and is gradually increased;
in the second stage, the soft label weight is fixed;
define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
wherein the parameterskThe value of (A) is dynamically adjusted in the training process, thereby ensuring thatHas a value ofInner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,is the initial soft tag weight.
As a further improvement of the invention, the specific steps of the step 3) are as follows;
is defined inkThe embedding dimension of the teacher model in the sub-iteration isThe embedding dimension of the teacher model isThe compression ratio of each iterationThe definition is as follows:
this fixed compression ratio is then used for each iterationModel compression is performed, in the first iteration, the first student model is trained using the pre-trained teacher model, in the second iterationkIn the second iteration, the first one is usedkStudents generated in 1 iteration as the secondkTeacher in the second iterationkHard tag loss for the next iteration is defined as follows:
wherein the content of the first and second substances,is as followskStudent model of sub-iterationScoring results of type scoring functions, firstkThe soft tag loss for the sub-iteration is defined as follows:
wherein the content of the first and second substances,is as followskThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheelThe definition is as follows:
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the studentThe following relationship is satisfied:
as a further improvement of the invention, the required student model dimension and the compression ratio of each iteration are preset during model compression in the step 3), and then the iteration times are determined by the teacher model dimension, the student model dimension and the compression ratio of each iteration.
As a further improvement of the invention, the specific steps of the step 4) are as follows;
through step 3), the final generation dimension isThe low-dimensional student model is the final result of the compressed embedded knowledge map model, and low-dimensional student model prediction can be carried out after the distilled low-dimensional model is obtained.
As a further improvement of the invention, the evaluation index adopted by the low-dimensional student model prediction in the step 4) is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1And average proportion of triples ranked less than or equal to 10And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed.
As a further improvement of the invention, step 4) is carried out a prediction phase of low-dimensional student model prediction, given a queryWhereinhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst;
The specific prediction process is as follows, first, a given query isThe head entity and the relation in (1) are embedded into a vector, and all candidate tail entities are simultaneously embeddedtThe embedding is carried out as a vector quantity,
then, the query is executedqInputting the triples formed by all candidate tail entities into a scoring function for scoring, sorting the scores of all triples, and calculating the indexThe calculation formula is as follows:
wherein the content of the first and second substances,Nindicates the number of all the triples,is an indication function when a condition is satisfiedThe value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides a rapid knowledge map embedded model compression method based on iterative distillation, which reduces the dimension difference between a teacher model and a student model of each distillation by the iterative distillation method so as to enable knowledge to be smoothly transferred from the teacher model to the student model. Meanwhile, the soft label weight self-adaptive mechanism solves the problem that training efficiency is influenced due to inconsistent optimization directions between hard label loss and soft label loss by dynamically adjusting the weight of the soft label. Furthermore, a single teacher distillation strategy is adopted in the distillation process, parameters of a teacher model are fixed, and the distillation speed is obviously improved. Verification is carried out on a link prediction task, and the rapid knowledge map embedded model compression method based on iterative distillation provided by the invention is proved to have better universality and can ensure high efficiency in practical application. Therefore, the invention has good application prospect and popularization range.
Drawings
FIG. 1 is a logic flow diagram of the method of the present invention;
FIG. 2 is a model flow diagram of the method of the present invention;
fig. 3 is a diagram of a training arrangement for the method of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a rapid knowledge map embedded model compression method based on iterative distillation, which reduces the dimension difference between a teacher model and a student model of each distillation by the iterative distillation method so as to enable knowledge to be smoothly transferred from the teacher model to the student model. Meanwhile, the soft label weight self-adaptive mechanism solves the problem that training efficiency is influenced due to inconsistent optimization directions between hard label loss and soft label loss by dynamically adjusting the weight of the soft label. Furthermore, a single teacher distillation strategy is adopted in the distillation process, parameters of a teacher model are fixed, and the distillation speed is obviously improved. Verification is carried out on a link prediction task, and the rapid knowledge map embedded model compression method based on iterative distillation provided by the invention is proved to have better universality and can ensure high efficiency in practical application. Therefore, the invention has good application prospect and popularization range.
As a specific embodiment of the present invention, the present invention provides a logic flow chart as shown in fig. 1, a model flow chart as shown in fig. 2, and a training configuration chart as shown in fig. 3, wherein the method for compressing a fast knowledge-graph embedded model based on iterative distillation comprises the steps of:
1) And pre-training a knowledge graph embedding model of the high-dimensional teacher.
The process of knowledge-graph embedding is to given a set of triples in a knowledge-graph, embedding all entities and relationships therein as vectors in a continuous space. And when the high-dimensional teacher model is trained, selecting a knowledge graph embedding model with higher embedding dimension for pre-training. The reason for selecting a higher-dimensional model as a teacher model is that for most knowledge graph embedded models, the higher the embedding dimension, the stronger the expression capability of the model.
The process of training a high-dimensional teacher model is as follows, first, a series of entities are giveneAnd relationRA knowledge mapGMay be represented as a set of a series of triples. Generally, use is made ofh,r,tRepresenting triplets (head entity, relationship)Tail entity). The knowledge graph embedding model takes the original triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triplesSimultaneously replacing randomlyThe head entity and the tail entity in (1) are used as negative triples. Then, the knowledge-graph embedding model embeds each triple into a vector, and then uses a scoring functionSA score is calculated for each triplet vector representation. Different knowledge-graph embedded models have different scoring functions. After the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
wherein the content of the first and second substances,for a positive triplet, the positive triplet is,(ii) a For the case of a negative triple, it is,,is thatSoftmaxA function. And after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model. By training the high-dimensional teacher model, a knowledge graph embedded model with better performance is obtained for distilling the low-dimensional knowledge graph embedded model.
2) Soft label weight adaptive distillation.
The soft label weight adaptive distillation uses the high-dimensional knowledge map embedded model trained in the step 1) as a teacher model, and aims to enable a low-dimensional student model to simulate the output of the high-dimensional teacher model to learn knowledge, so that the student model achieves the performance matched with the teacher model.
In the distillation process of the traditional knowledge graph embedded model, because the optimization directions of the soft label loss and the hard label loss are inconsistent, the problem that the model is difficult to converge in the training process can be caused. In order to solve the problem, the weight of the soft label is dynamically adjusted in the distillation process by the weight adaptive distillation of the soft label, and the weight of the soft label is innovatively changed according to the change of the total loss, so that the soft label has smaller weight in the early training period, and the hard label dominates the training process of the model. As training progresses, soft label weights are gradually increased so that the student model is gradually optimized in the hard label dominated optimization direction. By the strategy, the model is converged to a better direction, and the convergence speed of the model is accelerated.
In particular, a given tripleh,r,tFirstly, the codes are simultaneously input to a teacher model and a student model, and are respectively encoded by the teacher model and the student model. Then defining the scoring function of the teacher model as the scoring resultThe scoring function result of the student model isThe hard tag loss during distillation is the original loss of the student model, defined as follows:
wherein the content of the first and second substances,for a positive triplet, the positive triplet is,(ii) a For the case of a negative triple, it is,,is thatSoftmaxA function. Loss of soft label adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
finally, the total loss of distillationThe weighted sum of hard tag loss and soft tag loss is as follows:
wherein the content of the first and second substances,is a weight for the soft tag that balances soft tag loss and hard tag loss. Note that only the student model was trained during the distillation, and the model parameters of the teacher model were fixed.
To further balance hard tag loss and soft tag loss during retort, the weights of the soft tags are dynamically adjusted during retort. At the beginning of training, it is desirable that the weight lost by the soft label be as small as possible. Then, as the training process progresses, the soft label weights are continuously increased and finally fixed. In particular, the complete training process is divided into two phases. In the first stage, hard tag losses dominate,the soft tag loss weight is assigned a smaller initial value and gradually increases. In the second stage, the soft tag weights are fixed. Define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
wherein the parameterskThe value of (A) is dynamically adjusted in the training process, so that the condition thatIn the range ofInner, soft label time control parameterpThe time ratio of the weight adjustment of the soft label is controlled,is the initial soft tag weight. The soft label weight adaptive distillation solves the problem that the model is difficult to converge due to different directions of hard label loss optimization and soft label loss optimization at the initial training stage.
3) And (4) carrying out iterative distillation.
And (3) repeating the operations of the step 1) and the step 2) in the iterative distillation, and taking the student model generated in the last iteration as a teacher model in the next iteration, so that the dimension difference between the student model and the teacher model in each distillation is reduced, and the knowledge can be transferred more smoothly. Previous methods distilled low dimensional student models directly with high dimensional teacher models, however the better performing teacher model did not necessarily teach the better performing student model when the distillation knowledge graph was embedded in the model. This is because the difference in embedding dimensions between the teacher model and the student models is too large, resulting in too large a difference between output results, and it is difficult to directly transfer knowledge from the teacher model to the student models. The invention provides an iterative distillation strategy, which does not directly distill the original teacher model into a final student model, but gradually reduces the embedding dimension of the model for distillation, thereby reducing the dimension difference between the teacher model and the student model in each distillation process, and further smoothly transferring knowledge from the teacher model to the student model. Through the iterative distillation method, the finally generated student model well inherits the performance of the original teacher model.
Specifically, it is defined inkThe embedding dimension of the teacher model in the sub-iteration isThe embedding dimension of the teacher model isThe compression ratio of each iterationThe definition is as follows:
this fixed compression ratio is then used for each iterationAnd carrying out model compression. In the first iteration, a first student model is trained using a pre-trained teacher model. In the first placekIn the second iteration, the first iteration is usedkStudents generated in 1 iteration as the secondkTeacher in the second iteration. First, thekHard tag loss for the next iteration is defined as follows:
wherein the content of the first and second substances,is as followskScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:
wherein the content of the first and second substances,is a firstkThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheelThe definition is as follows:
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the studentThe following relationship is satisfied:
in practical application, required student model dimensions and compression ratio of each iteration are preset when the model is compressed, and then the iteration times are determined by the teacher model dimensions, the student model dimensions and the compression ratio of each iteration.
4) And (5) embedding the low-dimensional student knowledge graph into a model for prediction.
Through step 3), the final generation dimension isLow-dimensional student model most knowledge map embedding modelThe final result after compression. And after a distilled low-dimensional model is obtained, the low-dimensional student model prediction can be carried out. The performance of the knowledge graph embedded model on a link prediction task is generally adopted to evaluate the quality of the knowledge graph embedded model, and the adopted evaluation index is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1And average proportion of triples ranked less than or equal to 10And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed. In particular, in the prediction phase, a query is givenIn whichhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst. The specific prediction process is as follows, first, a given query isEmbedding head entities and relations in the vector, and simultaneously embedding all candidate tail entities into the vectortEmbedded as a vector. Then, the query is executedqThe triples with all candidate tail entities are input into a scoring function for scoring. Sorting the scores of all the triples and calculating indexesThe calculation formula is as follows:
wherein the content of the first and second substances,Nindicates the number of all the triples,is to indicateFunction when the condition is satisfiedThe value is 1 when the value is not satisfied, and the value is 0 when the value is not satisfied.
[ example 1 ] A method for producing a polycarbonate
In an embodiment, the iterative distillation-based fast knowledge map embedding model compression method is applied to a universal data setFB15K-237AndWN18RRtraining and prediction were performed as above, and the same data as this example was used in all other examples. Data set usage.FB15K-23714541 entities and 237 relations are contained in the data set, wherein 272115 triples are contained in the training set, 17535 triples are contained in the verification set, and 20466 triples are contained in the test set.WN18RR40943 entities and 11 relationships are contained in the data set, wherein 86835 triplets are contained in the training set, 3034 triplets are contained in the validation set, and 3134 triplets are contained in the test set. Because the model has good robustness and generalization, the same hyper-parameter setting can be used in different general data sets.
The specific implementation is as follows: knowledge graph embedding model selectionTransE,ComplEx,SimplEAndRotatE. The embedding dimension of the initial teacher model embedding layer is 512, and the embedding dimension of the final student model obtained by distillation is 32. Compressibility of each layer of distillationSet to 2, number of iterationsNSet to 4. Hyper-parameterpThe value of (d) is set to 2. In the training phase, useAdagradAs an optimizer, the value of the learning rate is set to 0.1. The batch size for each training is set to 1024 and the number of training rounds for each iteration is set to 1000. Because the stability of the model is good, only 1000 training rounds are needed for convergence at most in each iteration, and therefore the same batch setting and round setting can be adopted for different models and different data sets.
Testing the effectiveness of iterative distillation-based fast knowledge-map-embedded model compression methods using different knowledge-map-embedded models, and will be most effectiveThe final prediction results of the student model were compared with the student models obtained by other distillation methods, and were found inThe three indexes reach the highest performance, which shows that the iterative distillation-based rapid knowledge map embedded model compression method can reach the optimal level in the actual application scene.
[ example 2 ]
The iterative distillation-based fast knowledge map embedded model compression method has the advantage of training. Firstly, the distillation is performed by a single teacher, so that the training time can be greatly reduced compared with the training time of a multi-teacher distillation frame; secondly, teacher parameters are fixed and fixed in the distillation stage, a soft label weight dynamic adjustment mechanism is adopted to accelerate the training process, and the training time is reduced by 50% on average compared with the optimal method. The more complex the knowledge-graph embedding model, the more significantly the training time is shortened. For example embedding models for more complex knowledge graphsRotatEBy way of example, the training time can be reduced by 63%.
In practical applications, since real knowledge-maps often need to be updated, existing knowledge-map embedded models need to be retrained each time a new entity is encountered. Therefore, compared with the traditional distillation method, the iterative distillation-based fast knowledge map embedded model compression method can ensure that the knowledge map embedded model is updated efficiently, and simultaneously greatly saves the consumption of resources.
[ example 3 ]
Compared with a model without distillation in the same dimension, the student model generated by each iteration of the rapid knowledge map embedded model compression method based on iterative distillation has a good effect. Specifically, a 512-dimensional knowledge graph is embedded into a model by an iterative distillation method, the compression rate of each layer is set to be 2, the layers are distilled to be 256-dimensional, 128-dimensional, 64-dimensional and 32-dimensional respectively, and the effects of the intermediate models are tested respectively compared with the same-dimensional model without distillation, and the performance of the model obtained after distillation can be improved by 5% -90%. Wherein, the lower the embedding dimension of the model is, the more obvious the effect of performance improvement is. This shows that the iterative distillation method has high flexibility, and in practical application, the dimension needing distillation can be selected according to practical requirements. For different demand dimensions, the iterative distillation method can obtain a low-dimensional model with high performance.
[ example 4 ]
The compression ratio of each iteration of the iterative distillation-based rapid knowledge map embedded model compression method is set to be 2, and meanwhile, the distillation performance and efficiency are guaranteed. If the compression rate of each iteration is increased, the required student model can be obtained by using fewer iterations, but the performance is influenced to a certain extent. For example, setting the compression ratio per layer to 4 and 16, the number of iterations of training can be reduced by 50% and 75%, but the performance drops by 5% to 10%. From this, it is understood that the smaller the compressibility of each layer, the more remarkable the effect of the iterative distillation. Meanwhile, due to the huge training time advantage, different compression ratios can be flexibly selected according to requirements in practical application.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.
Claims (10)
1. A rapid knowledge map embedded model compression method based on iterative distillation specifically comprises the following steps:
1) Pre-training a knowledge graph embedding model of a high-dimensional teacher;
training a high-embedding dimension teacher model to prepare for guiding a low-embedding dimension student model;
2) Self-adaptive distillation of soft label weight;
providing a soft label weight self-adaptive distillation mechanism, and gradually increasing the weight of soft label loss according to the change of distillation loss to solve the problem that the optimization direction of hard label loss is inconsistent with the optimization direction of soft label loss in the process of guiding the training of a student model by a teacher model;
3) Iterative distillation;
providing an iterative distillation frame, enabling a knowledge graph embedded model to alternately become a student model and a teacher model in an iterative distillation process, accelerating a training process, using a single teacher to distill, and fixing parameters of the teacher model in the distillation process;
4) And (5) embedding a low-dimensional student knowledge map into a model for prediction.
2. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 1, wherein: step 1) training a high-embedding dimension teacher model as follows;
first, a series of entities is giveneAnd relationRA knowledge mapGExpressed as a set of a series of triples, usingh,r,tRepresenting triples, namely head entities, relations and tail entities, the knowledge graph embedding model takes the triples formed by the head entities, the relations and the tail entities in the knowledge graph as positive triplesSimultaneously replacing randomlyHead entity and tail entity in (1) as negative triples;
The knowledge-graph embedding model then embeds each triplet as a vector, and then uses a scoring functionSCalculating a score for each triplet vector representation;
different knowledge map embedded models have different scoring functions, after the score of each triplet is obtained, the loss function adopts binary cross entropy loss, and the formula is as follows:
wherein the content of the first and second substances,for a positive triplet, the positive triplet is,(ii) a In the case of a negative triple, for example,,is thatSoftmaxA function;
and after the training of the high-dimensional teacher model is finished, storing the trained high-dimensional teacher model.
3. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 2, wherein: step 2) comprises the following specific steps;
given a tripleth,r,tFirstly, it is simultaneously inputted into teacher model and student model, and respectively passed through teacher model and student model to make coding, then defining the scoring function of teacher model and scoring result asThe scoring function result of the student model isThe hard tag loss during distillation is the original loss of the student model, defined as follows:
wherein the content of the first and second substances,to aIn the case of a positive triplet, the positive triplet,(ii) a For the case of a negative triple, it is,,is thatSoftmaxFunction, soft tag loss adoptionHuberLoss calculates the difference in the distribution of the teacher model and the student model, defined as follows:
finally, the total loss of distillationThe weighted sum of hard tag loss and soft tag loss is as follows:
4. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: and 2) only training a student model in the distillation process of the soft label weight self-adaptive distillation mechanism, wherein the model parameters of the teacher model are fixed.
5. Rapid knowledge based on iterative distillation according to claim 3The recognition map embedded model compression method is characterized in that: the step 2) dynamically adjusting the weight of the soft label in the distillation process of the soft label weight adaptive distillation mechanismDividing the complete training process into two stages;
in the first stage, hard tag loss is dominant, and soft tag loss weight is distributed with a smaller initial value and is gradually increased;
in the second stage, the soft label weight is fixed;
define the number of complete training rounds asMOf 1 atmThe soft label weights for the wheels are as follows:
6. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 3, wherein: step 3) comprises the following specific steps;
is defined inkThe embedding dimension of the teacher model in the sub-iteration isThe embedding dimension of the teacher model isThe compression ratio of each iterationThe definition is as follows:
this fixed compression ratio is then used for each iterationModel compression is performed, in the first iteration, the first student model is trained using the pre-trained teacher model, in the second iterationkIn the second iteration, the first one is usedkStudents generated in 1 iteration as the secondkTeacher in the second iterationkHard tag loss for the next iteration is defined as follows:
wherein the content of the first and second substances,is a firstkScoring results of the student model scoring function for the sub-iteration, secondkThe soft tag loss for the sub-iteration is defined as follows:
wherein the content of the first and second substances,is as followskThe scoring result of the secondary iterative teacher model scoring function, stepkTotal loss of wheelThe definition is as follows:
defining the total number of iterations asNThen the final compression ratioAThe definition is as follows:
final compression ratioAIs preset so that the condition for stopping iteration is the dimension of the studentThe following relationship is satisfied:
7. the iterative distillation-based fast knowledge-map-embedded model compression method according to claim 6, wherein: and 3) presetting needed student model dimensionality and compression ratio of each iteration during model compression in the step 3), and then determining the number of iterations by the teacher model dimensionality, the student model dimensionality and the compression ratio of each iteration.
8. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 6, wherein: step 4), the concrete steps are as follows;
9. The iterative distillation-based fast knowledge-map embedded model compression method according to claim 7, wherein: step 4) the evaluation index adopted by the low-dimensional student model prediction is average reciprocal rankingMRRAverage percentage of triplets with rank less than or equal to 1And average proportion of triples with rank less than or equal to 10And the time required by model training and prediction is directly adopted to evaluate the quality of the model speed.
10. The iterative distillation-based fast knowledge-map-embedded model compression method according to claim 9, wherein: step 4) carrying out a prediction stage of low-dimensional student model prediction and giving queryWhereinhAndrrespectively representing head entities and relations, the goal of prediction being to predict tail entities given head entities and relationst;
The specific prediction process is as follows, first, a given query isThe head entity and the relation in (1) are embedded into a vector, and all candidate tail entities are simultaneously embeddedtThe embedding is carried out as a vector quantity,
then, the query is executedqInputting the triples formed by the triples and all candidate tail entities into a scoring function for scoring, sorting the scores of all the triples, and calculating indexesThe calculation formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211535321.2A CN115544277A (en) | 2022-12-02 | 2022-12-02 | Rapid knowledge graph embedded model compression method based on iterative distillation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211535321.2A CN115544277A (en) | 2022-12-02 | 2022-12-02 | Rapid knowledge graph embedded model compression method based on iterative distillation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115544277A true CN115544277A (en) | 2022-12-30 |
Family
ID=84722403
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211535321.2A Pending CN115544277A (en) | 2022-12-02 | 2022-12-02 | Rapid knowledge graph embedded model compression method based on iterative distillation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115544277A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402116A (en) * | 2023-06-05 | 2023-07-07 | 山东云海国创云计算装备产业创新中心有限公司 | Pruning method, system, equipment, medium and image processing method of neural network |
CN116415005A (en) * | 2023-06-12 | 2023-07-11 | 中南大学 | Relationship extraction method for academic network construction of scholars |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674880A (en) * | 2019-09-27 | 2020-01-10 | 北京迈格威科技有限公司 | Network training method, device, medium and electronic equipment for knowledge distillation |
CN112418343A (en) * | 2020-12-08 | 2021-02-26 | 中山大学 | Multi-teacher self-adaptive joint knowledge distillation |
CN113987196A (en) * | 2021-09-29 | 2022-01-28 | 浙江大学 | Knowledge graph embedding compression method based on knowledge graph distillation |
CN114386409A (en) * | 2022-01-17 | 2022-04-22 | 深圳大学 | Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium |
-
2022
- 2022-12-02 CN CN202211535321.2A patent/CN115544277A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674880A (en) * | 2019-09-27 | 2020-01-10 | 北京迈格威科技有限公司 | Network training method, device, medium and electronic equipment for knowledge distillation |
CN112418343A (en) * | 2020-12-08 | 2021-02-26 | 中山大学 | Multi-teacher self-adaptive joint knowledge distillation |
CN113987196A (en) * | 2021-09-29 | 2022-01-28 | 浙江大学 | Knowledge graph embedding compression method based on knowledge graph distillation |
CN114386409A (en) * | 2022-01-17 | 2022-04-22 | 深圳大学 | Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
葛仕明等: "基于深度特征蒸馏的人脸识别", 《北京交通大学学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402116A (en) * | 2023-06-05 | 2023-07-07 | 山东云海国创云计算装备产业创新中心有限公司 | Pruning method, system, equipment, medium and image processing method of neural network |
CN116402116B (en) * | 2023-06-05 | 2023-09-05 | 山东云海国创云计算装备产业创新中心有限公司 | Pruning method, system, equipment, medium and image processing method of neural network |
CN116415005A (en) * | 2023-06-12 | 2023-07-11 | 中南大学 | Relationship extraction method for academic network construction of scholars |
CN116415005B (en) * | 2023-06-12 | 2023-08-18 | 中南大学 | Relationship extraction method for academic network construction of scholars |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115544277A (en) | Rapid knowledge graph embedded model compression method based on iterative distillation | |
CN110647619B (en) | General knowledge question-answering method based on question generation and convolutional neural network | |
CN111199242A (en) | Image increment learning method based on dynamic correction vector | |
CN110413785A (en) | A kind of Automatic document classification method based on BERT and Fusion Features | |
CN112000772B (en) | Sentence-to-semantic matching method based on semantic feature cube and oriented to intelligent question and answer | |
CN112000770B (en) | Semantic feature graph-based sentence semantic matching method for intelligent question and answer | |
CN110110140A (en) | Video summarization method based on attention expansion coding and decoding network | |
CN115294407B (en) | Model compression method and system based on preview mechanism knowledge distillation | |
CN111178093B (en) | Neural machine translation system training acceleration method based on stacking algorithm | |
CN113204633B (en) | Semantic matching distillation method and device | |
CN114398976A (en) | Machine reading understanding method based on BERT and gate control type attention enhancement network | |
CN115424177A (en) | Twin network target tracking method based on incremental learning | |
CN113239209A (en) | Knowledge graph personalized learning path recommendation method based on RankNet-transformer | |
CN113869055A (en) | Power grid project characteristic attribute identification method based on deep learning | |
CN113971367A (en) | Automatic design method of convolutional neural network framework based on shuffled frog-leaping algorithm | |
CN113191445A (en) | Large-scale image retrieval method based on self-supervision countermeasure Hash algorithm | |
CN112950414A (en) | Legal text representation method based on decoupling legal elements | |
CN116610789A (en) | Accurate low-cost large language model using method and system | |
CN116894120A (en) | Unsupervised cross-modal hash retrieval method based on dynamic multi-expert knowledge distillation | |
CN113626537B (en) | Knowledge graph construction-oriented entity relation extraction method and system | |
CN112966527B (en) | Method for generating relation extraction model based on natural language reasoning | |
CN115455162A (en) | Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion | |
CN111259860B (en) | Multi-order characteristic dynamic fusion sign language translation method based on data self-driving | |
CN113222142A (en) | Channel pruning and quick connection layer pruning method and system | |
Zhang et al. | Anonymous model pruning for compressing deep neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |