CN114386409A

CN114386409A - Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium

Info

Publication number: CN114386409A
Application number: CN202210051393.3A
Authority: CN
Inventors: 蔡树彬; 何日安; 明仲
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-22

Abstract

The invention discloses a self-distillation Chinese word segmentation method based on an attention mechanism, a terminal and a storage medium, wherein the method comprises the following steps: introducing the preprocessed training set into a pre-training model; obtaining a teacher model through iterative training; acquiring an attention weight matrix through the teacher model and the student model; introducing the attention weight matrix into a knowledge distillation process, and performing targeted learning training on the student model; and verifying the whole model obtained by learning and training through a verification set to obtain the distilled Chinese word segmentation model. The invention can achieve the goal of optimizing the structure of the model by self-distillation, and further utilizes the attention system to study the distilled knowledge with emphasis, thereby improving the capability of the model to identify the unknown words and better solving the problem that the traditional Chinese word segmentation model cannot accurately process the unknown words.

Description

Self-distillation Chinese word segmentation method based on attention mechanism, terminal and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a self-distillation Chinese word segmentation method based on an attention mechanism, a terminal and a storage medium.

Background

The existing Chinese word segmentation models are mostly studied in a sequence labeling mode. From input to final output, the result undergoes three main parts, namely Embedding, Encoder and Decoder: an Embedding part (information Embedding) for converting an input character into a distributed string of information which can be understood and operated by a computer; an Encoder part (Encoder) for encoding the Embedding information and enabling the computer to capture the relation between the Embedding through specific operation; the decoder decodes the encoded information and then restores the decoded information to character information that can be understood by humans.

However, most research works focus on rich input information (such as n-gram word information), more complex model structures are designed to further improve the effect of specific tasks, and few researchers design methods to optimize the parameter structures of the models to achieve the purpose; in addition, although the existing models can achieve high accuracy and recall rate (F1 value), the recognition capability of the models on unknown words (i.e. words which are not included in the word segmentation word list but must be segmented out, including various proper nouns, acronyms, newly added words and the like) is still to be improved;

for example, a sentence of "a heavy load is difficult to meet an heaven and a visitor" needs to be participled, and an unknown word of the heaven and the visitor "is difficult to be accurately divided into the word" heaven and the visitor "through the existing model; or the word segmentation processing is required to be carried out on the 'swallow in the sky of the building', and the unlisted word 'swallow in the building' is also difficult to be accurately segmented into 'swallow in the building' through the existing model.

Therefore, the prior art has yet to be improved.

Disclosure of Invention

The invention aims to solve the technical problem that the existing Chinese word segmentation model cannot accurately process unregistered words.

The technical scheme adopted by the invention for solving the technical problem is as follows:

in a first aspect, the present invention provides a self-distillation chinese word segmentation method based on attention mechanism, which comprises the following steps:

introducing the preprocessed training set into a pre-training model;

obtaining a teacher model through iterative training;

acquiring an attention weight matrix through the teacher model and the student model;

introducing the attention weight matrix into a knowledge distillation process, and performing targeted learning training on the student model;

and verifying the whole model obtained by learning and training through a verification set to obtain the distilled Chinese word segmentation model.

In one implementation, the introducing the preprocessed training set into the pre-training model previously includes:

acquiring a Chinese word segmentation dictionary from an original training set;

randomly extracting data of a first proportion from the original training set to serve as the training set;

randomly extracting a second proportion of data from the original training set as the validation set.

In one implementation, the introducing the preprocessed training set into the pre-training model further includes:

and converting the character strings in the training set into character vectors, and combining the character vectors with position vectors for expressing character positions to obtain the preprocessed training set.

In one implementation, the obtaining a teacher model through iterative training includes:

verifying the student model through the verification set, and judging whether the F1 value obtained by the student model reaches the highest history;

if so, saving the student model as a teacher model in the next iteration process.

In one implementation, the obtaining of the attention weight matrix by the teacher model and the student model includes:

calculating first difference information between a predicted word segmentation result and a real word segmentation result output by the student model;

calculating second difference information between the predicted word segmentation result and the real word segmentation result output by the teacher model;

obtaining the attention weight matrix through the first difference information and the second difference information.

In one implementation, the introducing the attention weight matrix into the knowledge distillation process for targeted learning training of the student model includes:

calculating the overall loss of the student model through the attention weight matrix;

and reversely propagating the overall loss to the student model so as to update the parameter information of each node in the student model.

In one implementation, the calculating the overall loss of the student model by the attention weight matrix includes:

interacting the attention weight matrix with a predicted word segmentation result output by the student model and a predicted word segmentation result output by the teacher model respectively to obtain interaction information;

carrying out knowledge distillation according to the interactive information, and calculating distillation loss;

calculating the overall loss of the student model by the distillation loss and the conventional cross entropy loss.

In one implementation, the attention-based self-distillation chinese word segmentation method further comprises:

and performing Chinese word segmentation test on the integral model after the iterative training is finished.

In a second aspect, the present invention provides a terminal, comprising: a processor and a memory, the memory storing an attention-based self-distilling chinese participle program, the attention-based self-distilling chinese participle program when executed by the processor for implementing the attention-based self-distilling chinese participle method according to the first aspect.

In a third aspect, the present invention provides a storage medium, which is a computer-readable storage medium, and the storage medium stores an attention-based self-distillation chinese word segmentation program, and the attention-based self-distillation chinese word segmentation program is configured to implement the attention-based self-distillation chinese word segmentation method according to the first aspect when executed by a processor.

The invention adopts the technical scheme and has the following effects:

according to the self-distillation learning system, an attention mechanism is introduced into a self-distillation process, difference information between knowledge obtained by a model and actual knowledge of a corresponding sample is compared firstly, and then the difference information is converted into a specific attention value through a plurality of attention strategies, so that the student model can learn the knowledge from a teacher model in a targeted manner, the internal parameter structure of the student model is further optimized, and meanwhile, label information of the sample is integrated into a training process of a specific task, and the knowledge expression capacity of the model is further improved. The invention achieves the aim of optimizing the structure of the model by self-distillation, and further utilizes the attention system to study the distilled knowledge with emphasis, thereby improving the capability of the model for identifying the unregistered words and better solving the problem that the traditional Chinese word segmentation model cannot accurately process the unregistered words.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow diagram of a method for self-distilling Chinese word segmentation based on an attention mechanism in one implementation of the invention.

FIG. 2 is a schematic structural diagram of a self-distillation Chinese word segmentation model based on an attention mechanism in an implementation mode of the invention.

FIG. 3 is a diagram illustrating character weights in an implementation of the present invention.

FIG. 4 is a schematic illustration of the effect of self-distillation on the overall model knowledge representation capability in one implementation of the present invention.

FIG. 5 is a schematic diagram illustrating the effect of optimizing a model structure in one implementation of the invention.

Fig. 6 is a functional schematic of a terminal in one implementation of the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Exemplary method

The inventor finds that the current technical field of natural language processing is usually completed by adopting a neural network based on deep learning, and the neural network with strong learning capability can solve the problem of large-batch accurate classification. The Chinese sentence is more difficult to implement accurate word segmentation than the English sentence because of no space between words, and the existing Chinese word segmentation model mostly depends on extra data such as word information (n-gram) to further improve the word segmentation effect, and few researchers design methods to optimize the parameter structure of the model to achieve the purpose. In addition, although the existing models can achieve high accuracy and recall rate (F1 value), the recognition capability of the models on unknown words (i.e. words which are not included in the word segmentation word list but must be segmented out, including various proper nouns, acronyms, newly added words and the like) is still to be improved;

In order to solve the problems, in the embodiment of the application, an attention mechanism is introduced into a self-distillation process, difference information between knowledge obtained by a model and actual knowledge of a corresponding sample is firstly compared, and then the difference information is converted into a specific attention value through a plurality of attention strategies, so that the student model can learn the knowledge from a teacher model in a targeted manner, the internal parameter structure of the student model is further optimized, and meanwhile, label information of the sample is integrated into a training process of a specific task, so that the knowledge expression capacity of the model is further improved.

As shown in fig. 1, an embodiment of the present invention provides a self-distillation chinese word segmentation method based on attention mechanism, which includes the following steps:

and S100, introducing the preprocessed training set into a pre-training word segmentation model.

Specifically, a chinese segmentation dictionary is obtained from an original training set for evaluating and labeling unknown words, meanwhile, the training set is obtained from the original training set, data of a first proportion (for example, the first proportion is 90%) is randomly extracted from the original training set as a training set used for training the model, and data of a second proportion (the second proportion is 100% minus the first proportion, for example, the second proportion is 10%) is randomly extracted as a verification set, and it is noted that the training set and the verification set do not have overlapping parts. In the deep learning process of the neural network, the training set is used as a data sample for model fitting, so that the training set can be regarded as a learning textbook; the verification set is a sample set reserved independently in the model training process, can be used for adjusting the hyper-parameters of the model and carrying out preliminary evaluation on the capability of the model, and therefore can be regarded as 'operation' after the stage learning is finished.

That is, in one implementation manner of the present embodiment, step S100 includes the following steps before:

001, acquiring a Chinese word segmentation dictionary for evaluating unknown words from an original training set;

step S002, randomly extracting data with a first proportion from an original training set as the training set;

and S003, randomly extracting data of a second proportion from the original training set as the verification set.

After the training set is obtained, the training set needs to be preprocessed, in an implementation manner of this embodiment, the character strings in the training set are converted into character vectors, and for the characters in the sentence, the positions of the characters are also related to the semantic judgment, so the character vectors are combined with the position vectors expressing the positions of the characters, thereby obtaining the preprocessed training set.

That is, in an implementation manner of the present embodiment, step S100 further includes the following steps:

and S003, converting the character strings in the training set into character vectors, and combining the character vectors with position vectors for expressing character positions to obtain the preprocessed training set.

As shown in fig. 2, in an implementation manner of this embodiment, a preprocessed training set needs to be introduced into a pre-training model, where the pre-training model adopts a BERT model, the BERT is called Bidirectional Encoder Representation from transforms, and is a pre-training language Representation model, which emphasizes that a traditional unidirectional language model or a method of shallow splicing two unidirectional language models is no longer adopted for pre-training, but a new Masked Language Model (MLM) is adopted to generate a deep Bidirectional language Representation, so as to fuse left and right context information, and thus, the method is widely applied to the field of natural language processing.

By introducing the preprocessed training set into the pre-trained Chinese word segmentation model, iterative training can be performed by using the pre-trained Chinese word segmentation model and the preprocessed training set, so that a required teacher model is obtained.

As shown in fig. 1, in one implementation manner of the embodiment of the present invention, the self-distillation chinese word segmentation method based on attention mechanism further includes the following steps:

and step S200, obtaining a teacher model through iterative training.

In the embodiment, knowledge distillation is widely applied to the field of deep learning at present, and a teacher model is adopted to guide a student model to learn, so that the student model can learn 'dark knowledge' which cannot be learned from training data so as to improve the accuracy of the model; for example, a picture with cat content is input into a classification model, the classification probability of the input picture is p, and the classification probability of the model output is q, if only the conventional learning process is adopted, the model learning task is to minimize the loss of q relative to p; after knowledge distillation is introduced, pictures are input into a teacher model, the classification probability output by the teacher model is set to be q1, if the probability distribution displayed by q1 is that the picture has high probability of being a cat, the probability distribution becomes smooth (q2) after the knowledge distillation process, namely that the picture has a point like a dog besides a cat; and the classification probability obtained by inputting the picture into the student model is still q, the difference (Loss soft) between the two probability distributions of q and q2 is calculated, meanwhile, the difference (Loss hard) between the two probability distributions of p and q is calculated, the Loss is integrated by the Loss soft and the Loss hard, and the Loss is minimized, so that the student model can continuously learn the teacher model while iterative training, and the risk of overfitting the model (the student stumbles the back) is reduced to a certain extent.

The distillation known in the prior art is divided into three categories: self-distillation, off-line distillation and on-line distillation. The self-distillation is adopted in the application, namely, the teacher model and the student model are the same model.

In this embodiment, in the first full iteration process, there is no teacher model because of self-distillation learning, so the training process is not different from conventional training. In one implementation of this embodiment, after each complete iteration process is finished, whether the current student model can be persisted and used as a teacher model can be determined by judging the condition: only if the performance of the student model of the current iteration process on the verification set exceeds the best history record, namely the F1 value reaches the highest history, the student model is saved as the teacher model of the next stage, and the teacher model is not updated again until a better student model appears; it is noted that if the difference between the current F1 value and the historical optimal F1 value is <0.0001, the model is not saved, i.e., the model effect is not considered to be improved, thereby preventing frequent updates to the teacher model.

In another implementation of this embodiment, when deciding whether the current student model is to be saved as the next-stage teacher model, the student model of the current iteration process is saved as the teacher model of the next iteration process regardless of its performance on the verification set.

That is, in an implementation manner of this embodiment, the step S200 specifically includes the following steps:

step S201, verifying the student model through the verification set, and judging whether the F1 value obtained by the student model reaches the highest history;

and step S202, if so, saving the student model as a teacher model in the next iteration process.

and step S300, acquiring an attention weight matrix through the teacher model and the student model.

In this embodiment, deep learning needs to be performed based on a large amount of training data, and there is useful information that helps the problem and information that does not help the problem in the input training data, and the information that does not help the problem is referred to as "noise". Under the current limitation of computer resources, attention mechanism is an effective means for improving efficiency in learning a model. The core goal of attention mechanism is to select information from a multitude of information that is more critical to the current task goal, placing attention on it.

For the knowledge distillation applied in the solution of the present application, in which the student model does not necessarily receive the knowledge taught by the teacher all the way, it is possible that the content taught by the teacher model is not correct either. Therefore, the knowledge from the teacher model needs to be treated differently, so that the attention mechanism is introduced in the knowledge distillation process to further refine the importance degree of the knowledge, and the student model can learn in a targeted manner.

Calculating the predicted word segmentation result output by the student model and the predicted word segmentation result output by the teacher model, and the difference information between the predicted word segmentation result and the actual word segmentation result, and realizing normalization through a step function, wherein t is the teacher model, and s is the student model:

η_m＝|y_m-y|,m＝t,s；

η_m＝1-F(η_m)。

in one implementation of the present embodiment, based on knowledge importance recognition that "the more learnable knowledge is more important, and the less learnable knowledge is less so important", attention weight may be calculated by:

in another implementation of the present embodiment, based on knowledge importance level recognition that "knowledge that is easier to learn is more important, knowledge that is harder to learn is less important, and knowledge that the teacher model grasps is more important than knowledge that the student model grasps", attention weight can be calculated by:

in another implementation of the present embodiment, the attention weight may be calculated based on knowledge importance level awareness of "knowledge that is easier to learn is more important, knowledge that is harder to learn is less important, and knowledge that is mastered by the student model is considered more important than knowledge that is mastered by the teacher model":

in another implementation of the embodiment, the attention weight may be calculated based on knowledge importance recognition that "knowledge mastered by the teacher model but not mastered by the student model is considered most important, while knowledge mastered by the student model but not mastered by the teacher model is considered least important":

that is, four different attention weight calculation methods are recognized based on four different knowledge importance levels, and all possible weight values of one character (set to k) are shown in fig. 3.

That is, in an implementation manner of this embodiment, the step S300 specifically includes the following steps:

step S301, calculating first difference information between a predicted word segmentation result and a real word segmentation result output by the student model;

step S302, calculating second difference information between the predicted word segmentation result and the real word segmentation result output by the teacher model;

step S303, obtaining the attention weight matrix through the first difference information and the second difference information.

The difference information between the pseudo label and the real label predicted by the student model and the difference information between the pseudo label and the real label predicted by the teacher model are obtained through calculation, so that the student model and the teacher model can fully communicate, and a final weight vector is obtained.

and S400, introducing the attention weight matrix into a knowledge distillation process, and performing targeted learning training on the student model.

In this embodiment, the prediction information output by the student model and the teacher model is interacted with the obtained attention weight matrix, and the distillation Loss (corresponding to Loss soft in the above example) is calculated by the following formula:

wherein z is^(T)Predicted word segmentation result for teacher model output, z^(S)And outputting the predicted word segmentation result for the student model.

That is, in an implementation manner of this embodiment, the step S400 specifically includes the following steps:

step S401, calculating the overall loss of the student model through the attention weight matrix;

and step S402, reversely transmitting the overall loss to the student model so as to update the parameter information of each node in the student model.

As described in the above examples, the main task in the knowledge distillation process is to minimize the overall Loss, Loss being the combination of Loss soft (i.e. the distillation Loss described above) and the conventional Loss Loss hard where the input data does not pass through the teacher model, which in one implementation in this embodiment is the overall Loss (L)_KD) Specifically, the calculation is performed by the following formula:

L_KD＝(1-α)·L_CE+α·L_Distill；

wherein L is_CEFor cross entropy, which is used to measure the difference between the predicted word segmentation result (i.e. one probability distribution) and the actual word segmentation result (i.e. another probability distribution) output by the student model, α is a balance factor.

That is, in an implementation manner of this embodiment, step S401 specifically includes the following steps:

step S401a, respectively interacting the attention weight matrix with the predicted word segmentation result output by the student model and the predicted word segmentation result output by the teacher model to obtain interaction information;

step S401b, carrying out knowledge distillation according to the mutual information, and calculating distillation loss;

and step S401c, calculating the overall loss of the student model through the distillation loss and the conventional loss.

Specifically, the student model obtained by one iteration training is verified through the verification set, and whether the student model is reserved is determined according to the verification result, so that a complete iteration process is completed as described in the step S201.

Finally, the trained model needs to be tested, and when the scheme in the embodiment is applied to an experiment, the applicant sets some parameters in advance: the balance factor alpha is 0.3, the iteration times are 50, the batch size is 16, the learning rate is 0.00002, and meanwhile, the probability epochs parameter is set to 3, namely, if the difference between the current iteration times and the iteration times of storing the optimal student model last time is greater than or equal to 3, the model training process is terminated in advance.

In one implementation manner of the embodiment of the present invention, the self-distillation chinese word segmentation method based on the attention mechanism further includes the following steps:

and step S500, performing Chinese word segmentation test on the integral model after the iterative training is finished.

As shown in fig. 4, fig. 4 shows the effect obtained by the model trained on the above parameter setting at the time of testing, and it can be seen that the model expression effect based on the knowledge importance degree recognition that "the knowledge mastered by the teacher model but not mastered by the student model is most important and the knowledge mastered by the student model but not mastered by the teacher model is least important" is better, and as a whole, the effect obtained by the method of "the teacher model is stored as the next iteration after the effect of the student model on the verification set reaches the best history" is better than the effect obtained by the method of "the teacher model is stored as the next iteration regardless of the effect of the student model on the verification set".

As shown in fig. 5, fig. 5 is a result of comparing the effect of the model trained by the method with the effect of the related advanced technology in recent years, and it can be seen that the method can obtain a result close to or even better than the effect of the advanced technology in recent years only by optimizing the structure of the model itself, so that on the basis of the method, additional word segmentation auxiliary information is added to obtain a more excellent effect with a pre-training model with targeted word segmentation designed by other researchers.

In the embodiment, an attention mechanism is introduced into a self-distillation process, difference information between knowledge obtained by a model and actual knowledge of a corresponding sample is compared firstly, and then the difference information is converted into a specific attention value through various attention strategies, so that the student model can learn the knowledge from a teacher model in a targeted manner, the internal parameter structure of the student model is further optimized, and meanwhile, label information of the sample is integrated into a training process of a specific task, so that the knowledge expression capability of the model is further improved. The self-distillation optimization method achieves the aim of optimizing the structure of the model by self-distillation, and further learns distilled knowledge with emphasis by using attention, so that the capacity of the model for recognizing unknown words is improved, and the problem that the existing Chinese word segmentation model cannot accurately process the unknown words is well solved.

Exemplary device

Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 6.

The terminal includes: the system comprises a processor, a memory, an interface, a display screen and a communication module which are connected through a system bus; wherein the processor of the terminal is configured to provide computing and control capabilities; the memory of the terminal comprises a storage medium and an internal memory; the storage medium stores an operating system and a computer program; the internal memory provides an environment for the operation of an operating system and a computer program in the storage medium; the interface is used for connecting external terminal equipment, such as mobile terminals and computers; the display screen is used for displaying corresponding self-distillation Chinese word segmentation information based on an attention mechanism; the communication module is used for communicating with a cloud server or a mobile terminal.

The computer program is executed by a processor to implement a self-distillation Chinese word segmentation method based on an attention mechanism.

It will be understood by those skilled in the art that the block diagram of fig. 6 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a terminal is provided, which includes: the system comprises a processor and a memory, wherein the memory stores a self-distillation Chinese word segmentation program based on an attention mechanism, and the self-distillation Chinese word segmentation program based on the attention mechanism is used for realizing the self-distillation Chinese word segmentation method based on the attention mechanism when being executed by the processor.

In one embodiment, a storage medium is provided, wherein the storage medium is a computer readable storage medium, and the storage medium stores a self-distillation chinese word segmentation program based on attention system, and the self-distillation chinese word segmentation program based on attention system is used for implementing the self-distillation chinese word segmentation method based on attention system as above when being executed by a processor.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory.

In summary, the present invention provides a self-distillation chinese word segmentation method based on attention mechanism, a terminal and a storage medium, wherein the method comprises: introducing the preprocessed training set into a pre-trained Chinese word segmentation model; obtaining a teacher model through iterative training; acquiring an attention weight matrix through the teacher model and the student model; introducing the attention weight matrix into a knowledge distillation process, and performing targeted learning training on the student model; and verifying the whole model obtained by learning and training through a verification set to obtain the distilled Chinese word segmentation model. The invention can achieve the goal of optimizing the structure of the model by self-distillation, and further utilizes the attention system to study the distilled knowledge with emphasis, thereby improving the capability of the model to identify the unknown words and better solving the problem that the traditional Chinese word segmentation model cannot accurately process the unknown words.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A self-distillation Chinese word segmentation method based on an attention mechanism is characterized by comprising the following steps:

introducing the preprocessed training set into a pre-training model;

obtaining a teacher model through iterative training;

2. The attention-based mechanism self-distilling chinese tokenization method of claim 1, wherein the introducing the preprocessed training set into the pre-trained model previously comprises:

acquiring a Chinese word segmentation dictionary from an original training set;

3. The method for self-distillation chinese word segmentation based on attention mechanism of claim 1, wherein the introducing the preprocessed training set into the pre-training model further comprises:

4. The attention-based self-distilling Chinese word segmentation method of claim 1, wherein the obtaining of the teacher model through iterative training comprises:

5. The method for self-distillation Chinese word segmentation based on attention mechanism of claim 1, wherein the obtaining of the attention weight matrix through the teacher model and the student model comprises:

6. The self-distillation Chinese word segmentation method based on the attention mechanism as claimed in claim 1, wherein the attention weight matrix is introduced into a knowledge distillation process, and the student model is subjected to targeted learning training, and the method comprises the following steps:

7. The method of attention-based self-distillation Chinese word segmentation in the computer readable medium of claim 6, wherein the calculating the overall loss of the student model by the attention weight matrix comprises:

8. The attention-based self-distilling Chinese participle method as recited in claim 1, wherein the attention-based self-distilling Chinese participle method further comprises:

9. A terminal, comprising: a processor and a memory, the memory storing an attention-based self-distilling chinese word segmentation program, the attention-based self-distilling chinese word segmentation program when executed by the processor being configured to implement the attention-based self-distilling chinese word segmentation method according to any one of claims 1 to 8.

10. A storage medium, wherein the storage medium is a computer-readable storage medium, and wherein the storage medium stores an attention-based self-distillation chinese word segmentation program, and wherein the attention-based self-distillation chinese word segmentation program is configured to be executed by a processor to implement the attention-based self-distillation chinese word segmentation method according to any one of claims 1 to 8.