CN113407720A

CN113407720A - Classification system expansion method based on pre-training text coding model

Info

Publication number: CN113407720A
Application number: CN202110711017.8A
Authority: CN
Inventors: 袁晓洁; 刘子晨; 温延龙
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-17
Anticipated expiration: 2041-06-25
Also published as: CN113407720B

Abstract

The invention provides a classification system expansion method based on a pre-training text coding model. The invention uses the classification system to be expanded and the definition text of the vocabulary in the classification system as input data, and obtains a judgment model which is graded according to the classification path and the word definition by finely adjusting the model which is pre-trained in a wide domain through self-supervision training. In the self-supervision training process, the invention uses a dynamic difference loss function and designs a corresponding dynamic difference calculation method based on tasks. Compared with most of the existing methods which need to use a large amount of related corpora for training and forecasting, the method reduces the corpora needed in the training and forecasting process. Experimental results show that the method has the judgment accuracy rate which is obviously superior to that of other existing methods.

Description

Classification system expansion method based on pre-training text coding model

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a classification system expansion method based on a pre-training text coding model, which is a classification system expansion technology for carrying out self-supervision training according to the existing classification system and related word definitions.

Background

With the development of the internet and the arrival of the big data era, a large amount of available data resources are accumulated in society. How to organize heterogeneous distributed mass information, dig deeper knowledge content, and provide efficient and accurate information service for users becomes a problem of great concern.

The classification system is a semantic hierarchy composed of upper and lower relations and is an important component of the knowledge graph. The accurate and complete classification system is constructed, which is helpful for solving various problems in the fields of natural language processing, information retrieval and the like, such as query understanding, question answering systems, personalized recommendation and the like. In the world, the boundaries of network resources and human knowledge are continuously expanded, new knowledge is continuously and rapidly emerged, but the existing classification system is difficult to keep up with the increase speed, and the problem of insufficient coverage of the classification system is more prominent particularly in special fields such as 'medical treatment' and 'law'. The expansion of a classification system of large-scale data has the reasons of huge workload in a manual mode, difficulty in searching by experts and the like, and the method is very dependent on an automatic construction technology of a computer. This makes it important to expand the top-bottom relationship from large-scale corpora through automation of computer technology.

In recent years, significant progress has been made in methods for expanding the classification system from text corpora. The existing classification system expansion method can be mainly divided into two types, the traditional classification system expansion method mainly adopts a word method matching method based on a mode, and the method extracts the upper and lower relations from a large-scale corpus by setting a specific grammar mode so as to expand the classification system, so that a certain effect is obtained in English corpus, but the effect on Chinese corpus is poor; in recent years, with the development of deep learning and pre-training models in natural language processing, learning different text expression vectors by using the deep learning models becomes a mainstream method for improving the effect of the extension of a classification system.

However, this task is currently far from being solved, mainly for three reasons: 1. the size, topic, and quality of the text corpus may vary, and existing approaches fail to develop a generalized solution for all situations. 2. The task of expanding classification systems has not been fully studied for emerging and specific fields and in non-english and resource-deficient languages. 3. Most of the existing methods have low accuracy on the automatic method for expanding the classification system, because the language rules of the upper and lower relations in the classification system are difficult to obtain from the free text.

In conclusion, the classification system expansion based on the pre-training text coding model is an innovative research problem and has important research significance and application value.

Disclosure of Invention

The invention aims to solve the problem that a large amount of related linguistic data is needed and the accuracy is low in the expansion of the existing automatic classification system. The method comprises the steps of generating self-supervision training data under the condition that only an existing classification system, entity words needing to be inserted and an entity word definition set need to be input, and finely adjusting the existing pre-training text coding model through a dynamic deficit loss function under the support of a specific similarity deficit function, so that the classification system is expanded more accurately.

In order to achieve the purpose, the invention adopts the following technical scheme:

the classification system expansion method based on the pre-training text coding model comprises the following steps:

step 1, generating self-supervision training data

Generating data for subsequent self-supervision training without external data according to a given existing classification system and the definition of words in the classification system, wherein the part of data consists of a classification system path and the definition of words;

step 2, classification system path sampling

Through the generation of the self-supervision training data in the step 1, classification system path data which can be used for training already exists, but sampling is needed according to the training requirement to obtain a positive and negative sampling data set;

step 3, fine tuning the pre-training text coding model

Respectively inputting the positive and negative sampling data sets in the step (2) into a pre-training text coding model, and updating model parameters by using a dynamic difference loss and back propagation algorithm, so that the model is finely adjusted to have the capability of judging whether a classification system path is proper or not;

step 4, judging the position of the entity word to be inserted in the classification system

And (3) generating a candidate classification system path according to the entity word to be inserted and the word definition thereof, inputting the candidate classification system path into the model trained in the step (3), and sequencing and judging the position according to the score given by the model.

In the further optimization of the technical scheme, the specific definition of the classification system path in step 1 is as follows:

in a given taxonomy, a set of nodes with an order is formed by all nodes on the shortest path from a node to the root node.

In the further optimization of the technical scheme, the specific method for generating the self-supervision training data in the step 1 is as follows:

generating a classification system path as a correct classification system path for each node in a given classification system, and combining each non-root node with the rest (| V | -2) classification system paths to generate (| V | -2) error classification system paths, wherein V is a set of all nodes on the classification system.

Further optimization of the technical scheme, the specific method for sampling the classification system path in the step 2 is as follows:

in each sampling process, sampling is carried out on each non-root node, the sampling data comprises a classification system path corresponding to the sampling data as positive sampling, and an error path randomly extracted from the corresponding error classification system path is used as negative sampling;

the sampling process requires repeated random decimation in each round of training to sample to a different error path.

In a further optimization of the technical scheme, the data input in step 3 is composed of two parts, one part is a classification system path, the other part is a word definition text of the entity words corresponding to the classification system path, when a model is input, the two parts are combined into a string of texts, the foremost end is a [ CLS ] classification symbol, the first half is a word definition text, the middle is divided by using [ SEP ] symbols, and the second half is a text formed by arranging all entity words on the classification system path in sequence.

The technical scheme is further optimized, and the specific method for fine tuning the pre-training text coding model in the step 3 is as follows:

vector data corresponding to [ CLS ] classification symbols output by the pre-training text coding model are input into a multilayer fully-connected neural network, and the neural network outputs a scalar to represent whether a classification system path is appropriate to be scored or not; and (3) obtaining losses by the two scores according to the dynamic difference loss algorithm mentioned above, and then updating parameters of the model by using a back propagation algorithm so as to fine tune the pre-training text coding model and the multilayer fully-connected neural network added on the pre-training text coding model.

In a further optimization of the technical solution, the specific definition of the dynamic deficit loss is as follows:

first, a dynamic deficit function is defined as:

wherein P and P' represent correct classification system path and wrong classification system path respectively, k is the hyper-parameter of the model, and is set according to the size of the given classification system;

the dynamic balance loss is:

wherein

Representing the set of taxonomy paths for the positive sample,

represents its corresponding negative sample taxonomy path set, s (P)) And S (P ') respectively represents the output scores of the model under training for the classification system paths P and P', and the max function represents that a larger value is selected from the two parameters to serve as output.

In the further optimization of the technical scheme, the specific method for judging the position of the entity word to be inserted in the classification system in the step 4 is as follows:

given entity words to be inserted and noun definitions thereof, combining the entity words to be inserted with all correct classification system paths on the existing classification system to obtain | V | possible classification system paths, combining all classification system paths with noun definitions and inputting the combination into a fine-tuned model, ranking the first classification system path as the correct classification system path according to the obtained score ordering, and judging the positions of the entity words to be inserted in the classification system according to the classification system paths.

Different from the prior art, the technical scheme has the advantages and positive effects that:

the invention creatively provides a self-supervision classification system expansion method utilizing pre-training text coding model fine tuning, which creatively solves the problem of classification system expansion into the judgment of the appropriateness degree of a classification system path, inputs word definition and the classification system path into a pre-training text coding model together, and fine-tunes the model through a specially designed difference calculation formula and a dynamic difference loss function. In particular, in order to better enable the model to learn the difference between the classification system paths, the method designs a specific difference calculation formula which can reflect the similarity between two different classification system paths so as to form the difference for calculating the loss. The method effectively improves the accuracy and other judgment standards of the traditional classification system expansion method, and greatly reduces the corpus texts required in the training and prediction processes.

Drawings

FIG. 1 is a flow chart of a cup optic disc segmentation method based on fundus map dataset transfer learning;

FIG. 2 is a schematic diagram of a fundus map data set migration learning model;

FIG. 3 is a schematic view of a disc segmentation result across a fundus map data set;

FIG. 4 is a schematic diagram of an attention module with dynamically learnable coefficients;

fig. 5 is a schematic view of the cup segmentation results across the fundus map dataset.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The invention provides a classification system expansion method based on a pre-training text coding model, which is a flow chart of a cup optic disc segmentation method based on fundus map data set transfer learning as shown in figure 1.

The invention solves the research problem of classification system expansion, and FIG. 2 is a schematic diagram of a fundus image data set migration learning model, which is a definition diagram of the problem, wherein the left side is given to the existing classification system needing to be expanded, the middle part is an entity word needing to be added into the classification system and a word definition thereof, and the right side is a possible expansion position. For example, on a given taxonomy as shown in the figure, the new entity word "Deixis" to be expanded should be inserted under the "Semantic" node.

The classification system expansion method based on the pre-training text coding model adopts a data set provided in a task13 in a commonly used international semantic major (SemEval)2016 in an implementation stage. The data set provides classification systems of three different fields of food (food), science (science) and environment (environment). When a data set is divided, 80% of nodes in each different classification system are used as a training set, namely the known classification system needing to be expanded; leaf nodes accounting for 20% of the total number of the nodes are randomly selected as a test set, namely, new entities needing to be inserted into the classification system.

In the specific implementation process, the pre-training text coding model selected by the invention is a commonly used natural language understanding model BERT. By fine-tuning the BERT, an extended model can be obtained that is suitable for the classification system currently provided.

Step 1, self-supervised training data generation

The goal of this stage is to generate self-supervised training data that can be used to fine-tune the pre-trained text coding model, based on an existing classification system with a hierarchical structure. The method is an automatic supervision method, training data are not acquired by other classification systems, and the automatic supervision training data are generated by the existing classification system needing to be expanded at present, so that a pre-training text coding model is finely tuned.

Constructing a classification system path;

a taxonomy is a directed acyclic graph consisting of a set of context relationships and a set of entities, denoted as T ═ V, E, where T denotes the taxonomy, each node u ∈ V is an entity word, and each edge (u, V) ∈ E denotes the superior-inferior relationship between child node u ∈ V and parent node V ∈ V.

The classification system has a hierarchical structure, and a root node r belongs to V and has no father node, namely (r, V) belongs to E; and can communicate nodes in all classification systems, namely, V exists for any V e to V₁，v₂...v_D-1E.g. V, and (r, V)₁)，(v₁，v₂)...(v_D-2，v_D-1)，(v_D-1V) E. Can be called P_v＝[r，v₁，v₂...v_D-1，v]Is the taxonomy path for this node v. When the number of nodes in P is the minimum, D is the depth of v node in the classification system.

For a given existing classification system, the method generates a classification system path for each node (when a plurality of classification system paths exist in a certain node, the shortest one of the classification system paths is selected).

In the implementation process, firstly, word definitions of all entity words are acquired, including words on a known classification system and new entity words to be inserted. The method for acquiring the word definition comprises the step of capturing sentences related to the entry in the first section of each entry from the Wikipedia website. For words formed by combining multiple words, the invention forms word definitions of the complete words by simply combining the word definitions of all the combined words.

Thereafter, a correct taxonomy path is generated for each node (including the root node) on it, according to the structure of the taxonomy that needs to be extended. Then, to generate the sample data, it is necessary to generate an erroneous classification system path as negative sample data for each non-root node by combining with other classification system paths. That is, there is one correct taxonomy path for each non-root node, and (| V | -2) misclassification taxonomy paths (except for the path of the node and its parent node).

During training:

step 2, classification system path sampling

Through the generation of the self-supervision training data in the step 1, classification system path data which can be used for training already exists, but sampling is needed according to the training requirement;

for a given classification system T ═ V, E, the method generates a classification system path by combining word definitions of entity words represented by each non-root node V ∈ V, V ≠ r, and V is a set of all nodes on the classification system. Meanwhile, all other classification system paths are linked with the node and are combined with the word definition of the node to serve as negative sampling data. The specific combination method is that u belongs to V, u is not equal to V, and u is not equal to V_D-1,P_{Negative pole}＝[r，u₁，u₂...u_D-1，u，v]。

In the course of fine tuning the model, multiple rounds of training are performed. In each training round, resampling is needed, and the sampling steps are as follows:

1. selecting a correct classification system path of a non-root node as a positive sample according to the sequence

2. Randomly extracting one from the error classification system path set corresponding to the correct classification system as negative sample

3. Repeating the above steps until all non-root nodes are sampled

The above steps are performed in one training round, and the above steps need to be repeated in each training round.

Step 3, fine tuning the pre-training text coding model

The aim of the stage is to enable the model to have the capability of judging whether the classification system path is appropriate or not by fine-tuning the pre-training text coding model, and to an inappropriate input path, the model will be scored low and the appropriate input path model will be scored high.

Step 3.1, inputting a sampling data pair;

when the pre-training text coding model is finely adjusted, the loss function is a dynamic difference loss function, so that the same amount of positive sampling data and negative sampling data need to be input when each set of training input sampling data, and the positive sampling data and the negative sampling data correspond to the same node.

At each training round, all positive samples will be input, while negative samples will be randomly drawn from all optional negative samples of the corresponding node.

When inputting the pre-training text coding model, the entity word definition and the classification system path are used as the combination input of two sentences, and the middle part is separated by a segmentation symbol. Inputting a single sample, the pre-trained text coding model will return a string of vectors:

Model(S，P)＝x_[CLS]，x₁，...，x_[SEp]，x_v，...，x_r，x_[SEP]

wherein x represents a vector, [ CLS ] represents a classification symbol commonly used in a pre-training text coding model, [ SEP ] represents a segmentation symbol used at the tail part and the middle part of a text, the first half is used for coding a word definition sentence, and the second half is used for coding each entity word in a classification system path.

Step 3.2, calculating a dynamic difference;

in order to distinguish the difference between different negative samples, the method adjusts the traditional difference loss function and converts the fixed difference into the dynamic difference. The following two main benefits are achieved: (1) previous research results show that the model can be better trained and learned to slight and deep differences of upper and lower relations in a classification system. (2) In a classification system, the similarity between all nodes is not the same, different loss amounts are obtained by setting different differences, and the model can better learn the difference between the nodes.

The specific calculation difference formula is as follows:

wherein P and P' respectively represent two path node sets for calculating difference values, k is a hyper-parameter of the model and is set according to the size of a given classification system. The difference calculation method can reflect the similarity of the negative sample and the positive sample in the classification system.

3.3, fine-tuning the pre-training text coding model by utilizing a dynamic deficit loss function;

output [ CLS ] of pre-training text coding model]The vector is input into a multi-layer perceptron to obtain a numerical value as the score of the current input classification system path, namely s (P) -MLP (v)_[CLS]) Where s (p) represents the score of the current input taxonomy path and MLP represents the multi-layer perceptron.

The dynamic deficit loss function is:

wherein

Representing the set of taxonomy paths for the positive sample,

and (2) representing a negative sampling classification system path set corresponding to the training model, wherein S (P) and S (P ') respectively represent output scores of the training model for the classification system paths P and P', and a max function represents that a larger value is selected from two parameters to be used as output.

By utilizing the loss function in cooperation with the Adam optimizer, a certain amount of data is input every time, and through multiple rounds of iteration, the pre-training text coding model can be finely adjusted, so that the score can be given to the given classification system path input.

Referring to fig. 3, a schematic diagram of the cross fundus map data set optic disc segmentation result is shown, illustrating the overall fine tuning process. The sampled data will be used to fine tune the pre-trained text coding model. When inputting the pre-training text coding model, the entity word definition and the classification system path are used as the combination input of two sentences, and the middle part is separated by a segmentation symbol. Each set of training data will have half as positive samples and half as negative samples.

Referring to fig. 4, which is a schematic diagram of an attention module with dynamically learnable coefficients, a last layer of a neural network of a pre-trained text coding model outputs a representation vector to [ CLS ], and the representation vector is input to a fully-connected neural network layer to obtain a scalar output, which is used for scoring a classification system path input by the model. And calculating the Loss according to a dynamic difference Loss function of the classification system paths of the positive samples and the corresponding negative samples, and updating parameters in the model by using an Adam optimizer and a back propagation algorithm so as to achieve the effect of fine tuning the model.

The training process will be performed for multiple rounds, the specific number of rounds will be adjusted according to the difference of the data sets, taking the practice as an example, the Science data set trains 50 rounds, the Environment data set trains 45 rounds, and the Food data set trains 55 rounds. Adjusting the size of a reference data set, the depth of a classification system and the convergence speed of Loss in the training process; the fine tuning of the model has better robustness, and the training effect within a certain training round number range is not obviously different, and the range is about +/-5.

Step 4, aiming at the entity words to be inserted, judging the positions of the entity words in the classification system

After the steps are completed, namely the pre-training text coding model is finely adjusted through the given classification system and the definition of the nodes of the classification system, the task of the step is to judge the position of a new entity word needing to be expanded, which is to be inserted into the given classification system, by using the finely adjusted model.

Combining the new entity words to be expanded with all classification system paths (including root nodes) on a given classification system to generate | V | possible classification system paths, matching each path with the word definition of the new entity words to input the fine-tuned model to obtain scores, and sequencing the scores, wherein the node corresponding to the classification system with the highest score is the insertion position.

The pre-trained text coding model has the capability of judging whether the classification system path is appropriate or not by finely adjusting the model according to the training data generated by the existing classification system. At this time, the model can obtain a score for judging whether the classification system path is suitable or not for each input classification system path.

The classification system expansion method based on the pre-training text coding model provided by the invention is verified on the data set of SemEval-2016 Task 13. Three evaluation indexes were used in the experiment:

accuracy (Acc): measuring the fraction of correctly predicted taxonomy paths

Mean Reciprocal Rank (MRR): calculating the average of the inverse ranking of the correct classification system path

Wu & Palmer similarity (Wu & P): evaluating semantic similarity between the predicted classification system path and the real classification system path:

in the above formula, the correct classification system path of the ith node in the n tested entity words to be inserted is defined as y_iAnd the highest ranked taxonomy path given by the model is

Referring to fig. 5, a diagram of the cup segmentation results across fundus map datasets is shown demonstrating the effect of the method on the three english datasets of SemEval-2016 Task13 and a comparison with other currently existing methods. The comparison method BERT + MLP is a model which is simply trained by combining the Embedding output by the BERT with a multilayer neural network; taxoxexpand and STEAM are the proposed methods in two 2020 published papers, the effects of which derive from their papers. The experimental effect shows that compared with the best method in the prior art, the STEAM provided by the invention averagely achieves the improvement of 13.0%, 14.0% and 9.8% on three indexes of Acc, MRR and Wu & P. The comparison result fully shows that the method provided by the invention has excellent effect on the task of expanding the self-supervision classification system.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. The classification system expansion method based on the pre-training text coding model is characterized by comprising the following steps of: the method comprises the following steps:

step 1, generating self-supervision training data

step 2, classification system path sampling

step 3, fine tuning the pre-training text coding model

2. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the specific definition of the classification system path in step 1 is:

3. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the step 1 self-supervised training data generation method is as follows:

4. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the step 2 is a specific method for sampling classification system paths:

5. The method as claimed in claim 1, wherein the data inputted in step 3 is composed of two parts, one part is a classification system path, the other part is a word definition text of the entity word corresponding to the classification system path, when inputting the model, the two parts are composed into a string of texts, the front end is the [ CLS ] classification symbol, the first half is the word definition text, the middle part is divided by the [ SEP ] symbol, and the second half is the text composed by arranging all entity words on the classification system path in order.

6. The method for expanding the classification system based on the pre-trained text coding model according to claim 5, wherein the specific method for fine-tuning the pre-trained text coding model in the step 3 is as follows:

7. The method for extending classification system based on pre-trained text coding model according to claim 1 or 6, wherein the specific definition of the dynamic deficit loss is:

first, a dynamic deficit function is defined as:

the dynamic balance loss is:

wherein

Indicating positive miningThe set of paths of the classification system of the sample,

8. The classification system expansion method based on the pre-trained text coding model according to claim 1, wherein the specific method for judging the position of the entity word to be inserted in the classification system in the step 4 is as follows: