CN113407720A - Classification system expansion method based on pre-training text coding model - Google Patents

Classification system expansion method based on pre-training text coding model Download PDF

Info

Publication number
CN113407720A
CN113407720A CN202110711017.8A CN202110711017A CN113407720A CN 113407720 A CN113407720 A CN 113407720A CN 202110711017 A CN202110711017 A CN 202110711017A CN 113407720 A CN113407720 A CN 113407720A
Authority
CN
China
Prior art keywords
classification system
training
model
path
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110711017.8A
Other languages
Chinese (zh)
Other versions
CN113407720B (en
Inventor
袁晓洁
刘子晨
温延龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202110711017.8A priority Critical patent/CN113407720B/en
Publication of CN113407720A publication Critical patent/CN113407720A/en
Application granted granted Critical
Publication of CN113407720B publication Critical patent/CN113407720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a classification system expansion method based on a pre-training text coding model. The invention uses the classification system to be expanded and the definition text of the vocabulary in the classification system as input data, and obtains a judgment model which is graded according to the classification path and the word definition by finely adjusting the model which is pre-trained in a wide domain through self-supervision training. In the self-supervision training process, the invention uses a dynamic difference loss function and designs a corresponding dynamic difference calculation method based on tasks. Compared with most of the existing methods which need to use a large amount of related corpora for training and forecasting, the method reduces the corpora needed in the training and forecasting process. Experimental results show that the method has the judgment accuracy rate which is obviously superior to that of other existing methods.

Description

Classification system expansion method based on pre-training text coding model
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a classification system expansion method based on a pre-training text coding model, which is a classification system expansion technology for carrying out self-supervision training according to the existing classification system and related word definitions.
Background
With the development of the internet and the arrival of the big data era, a large amount of available data resources are accumulated in society. How to organize heterogeneous distributed mass information, dig deeper knowledge content, and provide efficient and accurate information service for users becomes a problem of great concern.
The classification system is a semantic hierarchy composed of upper and lower relations and is an important component of the knowledge graph. The accurate and complete classification system is constructed, which is helpful for solving various problems in the fields of natural language processing, information retrieval and the like, such as query understanding, question answering systems, personalized recommendation and the like. In the world, the boundaries of network resources and human knowledge are continuously expanded, new knowledge is continuously and rapidly emerged, but the existing classification system is difficult to keep up with the increase speed, and the problem of insufficient coverage of the classification system is more prominent particularly in special fields such as 'medical treatment' and 'law'. The expansion of a classification system of large-scale data has the reasons of huge workload in a manual mode, difficulty in searching by experts and the like, and the method is very dependent on an automatic construction technology of a computer. This makes it important to expand the top-bottom relationship from large-scale corpora through automation of computer technology.
In recent years, significant progress has been made in methods for expanding the classification system from text corpora. The existing classification system expansion method can be mainly divided into two types, the traditional classification system expansion method mainly adopts a word method matching method based on a mode, and the method extracts the upper and lower relations from a large-scale corpus by setting a specific grammar mode so as to expand the classification system, so that a certain effect is obtained in English corpus, but the effect on Chinese corpus is poor; in recent years, with the development of deep learning and pre-training models in natural language processing, learning different text expression vectors by using the deep learning models becomes a mainstream method for improving the effect of the extension of a classification system.
However, this task is currently far from being solved, mainly for three reasons: 1. the size, topic, and quality of the text corpus may vary, and existing approaches fail to develop a generalized solution for all situations. 2. The task of expanding classification systems has not been fully studied for emerging and specific fields and in non-english and resource-deficient languages. 3. Most of the existing methods have low accuracy on the automatic method for expanding the classification system, because the language rules of the upper and lower relations in the classification system are difficult to obtain from the free text.
In conclusion, the classification system expansion based on the pre-training text coding model is an innovative research problem and has important research significance and application value.
Disclosure of Invention
The invention aims to solve the problem that a large amount of related linguistic data is needed and the accuracy is low in the expansion of the existing automatic classification system. The method comprises the steps of generating self-supervision training data under the condition that only an existing classification system, entity words needing to be inserted and an entity word definition set need to be input, and finely adjusting the existing pre-training text coding model through a dynamic deficit loss function under the support of a specific similarity deficit function, so that the classification system is expanded more accurately.
In order to achieve the purpose, the invention adopts the following technical scheme:
the classification system expansion method based on the pre-training text coding model comprises the following steps:
step 1, generating self-supervision training data
Generating data for subsequent self-supervision training without external data according to a given existing classification system and the definition of words in the classification system, wherein the part of data consists of a classification system path and the definition of words;
step 2, classification system path sampling
Through the generation of the self-supervision training data in the step 1, classification system path data which can be used for training already exists, but sampling is needed according to the training requirement to obtain a positive and negative sampling data set;
step 3, fine tuning the pre-training text coding model
Respectively inputting the positive and negative sampling data sets in the step (2) into a pre-training text coding model, and updating model parameters by using a dynamic difference loss and back propagation algorithm, so that the model is finely adjusted to have the capability of judging whether a classification system path is proper or not;
step 4, judging the position of the entity word to be inserted in the classification system
And (3) generating a candidate classification system path according to the entity word to be inserted and the word definition thereof, inputting the candidate classification system path into the model trained in the step (3), and sequencing and judging the position according to the score given by the model.
In the further optimization of the technical scheme, the specific definition of the classification system path in step 1 is as follows:
in a given taxonomy, a set of nodes with an order is formed by all nodes on the shortest path from a node to the root node.
In the further optimization of the technical scheme, the specific method for generating the self-supervision training data in the step 1 is as follows:
generating a classification system path as a correct classification system path for each node in a given classification system, and combining each non-root node with the rest (| V | -2) classification system paths to generate (| V | -2) error classification system paths, wherein V is a set of all nodes on the classification system.
Further optimization of the technical scheme, the specific method for sampling the classification system path in the step 2 is as follows:
in each sampling process, sampling is carried out on each non-root node, the sampling data comprises a classification system path corresponding to the sampling data as positive sampling, and an error path randomly extracted from the corresponding error classification system path is used as negative sampling;
the sampling process requires repeated random decimation in each round of training to sample to a different error path.
In a further optimization of the technical scheme, the data input in step 3 is composed of two parts, one part is a classification system path, the other part is a word definition text of the entity words corresponding to the classification system path, when a model is input, the two parts are combined into a string of texts, the foremost end is a [ CLS ] classification symbol, the first half is a word definition text, the middle is divided by using [ SEP ] symbols, and the second half is a text formed by arranging all entity words on the classification system path in sequence.
The technical scheme is further optimized, and the specific method for fine tuning the pre-training text coding model in the step 3 is as follows:
vector data corresponding to [ CLS ] classification symbols output by the pre-training text coding model are input into a multilayer fully-connected neural network, and the neural network outputs a scalar to represent whether a classification system path is appropriate to be scored or not; and (3) obtaining losses by the two scores according to the dynamic difference loss algorithm mentioned above, and then updating parameters of the model by using a back propagation algorithm so as to fine tune the pre-training text coding model and the multilayer fully-connected neural network added on the pre-training text coding model.
In a further optimization of the technical solution, the specific definition of the dynamic deficit loss is as follows:
first, a dynamic deficit function is defined as:
Figure BDA0003133756880000031
wherein P and P' represent correct classification system path and wrong classification system path respectively, k is the hyper-parameter of the model, and is set according to the size of the given classification system;
the dynamic balance loss is:
Figure BDA0003133756880000032
wherein
Figure BDA0003133756880000033
Representing the set of taxonomy paths for the positive sample,
Figure BDA0003133756880000034
represents its corresponding negative sample taxonomy path set, s (P)) And S (P ') respectively represents the output scores of the model under training for the classification system paths P and P', and the max function represents that a larger value is selected from the two parameters to serve as output.
In the further optimization of the technical scheme, the specific method for judging the position of the entity word to be inserted in the classification system in the step 4 is as follows:
given entity words to be inserted and noun definitions thereof, combining the entity words to be inserted with all correct classification system paths on the existing classification system to obtain | V | possible classification system paths, combining all classification system paths with noun definitions and inputting the combination into a fine-tuned model, ranking the first classification system path as the correct classification system path according to the obtained score ordering, and judging the positions of the entity words to be inserted in the classification system according to the classification system paths.
Different from the prior art, the technical scheme has the advantages and positive effects that:
the invention creatively provides a self-supervision classification system expansion method utilizing pre-training text coding model fine tuning, which creatively solves the problem of classification system expansion into the judgment of the appropriateness degree of a classification system path, inputs word definition and the classification system path into a pre-training text coding model together, and fine-tunes the model through a specially designed difference calculation formula and a dynamic difference loss function. In particular, in order to better enable the model to learn the difference between the classification system paths, the method designs a specific difference calculation formula which can reflect the similarity between two different classification system paths so as to form the difference for calculating the loss. The method effectively improves the accuracy and other judgment standards of the traditional classification system expansion method, and greatly reduces the corpus texts required in the training and prediction processes.
Drawings
FIG. 1 is a flow chart of a cup optic disc segmentation method based on fundus map dataset transfer learning;
FIG. 2 is a schematic diagram of a fundus map data set migration learning model;
FIG. 3 is a schematic view of a disc segmentation result across a fundus map data set;
FIG. 4 is a schematic diagram of an attention module with dynamically learnable coefficients;
fig. 5 is a schematic view of the cup segmentation results across the fundus map dataset.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
The invention provides a classification system expansion method based on a pre-training text coding model, which is a flow chart of a cup optic disc segmentation method based on fundus map data set transfer learning as shown in figure 1.
The invention solves the research problem of classification system expansion, and FIG. 2 is a schematic diagram of a fundus image data set migration learning model, which is a definition diagram of the problem, wherein the left side is given to the existing classification system needing to be expanded, the middle part is an entity word needing to be added into the classification system and a word definition thereof, and the right side is a possible expansion position. For example, on a given taxonomy as shown in the figure, the new entity word "Deixis" to be expanded should be inserted under the "Semantic" node.
The classification system expansion method based on the pre-training text coding model adopts a data set provided in a task13 in a commonly used international semantic major (SemEval)2016 in an implementation stage. The data set provides classification systems of three different fields of food (food), science (science) and environment (environment). When a data set is divided, 80% of nodes in each different classification system are used as a training set, namely the known classification system needing to be expanded; leaf nodes accounting for 20% of the total number of the nodes are randomly selected as a test set, namely, new entities needing to be inserted into the classification system.
In the specific implementation process, the pre-training text coding model selected by the invention is a commonly used natural language understanding model BERT. By fine-tuning the BERT, an extended model can be obtained that is suitable for the classification system currently provided.
Step 1, self-supervised training data generation
The goal of this stage is to generate self-supervised training data that can be used to fine-tune the pre-trained text coding model, based on an existing classification system with a hierarchical structure. The method is an automatic supervision method, training data are not acquired by other classification systems, and the automatic supervision training data are generated by the existing classification system needing to be expanded at present, so that a pre-training text coding model is finely tuned.
Constructing a classification system path;
a taxonomy is a directed acyclic graph consisting of a set of context relationships and a set of entities, denoted as T ═ V, E, where T denotes the taxonomy, each node u ∈ V is an entity word, and each edge (u, V) ∈ E denotes the superior-inferior relationship between child node u ∈ V and parent node V ∈ V.
The classification system has a hierarchical structure, and a root node r belongs to V and has no father node, namely (r, V) belongs to E; and can communicate nodes in all classification systems, namely, V exists for any V e to V1,v2...vD-1E.g. V, and (r, V)1),(v1,v2)...(vD-2,vD-1),(vD-1V) E. Can be called Pv=[r,v1,v2...vD-1,v]Is the taxonomy path for this node v. When the number of nodes in P is the minimum, D is the depth of v node in the classification system.
For a given existing classification system, the method generates a classification system path for each node (when a plurality of classification system paths exist in a certain node, the shortest one of the classification system paths is selected).
In the implementation process, firstly, word definitions of all entity words are acquired, including words on a known classification system and new entity words to be inserted. The method for acquiring the word definition comprises the step of capturing sentences related to the entry in the first section of each entry from the Wikipedia website. For words formed by combining multiple words, the invention forms word definitions of the complete words by simply combining the word definitions of all the combined words.
Thereafter, a correct taxonomy path is generated for each node (including the root node) on it, according to the structure of the taxonomy that needs to be extended. Then, to generate the sample data, it is necessary to generate an erroneous classification system path as negative sample data for each non-root node by combining with other classification system paths. That is, there is one correct taxonomy path for each non-root node, and (| V | -2) misclassification taxonomy paths (except for the path of the node and its parent node).
During training:
step 2, classification system path sampling
Through the generation of the self-supervision training data in the step 1, classification system path data which can be used for training already exists, but sampling is needed according to the training requirement;
for a given classification system T ═ V, E, the method generates a classification system path by combining word definitions of entity words represented by each non-root node V ∈ V, V ≠ r, and V is a set of all nodes on the classification system. Meanwhile, all other classification system paths are linked with the node and are combined with the word definition of the node to serve as negative sampling data. The specific combination method is that u belongs to V, u is not equal to V, and u is not equal to VD-1,PNegative pole=[r,u1,u2...uD-1,u,v]。
In the course of fine tuning the model, multiple rounds of training are performed. In each training round, resampling is needed, and the sampling steps are as follows:
1. selecting a correct classification system path of a non-root node as a positive sample according to the sequence
2. Randomly extracting one from the error classification system path set corresponding to the correct classification system as negative sample
3. Repeating the above steps until all non-root nodes are sampled
The above steps are performed in one training round, and the above steps need to be repeated in each training round.
Step 3, fine tuning the pre-training text coding model
The aim of the stage is to enable the model to have the capability of judging whether the classification system path is appropriate or not by fine-tuning the pre-training text coding model, and to an inappropriate input path, the model will be scored low and the appropriate input path model will be scored high.
Step 3.1, inputting a sampling data pair;
when the pre-training text coding model is finely adjusted, the loss function is a dynamic difference loss function, so that the same amount of positive sampling data and negative sampling data need to be input when each set of training input sampling data, and the positive sampling data and the negative sampling data correspond to the same node.
At each training round, all positive samples will be input, while negative samples will be randomly drawn from all optional negative samples of the corresponding node.
When inputting the pre-training text coding model, the entity word definition and the classification system path are used as the combination input of two sentences, and the middle part is separated by a segmentation symbol. Inputting a single sample, the pre-trained text coding model will return a string of vectors:
Model(S,P)=x[CLS],x1,...,x[SEp],xv,...,xr,x[SEP]
wherein x represents a vector, [ CLS ] represents a classification symbol commonly used in a pre-training text coding model, [ SEP ] represents a segmentation symbol used at the tail part and the middle part of a text, the first half is used for coding a word definition sentence, and the second half is used for coding each entity word in a classification system path.
Step 3.2, calculating a dynamic difference;
in order to distinguish the difference between different negative samples, the method adjusts the traditional difference loss function and converts the fixed difference into the dynamic difference. The following two main benefits are achieved: (1) previous research results show that the model can be better trained and learned to slight and deep differences of upper and lower relations in a classification system. (2) In a classification system, the similarity between all nodes is not the same, different loss amounts are obtained by setting different differences, and the model can better learn the difference between the nodes.
The specific calculation difference formula is as follows:
Figure BDA0003133756880000071
wherein P and P' respectively represent two path node sets for calculating difference values, k is a hyper-parameter of the model and is set according to the size of a given classification system. The difference calculation method can reflect the similarity of the negative sample and the positive sample in the classification system.
3.3, fine-tuning the pre-training text coding model by utilizing a dynamic deficit loss function;
output [ CLS ] of pre-training text coding model]The vector is input into a multi-layer perceptron to obtain a numerical value as the score of the current input classification system path, namely s (P) -MLP (v)[CLS]) Where s (p) represents the score of the current input taxonomy path and MLP represents the multi-layer perceptron.
The dynamic deficit loss function is:
Figure BDA0003133756880000072
wherein
Figure BDA0003133756880000073
Representing the set of taxonomy paths for the positive sample,
Figure BDA0003133756880000074
and (2) representing a negative sampling classification system path set corresponding to the training model, wherein S (P) and S (P ') respectively represent output scores of the training model for the classification system paths P and P', and a max function represents that a larger value is selected from two parameters to be used as output.
By utilizing the loss function in cooperation with the Adam optimizer, a certain amount of data is input every time, and through multiple rounds of iteration, the pre-training text coding model can be finely adjusted, so that the score can be given to the given classification system path input.
Referring to fig. 3, a schematic diagram of the cross fundus map data set optic disc segmentation result is shown, illustrating the overall fine tuning process. The sampled data will be used to fine tune the pre-trained text coding model. When inputting the pre-training text coding model, the entity word definition and the classification system path are used as the combination input of two sentences, and the middle part is separated by a segmentation symbol. Each set of training data will have half as positive samples and half as negative samples.
Referring to fig. 4, which is a schematic diagram of an attention module with dynamically learnable coefficients, a last layer of a neural network of a pre-trained text coding model outputs a representation vector to [ CLS ], and the representation vector is input to a fully-connected neural network layer to obtain a scalar output, which is used for scoring a classification system path input by the model. And calculating the Loss according to a dynamic difference Loss function of the classification system paths of the positive samples and the corresponding negative samples, and updating parameters in the model by using an Adam optimizer and a back propagation algorithm so as to achieve the effect of fine tuning the model.
The training process will be performed for multiple rounds, the specific number of rounds will be adjusted according to the difference of the data sets, taking the practice as an example, the Science data set trains 50 rounds, the Environment data set trains 45 rounds, and the Food data set trains 55 rounds. Adjusting the size of a reference data set, the depth of a classification system and the convergence speed of Loss in the training process; the fine tuning of the model has better robustness, and the training effect within a certain training round number range is not obviously different, and the range is about +/-5.
Step 4, aiming at the entity words to be inserted, judging the positions of the entity words in the classification system
After the steps are completed, namely the pre-training text coding model is finely adjusted through the given classification system and the definition of the nodes of the classification system, the task of the step is to judge the position of a new entity word needing to be expanded, which is to be inserted into the given classification system, by using the finely adjusted model.
Combining the new entity words to be expanded with all classification system paths (including root nodes) on a given classification system to generate | V | possible classification system paths, matching each path with the word definition of the new entity words to input the fine-tuned model to obtain scores, and sequencing the scores, wherein the node corresponding to the classification system with the highest score is the insertion position.
The pre-trained text coding model has the capability of judging whether the classification system path is appropriate or not by finely adjusting the model according to the training data generated by the existing classification system. At this time, the model can obtain a score for judging whether the classification system path is suitable or not for each input classification system path.
Given entity words to be inserted and noun definitions thereof, combining the entity words to be inserted with all correct classification system paths on the existing classification system to obtain | V | possible classification system paths, combining all classification system paths with noun definitions and inputting the combination into a fine-tuned model, ranking the first classification system path as the correct classification system path according to the obtained score ordering, and judging the positions of the entity words to be inserted in the classification system according to the classification system paths.
The classification system expansion method based on the pre-training text coding model provided by the invention is verified on the data set of SemEval-2016 Task 13. Three evaluation indexes were used in the experiment:
accuracy (Acc): measuring the fraction of correctly predicted taxonomy paths
Figure BDA0003133756880000091
Mean Reciprocal Rank (MRR): calculating the average of the inverse ranking of the correct classification system path
Figure BDA0003133756880000092
Wu & Palmer similarity (Wu & P): evaluating semantic similarity between the predicted classification system path and the real classification system path:
Figure BDA0003133756880000093
in the above formula, the correct classification system path of the ith node in the n tested entity words to be inserted is defined as yiAnd the highest ranked taxonomy path given by the model is
Figure BDA0003133756880000094
Referring to fig. 5, a diagram of the cup segmentation results across fundus map datasets is shown demonstrating the effect of the method on the three english datasets of SemEval-2016 Task13 and a comparison with other currently existing methods. The comparison method BERT + MLP is a model which is simply trained by combining the Embedding output by the BERT with a multilayer neural network; taxoxexpand and STEAM are the proposed methods in two 2020 published papers, the effects of which derive from their papers. The experimental effect shows that compared with the best method in the prior art, the STEAM provided by the invention averagely achieves the improvement of 13.0%, 14.0% and 9.8% on three indexes of Acc, MRR and Wu & P. The comparison result fully shows that the method provided by the invention has excellent effect on the task of expanding the self-supervision classification system.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.
Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims (8)

1. The classification system expansion method based on the pre-training text coding model is characterized by comprising the following steps of: the method comprises the following steps:
step 1, generating self-supervision training data
Generating data for subsequent self-supervision training without external data according to a given existing classification system and the definition of words in the classification system, wherein the part of data consists of a classification system path and the definition of words;
step 2, classification system path sampling
Through the generation of the self-supervision training data in the step 1, classification system path data which can be used for training already exists, but sampling is needed according to the training requirement to obtain a positive and negative sampling data set;
step 3, fine tuning the pre-training text coding model
Respectively inputting the positive and negative sampling data sets in the step (2) into a pre-training text coding model, and updating model parameters by using a dynamic difference loss and back propagation algorithm, so that the model is finely adjusted to have the capability of judging whether a classification system path is proper or not;
step 4, judging the position of the entity word to be inserted in the classification system
And (3) generating a candidate classification system path according to the entity word to be inserted and the word definition thereof, inputting the candidate classification system path into the model trained in the step (3), and sequencing and judging the position according to the score given by the model.
2. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the specific definition of the classification system path in step 1 is:
in a given taxonomy, a set of nodes with an order is formed by all nodes on the shortest path from a node to the root node.
3. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the step 1 self-supervised training data generation method is as follows:
generating a classification system path as a correct classification system path for each node in a given classification system, and combining each non-root node with the rest (| V | -2) classification system paths to generate (| V | -2) error classification system paths, wherein V is a set of all nodes on the classification system.
4. The method for extending classification system based on pre-trained text coding model according to claim 1, wherein the step 2 is a specific method for sampling classification system paths:
in each sampling process, sampling is carried out on each non-root node, the sampling data comprises a classification system path corresponding to the sampling data as positive sampling, and an error path randomly extracted from the corresponding error classification system path is used as negative sampling;
the sampling process requires repeated random decimation in each round of training to sample to a different error path.
5. The method as claimed in claim 1, wherein the data inputted in step 3 is composed of two parts, one part is a classification system path, the other part is a word definition text of the entity word corresponding to the classification system path, when inputting the model, the two parts are composed into a string of texts, the front end is the [ CLS ] classification symbol, the first half is the word definition text, the middle part is divided by the [ SEP ] symbol, and the second half is the text composed by arranging all entity words on the classification system path in order.
6. The method for expanding the classification system based on the pre-trained text coding model according to claim 5, wherein the specific method for fine-tuning the pre-trained text coding model in the step 3 is as follows:
vector data corresponding to [ CLS ] classification symbols output by the pre-training text coding model are input into a multilayer fully-connected neural network, and the neural network outputs a scalar to represent whether a classification system path is appropriate to be scored or not; and (3) obtaining losses by the two scores according to the dynamic difference loss algorithm mentioned above, and then updating parameters of the model by using a back propagation algorithm so as to fine tune the pre-training text coding model and the multilayer fully-connected neural network added on the pre-training text coding model.
7. The method for extending classification system based on pre-trained text coding model according to claim 1 or 6, wherein the specific definition of the dynamic deficit loss is:
first, a dynamic deficit function is defined as:
Figure FDA0003133756870000021
wherein P and P' represent correct classification system path and wrong classification system path respectively, k is the hyper-parameter of the model, and is set according to the size of the given classification system;
the dynamic balance loss is:
Figure FDA0003133756870000022
wherein
Figure FDA0003133756870000023
Indicating positive miningThe set of paths of the classification system of the sample,
Figure FDA0003133756870000024
and (2) representing a negative sampling classification system path set corresponding to the training model, wherein S (P) and S (P ') respectively represent output scores of the training model for the classification system paths P and P', and a max function represents that a larger value is selected from two parameters to be used as output.
8. The classification system expansion method based on the pre-trained text coding model according to claim 1, wherein the specific method for judging the position of the entity word to be inserted in the classification system in the step 4 is as follows:
given entity words to be inserted and noun definitions thereof, combining the entity words to be inserted with all correct classification system paths on the existing classification system to obtain | V | possible classification system paths, combining all classification system paths with noun definitions and inputting the combination into a fine-tuned model, ranking the first classification system path as the correct classification system path according to the obtained score ordering, and judging the positions of the entity words to be inserted in the classification system according to the classification system paths.
CN202110711017.8A 2021-06-25 2021-06-25 Classification system expansion method based on pre-training text coding model Active CN113407720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110711017.8A CN113407720B (en) 2021-06-25 2021-06-25 Classification system expansion method based on pre-training text coding model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110711017.8A CN113407720B (en) 2021-06-25 2021-06-25 Classification system expansion method based on pre-training text coding model

Publications (2)

Publication Number Publication Date
CN113407720A true CN113407720A (en) 2021-09-17
CN113407720B CN113407720B (en) 2023-04-25

Family

ID=77679432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110711017.8A Active CN113407720B (en) 2021-06-25 2021-06-25 Classification system expansion method based on pre-training text coding model

Country Status (1)

Country Link
CN (1) CN113407720B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073677A (en) * 2017-11-02 2018-05-25 中国科学院信息工程研究所 A kind of multistage text multi-tag sorting technique and system based on artificial intelligence
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110502643A (en) * 2019-08-28 2019-11-26 南京璇玑信息技术有限公司 A kind of next model autocreating technology of the prediction based on BERT model
CN111444305A (en) * 2020-03-19 2020-07-24 浙江大学 Multi-triple combined extraction method based on knowledge graph embedding
CN111538848A (en) * 2020-04-29 2020-08-14 华中科技大学 Knowledge representation learning method fusing multi-source information
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN112214599A (en) * 2020-10-20 2021-01-12 电子科技大学 Multi-label text classification method based on statistics and pre-training language model
CN112329463A (en) * 2020-11-27 2021-02-05 上海汽车集团股份有限公司 Training method of remote monitoring relation extraction model and related device
CN112434513A (en) * 2020-11-24 2021-03-02 杭州电子科技大学 Word pair up-down relation training method based on dependency semantic attention mechanism
CN112560486A (en) * 2020-11-25 2021-03-26 国网江苏省电力有限公司电力科学研究院 Power entity identification method based on multilayer neural network, storage medium and equipment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073677A (en) * 2017-11-02 2018-05-25 中国科学院信息工程研究所 A kind of multistage text multi-tag sorting technique and system based on artificial intelligence
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110502643A (en) * 2019-08-28 2019-11-26 南京璇玑信息技术有限公司 A kind of next model autocreating technology of the prediction based on BERT model
CN111444305A (en) * 2020-03-19 2020-07-24 浙江大学 Multi-triple combined extraction method based on knowledge graph embedding
CN111538848A (en) * 2020-04-29 2020-08-14 华中科技大学 Knowledge representation learning method fusing multi-source information
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN112214599A (en) * 2020-10-20 2021-01-12 电子科技大学 Multi-label text classification method based on statistics and pre-training language model
CN112434513A (en) * 2020-11-24 2021-03-02 杭州电子科技大学 Word pair up-down relation training method based on dependency semantic attention mechanism
CN112560486A (en) * 2020-11-25 2021-03-26 国网江苏省电力有限公司电力科学研究院 Power entity identification method based on multilayer neural network, storage medium and equipment
CN112329463A (en) * 2020-11-27 2021-02-05 上海汽车集团股份有限公司 Training method of remote monitoring relation extraction model and related device

Also Published As

Publication number Publication date
CN113407720B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN107291693B (en) Semantic calculation method for improved word vector model
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN112650886B (en) Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112699216A (en) End-to-end language model pre-training method, system, device and storage medium
CN111931506A (en) Entity relationship extraction method based on graph information enhancement
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112925918B (en) Question-answer matching system based on disease field knowledge graph
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN112214989A (en) Chinese sentence simplification method based on BERT
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN114428850A (en) Text retrieval matching method and system
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN113032559B (en) Language model fine tuning method for low-resource adhesive language text classification
CN112989803B (en) Entity link prediction method based on topic vector learning
CN110516240A (en) A kind of Semantic Similarity Measurement model DSSM technology based on Transformer
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN109815497A (en) Based on the interdependent character attribute abstracting method of syntax
CN116757188A (en) Cross-language information retrieval training method based on alignment query entity pairs
CN113407720B (en) Classification system expansion method based on pre-training text coding model
CN116029300A (en) Language model training method and system for strengthening semantic features of Chinese entities
CN113111136B (en) Entity disambiguation method and device based on UCL knowledge space
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant