CN112232024A - Dependency syntax analysis model training method and device based on multi-labeled data - Google Patents

Dependency syntax analysis model training method and device based on multi-labeled data Download PDF

Info

Publication number
CN112232024A
CN112232024A CN202011089840.1A CN202011089840A CN112232024A CN 112232024 A CN112232024 A CN 112232024A CN 202011089840 A CN202011089840 A CN 202011089840A CN 112232024 A CN112232024 A CN 112232024A
Authority
CN
China
Prior art keywords
dependency
arc
score
label
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011089840.1A
Other languages
Chinese (zh)
Inventor
李正华
周明月
赵煜
张民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202011089840.1A priority Critical patent/CN112232024A/en
Publication of CN112232024A publication Critical patent/CN112232024A/en
Priority to PCT/CN2021/088601 priority patent/WO2022077891A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a dependency syntax analysis model training method based on multi-label data, which comprises the following steps: acquiring a word sequence and a plurality of labeling results; inputting the word sequence into a dependency syntax analysis model to obtain an arc score and a label score; calculating loss values of the arc score and the label score relative to various labeling results according to the target loss function; through iterative training, the model parameters of the dependency syntax analysis model are adjusted with the aim of minimizing the loss value, so that model training is realized. Therefore, the method can calculate the loss value of the output result of the model relative to all the labeled results according to the target loss function, and accordingly, the iterative training of the model is completed, the purpose of fully utilizing the effective information in all the labeled data is achieved, and the dependency syntactic analysis capability of the model is improved. In addition, the application also provides a dependency parsing model training device, equipment and a readable storage medium based on multi-labeled data, and the technical effect of the device corresponds to the method.

Description

Dependency syntax analysis model training method and device based on multi-labeled data
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for dependency parsing model training based on multi-labeled data.
Background
The objective of the dependency syntax analysis is to give an input sentence, capture the modification and collocation relationship between words in the sentence, characterize the syntax and semantic structure of the sentence, and construct a dependency syntax tree.
In recent years, with the rapid development of deep learning in the field of natural language processing, the accuracy of dependency parsing is significantly improved. However, when processing text that is different from the training data, the accuracy of the dependency parsing may drop dramatically. A straightforward solution to this problem is to label domain-specific syntactic data. However, most of the dependency syntax tree libraries are constructed by a few linguistics experts for a long time, so that the method has the defects of time and labor waste and high cost, and cannot meet the current requirements.
Inspired by crowdsourcing work, the method is a feasible method for quickly constructing the multi-label dependency syntax tree library by using label data of a large number of non-expert label personnel. However, compared to expert annotation, this method has relatively low annotation quality and high inconsistency. The current solutions include two, one is to select one kind of labeled data from multiple kinds of labeled data by adopting a majority voting method, and the other is to simply discard inconsistent labeled data or to manually review.
For the majority voting mode, the result obtained by voting can also be a completely wrong answer, so that the possibly correct information is completely discarded, the training effect is influenced, and the less the number of people is marked, the less reliable the voting result is. Although a weighted voting method can also be used, the problem of biased listening confidence when the number of people marked is small still cannot be solved.
For the way of simply discarding inconsistent sentences, although the reliability of the data set is improved, if the inconsistency rate of the original data set is higher, the way will result in greatly reduced data set size and waste. Although the manual auditing method can greatly improve the quality of the data set, the method is time-consuming, labor-consuming and high in cost.
In summary, although a data set which can be directly used for the dependency parsing model can be obtained in the majority voting mode and the simple inconsistent data discarding mode, both the two modes generate data waste, discard information of a part of the data set, and do not fully utilize effective information in multi-labeled data, thereby resulting in poor model performance.
Therefore, how to fully utilize the multi-labeled data to complete the training of the dependency syntactic analysis model and improve the performance of the model is a problem to be solved by the technical personnel in the field.
Disclosure of Invention
The invention aims to provide a multi-labeled data-based dependency syntactic analysis model training method, a multi-labeled data-based dependency syntactic analysis model training device and a readable storage medium, and aims to solve the problem that when a dependency syntactic analysis model is trained by using multi-labeled data, part of labeled data is still discarded essentially, only one type of labeled data is used for model training, effective information in the multi-labeled data cannot be fully utilized, and the model performance is poor. The specific scheme is as follows:
in a first aspect, the present application provides a method for dependency parsing model training based on multi-labeled data, including:
acquiring a word sequence and a plurality of labeling results of the word sequence, wherein for each modified word in the word sequence, the labeling results comprise an arc and a dependency relationship label, and each labeling result comes from different users;
inputting the word sequence into a dependency syntactic analysis model to obtain an arc score and a label score;
calculating loss values of the arc score and the label score relative to the plurality of labeled results according to a target loss function;
and adjusting the model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
Preferably, the calculating the loss values of the arc score and the label score relative to the plurality of labeled results according to an objective loss function includes:
setting weight values for various marking results in the multiple marking results according to the marking capabilities of different users;
and calculating the loss values of the arc scores and the label scores relative to the various labeling results according to the target loss function and the weight values of various labeling results.
Preferably, the setting a weight value for each of the plurality of labeling results includes:
and respectively setting an arc weight value and/or a label weight value aiming at each labeling result in the plurality of labeling results.
Preferably, the calculating the loss values of the arc score and the label score relative to the plurality of labeled results according to an objective loss function includes:
calculating the loss value of the arc score relative to the arc in the various labeling results according to an arc loss function to obtain a first loss value;
calculating the loss value of the label score relative to the dependency relationship label in the multiple labeling results according to a label loss function to obtain a second loss value;
determining a loss value of the arc score and the label score relative to a plurality of annotated results based on the first loss value and the second loss value.
Preferably, the calculating, according to a tag loss function, a loss value of the tag score with respect to the dependency tag in the multiple labeling results to obtain a second loss value includes:
calculating a loss value of the tag score relative to a dependency relationship tag in a target labeling result according to a tag loss function to obtain a second loss value, wherein the target labeling result is a labeling result that an arc in the multiple labeling results is equal to a target arc, the target arc is an arc determined according to a target strategy, and the target strategy comprises: arc score prediction, majority voting, weighted voting, random selection.
Preferably, the dependency parsing model includes: the device comprises an input layer, a coding layer, a first MLP layer, a first result layer, a second MLP layer and a second result layer;
wherein the first MLP layer is used for determining a representation vector of a current word as a core word and a representation vector of a current word as a modifier according to the output of the coding layer, and the first scoring layer is used for determining an arc score according to the output of the first MLP layer;
the second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector including dependency label information when the current word is used as a core word and a representation vector including dependency label information when the current word is used as a modifier, and the second score layer is configured to determine a label score according to the output of the second MLP layer.
Preferably, the coding layer of the dependency parsing model comprises a plurality of layers of BilSTM.
In a second aspect, the present application provides a dependency parsing model training apparatus based on multi-labeled data, including:
a training sample acquisition module: the system comprises a word sequence and a plurality of labeling results of the word sequence, wherein the labeling results comprise arcs and dependency relationship labels for each modified word in the word sequence, and each labeling result comes from different users;
an input-output module: the dependency syntactic analysis model is used for inputting the word sequence to obtain an arc score and a label score;
a loss calculation module: for calculating a loss value of the arc score and the label score relative to the plurality of annotated results according to a target loss function;
an iteration module: and adjusting the model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
In a third aspect, the present application provides a dependency parsing model training device based on multi-labeled data, including:
a memory: for storing a computer program;
a processor: for executing the computer program to implement the multi-labeled data-based dependency parsing model training method as described above.
In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program for implementing a multi-labeled data-based dependency parsing model training method as described above when executed by a processor.
The application provides a dependency syntactic analysis model training method based on multi-label data, which comprises the following steps: acquiring a word sequence and a plurality of labeling results of the word sequence; inputting the word sequence into a dependency syntax analysis model to obtain an arc score and a label score; calculating loss values of the arc score and the label score relative to various labeling results according to the target loss function; through iterative training, the model parameters of the dependency syntax analysis model are adjusted with the aim of minimizing the loss value, so that the training of the dependency syntax analysis model is realized. Therefore, the method can calculate the loss value of the output result of the model relative to all the labeled results according to the target loss function, and accordingly, the iterative training of the model is completed, the purpose of fully utilizing the effective information in all the labeled data is achieved, and the dependency syntactic analysis capability of the model is improved.
In addition, the application also provides a device, equipment and a readable storage medium for training the dependency parsing model based on the multi-label data, and the technical effect of the device corresponds to the method, which is not repeated herein.
Drawings
For a clearer explanation of the embodiments or technical solutions of the prior art of the present application, the drawings needed for the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a first embodiment of a multi-labeled data-based dependency parsing model training method provided in the present application;
FIG. 2 is a flowchart illustrating a step S103 in an embodiment of a multi-labeled data-based dependency parsing model training method provided in the present application;
FIG. 3 is a schematic diagram of a model architecture of a second embodiment of a multi-labeled data-based dependency parsing model training method provided by the present application;
FIG. 4 is a diagram illustrating a single annotation result in a second embodiment of a multi-annotation data-based dependency parsing model training method according to the present application;
FIG. 5 is a data storage format of a single annotation result in a second embodiment of a multi-annotation-data-based dependency parsing model training method according to the present application;
FIG. 6 is a diagram illustrating a multi-labeled result of a second embodiment of a multi-labeled data-based dependency parsing model training method provided by the present application;
FIG. 7 is a data storage format of a multi-labeled result in a second embodiment of a multi-labeled data-based dependency parsing model training method provided by the present application;
FIG. 8 is a functional block diagram of an embodiment of a multi-labeled data-based dependency parsing model training apparatus provided in the present application.
Detailed Description
The core of the application is to provide a multi-labeled data-based dependency syntactic analysis model training method, device, equipment and readable storage medium, which can fully utilize effective information in all labeled data and improve the dependency syntactic analysis capability of the model.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a first embodiment of a dependency parsing model training method based on multi-labeled data provided by the present application is described below, where the first embodiment includes:
s101, obtaining a word sequence and a plurality of labeling results of the word sequence, wherein the labeling results comprise arcs and dependency relationship labels for each modified word in the word sequence;
the word sequence is a sequence obtained by segmenting a sentence. In the multiple annotation results (more than two annotation results) obtained in this embodiment, each annotation result comes from a different user. Assuming that a sentence is labeled by K users, K labeling results are generated, and each labeling result is a dependency syntax tree of the sentence.
The dependency syntax tree is used to describe the dependency relationship between words, and one dependency relationship contains three elements: modifiers, core words, and dependency types, meaning that a modifier modifies a core word with a certain dependency type.
In this embodiment, for each modifier in the word sequence, the tagging result includes the following two items of information: arcs (core words) and dependency tags.
S102, inputting the word sequence into a dependency syntax analysis model to obtain an arc score and a label score;
in this embodiment, the dependency parsing model is used to predict the core word and the dependency label of each word according to the word sequence, and specifically, the model outputs the arc score and the dependency score, and the actually predicted arc and dependency label can be determined according to the arc score and the dependency score.
The present embodiment does not limit what neural network is selected as the dependency parsing model, as long as it can predict the dependency relationship from the word sequence. A feasible scheme is provided, and a Biaffine Parser model is selected as the dependency parsing model of the embodiment.
S103, calculating loss values of the arc scores and the label scores relative to the multiple labeling results according to a target loss function;
in general, when only one kind of labeled result is used as a standard, the loss value between the actual prediction result and the labeled result is calculated directly. Since the method and the device adopt various labeling results, when the loss value is calculated, the loss value of the actual prediction result relative to all the labeling results needs to be calculated. Specifically, the loss values between the actual predicted result and each labeled result can be calculated respectively and then accumulated.
And S104, adjusting model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
The dependency syntax analysis model training method based on multi-labeled data provided by the embodiment can calculate the loss value of the output result of the model relative to all labeled results according to the target loss function, and accordingly, the iterative training of the model is completed, the purpose of fully utilizing effective information in all labeled data is achieved, and the dependency syntax analysis capability of the model is improved.
As a preferred implementation manner, on the basis of the first embodiment, weighting values may be assigned to the labeling results of different users to distinguish the labeling capabilities of different users. For example, for the labeling result given by the expert, a relatively high weight value can be given; for the labeling result given by the ordinary user, a lower weight value can be given.
Specifically, in order to distinguish the labeling capabilities of different users, weighted values are respectively set for various labeling results in the multiple labeling results, and then the process of S103 is modified as follows: and calculating the loss values of the arc scores and the label scores relative to the various labeling results according to the target loss function and the weight values of various labeling results.
On the basis, the annotation result is considered to contain two items of information: the arc and the dependency relationship labels can be distinguished from two dimensions respectively when the user labeling capacity is distinguished, and an arc weight value and a label weight value are set respectively. The user's marking ability may even be differentiated from only one of the dimensions, and not from the other dimension.
At this time, the weight setting process specifically includes: and respectively setting an arc weight value and/or a label weight value aiming at each labeling result in the plurality of labeling results. When the arc weight value and the label weight value are set respectively, the numerical value of the arc weight value is different from the weight value of the label weight.
To sum up, taking table 1 as an example, when setting the weight for the word i, the embodiment provides the following four weight setting methods to adapt to different scene requirements:
TABLE 1
Figure BDA0002721636750000081
Specifically, when calculating the loss value of the actual prediction result (the labeling result output by the model) with respect to all the labeling results, the calculation may be performed from two dimensions of the arc and the dependency relationship label, respectively. In this case, as shown in fig. 2, S103 includes:
s201, calculating loss values of the arc scores relative to the arcs in the multiple labeling results according to an arc loss function to obtain a first loss value;
s202, calculating the loss value of the label score relative to the dependency relationship label in the multiple labeling results according to a label loss function to obtain a second loss value;
s203, determining the loss values of the arc score and the label score relative to various labeling results according to the first loss value and the second loss value.
On the basis, as a preferred embodiment, when calculating the tag loss, the difference calculation may be performed not with the relationship type tags in all the labeling results, but with only the relationship type tags of some of the labeling results. The partial annotation result is an annotation result selected from all annotation results according to a certain policy, and the policy may specifically be majority voting, weighted voting, arc score prediction, random selection, and the like. In this case, as shown in fig. 3, the step S202 is specifically:
calculating a loss value of the tag score relative to a dependency relationship tag in a target labeling result according to a tag loss function to obtain a second loss value, wherein the target labeling result is a labeling result that an arc in the multiple labeling results is equal to a target arc, the target arc is an arc determined according to a target strategy, and the target strategy comprises: arc score prediction, majority voting, weighted voting, random selection.
Wherein, the arc score prediction means: selecting the arc with the maximum score as a target arc according to the arc score output by the dependency syntax analysis model;
majority voting refers to: selecting the arc with the most occurrence times in the multi-marking result as a target arc by adopting a majority voting method;
the weighted voting refers to: selecting a target arc by adopting a weighted majority voting method and combining the weight of each kind of marked result and the occurrence frequency of each kind of marked result in the plurality of kinds of marked results;
the random selection means that: and randomly selecting an arc from the various labeling results as a target arc.
The following begins to describe in detail an embodiment two of the dependency parsing model training method based on multi-labeled data provided by the present application, and the embodiment two describes the training process in detail by taking practical applications as examples based on the foregoing description.
In this embodiment, a Biaffine Parser model is adopted, as shown in fig. 3. The dependency parsing model includes: the device comprises an input layer, a coding layer, a first MLP layer, a first result layer, a second MLP layer and a second result layer;
wherein, the coding layer comprises a plurality of layers of BiLSTM;
the first MLP layer is used for determining a representation vector of a current word as a core word and a representation vector of a current word as a modifier according to the output of the coding layer, and the first score layer is used for determining an arc score according to the output of the first MLP layer;
the second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector including dependency label information when the current word is used as a core word and a representation vector including dependency label information when the current word is used as a modifier, and the second score layer is configured to determine a label score according to the output of the second MLP layer.
For sentence S ═ w0w1w2w3...wn,w0Is to insert an auxiliary root node at the beginning of the sentence. The input layer combines each word wiMapping to a vector xi,xiIs the concatenation of the word embedding vector and the character embedding (Char-LSTM) vector, i.e.:
Figure BDA0002721636750000091
the coding layer is a plurality of layers of BiLSTM, and the output of the two-direction connection of the previous layer of BiLSTM is the input of the next layer.
Then the MLP indicates that the layer will encode the output h of the layeriAs input, four independent MLPs are used to obtain four low-dimensional representation vectors containing corresponding information
Figure BDA0002721636750000101
And
Figure BDA0002721636750000102
as follows:
Figure BDA0002721636750000103
Figure BDA0002721636750000104
Figure BDA0002721636750000105
Figure BDA0002721636750000106
wherein
Figure BDA0002721636750000107
Is wiAs a representative vector when it is a core word,
Figure BDA0002721636750000108
is wiAs a vector of representations when a modifier is used,
Figure BDA0002721636750000109
denotes wiAs a core word, a representation vector containing the predicted dependency label information,
Figure BDA00027216367500001010
is wiThe modifier is a vector containing the representation of the predicted dependency tag information.
The biaffine scoring layer then computes the scores for all dependencies through biaffine, the dependency score being divided into two parts, an arc score and a dependency label score, where the arc score is as follows:
Figure BDA00027216367500001011
wherein scorearc(i, j) represents the score of the dependent arc with j acting as the core word and i acting as the modifier. Matrix WbIs the biaffine parameter.
The dependency label score is as follows:
Figure BDA00027216367500001012
wherein
Figure BDA00027216367500001013
And
Figure BDA00027216367500001014
is the biaffine parameter and b is the offset.
The overall loss of the model consists of two parts: arc loss and label loss, wherein the arc loss refers to a part of the overall loss function and represents the difference between the distribution of the predicted arc and the distribution of the real arc; tag loss also refers to a portion of the overall loss function, representing the difference between the distribution of predicted tags and the true tags.
The original Biaffine attribution server uses cross entropy as a loss function, and each word calculates local loss separately. The original arc loss function is shown below:
Figure BDA00027216367500001015
in this embodiment, in order to adapt the model to the multi-labeled data, all answers of the multi-labeled data are fully utilized by modifying the original loss function of the model. Assuming that a sentence is labeled by K-labeled persons, multi-labeled data is generated. For the ith word, the K core words correspondingly labeled by the K labeled personnel are represented as a list H ═ H1,h2,...,hK]Then the arc loss for this word is:
Figure BDA0002721636750000111
assume that the label set is L ═ L1,l2,...,lTAnd for a modifier i to modify the dependency arc of the core word j by the dependency relationship type l, the original label loss is:
Figure BDA0002721636750000112
suppose that the K dependency labels correspondingly labeled by the K labeled personnel are represented as Y ═ Y1,y2,...,yK]. Calculating each pair of answers (h) in combination with the label loss functionk,yk) Then summing to obtain the final ensembleThe loss function is:
Figure BDA0002721636750000113
the overall loss is minimized in the model training iteration, and the difference is reduced, so that the optimized result is achieved.
And (4) obtaining a final syntactic analysis model through iterative training, and decoding and analyzing any input sentence to obtain a syntactic tree result. After syntactic information of the data is obtained, the syntactic information can be used for extracting long-distance information to adapt to the requirements of other natural language tasks.
On the basis, weight values can be set for various marking results. For example, the consistency of one annotator with other annotators is used to measure his annotating ability, and the higher the consistency, the higher the weight.
If K annotators are available { a1,a2,...,aK},s(ak) Is the annotator akNumber of words labeled, w (a)k) Is the annotator akThe number of words in the tagged word that are consistent with the answers given by the other tagged persons, then w (a)k)/s(ak) That is the annotator akThe rate of coincidence. The weight is the normalized coincidence ratio, i.e. the annotator akThe weight calculation formula of (a) is:
Figure BDA0002721636750000121
thus, the arc loss function for the ith word is modified to:
Figure BDA0002721636750000122
here again, the dependency type loss weighting is not applied, and the final loss function is then:
Figure BDA0002721636750000123
the above describes the loss function calculation method in this embodiment, and other calculation methods may be adopted in practical applications, which should not be construed as limiting the present application.
The dependency syntax tree is illustrated in FIG. 4, where s0Representing a pseudo node pointing to a word that is the root node of a sentence. One dependent arc is composed of three elements
Figure BDA0002721636750000124
Wherein wiCalled core word, wjCalled modifier, r is a relationship type, representing wjEmbellishment of w with syntactic role ri. Here, the dependency arcs are taken as examples, and the relationship types are omitted.
The existing model uses a gold standard database, in which each sentence has only one standard answer, as shown in fig. 4. Fig. 4 is a graphical representation of dependency syntax data, and the corresponding data store has a CoNLL format as shown in fig. 5, where the second column is a vector representation of words and the seventh column is the corresponding core word sequence standard answers.
According to the method and the system, a plurality of annotating personnel annotate the same sentence according to the annotation guide, so that a plurality of annotation results are obtained. Each sentence has multiple syntactic tree labeled answers, FIG. 6 is an example of two-person labeling, with one person's labeling above the sentence and another person's labeling below the sentence. Correspondingly, the application is modified on the basis of the CoNLL format, so that the data format is also suitable for the multi-label form, as shown in FIG. 7. The first 10 columns are consistent with the CoNLL format, the 11 th column to the 12 th column are respectively the identification of the first annotation person and the core word sequence annotation answer, and the 14 th column to the 15 th column are respectively the identification of the second annotation person and the core word sequence annotation answer.
According to the scheme provided by the application, the data input format and the loss function of the Biaffine Parser basic model are modified, and then the multi-label data can be directly used for training. After iterative training, a syntactic analysis model can be obtained, and a syntactic tree result can be given for any input sentence.
In the following, a multi-labeled data-based dependency parsing model training apparatus according to an embodiment of the present application is described, and a multi-labeled data-based dependency parsing model training apparatus described below and a multi-labeled data-based dependency parsing model training method described above may be referred to in correspondence.
As shown in fig. 8, the dependency parsing model training apparatus based on multi-labeled data according to the present embodiment includes:
training sample acquisition module 801: the system comprises a word sequence and a plurality of labeling results of the word sequence, wherein the labeling results comprise arcs and dependency relationship labels for each modified word in the word sequence, and each labeling result comes from different users;
the input-output module 802: the dependency syntactic analysis model is used for inputting the word sequence to obtain an arc score and a label score;
loss calculation module 803: for calculating a loss value of the arc score and the label score relative to the plurality of annotated results according to a target loss function;
an iteration module 804: and adjusting the model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
The multi-labeled data-based dependency parsing model training device of the present embodiment is used for implementing the aforementioned multi-labeled data-based dependency parsing model training method, and therefore specific embodiments of the device can be seen in the foregoing embodiments of the multi-labeled data-based dependency parsing model training method, for example, the training sample obtaining module 801, the input/output module 802, the loss calculating module 803, and the iteration module 804 are respectively used for implementing the steps S101, S102, S103, and S104 of the aforementioned multi-labeled data-based dependency parsing model training method. Therefore, specific embodiments thereof may be referred to in the description of the corresponding respective partial embodiments, and will not be described herein.
In addition, since the dependency parsing model training device based on multi-labeled data of the present embodiment is used to implement the aforementioned dependency parsing model training method based on multi-labeled data, the function corresponds to that of the aforementioned method, and is not described herein again.
In addition, the present application also provides a dependency parsing model training device based on multi-labeled data, including:
a memory: for storing a computer program;
a processor: for executing the computer program to implement the multi-labeled data-based dependency parsing model training method as described above.
Finally, the present application provides a readable storage medium having stored thereon a computer program for implementing a multi-labeled data-based dependency parsing model training method as described above when executed by a processor.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above detailed descriptions of the solutions provided in the present application, and the specific examples applied herein are set forth to explain the principles and implementations of the present application, and the above descriptions of the examples are only used to help understand the method and its core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A multi-annotation data-based dependency parsing model training method is characterized by comprising the following steps:
acquiring a word sequence and a plurality of labeling results of the word sequence, wherein for each modified word in the word sequence, the labeling results comprise an arc and a dependency relationship label, and each labeling result comes from different users;
inputting the word sequence into a dependency syntactic analysis model to obtain an arc score and a label score;
calculating loss values of the arc score and the label score relative to the plurality of labeled results according to a target loss function;
and adjusting the model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
2. The method of claim 1, wherein said calculating loss values for said arc score and said label score with respect to said plurality of labeled results according to an objective loss function comprises:
setting weight values for various marking results in the multiple marking results according to the marking capabilities of different users;
and calculating the loss values of the arc scores and the label scores relative to the various labeling results according to the target loss function and the weight values of various labeling results.
3. The method of claim 2, wherein setting weight values for each of the plurality of annotation results comprises:
and respectively setting an arc weight value and/or a label weight value aiming at each labeling result in the plurality of labeling results.
4. The method of claim 1, wherein said calculating loss values for said arc score and said label score with respect to said plurality of labeled results according to an objective loss function comprises:
calculating the loss value of the arc score relative to the arc in the various labeling results according to an arc loss function to obtain a first loss value;
calculating the loss value of the label score relative to the dependency relationship label in the multiple labeling results according to a label loss function to obtain a second loss value;
determining a loss value of the arc score and the label score relative to a plurality of annotated results based on the first loss value and the second loss value.
5. The method of claim 4, wherein the calculating the loss value of the tag score with respect to the dependency tag in the plurality of labeling results according to the tag loss function to obtain a second loss value comprises:
calculating a loss value of the tag score relative to a dependency relationship tag in a target labeling result according to a tag loss function to obtain a second loss value, wherein the target labeling result is a labeling result that an arc in the multiple labeling results is equal to a target arc, the target arc is an arc determined according to a target strategy, and the target strategy comprises: arc score prediction, majority voting, weighted voting, random selection.
6. The method of claim 1, wherein the dependency parsing model comprises: the device comprises an input layer, a coding layer, a first MLP layer, a first result layer, a second MLP layer and a second result layer;
wherein the first MLP layer is used for determining a representation vector of a current word as a core word and a representation vector of a current word as a modifier according to the output of the coding layer, and the first scoring layer is used for determining an arc score according to the output of the first MLP layer;
the second MLP layer is configured to determine, according to the output of the encoding layer, a representation vector including dependency label information when the current word is used as a core word and a representation vector including dependency label information when the current word is used as a modifier, and the second score layer is configured to determine a label score according to the output of the second MLP layer.
7. The method as recited in claim 6, wherein the coding layer of the dependency syntax analysis model comprises a plurality of layers of BilSTM.
8. A multi-labeled data-based dependency parsing model training apparatus, comprising:
a training sample acquisition module: the system comprises a word sequence and a plurality of labeling results of the word sequence, wherein the labeling results comprise arcs and dependency relationship labels for each modified word in the word sequence, and each labeling result comes from different users;
an input-output module: the dependency syntactic analysis model is used for inputting the word sequence to obtain an arc score and a label score;
a loss calculation module: for calculating a loss value of the arc score and the label score relative to the plurality of annotated results according to a target loss function;
an iteration module: and adjusting the model parameters of the dependency syntax analysis model through iterative training with the aim of minimizing the loss value so as to realize the training of the dependency syntax analysis model.
9. A multi-labeled data-based dependency parsing model training device, comprising:
a memory: for storing a computer program;
a processor: for executing the computer program to implement the multi-labeled data based dependency parsing model training method according to any one of claims 1-7.
10. A readable storage medium having stored thereon a computer program for implementing a multi-labeled data based dependency parsing model training method according to any one of claims 1-7 when executed by a processor.
CN202011089840.1A 2020-10-13 2020-10-13 Dependency syntax analysis model training method and device based on multi-labeled data Pending CN112232024A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011089840.1A CN112232024A (en) 2020-10-13 2020-10-13 Dependency syntax analysis model training method and device based on multi-labeled data
PCT/CN2021/088601 WO2022077891A1 (en) 2020-10-13 2021-04-21 Multi-labeled data-based dependency and syntactic parsing model training method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011089840.1A CN112232024A (en) 2020-10-13 2020-10-13 Dependency syntax analysis model training method and device based on multi-labeled data

Publications (1)

Publication Number Publication Date
CN112232024A true CN112232024A (en) 2021-01-15

Family

ID=74112424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011089840.1A Pending CN112232024A (en) 2020-10-13 2020-10-13 Dependency syntax analysis model training method and device based on multi-labeled data

Country Status (2)

Country Link
CN (1) CN112232024A (en)
WO (1) WO2022077891A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901791A (en) * 2021-09-15 2022-01-07 昆明理工大学 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
WO2022077891A1 (en) * 2020-10-13 2022-04-21 苏州大学 Multi-labeled data-based dependency and syntactic parsing model training method and apparatus
CN114611463A (en) * 2022-05-10 2022-06-10 天津大学 Dependency analysis-oriented crowdsourcing labeling method and device
CN114611487A (en) * 2022-03-10 2022-06-10 昆明理工大学 Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment
CN116306663A (en) * 2022-12-27 2023-06-23 华润数字科技有限公司 Semantic role labeling method, device, equipment and medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062611B (en) * 2022-05-23 2023-05-05 广东外语外贸大学 Training method, device, equipment and storage medium of grammar error correction model
CN115391608B (en) * 2022-08-23 2023-05-23 哈尔滨工业大学 Automatic labeling conversion method for graph-to-graph structure
CN117436446B (en) * 2023-12-21 2024-03-22 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462066A (en) * 2014-12-24 2015-03-25 北京百度网讯科技有限公司 Method and device for labeling semantic role
CN107168945A (en) * 2017-04-13 2017-09-15 广东工业大学 A kind of bidirectional circulating neutral net fine granularity opinion mining method for merging multiple features
CN108172246A (en) * 2017-12-29 2018-06-15 北京淳中科技股份有限公司 The Collaborative Tagging method and apparatus of more tagging equipments
US10002129B1 (en) * 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
CN108628829A (en) * 2018-04-23 2018-10-09 苏州大学 Automatic treebank method for transformation based on tree-like Recognition with Recurrent Neural Network and system
CN108647254A (en) * 2018-04-23 2018-10-12 苏州大学 Automatic treebank method for transformation and system based on pattern insertion
CN110458181A (en) * 2018-06-07 2019-11-15 中国矿业大学 A kind of syntax dependency model, training method and analysis method based on width random forest
CN110795934A (en) * 2019-10-31 2020-02-14 北京金山数字娱乐科技有限公司 Sentence analysis model training method and device and sentence analysis method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965821B (en) * 2015-07-17 2018-01-05 苏州大学 A kind of data mask method and device
CN110444261B (en) * 2019-07-11 2023-02-03 新华三大数据技术有限公司 Sequence labeling network training method, electronic medical record processing method and related device
CN110472229B (en) * 2019-07-11 2022-09-09 新华三大数据技术有限公司 Sequence labeling model training method, electronic medical record processing method and related device
CN112232024A (en) * 2020-10-13 2021-01-15 苏州大学 Dependency syntax analysis model training method and device based on multi-labeled data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462066A (en) * 2014-12-24 2015-03-25 北京百度网讯科技有限公司 Method and device for labeling semantic role
US10002129B1 (en) * 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
CN107168945A (en) * 2017-04-13 2017-09-15 广东工业大学 A kind of bidirectional circulating neutral net fine granularity opinion mining method for merging multiple features
CN108172246A (en) * 2017-12-29 2018-06-15 北京淳中科技股份有限公司 The Collaborative Tagging method and apparatus of more tagging equipments
CN108628829A (en) * 2018-04-23 2018-10-09 苏州大学 Automatic treebank method for transformation based on tree-like Recognition with Recurrent Neural Network and system
CN108647254A (en) * 2018-04-23 2018-10-12 苏州大学 Automatic treebank method for transformation and system based on pattern insertion
CN110458181A (en) * 2018-06-07 2019-11-15 中国矿业大学 A kind of syntax dependency model, training method and analysis method based on width random forest
CN110795934A (en) * 2019-10-31 2020-02-14 北京金山数字娱乐科技有限公司 Sentence analysis model training method and device and sentence analysis method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHAO YU 等: "Dependency Parsing with Noisy Multi-annotation Data", 《CCF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING》 *
凡子威 等: "基于BiLSTM并结合自注意力机制和句法信息的隐式篇章关系分类", 《计算机科学》 *
蒋炜 等: "句法增强的UCCA语义分析方法", 《北京大学学报(自然科学版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022077891A1 (en) * 2020-10-13 2022-04-21 苏州大学 Multi-labeled data-based dependency and syntactic parsing model training method and apparatus
CN113901791A (en) * 2021-09-15 2022-01-07 昆明理工大学 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
CN113901791B (en) * 2021-09-15 2022-09-23 昆明理工大学 Enhanced dependency syntax analysis method for fusing multi-strategy data under low-resource condition
CN114611487A (en) * 2022-03-10 2022-06-10 昆明理工大学 Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment
CN114611463A (en) * 2022-05-10 2022-06-10 天津大学 Dependency analysis-oriented crowdsourcing labeling method and device
CN116306663A (en) * 2022-12-27 2023-06-23 华润数字科技有限公司 Semantic role labeling method, device, equipment and medium
CN116306663B (en) * 2022-12-27 2024-01-02 华润数字科技有限公司 Semantic role labeling method, device, equipment and medium

Also Published As

Publication number Publication date
WO2022077891A1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
CN112232024A (en) Dependency syntax analysis model training method and device based on multi-labeled data
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN109902301B (en) Deep neural network-based relationship reasoning method, device and equipment
CN111222305A (en) Information structuring method and device
CN111931490B (en) Text error correction method, device and storage medium
CN109062902B (en) Text semantic expression method and device
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN111046659B (en) Context information generating method, context information generating device, and computer-readable recording medium
CN110321426B (en) Digest extraction method and device and computer equipment
CN116628186B (en) Text abstract generation method and system
CN113821605A (en) Event extraction method
CN110659392B (en) Retrieval method and device, and storage medium
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
CN114580354B (en) Information coding method, device, equipment and storage medium based on synonym
CN113449528B (en) Address element extraction method and device, computer equipment and storage medium
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN115526176A (en) Text recognition method and device, electronic equipment and storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
CN114722817A (en) Event processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210115

WD01 Invention patent application deemed withdrawn after publication