CN116227434B - Aviation product text entity identification method based on weak supervision learning - Google Patents
Aviation product text entity identification method based on weak supervision learning Download PDFInfo
- Publication number
- CN116227434B CN116227434B CN202211690404.9A CN202211690404A CN116227434B CN 116227434 B CN116227434 B CN 116227434B CN 202211690404 A CN202211690404 A CN 202211690404A CN 116227434 B CN116227434 B CN 116227434B
- Authority
- CN
- China
- Prior art keywords
- attention
- text
- self
- sentence
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012549 training Methods 0.000 claims abstract description 52
- 230000006870 function Effects 0.000 claims abstract description 37
- 238000005065 mining Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 15
- 230000000694 effects Effects 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 61
- 238000004364 calculation method Methods 0.000 claims description 29
- 239000013598 vector Substances 0.000 claims description 25
- 239000012634 fragment Substances 0.000 claims description 20
- 230000004913 activation Effects 0.000 claims description 17
- 238000010606 normalization Methods 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 8
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 abstract description 12
- 238000009825 accumulation Methods 0.000 abstract description 6
- 239000010410 layer Substances 0.000 description 35
- 238000010586 diagram Methods 0.000 description 10
- 239000002356 single layer Substances 0.000 description 5
- 239000000463 material Substances 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an aviation product text entity identification method based on weak supervision learning, which comprises the following steps of S1: processing an original text, and performing text coding; s2: implementing text entity recognition based on self-attention; s3: realizing uncertainty sampling of sentences based on a minimum confidence policy; s4: and (5) completing automatic mining of entity word list by using a weak supervision mode. According to the invention, the supervision signals are discovered by the model, the data is actively selected as valuable labeling data by a minimum confidence coefficient strategy, so that the time and labor for labeling the data are reduced, and the working efficiency is further improved; only a small amount of data is used for training the weak supervision learning model in each round, so that the weak supervision learning is used for fine tuning the pre-training model under the condition of not losing the use effect, and error accumulation caused by direct training of massive data is avoided; under the condition of carrying out multi-round weak supervision learning, the invention can improve the generalization capability of the model and finally realize the function of automatically mining entity word list.
Description
Technical Field
The invention relates to the field of aviation product text entity recognition, in particular to an aviation product text entity recognition method based on weak supervision learning.
Background
The explosive growth of massive natural language data in the big data age, including news, social network language, academic papers and the like, contains a lot of valuable information in the unstructured text, and is a constant and big problem in the field by structuring, mining and carrying out natural language processing.
Entity identification is to identify an entity name in a text, which is a very basic and important task in natural language processing, and many tasks can continue to carry out natural language processing by relying on the results. Although the task has been in the field for a long time, the solution is various, but the quality of the data set required by the training entity recognition model is excellent, the quantity is large, but people are concerned, which can be calculated as the bottleneck of the field front and the task.
On the one hand, in some data sets in specific fields, besides the labeling cost is much in quantity, the labeling needs to be completed by crowd-sourced field experts to ensure the accuracy and the effectiveness of labels, and a large amount of time and manpower and material resources are needed to be input, so that a large amount of time and manpower and material resources are wasted.
On the other hand, the existing single model is not enough to solve the practical problem, and pretreatment, post-treatment and other means are still needed to assist, so that the model capability is poor and the working efficiency is low.
Disclosure of Invention
In order to overcome the defects of the prior art mentioned in the background art, the invention discovers the supervision signals by the model, and actively selects the data as valuable labeling data by a minimum confidence level strategy, thereby reducing the time and labor for labeling the data and further improving the working efficiency. When the weak supervision model is trained, only a small amount of data is used for participating in the training of the weak supervision learning model in each round, so that the pretraining model is finely adjusted by using the weak supervision learning under the condition of not losing the using effect, and error accumulation caused by direct training by using massive data is avoided. Meanwhile, under the condition of carrying out multi-round weak supervision learning, the invention can improve the generalization capability of the model and finally realize the function of automatically mining entity word list.
In order to achieve the above object, the solution adopted by the present invention is:
the invention provides an aviation product text entity identification method based on weak supervision learning, which comprises the following steps:
the aviation product text entity identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, processing an original text, and performing text coding, wherein the method comprises the following substeps:
s11, converting the original text into a vector form;
s12, performing target word embedding, segment embedding and position embedding operations;
s13, embedding and adding the three in the step S12 to obtain the text code of the Chinese sentence at the position, wherein the formula is as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence;
s2, realizing text entity recognition based on self-attention, wherein the method comprises the following substeps:
s21, constructing a self-attention head (Q, K, V), obtaining a Query, key and Value matrix of the self-attention head, wherein the number of the self-attention heads of each layer is h=8, combining self-attention Value results of a plurality of self-attention heads, and then outputting the self-attention Value results through a full-connection layer in a dimension-reducing way, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; WO represents a weight matrix of a plurality of self-attention heads; concat represents a merge function; headl represents the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function;and->Respectively representing the process that the weight matrix of the ith self-attention head is multiplied by XT to obtain Q, K and V matrices; q represents a Query matrix; k represents a Key matrix; v represents a Value matrix;
s22, training through a feedforward neural network, and carrying out normalization processing by using an activation function to obtain probabilities of N categories corresponding to each word block;
s3, realizing uncertainty sampling of sentences based on a minimum confidence policy, wherein the method comprises the following substeps:
s31, obtaining the probability of each word block based on each category, sampling data from a data pool to a trained model for probability prediction after training of each round is completed, and obtaining the probability of each sentence and each word block based on each category, namely the full-connection layer output result of each sentence;
s32, after obtaining the output result of the full connection layer of each sentence, calculating the confidence coefficient by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, wherein the method specifically comprises the following steps: averaging the uncertainties of all word blocks to obtain the uncertainties of the sentences; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertainty S Representing uncertainty of sentences;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value;
s33, after the uncertainty of all sentences is obtained, sorting in a reverse order by using an uncertainty value, wherein the higher the uncertainty value is, the more uncertain the model is regarded as data, and the data has higher annotation value and is recommended preferentially; the lower the uncertainty value is, the semantic information representing the piece of data is partially or completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked;
s34, repeating the step S2 with the small-step recommended data to perform model training, so as to achieve the effect of gradually fine-tuning the model;
s4, completing automatic mining of entity word list by using a weak supervision mode, and acquiring required text entity fragment information, wherein the method comprises the following substeps:
s41, representing word blocks of sentences in the original text by a feature with context coding; converting the attention map into an image with the length and the width of N pixels of an L multiplied by H channel, and converting the entity identification problem into a graphic resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer, so that the labels are marked on the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ Representing that the plurality of calculation results for θ are selected to be the smallest; />Representing a model parameterized by θ; />Indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number, m.epsilon.1, 2, …,/and-> Representing the total number of segment candidates;
s42, after model training is completed, fixing parametersIn the prediction phase, the ++information containing the original piece information is calculated in the reverse direction by the formula in step S41>And acquiring the required text entity fragment information.
Preferably, in step S11, the chinese sentence is converted into a vector form, specifically:
the bidirectional encoder representation technology pre-training model based on the converter uses an encoder in a converter structure, a basic network structure is built by using multiple layers of encoders, a depth bidirectional model is built, and left and right and context information of a target word are accumulated from a first layer to a last layer to be captured; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; c j Representing a j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
Preferably, the specific method for performing the target word embedding, segment embedding and position embedding operations in step S12 is as follows:
the target word is embedded into a number in a vocabulary built in the word segmentation device by mapping each word block into the word segmentation device through the word segmentation device, and the number is as follows:
wherein: the direct represents a target word embedding function;
the block embedding segments the inputted long text, uses the embedding to distinguish each text segment content, and the segment embedding value of each word block is E A =0;
The location embedding transmits location information providing the word blocks to the pre-trained model for subsequent model self-attention calculations to derive context information for the word blocks, the embedded value typically being an index of the word blocks, i.e
Preferably, the method for obtaining the Query, key and Value matrix in step S21 is as follows:
the two-way encoder representation technique model based on the converter trains the text-encoded data before making the next sentence prediction, and after self-focusing on the input sentence code X, firstMultiplying X by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; WK represents a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T A transpose matrix representing sentence code X;
the method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
Preferably, training through the feedforward neural network in step S22 is specifically:
binding a self-attention value matrix MultiHead (Q, K, V) with an initial input sentence code X, obtaining an input X entering a feedforward neural network after the full-connection layer is subjected to dimension reduction, fitting training input text through the network, and leaving the calculated weight value at the layer for the next training fit prediction result, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2 ;
wherein: FFN (x) represents the tag value represented as x input during the training phase; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Separate tableThe weight and the offset obtained by preliminary fitting of x are shown; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
Preferably, the normalization processing using the activation function in step S22 is specifically:
the tag values represented as inputs during the training phase are normalized by an activation function, which becomes a probability value between 0 and 1, the activation function formula is shown as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (x) i ) Represented as x in the training phase i The tag value entered.
Preferably, the word blocks of the sentences in the original text are represented in step S41 as a feature with context coding;
for the original text given in the text representation of step S1, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s represents the length of a sentence fragment;
for an L-layer encoder and a model with H self-attention heads per layer encoder, there is an attention fraction feature vector of the segment at the H-th attention head of the first layer
For each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p A self-attention score feature vector representing the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer;
simplifying the above formula, and carrying out mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
Preferably, on the other hand, the invention also provides a product text entity recognition system of the aviation product text entity recognition method based on weak supervision learning, which comprises a text encoding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text encoding unit is used for processing an original text to perform text encoding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence level strategy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire required text entity fragment information.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, the supervision signals are naturally found by the model, part of data is actively selected as valuable labeling data by a minimum confidence degree strategy, and other data are temporarily discarded, so that the time and labor for labeling the data are greatly reduced, the waste of a large amount of time and labor and materials is reduced, and the working efficiency is further improved.
(2) When the weak supervision model is trained, only a small amount of data is used for participating in the training of the weak supervision learning model in each round, so that the pretraining model is finely adjusted by using the weak supervision learning under the condition of not losing the using effect, and meanwhile, error accumulation caused by direct training of massive data is avoided.
(3) The invention can improve the generalization capability of the model under the condition of carrying out multi-round weak supervised learning, and finally realizes the function of automatically mining entity word list.
Drawings
FIG. 1 is a control block diagram of an aviation product text entity recognition method based on weak supervised learning according to an embodiment of the invention;
FIG. 2 is a diagram of a transformer-based bi-directional encoder representation technical model architecture in accordance with an embodiment of the present invention;
FIG. 3 is a text encoding diagram of an embodiment of the present invention;
FIG. 4 is a graph of an embodiment of the present invention with text input through a single layer Transformer Encoder;
FIG. 5 is a graph of an embodiment of the present invention in which input text passes through multiple layers Transformer Encoder;
FIG. 6 is a graph of predictive probability obtained by computing an input text according to an embodiment of the present invention;
FIG. 7 is a diagram of a visual word relationship for an attention map according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
According to the embodiment of the invention, through the processing and analyzing process of the example, part of data is actively selected as valuable marking data by a minimum confidence coefficient strategy, and only a small amount of data is used for participating in the training of the weak supervision learning model, so that the weak supervision learning is used for fine tuning of the pre-training model under the condition that the using effect is not lost, and error accumulation caused by direct training of massive data is avoided; under the condition of carrying out multi-round weak supervision learning, the generalization capability of the model is improved, and finally the function of automatically mining entity word list is realized. Fig. 1 is a control block diagram of an aviation product text entity recognition method based on weak supervised learning according to an embodiment of the invention.
The embodiment of the invention firstly provides a product text entity recognition system of an aviation product text entity recognition method based on weak supervision learning, which comprises a text coding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text coding unit is used for processing an original text to perform text coding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence degree strategy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire required text entity fragment information.
Similarly, the embodiment of the invention provides an aviation product text entity identification method based on weak supervision learning, which is applied to an example in order to prove the applicability of the invention, and specifically comprises the following steps:
s1: processing an original text, and performing text coding;
firstly, converting an original text into a vector form; transducer-based bi-directional encoder representation technique pre-training model an encoder within a transducer structure is used, as shown in fig. 2, which illustrates a transducer-based bi-directional encoder representation technique model architecture diagram in accordance with an embodiment of the present invention; constructing a basic network structure by using multiple layers of encoders together, and constructing a depth bidirectional model, namely, the network can effectively accumulate left and right context information of a target word from a first layer to a last layer to capture the left and right context information of the target word, so that the reliability of the follow-up entity identification for reasoning through the context is improved; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; cj represents the j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
Assuming that the original text is "today weather is good", there is a word block s= [ [ CLS ], today, day, qi, true, [ SEP ] ] after word segmentation.
Then, performing target word embedding, segment embedding and position embedding operations; the target word is embedded by mapping each word block into a number in a vocabulary built in the word segmentation device through the word segmentation device, as follows:
wherein: the subject represents a target word embedding function. The original text can be transcribed into target word embedding of each word block: 101,658,384,384,509,368,489,102]
Segment embedding will segment the inputted long text, so the embedding is used to distinguish each segment of text, in fact, the text is divided into proper model processing lengths in the preprocessing process, i.e. the inputted text is rarely divided, so the segment embedding value of each word block is E A =0. The original text can be transcribed into target word embedding of each word block: [ 0,0,0,0,0,0,0,0]
location embedding is to provide location information of a word block to a pre-trained model so that a subsequent model calculates context information of the word block from the attention, the embedded value is typically an index of the word block, i.eThe original text can be transcribed into target word embedding of each word block: [ 0,1,2,3,4,5,6,7]
finally, embedding and adding the three to obtain sentence codes of the Chinese sentences at the positions, wherein the formulas are as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence.
A text encoding diagram of an embodiment of the present invention is shown in fig. 3. The figure shows the process of obtaining the characterization of each word block by the target word embedding, segment embedding and position embedding addition.
S2: implementing text entity recognition based on self-attention;
to dissociate text information, a Query, key and Value matrix from the attention header is needed, a transducer-based bi-directional encoder representation technique model is trained using text encoded data before the next sentence prediction, and after the attention input sentence is encoded with X, X is first multiplied by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; w (W) K Representing a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T Representing the transposed matrix of sentence code X.
The method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
Constructing a self-attention head (Q, K, V), wherein the number of the self-attention heads of each layer is h=8, combining a plurality of self-attention results, and then performing dimension reduction output through a full-connection layer, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; w (W) O A weight matrix representing a plurality of self-attention heads; concat represents a merge function; head part l Representing the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function;and->Weight matrix respectively representing ith self-attention head and X T Multiplying to obtain Q, K and V matrix; q represents a Query matrix; k represents a Key matrix; v represents the Value matrix.
Training is carried out through a feedforward neural network, a self-attention value matrix Multihead (Q, K, V) is bound with an initial input sentence code X, an input X entering the feedforward neural network is obtained after the dimension reduction of a full-connection layer, training input text is fitted through the network, and the calculated weight is reserved at the layer and used for next training fit prediction results, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2 ;
wherein: FFN (x) representation in training phase representationA tag value input for x; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Respectively representing the weight and the offset obtained by preliminary fitting of x; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
Then, normalization processing is carried out by using an activation function, the tag value which is expressed as input in the training stage is normalized by the activation function, and becomes a probability value between 0 and 1, and the formula of the activation function is as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (xi) is represented as x during the training phase i The tag value entered.
Finally, the probability of N categories corresponding to each word block can be obtained. Assuming that the model has learned the Time entity class Time, then there is a tag list of [ O, B-Time, E-Time ] in the model corresponding to the order of probability of N classes per word block, for the example S= [ [ CLS ], so far, day, gas, true, [ SEP ] ], after this probability value prediction step, there is a result of [0.9,0.05,0.05], [0.3,0.6,0.1], [0.1,0.3,0.6], [0.7,0.2,0.1], [0.9,0.01,0.09], [0.8,0.1,0.1], [0.99,0.005,0.005], [0.95,0.01,0.04] ], where the second [0.3,0.6,0.1] then corresponds to the word block "so far" with a probability of [ O:0.3, B-Time:0.6, E-Time:0.1] on the tag list, where the word block has the greatest likelihood of the beginning of the Time class Time, and similarly there is a word block "day" with the greatest likelihood of the end of the Time class Time, then there is a Time class entity "today".
FIG. 4 is a single-layer Transformer Encoder operational diagram of text input according to an embodiment of the invention; the figure shows the structure of a single layer in a pre-trained model multi-layer encoder, which demonstrates that the single layer is characterized by text input, multi-headed self-attention calculation, normalization, linear transformation steps to output self-attention values.
FIG. 5 is a graph of the operations of entering text through multiple layers Transformer Encoder in accordance with an embodiment of the present invention; the figure shows a pre-trained model multi-layer encoder, the output of the previous layer is taken as the input of the layer, a single-layer calculation process is carried out, and finally the output of the layer is obtained, and the process is continued until the output of the last layer is obtained.
S3: realizing uncertainty sampling of sentences based on a minimum confidence policy;
after each round of training is completed, data is sampled from a data pool to a trained model for probability prediction, so that the probability of each sentence and each word block based on each category, namely the full-connection layer output of each sentence, is obtained, and confidence calculation and sequencing selection are further required for the word blocks of all sentences.
After obtaining the full connection layer output of each sentence, calculating by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, and directly averaging the uncertainties of all word blocks to obtain the uncertainty of the sentence; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertaintys represents the uncertainty of a sentence;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value.
For the above example, S= [ [ CLS ], so far, day, gas, true, good, [ SEP ] ], wherein the probability of the second [0.3,0.6,0.1] corresponding to the word block "so far" on the tag list is [ O:0.3, B-Time:0.6, E-Time:0.1], then the uncertainty value of the word block is 1-0.6=0.4, and the uncertainty of all word blocks is obtained by the same method [0.1,0.4,0.4,0.3,0.1,0.2,0.01,0.05], finally, the uncertainty Uancertityys= (0.1+0.4+0.4+0.3+0.1+0.2+0.2) of the sentence is calculated according to the above formula
0.01+0.05)/8=0.195, the larger the value, the higher the uncertainty of the sentence is proved to be and the more valuable the annotation is.
After the uncertainty of all sentences is obtained, sorting is performed in an uncertainty reverse order, and the higher the uncertainty is, the more uncertain the model is considered as data, and the data has higher labeling value and can be recommended out preferentially; if the uncertainty is lower, the semantic information of the data is partially contained or even completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked; therefore, the problem of model overfitting caused by excessive semantically repeated data is avoided; and (3) carrying out model training by using the step recommended data re-walking S2 so as to achieve the effect of gradually fine-tuning the model.
FIG. 6 shows a predictive probability map obtained by computing an input text according to an embodiment of the present invention; the chart describes the process of performing normalization after multi-head self-attention calculation on the input text after embedded coding, performing normalization after feedforward neural network calculation, and finally obtaining output after linear layer and softmax normalization.
S4: the automatic mining of entity word list is completed by applying a weak supervision mode;
representing word blocks of sentences in the original text with a feature with context coding; for a given original text in the S1 text representation, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s denotes the length of the sentence fragment.
For an L-layer encoder, each layer encoder has a model of H self-attention heads, and the H-th attention head of the first layer has the attention fraction characteristic direction of the segmentMeasuring amountFinally, for each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p A self-attention score feature vector representing the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer; simplifying and describing the mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
Converting the attention map into an image with the length and the width of N pixels of an L multiplied by H channel, and converting the entity identification problem into a graph resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer to label the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ represents selecting the smallest of the plurality of calculation results for θ; />Representing a model parameterized by θ;/>indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number-> Representing the total number of segment candidates.
Fig. 7 is a diagram showing the relationship between the visual words of the attention map according to the embodiment of the invention. The figure describes that after the above-mentioned problem is converted, the context information is displayed through calculation of self-attention diagram, and the range of words composed of word blocks with strong relevance is sequentially determined.
After model training is completed, parametersFixed, in the prediction phase, the +.A.of the original piece information is calculated back by the above formula>And acquiring the entity fragment information with quality.
Through experimental analysis, for the case where the method is not used, the reference is as follows: when the quantity of the marked data reaches 100, the F1 score of the model is 0.663. After the recommended data is marked by using the method, when the marked data quantity reaches 70, the F1 score of the model can reach 0.661, which is very close to a reference value; when the labeling data amount further reaches 100, the F1 fraction of the model is further improved to 0.706 due to the fact that part of error accumulation is reduced through weak supervision learning, and compared with a reference model, the F1 fraction of the model is obviously improved. Therefore, when 100 pieces of annotation data are used as the basis, the annotation amount can be reduced by 30% by using the weakly supervised learning method.
In conclusion, the prediction result of the aviation product text entity recognition method based on weak supervision learning proves that the method has a good effect.
(1) According to the embodiment of the invention, the supervision signals are naturally found by the model, part of data of the text of the given instance is actively selected as valuable labeling data by a minimum confidence degree strategy, and other data are temporarily discarded data, so that the time and labor for labeling the data are greatly reduced, and the working efficiency is further improved.
(2) According to the embodiment of the invention, only a small amount of original text data is used for participating in the training of the weak supervision learning model in each round, so that the weak supervision learning is used for fine tuning of the pre-training model under the condition of not losing the using effect, and meanwhile, error accumulation caused by direct training of massive data is avoided.
(3) According to the embodiment of the invention, under the condition of carrying out multi-round weak supervision learning, the generalization capability of the model is improved, and finally, the function of automatically mining entity word list is realized; through example analysis, the method has better application effect.
The above examples are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the scope of protection defined by the claims of the present invention without departing from the spirit of the present invention.
Claims (8)
1. The aviation product text entity identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, processing an original text of an aviation product, and performing text coding, wherein the method comprises the following substeps:
s11, converting the original text into a vector form;
s12, performing target word embedding, segment embedding and position embedding operations;
s13, embedding and adding the three in the step S12 to obtain the text code of the Chinese sentence at the position, wherein the formula is as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence;
s2, realizing text entity recognition based on self-attention, wherein the method comprises the following substeps:
s21, constructing a self-attention head (Q, K, V), obtaining a Query, key and Value matrix of the self-attention head, wherein the number of the self-attention heads of each layer is h=8, combining self-attention Value results of a plurality of self-attention heads, and then outputting the self-attention Value results through a full-connection layer in a dimension-reducing way, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; w (W) O A weight matrix representing a plurality of self-attention heads; concat represents a merge function; head l Representing the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function; QW (QW) l Q ,KW l K And VW l V Weight matrix respectively representing the first self-attention head and X T Multiplying to obtain Q, K and V matrix; q represents a Query matrix; k represents a Key matrix; v represents a Value matrix;
s22, training through a feedforward neural network, and carrying out normalization processing by using an activation function to obtain probabilities of N categories corresponding to each word block;
s3, realizing uncertainty sampling of sentences based on a minimum confidence policy, wherein the method comprises the following substeps:
s31, obtaining the probability of each word block based on each category, sampling data from a data pool to a trained model for probability prediction after training of each round is completed, and obtaining the probability of each sentence and each word block based on each category, namely the full-connection layer output result of each sentence;
s32, after obtaining the output result of the full connection layer of each sentence, calculating the confidence coefficient by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, wherein the method specifically comprises the following steps: averaging the uncertainties of all word blocks to obtain the uncertainties of the sentences; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertainty S Representing uncertainty of sentences;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value;
s33, after the uncertainty of all sentences is obtained, sorting in a reverse order by using an uncertainty value, wherein the higher the uncertainty value is, the more uncertain the model is regarded as data, and the data has higher annotation value and is recommended preferentially; the lower the uncertainty value is, the semantic information representing the piece of data is partially or completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked;
s34, repeating the step S2 with the small-step recommended data to perform model training, so as to achieve the effect of gradually fine-tuning the model;
s4, completing automatic mining of entity word list by using a weak supervision mode, and acquiring required text entity fragment information, wherein the method comprises the following substeps:
s41, representing word blocks of sentences in the original text by a feature with context coding; converting the attention map into an image with the length and the width of N' pixels of an L multiplied by H channel, and converting the entity identification problem into a graphic resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer, so that the labels are marked on the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ Representing that the plurality of calculation results for θ are selected to be the smallest; />Representing a model parameterized by θ; />Indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number, m.epsilon.1, 2, …,/and-> Representing the total number of segment candidates;
s42, after model training is completed, fixing parametersIn the prediction phase, the ++information containing the original piece information is calculated in the reverse direction by the formula in step S41>And acquiring the required text entity fragment information.
2. The method for identifying the text entities of the aviation products based on the weak supervised learning as set forth in claim 1, wherein the converting the original text into the vector form in step S11 is specifically:
the bidirectional encoder representation technology pre-training model based on the converter uses an encoder in a converter structure, a basic network structure is built by using multiple layers of encoders, a depth bidirectional model is built, and left and right and context information of a target word are accumulated from a first layer to a last layer to be captured; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; c j Representing a j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
3. The method for identifying the text entity of the aviation product based on the weak supervision learning according to claim 2, wherein the specific method for performing the target word embedding, segment embedding and position embedding operations in the step S12 is as follows:
the target word is embedded and each word block is mapped into a number in a vocabulary arranged in the word segmentation device through the word segmentation device, and the number is as follows:
wherein: the direct represents a target word embedding function;
the block embedding segments the inputted long text, uses the embedding to distinguish each text segment content, and the segment embedding value of each word block is E A =0;
The location embedding transmits location information providing the word blocks to the pre-trained model for subsequent model self-attention calculations to derive context information for the word blocks, the embedded value typically being an index of the word blocks, i.e
4. The method for identifying the text entity of the aviation product based on the weak supervised learning as set forth in claim 1, wherein the method for acquiring the Query, key and Value matrix in the step S21 is as follows:
the transducer-based bi-directional encoder representation technique model trains text-encoded data before making the next sentence prediction, self-attention input sentence encoding X T After that, X is first of all T Multiplying by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; w (W) K Representing a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T A transpose matrix representing sentence code X;
the method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
5. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the training by the feedforward neural network in step S22 is specifically:
binding a self-attention value matrix MultiHead (Q, K, V) with an initial input sentence code X, obtaining an input X entering a feedforward neural network after the full-connection layer is subjected to dimension reduction, fitting training input text through the network, and leaving the calculated weight value at the layer for the next training fit prediction result, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2 ;
wherein: FFN (x) represents the tag value represented as x input during the training phase; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Respectively representing the weight and the offset obtained by preliminary fitting of x; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
6. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the normalizing process using the activation function in step S22 is specifically:
the tag values represented as inputs during the training phase are normalized by an activation function, which becomes a probability value between 0 and 1, the activation function formula is shown as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (x) i ) Represented as x in the training phase i The tag value entered.
7. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the word blocks of sentences in the original text are represented by a feature with context coding in step S41;
for the original text given in the text representation of step S1, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s represents the length of a sentence fragment;
for an L-layer encoder and a model with H self-attention heads per layer encoder, there is an attention fraction feature vector of the segment at the H-th attention head of the first layer
For each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p Representation ofA self-attention score feature vector for the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer;
simplifying the above formula, and carrying out mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
8. A product text entity recognition system for the aviation product text entity recognition method based on weak supervision learning of claim 1, which comprises a text coding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text coding unit is used for processing an original text to perform text coding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence policy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire needed text entity fragment information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211690404.9A CN116227434B (en) | 2022-12-27 | 2022-12-27 | Aviation product text entity identification method based on weak supervision learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211690404.9A CN116227434B (en) | 2022-12-27 | 2022-12-27 | Aviation product text entity identification method based on weak supervision learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116227434A CN116227434A (en) | 2023-06-06 |
CN116227434B true CN116227434B (en) | 2024-02-13 |
Family
ID=86583418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211690404.9A Active CN116227434B (en) | 2022-12-27 | 2022-12-27 | Aviation product text entity identification method based on weak supervision learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116227434B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826303A (en) * | 2019-11-12 | 2020-02-21 | 中国石油大学(华东) | Joint information extraction method based on weak supervised learning |
CN112270193A (en) * | 2020-11-02 | 2021-01-26 | 重庆邮电大学 | Chinese named entity identification method based on BERT-FLAT |
CN114091478A (en) * | 2021-11-30 | 2022-02-25 | 复旦大学 | Dialog emotion recognition method based on supervised contrast learning and reply generation assistance |
-
2022
- 2022-12-27 CN CN202211690404.9A patent/CN116227434B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826303A (en) * | 2019-11-12 | 2020-02-21 | 中国石油大学(华东) | Joint information extraction method based on weak supervised learning |
CN112270193A (en) * | 2020-11-02 | 2021-01-26 | 重庆邮电大学 | Chinese named entity identification method based on BERT-FLAT |
CN114091478A (en) * | 2021-11-30 | 2022-02-25 | 复旦大学 | Dialog emotion recognition method based on supervised contrast learning and reply generation assistance |
Also Published As
Publication number | Publication date |
---|---|
CN116227434A (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299273B (en) | Multi-source multi-label text classification method and system based on improved seq2seq model | |
CN108388560B (en) | GRU-CRF conference name identification method based on language model | |
CN111694924A (en) | Event extraction method and system | |
CN113836298A (en) | Text classification method and system based on visual enhancement | |
CN111680512B (en) | Named entity recognition model, telephone exchange extension switching method and system | |
CN113420543B (en) | Mathematical test question automatic labeling method based on improved Seq2Seq model | |
CN114385802A (en) | Common-emotion conversation generation method integrating theme prediction and emotion inference | |
CN113223509A (en) | Fuzzy statement identification method and system applied to multi-person mixed scene | |
CN111462749A (en) | End-to-end dialogue system and method based on dialogue state guidance and knowledge base retrieval | |
CN114756687A (en) | Self-learning entity relationship combined extraction-based steel production line equipment diagnosis method | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN115935975A (en) | Controllable-emotion news comment generation method | |
CN116245110A (en) | Multi-dimensional information fusion user standing detection method based on graph attention network | |
CN115658905A (en) | Cross-chapter multi-dimensional event image generation method | |
CN116341562A (en) | Similar problem generation method based on Unilm language model | |
CN114582448A (en) | Epidemic case information extraction framework construction method based on pre-training language model | |
CN110175330A (en) | A kind of name entity recognition method based on attention mechanism | |
CN116227434B (en) | Aviation product text entity identification method based on weak supervision learning | |
CN113656569A (en) | Generating type dialogue method based on context information reasoning | |
CN117149977A (en) | Intelligent collecting robot based on robot flow automation | |
CN115186670B (en) | Method and system for identifying domain named entities based on active learning | |
CN115965026A (en) | Model pre-training method and device, text analysis method and device and storage medium | |
CN113626537B (en) | Knowledge graph construction-oriented entity relation extraction method and system | |
CN115422388A (en) | Visual conversation method and system | |
Xin et al. | Automatic annotation of text classification data set in specific field using named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |