CN116227434B - Aviation product text entity identification method based on weak supervision learning - Google Patents

Aviation product text entity identification method based on weak supervision learning Download PDF

Info

Publication number
CN116227434B
CN116227434B CN202211690404.9A CN202211690404A CN116227434B CN 116227434 B CN116227434 B CN 116227434B CN 202211690404 A CN202211690404 A CN 202211690404A CN 116227434 B CN116227434 B CN 116227434B
Authority
CN
China
Prior art keywords
attention
text
self
sentence
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211690404.9A
Other languages
Chinese (zh)
Other versions
CN116227434A (en
Inventor
刘俊
贺薇
董洪飞
陶剑
安然
何柳
胡德雨
武铎
卓雨东
裴育
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Aero Polytechnology Establishment
Original Assignee
China Aero Polytechnology Establishment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Aero Polytechnology Establishment filed Critical China Aero Polytechnology Establishment
Priority to CN202211690404.9A priority Critical patent/CN116227434B/en
Publication of CN116227434A publication Critical patent/CN116227434A/en
Application granted granted Critical
Publication of CN116227434B publication Critical patent/CN116227434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an aviation product text entity identification method based on weak supervision learning, which comprises the following steps of S1: processing an original text, and performing text coding; s2: implementing text entity recognition based on self-attention; s3: realizing uncertainty sampling of sentences based on a minimum confidence policy; s4: and (5) completing automatic mining of entity word list by using a weak supervision mode. According to the invention, the supervision signals are discovered by the model, the data is actively selected as valuable labeling data by a minimum confidence coefficient strategy, so that the time and labor for labeling the data are reduced, and the working efficiency is further improved; only a small amount of data is used for training the weak supervision learning model in each round, so that the weak supervision learning is used for fine tuning the pre-training model under the condition of not losing the use effect, and error accumulation caused by direct training of massive data is avoided; under the condition of carrying out multi-round weak supervision learning, the invention can improve the generalization capability of the model and finally realize the function of automatically mining entity word list.

Description

Aviation product text entity identification method based on weak supervision learning
Technical Field
The invention relates to the field of aviation product text entity recognition, in particular to an aviation product text entity recognition method based on weak supervision learning.
Background
The explosive growth of massive natural language data in the big data age, including news, social network language, academic papers and the like, contains a lot of valuable information in the unstructured text, and is a constant and big problem in the field by structuring, mining and carrying out natural language processing.
Entity identification is to identify an entity name in a text, which is a very basic and important task in natural language processing, and many tasks can continue to carry out natural language processing by relying on the results. Although the task has been in the field for a long time, the solution is various, but the quality of the data set required by the training entity recognition model is excellent, the quantity is large, but people are concerned, which can be calculated as the bottleneck of the field front and the task.
On the one hand, in some data sets in specific fields, besides the labeling cost is much in quantity, the labeling needs to be completed by crowd-sourced field experts to ensure the accuracy and the effectiveness of labels, and a large amount of time and manpower and material resources are needed to be input, so that a large amount of time and manpower and material resources are wasted.
On the other hand, the existing single model is not enough to solve the practical problem, and pretreatment, post-treatment and other means are still needed to assist, so that the model capability is poor and the working efficiency is low.
Disclosure of Invention
In order to overcome the defects of the prior art mentioned in the background art, the invention discovers the supervision signals by the model, and actively selects the data as valuable labeling data by a minimum confidence level strategy, thereby reducing the time and labor for labeling the data and further improving the working efficiency. When the weak supervision model is trained, only a small amount of data is used for participating in the training of the weak supervision learning model in each round, so that the pretraining model is finely adjusted by using the weak supervision learning under the condition of not losing the using effect, and error accumulation caused by direct training by using massive data is avoided. Meanwhile, under the condition of carrying out multi-round weak supervision learning, the invention can improve the generalization capability of the model and finally realize the function of automatically mining entity word list.
In order to achieve the above object, the solution adopted by the present invention is:
the invention provides an aviation product text entity identification method based on weak supervision learning, which comprises the following steps:
the aviation product text entity identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, processing an original text, and performing text coding, wherein the method comprises the following substeps:
s11, converting the original text into a vector form;
s12, performing target word embedding, segment embedding and position embedding operations;
s13, embedding and adding the three in the step S12 to obtain the text code of the Chinese sentence at the position, wherein the formula is as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence;
s2, realizing text entity recognition based on self-attention, wherein the method comprises the following substeps:
s21, constructing a self-attention head (Q, K, V), obtaining a Query, key and Value matrix of the self-attention head, wherein the number of the self-attention heads of each layer is h=8, combining self-attention Value results of a plurality of self-attention heads, and then outputting the self-attention Value results through a full-connection layer in a dimension-reducing way, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; WO represents a weight matrix of a plurality of self-attention heads; concat represents a merge function; headl represents the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function;and->Respectively representing the process that the weight matrix of the ith self-attention head is multiplied by XT to obtain Q, K and V matrices; q represents a Query matrix; k represents a Key matrix; v represents a Value matrix;
s22, training through a feedforward neural network, and carrying out normalization processing by using an activation function to obtain probabilities of N categories corresponding to each word block;
s3, realizing uncertainty sampling of sentences based on a minimum confidence policy, wherein the method comprises the following substeps:
s31, obtaining the probability of each word block based on each category, sampling data from a data pool to a trained model for probability prediction after training of each round is completed, and obtaining the probability of each sentence and each word block based on each category, namely the full-connection layer output result of each sentence;
s32, after obtaining the output result of the full connection layer of each sentence, calculating the confidence coefficient by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, wherein the method specifically comprises the following steps: averaging the uncertainties of all word blocks to obtain the uncertainties of the sentences; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertainty S Representing uncertainty of sentences;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value;
s33, after the uncertainty of all sentences is obtained, sorting in a reverse order by using an uncertainty value, wherein the higher the uncertainty value is, the more uncertain the model is regarded as data, and the data has higher annotation value and is recommended preferentially; the lower the uncertainty value is, the semantic information representing the piece of data is partially or completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked;
s34, repeating the step S2 with the small-step recommended data to perform model training, so as to achieve the effect of gradually fine-tuning the model;
s4, completing automatic mining of entity word list by using a weak supervision mode, and acquiring required text entity fragment information, wherein the method comprises the following substeps:
s41, representing word blocks of sentences in the original text by a feature with context coding; converting the attention map into an image with the length and the width of N pixels of an L multiplied by H channel, and converting the entity identification problem into a graphic resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer, so that the labels are marked on the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ Representing that the plurality of calculation results for θ are selected to be the smallest; />Representing a model parameterized by θ; />Indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number, m.epsilon.1, 2, …,/and-> Representing the total number of segment candidates;
s42, after model training is completed, fixing parametersIn the prediction phase, the ++information containing the original piece information is calculated in the reverse direction by the formula in step S41>And acquiring the required text entity fragment information.
Preferably, in step S11, the chinese sentence is converted into a vector form, specifically:
the bidirectional encoder representation technology pre-training model based on the converter uses an encoder in a converter structure, a basic network structure is built by using multiple layers of encoders, a depth bidirectional model is built, and left and right and context information of a target word are accumulated from a first layer to a last layer to be captured; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; c j Representing a j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
Preferably, the specific method for performing the target word embedding, segment embedding and position embedding operations in step S12 is as follows:
the target word is embedded into a number in a vocabulary built in the word segmentation device by mapping each word block into the word segmentation device through the word segmentation device, and the number is as follows:
wherein: the direct represents a target word embedding function;
the block embedding segments the inputted long text, uses the embedding to distinguish each text segment content, and the segment embedding value of each word block is E A =0;
The location embedding transmits location information providing the word blocks to the pre-trained model for subsequent model self-attention calculations to derive context information for the word blocks, the embedded value typically being an index of the word blocks, i.e
Preferably, the method for obtaining the Query, key and Value matrix in step S21 is as follows:
the two-way encoder representation technique model based on the converter trains the text-encoded data before making the next sentence prediction, and after self-focusing on the input sentence code X, firstMultiplying X by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; WK represents a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T A transpose matrix representing sentence code X;
the method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
Preferably, training through the feedforward neural network in step S22 is specifically:
binding a self-attention value matrix MultiHead (Q, K, V) with an initial input sentence code X, obtaining an input X entering a feedforward neural network after the full-connection layer is subjected to dimension reduction, fitting training input text through the network, and leaving the calculated weight value at the layer for the next training fit prediction result, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2
wherein: FFN (x) represents the tag value represented as x input during the training phase; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Separate tableThe weight and the offset obtained by preliminary fitting of x are shown; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
Preferably, the normalization processing using the activation function in step S22 is specifically:
the tag values represented as inputs during the training phase are normalized by an activation function, which becomes a probability value between 0 and 1, the activation function formula is shown as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (x) i ) Represented as x in the training phase i The tag value entered.
Preferably, the word blocks of the sentences in the original text are represented in step S41 as a feature with context coding;
for the original text given in the text representation of step S1, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s represents the length of a sentence fragment;
for an L-layer encoder and a model with H self-attention heads per layer encoder, there is an attention fraction feature vector of the segment at the H-th attention head of the first layer
For each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p A self-attention score feature vector representing the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer;
simplifying the above formula, and carrying out mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
Preferably, on the other hand, the invention also provides a product text entity recognition system of the aviation product text entity recognition method based on weak supervision learning, which comprises a text encoding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text encoding unit is used for processing an original text to perform text encoding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence level strategy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire required text entity fragment information.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, the supervision signals are naturally found by the model, part of data is actively selected as valuable labeling data by a minimum confidence degree strategy, and other data are temporarily discarded, so that the time and labor for labeling the data are greatly reduced, the waste of a large amount of time and labor and materials is reduced, and the working efficiency is further improved.
(2) When the weak supervision model is trained, only a small amount of data is used for participating in the training of the weak supervision learning model in each round, so that the pretraining model is finely adjusted by using the weak supervision learning under the condition of not losing the using effect, and meanwhile, error accumulation caused by direct training of massive data is avoided.
(3) The invention can improve the generalization capability of the model under the condition of carrying out multi-round weak supervised learning, and finally realizes the function of automatically mining entity word list.
Drawings
FIG. 1 is a control block diagram of an aviation product text entity recognition method based on weak supervised learning according to an embodiment of the invention;
FIG. 2 is a diagram of a transformer-based bi-directional encoder representation technical model architecture in accordance with an embodiment of the present invention;
FIG. 3 is a text encoding diagram of an embodiment of the present invention;
FIG. 4 is a graph of an embodiment of the present invention with text input through a single layer Transformer Encoder;
FIG. 5 is a graph of an embodiment of the present invention in which input text passes through multiple layers Transformer Encoder;
FIG. 6 is a graph of predictive probability obtained by computing an input text according to an embodiment of the present invention;
FIG. 7 is a diagram of a visual word relationship for an attention map according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
According to the embodiment of the invention, through the processing and analyzing process of the example, part of data is actively selected as valuable marking data by a minimum confidence coefficient strategy, and only a small amount of data is used for participating in the training of the weak supervision learning model, so that the weak supervision learning is used for fine tuning of the pre-training model under the condition that the using effect is not lost, and error accumulation caused by direct training of massive data is avoided; under the condition of carrying out multi-round weak supervision learning, the generalization capability of the model is improved, and finally the function of automatically mining entity word list is realized. Fig. 1 is a control block diagram of an aviation product text entity recognition method based on weak supervised learning according to an embodiment of the invention.
The embodiment of the invention firstly provides a product text entity recognition system of an aviation product text entity recognition method based on weak supervision learning, which comprises a text coding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text coding unit is used for processing an original text to perform text coding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence degree strategy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire required text entity fragment information.
Similarly, the embodiment of the invention provides an aviation product text entity identification method based on weak supervision learning, which is applied to an example in order to prove the applicability of the invention, and specifically comprises the following steps:
s1: processing an original text, and performing text coding;
firstly, converting an original text into a vector form; transducer-based bi-directional encoder representation technique pre-training model an encoder within a transducer structure is used, as shown in fig. 2, which illustrates a transducer-based bi-directional encoder representation technique model architecture diagram in accordance with an embodiment of the present invention; constructing a basic network structure by using multiple layers of encoders together, and constructing a depth bidirectional model, namely, the network can effectively accumulate left and right context information of a target word from a first layer to a last layer to capture the left and right context information of the target word, so that the reliability of the follow-up entity identification for reasoning through the context is improved; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; cj represents the j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
Assuming that the original text is "today weather is good", there is a word block s= [ [ CLS ], today, day, qi, true, [ SEP ] ] after word segmentation.
Then, performing target word embedding, segment embedding and position embedding operations; the target word is embedded by mapping each word block into a number in a vocabulary built in the word segmentation device through the word segmentation device, as follows:
wherein: the subject represents a target word embedding function. The original text can be transcribed into target word embedding of each word block: 101,658,384,384,509,368,489,102]
Segment embedding will segment the inputted long text, so the embedding is used to distinguish each segment of text, in fact, the text is divided into proper model processing lengths in the preprocessing process, i.e. the inputted text is rarely divided, so the segment embedding value of each word block is E A =0. The original text can be transcribed into target word embedding of each word block: [ 0,0,0,0,0,0,0,0]
location embedding is to provide location information of a word block to a pre-trained model so that a subsequent model calculates context information of the word block from the attention, the embedded value is typically an index of the word block, i.eThe original text can be transcribed into target word embedding of each word block: [ 0,1,2,3,4,5,6,7]
finally, embedding and adding the three to obtain sentence codes of the Chinese sentences at the positions, wherein the formulas are as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence.
A text encoding diagram of an embodiment of the present invention is shown in fig. 3. The figure shows the process of obtaining the characterization of each word block by the target word embedding, segment embedding and position embedding addition.
S2: implementing text entity recognition based on self-attention;
to dissociate text information, a Query, key and Value matrix from the attention header is needed, a transducer-based bi-directional encoder representation technique model is trained using text encoded data before the next sentence prediction, and after the attention input sentence is encoded with X, X is first multiplied by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; w (W) K Representing a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T Representing the transposed matrix of sentence code X.
The method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
Constructing a self-attention head (Q, K, V), wherein the number of the self-attention heads of each layer is h=8, combining a plurality of self-attention results, and then performing dimension reduction output through a full-connection layer, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; w (W) O A weight matrix representing a plurality of self-attention heads; concat represents a merge function; head part l Representing the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function;and->Weight matrix respectively representing ith self-attention head and X T Multiplying to obtain Q, K and V matrix; q represents a Query matrix; k represents a Key matrix; v represents the Value matrix.
Training is carried out through a feedforward neural network, a self-attention value matrix Multihead (Q, K, V) is bound with an initial input sentence code X, an input X entering the feedforward neural network is obtained after the dimension reduction of a full-connection layer, training input text is fitted through the network, and the calculated weight is reserved at the layer and used for next training fit prediction results, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2
wherein: FFN (x) representation in training phase representationA tag value input for x; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Respectively representing the weight and the offset obtained by preliminary fitting of x; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
Then, normalization processing is carried out by using an activation function, the tag value which is expressed as input in the training stage is normalized by the activation function, and becomes a probability value between 0 and 1, and the formula of the activation function is as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (xi) is represented as x during the training phase i The tag value entered.
Finally, the probability of N categories corresponding to each word block can be obtained. Assuming that the model has learned the Time entity class Time, then there is a tag list of [ O, B-Time, E-Time ] in the model corresponding to the order of probability of N classes per word block, for the example S= [ [ CLS ], so far, day, gas, true, [ SEP ] ], after this probability value prediction step, there is a result of [0.9,0.05,0.05], [0.3,0.6,0.1], [0.1,0.3,0.6], [0.7,0.2,0.1], [0.9,0.01,0.09], [0.8,0.1,0.1], [0.99,0.005,0.005], [0.95,0.01,0.04] ], where the second [0.3,0.6,0.1] then corresponds to the word block "so far" with a probability of [ O:0.3, B-Time:0.6, E-Time:0.1] on the tag list, where the word block has the greatest likelihood of the beginning of the Time class Time, and similarly there is a word block "day" with the greatest likelihood of the end of the Time class Time, then there is a Time class entity "today".
FIG. 4 is a single-layer Transformer Encoder operational diagram of text input according to an embodiment of the invention; the figure shows the structure of a single layer in a pre-trained model multi-layer encoder, which demonstrates that the single layer is characterized by text input, multi-headed self-attention calculation, normalization, linear transformation steps to output self-attention values.
FIG. 5 is a graph of the operations of entering text through multiple layers Transformer Encoder in accordance with an embodiment of the present invention; the figure shows a pre-trained model multi-layer encoder, the output of the previous layer is taken as the input of the layer, a single-layer calculation process is carried out, and finally the output of the layer is obtained, and the process is continued until the output of the last layer is obtained.
S3: realizing uncertainty sampling of sentences based on a minimum confidence policy;
after each round of training is completed, data is sampled from a data pool to a trained model for probability prediction, so that the probability of each sentence and each word block based on each category, namely the full-connection layer output of each sentence, is obtained, and confidence calculation and sequencing selection are further required for the word blocks of all sentences.
After obtaining the full connection layer output of each sentence, calculating by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, and directly averaging the uncertainties of all word blocks to obtain the uncertainty of the sentence; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertaintys represents the uncertainty of a sentence;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value.
For the above example, S= [ [ CLS ], so far, day, gas, true, good, [ SEP ] ], wherein the probability of the second [0.3,0.6,0.1] corresponding to the word block "so far" on the tag list is [ O:0.3, B-Time:0.6, E-Time:0.1], then the uncertainty value of the word block is 1-0.6=0.4, and the uncertainty of all word blocks is obtained by the same method [0.1,0.4,0.4,0.3,0.1,0.2,0.01,0.05], finally, the uncertainty Uancertityys= (0.1+0.4+0.4+0.3+0.1+0.2+0.2) of the sentence is calculated according to the above formula
0.01+0.05)/8=0.195, the larger the value, the higher the uncertainty of the sentence is proved to be and the more valuable the annotation is.
After the uncertainty of all sentences is obtained, sorting is performed in an uncertainty reverse order, and the higher the uncertainty is, the more uncertain the model is considered as data, and the data has higher labeling value and can be recommended out preferentially; if the uncertainty is lower, the semantic information of the data is partially contained or even completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked; therefore, the problem of model overfitting caused by excessive semantically repeated data is avoided; and (3) carrying out model training by using the step recommended data re-walking S2 so as to achieve the effect of gradually fine-tuning the model.
FIG. 6 shows a predictive probability map obtained by computing an input text according to an embodiment of the present invention; the chart describes the process of performing normalization after multi-head self-attention calculation on the input text after embedded coding, performing normalization after feedforward neural network calculation, and finally obtaining output after linear layer and softmax normalization.
S4: the automatic mining of entity word list is completed by applying a weak supervision mode;
representing word blocks of sentences in the original text with a feature with context coding; for a given original text in the S1 text representation, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s denotes the length of the sentence fragment.
For an L-layer encoder, each layer encoder has a model of H self-attention heads, and the H-th attention head of the first layer has the attention fraction characteristic direction of the segmentMeasuring amountFinally, for each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p A self-attention score feature vector representing the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer; simplifying and describing the mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
Converting the attention map into an image with the length and the width of N pixels of an L multiplied by H channel, and converting the entity identification problem into a graph resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer to label the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ represents selecting the smallest of the plurality of calculation results for θ; />Representing a model parameterized by θ;/>indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number-> Representing the total number of segment candidates.
Fig. 7 is a diagram showing the relationship between the visual words of the attention map according to the embodiment of the invention. The figure describes that after the above-mentioned problem is converted, the context information is displayed through calculation of self-attention diagram, and the range of words composed of word blocks with strong relevance is sequentially determined.
After model training is completed, parametersFixed, in the prediction phase, the +.A.of the original piece information is calculated back by the above formula>And acquiring the entity fragment information with quality.
Through experimental analysis, for the case where the method is not used, the reference is as follows: when the quantity of the marked data reaches 100, the F1 score of the model is 0.663. After the recommended data is marked by using the method, when the marked data quantity reaches 70, the F1 score of the model can reach 0.661, which is very close to a reference value; when the labeling data amount further reaches 100, the F1 fraction of the model is further improved to 0.706 due to the fact that part of error accumulation is reduced through weak supervision learning, and compared with a reference model, the F1 fraction of the model is obviously improved. Therefore, when 100 pieces of annotation data are used as the basis, the annotation amount can be reduced by 30% by using the weakly supervised learning method.
In conclusion, the prediction result of the aviation product text entity recognition method based on weak supervision learning proves that the method has a good effect.
(1) According to the embodiment of the invention, the supervision signals are naturally found by the model, part of data of the text of the given instance is actively selected as valuable labeling data by a minimum confidence degree strategy, and other data are temporarily discarded data, so that the time and labor for labeling the data are greatly reduced, and the working efficiency is further improved.
(2) According to the embodiment of the invention, only a small amount of original text data is used for participating in the training of the weak supervision learning model in each round, so that the weak supervision learning is used for fine tuning of the pre-training model under the condition of not losing the using effect, and meanwhile, error accumulation caused by direct training of massive data is avoided.
(3) According to the embodiment of the invention, under the condition of carrying out multi-round weak supervision learning, the generalization capability of the model is improved, and finally, the function of automatically mining entity word list is realized; through example analysis, the method has better application effect.
The above examples are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the scope of protection defined by the claims of the present invention without departing from the spirit of the present invention.

Claims (8)

1. The aviation product text entity identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, processing an original text of an aviation product, and performing text coding, wherein the method comprises the following substeps:
s11, converting the original text into a vector form;
s12, performing target word embedding, segment embedding and position embedding operations;
s13, embedding and adding the three in the step S12 to obtain the text code of the Chinese sentence at the position, wherein the formula is as follows:
wherein: x represents the matrix of the entire sentence after the text encoded representation; x is X i Text code embedded information representing an ith word block in a sentence;representing the target word embedding of the ith word block; e (E) A Segment embedding representing the sentence; />Representing the position embedding of the ith word block; i represents the word block number in the sentence, i e 1,2, …, N; n represents the total number of word blocks in the sentence;
s2, realizing text entity recognition based on self-attention, wherein the method comprises the following substeps:
s21, constructing a self-attention head (Q, K, V), obtaining a Query, key and Value matrix of the self-attention head, wherein the number of the self-attention heads of each layer is h=8, combining self-attention Value results of a plurality of self-attention heads, and then outputting the self-attention Value results through a full-connection layer in a dimension-reducing way, wherein the calculation method is as follows:
wherein: multiHead (Q, K, V) represents the self-attention value after h self-attention heads are combined; w (W) O A weight matrix representing a plurality of self-attention heads; concat represents a merge function; head l Representing the first self-attention header; l represents the self-attention header number, l e 1,2, …, h; h represents the total number of self-attention heads; attention represents a self-Attention function; QW (QW) l Q ,KW l K And VW l V Weight matrix respectively representing the first self-attention head and X T Multiplying to obtain Q, K and V matrix; q represents a Query matrix; k represents a Key matrix; v represents a Value matrix;
s22, training through a feedforward neural network, and carrying out normalization processing by using an activation function to obtain probabilities of N categories corresponding to each word block;
s3, realizing uncertainty sampling of sentences based on a minimum confidence policy, wherein the method comprises the following substeps:
s31, obtaining the probability of each word block based on each category, sampling data from a data pool to a trained model for probability prediction after training of each round is completed, and obtaining the probability of each sentence and each word block based on each category, namely the full-connection layer output result of each sentence;
s32, after obtaining the output result of the full connection layer of each sentence, calculating the confidence coefficient by adopting a minimum confidence coefficient strategy to obtain the uncertainty of the whole sentence, wherein the method specifically comprises the following steps: averaging the uncertainties of all word blocks to obtain the uncertainties of the sentences; under the minimum confidence policy, the calculation formula of the uncertainty of each sentence is as follows:
wherein: uncertainty S Representing uncertainty of sentences;representing a probability value calculated after the feedforward neural network prediction; max represents taking the maximum value;
s33, after the uncertainty of all sentences is obtained, sorting in a reverse order by using an uncertainty value, wherein the higher the uncertainty value is, the more uncertain the model is regarded as data, and the data has higher annotation value and is recommended preferentially; the lower the uncertainty value is, the semantic information representing the piece of data is partially or completely contained by the data of the previous round, and the piece of data is excluded from the data needing to be marked;
s34, repeating the step S2 with the small-step recommended data to perform model training, so as to achieve the effect of gradually fine-tuning the model;
s4, completing automatic mining of entity word list by using a weak supervision mode, and acquiring required text entity fragment information, wherein the method comprises the following substeps:
s41, representing word blocks of sentences in the original text by a feature with context coding; converting the attention map into an image with the length and the width of N' pixels of an L multiplied by H channel, and converting the entity identification problem into a graphic resolution problem; the lightweight convolutional neural network is adopted, the model output is used as the input of the logistic regression layer, so that the labels are marked on the related word segments, and the training process formula is as follows:
wherein:representing the calculated model parameters; θ represents an initial value of a model parameter; argmin θ Representing that the plurality of calculation results for θ are selected to be the smallest; />Representing a model parameterized by θ; />Indicating total->The mth segment candidate among the segment candidates; />Then the self-attention feature of the mth segment candidate is represented; loss represents a Loss function; m represents the segment candidate number, m.epsilon.1, 2, …,/and-> Representing the total number of segment candidates;
s42, after model training is completed, fixing parametersIn the prediction phase, the ++information containing the original piece information is calculated in the reverse direction by the formula in step S41>And acquiring the required text entity fragment information.
2. The method for identifying the text entities of the aviation products based on the weak supervised learning as set forth in claim 1, wherein the converting the original text into the vector form in step S11 is specifically:
the bidirectional encoder representation technology pre-training model based on the converter uses an encoder in a converter structure, a basic network structure is built by using multiple layers of encoders, a depth bidirectional model is built, and left and right and context information of a target word are accumulated from a first layer to a last layer to be captured; dividing a Chinese sentence in an original text into a plurality of word blocks, and representing the word blocks as vectors as follows;
S=[c 1 ,c 2 ,…,c j ,…,c n ];
wherein: s represents the vector form of a Chinese sentence; c j Representing a j-th character in the sentence; j represents the character number in the sentence, j e 1,2, …, n; n represents the total number of characters in the sentence.
3. The method for identifying the text entity of the aviation product based on the weak supervision learning according to claim 2, wherein the specific method for performing the target word embedding, segment embedding and position embedding operations in the step S12 is as follows:
the target word is embedded and each word block is mapped into a number in a vocabulary arranged in the word segmentation device through the word segmentation device, and the number is as follows:
wherein: the direct represents a target word embedding function;
the block embedding segments the inputted long text, uses the embedding to distinguish each text segment content, and the segment embedding value of each word block is E A =0;
The location embedding transmits location information providing the word blocks to the pre-trained model for subsequent model self-attention calculations to derive context information for the word blocks, the embedded value typically being an index of the word blocks, i.e
4. The method for identifying the text entity of the aviation product based on the weak supervised learning as set forth in claim 1, wherein the method for acquiring the Query, key and Value matrix in the step S21 is as follows:
the transducer-based bi-directional encoder representation technique model trains text-encoded data before making the next sentence prediction, self-attention input sentence encoding X T After that, X is first of all T Multiplying by W Q 、W K 、W V The weight matrix is Q, K, V, and is all of the self-attention heads of the pre-training model, and the calculation formula is as follows:
wherein: w (W) Q Representing a weight matrix used to calculate a Query matrix; w (W) K Representing a weight matrix used to calculate a Key matrix; w (W) V Representing a weight matrix used to calculate the Value matrix; x is X T A transpose matrix representing sentence code X;
the method comprises the steps of firstly calculating Q, K matrix multiplication, dividing the result by the root mean square of the dimension of the result in order to prevent the result from being oversized, carrying out normalization processing through an activation function softmaxA to obtain a probability value, and finally multiplying the probability value with a V matrix to obtain self-attention output, wherein the specific formula is as follows:
wherein: attention (Q, K, V) represents a self-Attention calculation score; d, d k Representing the dimension of the K matrix; softmaxA represents an activation function.
5. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the training by the feedforward neural network in step S22 is specifically:
binding a self-attention value matrix MultiHead (Q, K, V) with an initial input sentence code X, obtaining an input X entering a feedforward neural network after the full-connection layer is subjected to dimension reduction, fitting training input text through the network, and leaving the calculated weight value at the layer for the next training fit prediction result, wherein the specific formula is as follows:
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2
wherein: FFN (x) represents the tag value represented as x input during the training phase; x represents the input obtained by reducing the dimension after the multi-head self-attention and sentence coding are bound; w (W) 1 ,b 1 Respectively representing the weight and the offset obtained by preliminary fitting of x; w (W) 2 ,b 2 Respectively represent the most suitable weight W selected after preliminary fitting 1 And offset b 1 The weight and offset obtained by fitting the whole again.
6. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the normalizing process using the activation function in step S22 is specifically:
the tag values represented as inputs during the training phase are normalized by an activation function, which becomes a probability value between 0 and 1, the activation function formula is shown as follows:
wherein:representing the probability of the ith word block in each category; softmaxB represents a normalization function; FFN (x) i ) Represented as x in the training phase i The tag value entered.
7. The method for recognizing text entities of aviation products based on weak supervised learning as set forth in claim 1, wherein the word blocks of sentences in the original text are represented by a feature with context coding in step S41;
for the original text given in the text representation of step S1, there is a vector form of sentence fragments, as follows:
span j,j+s =[c j ,…,c j+s ];
wherein: span j,j+s A vector representing a jth sentence fragment of a sentence; c j+s Representing the j+s-th character in the sentence; s represents the length of a sentence fragment;
for an L-layer encoder and a model with H self-attention heads per layer encoder, there is an attention fraction feature vector of the segment at the H-th attention head of the first layer
For each segment candidate, its self-attention feature can be expressed as:
wherein: x is X p Representation ofA self-attention score feature vector for the sentence fragment; multiHead represents the calculation formula of the self-attention score feature vector; l represents the calculation by the L-layer encoder; h represents H self-attention heads per layer;
simplifying the above formula, and carrying out mathematical description again: for the sum upA segment candidate, wherein the self-attention characteristic of the ith segment candidate is expressed as +.>
8. A product text entity recognition system for the aviation product text entity recognition method based on weak supervision learning of claim 1, which comprises a text coding unit, a text entity recognition unit, an uncertainty sampling unit and an entity vocabulary automatic mining unit, wherein the text coding unit is used for processing an original text to perform text coding, the text entity recognition unit is used for realizing text entity recognition based on self-attention, the uncertainty sampling unit is used for realizing uncertainty sampling of sentences based on a minimum confidence policy, and the entity vocabulary automatic mining unit is used for completing automatic mining of entity vocabularies by using a weak supervision mode to acquire needed text entity fragment information.
CN202211690404.9A 2022-12-27 2022-12-27 Aviation product text entity identification method based on weak supervision learning Active CN116227434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211690404.9A CN116227434B (en) 2022-12-27 2022-12-27 Aviation product text entity identification method based on weak supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211690404.9A CN116227434B (en) 2022-12-27 2022-12-27 Aviation product text entity identification method based on weak supervision learning

Publications (2)

Publication Number Publication Date
CN116227434A CN116227434A (en) 2023-06-06
CN116227434B true CN116227434B (en) 2024-02-13

Family

ID=86583418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211690404.9A Active CN116227434B (en) 2022-12-27 2022-12-27 Aviation product text entity identification method based on weak supervision learning

Country Status (1)

Country Link
CN (1) CN116227434B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826303A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Joint information extraction method based on weak supervised learning
CN112270193A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Chinese named entity identification method based on BERT-FLAT
CN114091478A (en) * 2021-11-30 2022-02-25 复旦大学 Dialog emotion recognition method based on supervised contrast learning and reply generation assistance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826303A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Joint information extraction method based on weak supervised learning
CN112270193A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Chinese named entity identification method based on BERT-FLAT
CN114091478A (en) * 2021-11-30 2022-02-25 复旦大学 Dialog emotion recognition method based on supervised contrast learning and reply generation assistance

Also Published As

Publication number Publication date
CN116227434A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN109299273B (en) Multi-source multi-label text classification method and system based on improved seq2seq model
CN108388560B (en) GRU-CRF conference name identification method based on language model
CN111694924A (en) Event extraction method and system
CN113836298A (en) Text classification method and system based on visual enhancement
CN111680512B (en) Named entity recognition model, telephone exchange extension switching method and system
CN113420543B (en) Mathematical test question automatic labeling method based on improved Seq2Seq model
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN113223509A (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN111462749A (en) End-to-end dialogue system and method based on dialogue state guidance and knowledge base retrieval
CN114756687A (en) Self-learning entity relationship combined extraction-based steel production line equipment diagnosis method
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN115935975A (en) Controllable-emotion news comment generation method
CN116245110A (en) Multi-dimensional information fusion user standing detection method based on graph attention network
CN115658905A (en) Cross-chapter multi-dimensional event image generation method
CN116341562A (en) Similar problem generation method based on Unilm language model
CN114582448A (en) Epidemic case information extraction framework construction method based on pre-training language model
CN110175330A (en) A kind of name entity recognition method based on attention mechanism
CN116227434B (en) Aviation product text entity identification method based on weak supervision learning
CN113656569A (en) Generating type dialogue method based on context information reasoning
CN117149977A (en) Intelligent collecting robot based on robot flow automation
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN115965026A (en) Model pre-training method and device, text analysis method and device and storage medium
CN113626537B (en) Knowledge graph construction-oriented entity relation extraction method and system
CN115422388A (en) Visual conversation method and system
Xin et al. Automatic annotation of text classification data set in specific field using named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant