CN113157883A - Chinese opinion target boundary prediction method based on dual-model structure - Google Patents

Chinese opinion target boundary prediction method based on dual-model structure Download PDF

Info

Publication number
CN113157883A
CN113157883A CN202110374539.3A CN202110374539A CN113157883A CN 113157883 A CN113157883 A CN 113157883A CN 202110374539 A CN202110374539 A CN 202110374539A CN 113157883 A CN113157883 A CN 113157883A
Authority
CN
China
Prior art keywords
opinion
model
data
word
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110374539.3A
Other languages
Chinese (zh)
Inventor
王丽亚
章增优
梅成才
余威明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Industry and Trade Vocational College
Original Assignee
Zhejiang Industry and Trade Vocational College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Industry and Trade Vocational College filed Critical Zhejiang Industry and Trade Vocational College
Priority to CN202110374539.3A priority Critical patent/CN113157883A/en
Publication of CN113157883A publication Critical patent/CN113157883A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention provides a Chinese opinion target boundary prediction method based on a dual-model structure, which comprises the following steps: step1, acquiring a Chinese opinion text data set, dividing the data set into a training set, a verification set and a test set, and then performing data preprocessing on the training set to obtain training sample data; step2, constructing a Chinese opinion target extraction model based on BERT _ BiGRU, wherein the Chinese opinion target extraction model comprises a model I and a model II; step3, inputting the training sample data into a model II to be trained to predict the relationship between the boundary of the opinion target label and the opinion text; step4, training the second model and optimizing the first model; and Step5, inputting word vectors corresponding to the test sample word sequence into the optimal model I to obtain a predicted boundary score vector, and converting the predicted boundary score vector into a final opinion target by using a decoding algorithm for outputting.

Description

Chinese opinion target boundary prediction method based on dual-model structure
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese opinion target boundary prediction method based on a dual-mode structure.
Background
Opinion target extraction (opiniontargetextration) is a basic task of opinion mining and emotion analysis, and is a research hotspot in the field of Natural Language Processing (NLP). The opinion target extraction mainly extracts the main body expressing the opinion in the text. The traditional method for extracting the target can be divided into three types, namely a rule-based method, a statistic-based method and a method combining the rule and the statistic. However, the 3 methods have strong limitations, depend too much on manually established rules, and have complex processes. The traditional method models opinion target extraction as a sequence labeling task, and needs to be subjected to complicated sequence labeling operation. The deep learning-based method does not depend on artificial features any more, so that the labor cost required by the traditional method is reduced, and the working efficiency is further improved. However, in the current deep learning-based method, the OTE task is mostly modeled as a sequence marking task, and complex sequence marking operation needs to be performed on a data set.
In summary, it is an urgent need to solve the problem for those skilled in the art to provide a method for predicting Chinese opinion target boundaries based on a dual-model structure, which can avoid tedious sequence tagging operation and effectively improve the accuracy of Chinese opinion target extraction.
Disclosure of Invention
In order to solve the problems and requirements, the scheme provides a Chinese opinion target boundary prediction method based on a dual-model structure, and the technical problems can be solved by adopting the following technical scheme.
In order to achieve the purpose, the invention provides the following technical scheme: a Chinese opinion target boundary prediction method based on a double model structure comprises the following steps:
step1, acquiring a Chinese opinion text data set, dividing the data set into a training set, a verification set and a test set, and then performing data preprocessing on the training set to obtain training sample data, wherein the training sample data comprises word vectors corresponding to opinion text word sequences and sequences used for representing the probability distribution condition of real target boundaries of opinion texts, and the word vectors are used for representing words and word positions in the opinion text word sequences;
step2, constructing a Chinese opinion target extraction model based on BERT _ BiGRU by adopting a Keras framework, wherein the Chinese opinion target extraction model comprises a model I and a model II;
step3, inputting the training sample data into a model II to be trained, and predicting the relation between the boundary of the opinion target label and the opinion text according to the output result of the model II;
step4, obtaining verification sample information of the training sample data from a verification set, training a second model based on the output result and the verification sample information to obtain a trained second model, and optimizing the first model by the trained second model to obtain an optimal first model for predicting the opinion target label boundary;
step5, obtaining a test sample word sequence by word segmentation of the test set data, obtaining a word vector corresponding to the test sample word sequence, inputting the word vector corresponding to the test sample word sequence into the optimal model I to obtain a predicted boundary score vector, converting the predicted boundary score vector into a final opinion target by using a decoding algorithm, and outputting, wherein the test data in the test set only comprises opinion texts.
Further, the data preprocessing comprises: converting each character in the training data set into dictionary indexes through a dictionary vocab.txt carried by a BERT-wwm-ext model, constructing a new dictionary token _ fact, and representing characters which are not in the token _ fact by using a [ unused1] label and a [ UNK ] label, wherein the [ unused1] label is used for marking space class untrained characters; disordering the training data set sequence according to random numbers, and carrying out slicing operation on the opinion texts of the training data in the training data set according to the maxlen parameter value to obtain an opinion target label sequence and an opinion text sequence; performing word segmentation processing on the opinion text sequence and the opinion target label sequence to obtain an opinion text word sequence and an opinion target label word sequence, and adding [ CLS ] and [ SEP ] labels to the head and the tail of the opinion text word sequence respectively; acquiring the position of the first word of the opinion target label word sequence in the opinion text word sequence, and recording the attribute value corresponding to the position as the initial attribute value; obtaining the position of the end of the opinion target label word sequence through the position of the first word of the opinion target label word sequence in the opinion text word sequence and the length of the opinion target label word sequence; assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, and filling with 0 to obtain a sequence s1, wherein the sequence s1 has the same length as the opinion text word sequence, assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, and filling with 0 to obtain a sequence s2, and the sequence s2 has the same length as the opinion text word sequence; obtaining an opinion text according to token _ fact, splitting the opinion text into words to obtain an opinion text word vector x1 and a sequence x2 after padding the opinion text word vector x 1; performing the operation on each 32 data in the training data set as a batch, and sequentially obtaining vectors X1, X2, S1 and S2 for X1, X2, S1 and S2 of the 32 data obtained after the operation; and padding filling operation is carried out on the batch of data according to the maximum length of single data in a batch to obtain new X1n, X2n, S1n and S2n as training sample data.
Further, the model two comprises a plurality of cascaded data processing layers, wherein the plurality of cascaded data processing layers comprise 1 Input layer, 4 lambda layers, 1 BERT-wwm-ext layer, 1 BIGRU layer and 3 Dense layers, and the result of each data processing unit in the upper data processing layer is Input into each data processing unit in the lower data processing layer in the plurality of cascaded data processing layers.
Further, the training process of the second model is as follows: acquiring the training sample data, inputting the training sample data into a second model, and predicting the relation between the opinion target boundary and the text, wherein the relation between the opinion target boundary and the text is used for indicating the probability distribution condition of the opinion target boundary in the opinion text word sequence; and optimizing the second model according to the predicted probability distribution condition and the probability distribution condition of the real target boundary of the opinion text to obtain the trained second model.
Further, a model one is optimized while a model two is trained to obtain an optimal model one, the model one is used for predicting the opinion target label boundary, the input of the optimal model one is only a test data word vector, the output is the probability ps1 that each word in the opinion text starts as an opinion target item and the probability ps2 that each word in the opinion text ends as an opinion target item, and the predicted boundary fraction vectors _ ps1 and _ ps2 are obtained through the probability ps1 and the probability ps 2.
Further, the predicted boundary score vectors _ ps1 and _ ps2 are input to a decoding layer to obtain a final opinion objective using a decoding algorithm.
Still further, the decoding algorithm comprises: and respectively carrying out normalization processing on the boundary score vectors _ ps1 and _ ps2 by using two softmax functions, acquiring an attribute value sequence of the word with the maximum probability, and acquiring an opinion target entity fragment by using slicing operation according to the attribute value sequence.
Further, when training the second model, by comparing the predicted probability distribution situation with the real target boundary probability distribution situation of the opinion text, according to the target loss function:
Figure BDA0003010659280000041
Figure BDA0003010659280000042
Loss=Loss1+Loss2
and evaluating the difference between the predicted result and the real result, training a second model, and optimizing the first model, wherein s1_ in and s2_ in are indications of the boundary of the real opinion target, and N is the number of samples.
According to the technical scheme, the invention has the beneficial effects that:
1. the invention positions the opinion target extraction task as the target segment boundary prediction task, thereby avoiding fussy part of speech tagging operation.
2. Because the BERT _ BIGRU network with the dual-model structure is a multi-input multi-output model, two models can be synchronously trained, the first model is used for predicting the opinion target, the second model is used for learning the relation between the opinion target and the text, the first model is also optimized while the second model is trained, and the training efficiency is improved.
3. The Chinese opinion target extraction model based on the double model structure can effectively improve the accuracy of Chinese opinion target extraction.
In addition to the above objects, features and advantages, preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings so that the features and advantages of the present invention can be easily understood.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described below, wherein the drawings are only used for illustrating some embodiments of the present invention and do not limit all embodiments of the present invention thereto.
FIG. 1 is a schematic diagram illustrating the specific steps of the method for predicting the boundary of a Chinese opinion target based on a dual-model structure according to the present invention.
FIG. 2 is a schematic diagram of the structure of the Chinese opinion target extraction model according to the present invention.
FIG. 3 is a schematic flow chart of the Chinese opinion target boundary prediction method based on a dual model structure according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be described in detail and completely with reference to the accompanying drawings of the specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
The invention relates to a method for predicting the opinion expression in the text, which comprises the steps of interpreting the opinion target extraction task requirement as a target for positioning the opinion expression from the text, wherein the target segment consists of one segment in the text, modeling the task as a boundary prediction task, and predicting two position indexes in the text to indicate the starting position and the ending position of an answer. Aiming at the Chinese short text, the Chinese opinion target boundary prediction method based on the dual-model structure can avoid the fussy sequence marking operation and effectively improve the accuracy of Chinese opinion target extraction. As shown in fig. 1 to 3, the method includes: step1, acquiring a Chinese opinion text data set, dividing the data set into a training set, a verification set and a test set, and then performing data preprocessing on the training set to obtain training sample data, wherein the training sample data comprises word vectors corresponding to opinion text word sequences and sequences used for representing real target boundary probability distribution conditions of opinion texts, the word vectors are used for representing words and word positions in the opinion text word sequences, and the data preprocessing comprises: converting each character in the training data set into dictionary indexes through a dictionary vocab.txt carried by a BERT-wwm-ext model, constructing a new dictionary token _ fact, and representing characters which are not in the token _ fact by using a [ unused1] label and a [ UNK ] label, wherein the [ unused1] label is used for marking space class untrained characters; disordering the training data set sequence according to random numbers, and carrying out slicing operation on opinion texts of training data in the training data set according to maxlen parameter values to obtain an opinion target label sequence and an opinion text sequence; carrying out word segmentation processing on the opinion text sequence and the opinion target label sequence to obtain an opinion text word sequence and an opinion target label word sequence, and respectively adding [ CLS ] and [ SEP ] labels to the head and the tail of the opinion text word sequence; acquiring the position of the first word of the opinion target label word sequence in the opinion text word sequence, and recording the attribute value corresponding to the position as the initial attribute value; obtaining the position of the end of the opinion target label word sequence through the position of the first word of the opinion target label word sequence in the opinion text word sequence and the length of the opinion target label word sequence; assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, filling with 0 to obtain a sequence s1, wherein the sequence s1 has the same length as the opinion text word sequence, assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, filling with 0 to obtain a sequence s2, and the sequence s2 has the same length as the opinion text word sequence; obtaining an opinion text according to token _ fact, splitting the opinion text into characters, obtaining an opinion text word vector x1 and a sequence x2 after padding the opinion text word vector x 1; carrying out the operation on each 32 data in the training data set as a batch, and sequentially obtaining vectors X1, X2, S1 and S2 for X1, X2, S1 and S2 of the 32 data obtained after the operation; and padding filling operation is carried out on the batch data according to the maximum length of single data in a batch to obtain new X1n, X2n, S1n and S2n as training sample data.
In this embodiment, the chinese opinion target boundary prediction method based on the dual-model structure provided by the present invention may be executed by a server or an intelligent terminal device, and the specific operation process of the data preprocessing may be implemented by the following steps: first, a data set is obtained from some network reviews of three internet companies, Baidu (baidu), Dianping (dianping), Mahalanobis (Mahengwo).
Second, the data set is then partitioned. One of the data instances can be shown as: (the 'belong lake is the first scenic spot of the general park, the lake water is clear, the lake is taken by cold wind by walking at the lake, the tree shallot is a good place', 'belong lake'), and the example is an expression suggestion for the target 'belong lake'.
And thirdly, calling a full word coverage Chinese BERT pre-training model (BERT-wwm-ext) jointly issued by the Harbour and the major news.
And fourthly, preprocessing the text. Every 32 data are processed in a batch, and the method specifically comprises the following steps:
a1, call the contents of the vocab. txt file, and assign a value to token _ fact.
And A2, performing shuffle operation, and disordering the original data sequence according to random numbers to prevent overfitting caused by training data.
A3, training data should have two columns, the first column being comment text d [0], the second column being comment object label d [1 ]. The first column is sliced with the maxlen parameter value.
For example, sample: ' Water crab porridge is a Australian snack. 'Water crab porridge'.
The d 0 ═ water crab gruel is a Australian snack. '
d1 ═ water crab porridge'
A4, constructing a word segmentation device, marking untrained characters in space class by using [ unused1] for characters which are not in token _ fact, and representing the rest characters by using [ UNK ].
A5, dividing d 0 and d1 into words to obtain lists text _ tokens and tag _ tokens. d 0, after word division, the first and last positions of sentence are respectively added with [ CLS ] and [ SEP ] marks.
For the above sample:
text _ tokens [ ' [ CLS ] ', ' water ', ' crab ', ' porridge ', ' is ', ' ao ', ' gate ', ' small ', ' eat ', '. ', ' [ SEP ] ' ]
tag _ tokens [ 'water', 'crab', 'porridge' ]
A6, returns 2 arrays of given length and type, filled with 0, with the zeros function under Python's numpy module is s1, s2, where the length is the length of text _ keys, the type defaults to numpy.
For the above sample:
s1=[0,0,0,0,0,0,0,0,0,0,0]
s2=[0,0,0,0,0,0,0,0,0,0,0]
a7, write list _ find function, return tag _ tokens [0], position of occurrence in text _ tokens, and assign id value to start. I.e., where the first word of the opinion target appears in the comment text.
A8, obtaining the end position through the start position. The Len function returns the length of tag tokens.
end=start+len(tag_tokens)-1
For the above sample:
len(tag_tokens)=3
start=1
end=3
a9, assigning the value with id as start in the array s1 as 1 to obtain a new array s1, and assigning the value with id as end in the array s2 as 1 to obtain a new array s 2.
For the above sample:
s1=[0,1,0,0,0,0,0,0,0,0,0]
s2=[0,0,0,1,0,0,0,0,0,0,0]
a10, referring to token _ dit storing words and id mapping, using the encode function of the Tokenizer module, returning d [0] and splitting into words to generate corresponding id array x1 and array x2 after padding technology.
For the above sample:
x1=[101,3717,6101,5114,3221,4078,7305,2207,1391,511,10 2]
x2=[0,0,0,0,0,0,0,0,0,0,0]
a11, each 32 data is a batch. The X1, X2, S1 and S2 values of the 32 data obtained after processing are sequentially stored into lists X1, X2, S1 and S2.
For the batch of the sample:
x1 [ [ X1 of data 1], [ X1 of data 2], [101,3717,6101,5114,3221,4078,7305,2207,1391,511,102], …, [ X1 of data 32 ] ]
X2 [ [ X2 of data 1], [ X2 of data 2], [0,0,0,0,0,0,0, 0], …, [ X2 of data 32 ] ]
S1 [ [ S1 of data 1], [ S1 of data 2], [0,1,0,0,0,0,0,0,0, 0], …, [ S1 of data 32 ] ]
S2 [ [ S2 of data 1], [ S2 of data 2], [0,0,0,1,0,0,0,0,0, … ], [ S2 of data 32 ] ]
A12, padding technical operation is carried out on the batch data according to the maximum length of single data in the batch, namely, the data length in the batch is smaller than the maximum length of the data in the batch, and 0 is filled in the data length in the batch, so that new X1, X2, S1 and S2 (namely X1n, X2n, S1n and S2n) are obtained.
For the batch of the sample, if the length value of the longest data is 15:
x1 [ [ X1 after data 1padding ], [ X1 after data 2padding ], [101,3717,6101,5114,3221,4078,7305,2207,1391,511,102,0, 0,0], … ], [ X1 after data 32padding ]
X2 [ [ X2 after data 1padding ], [ X2 after data 2padding ], [0,0,0,0,0,0,0,0,0, … ], [ X2 after data 32padding ] ]
S1 [ [ S1 after data 1padding ], [ S1 after data 2padding ], [0,1,0,0,0,0,0,0,0, 0], …, [ S1 after data 32padding ] ]
S2 [ [ S2 after data 1padding ], [ S2 after data 2padding ], [0,0,0,1,0,0,0,0,0,0,0,0, 0], …, [ S2 after data 32padding ] ], and new X1, X2, S1, S2 are used as sample data.
Step2, constructing a Chinese opinion target extraction model based on BERT _ BiGRU by adopting a Keras framework, wherein the Chinese opinion target extraction model comprises a model I and a model II; as shown in fig. 2, in this embodiment, the specific implementation of the model includes: b1, calling a BERT-wwm-ext model, and specifically explaining as follows:
bert_model=load_trained_model_from_checkpoint(config_path, checkpoint_path,seq_len=None)
b2, describe in detail the Chinese opinion target extraction model based on BERT _ BiGRU boundary prediction in conjunction with FIG. 2.
B3, adding 4 Input layers for receiving the sample data: x1, X2, S1 and S2. The concrete explanation is as follows:
x1_ in ═ Input (shape ═ receive) # receive data X1
X2_ in ═ Input (shape ═ receive) # receive data X2
S1_ in ═ Input (shape ═ receive data S1
S2_ in ═ Input (shape ═ receive data S2
X1,X2,S1,S2=x1_in,x2_in,s1_in,s2_in
B4, adding a lambda1 layer, taking X1 as an input, assigning an output to X _ mask, and performing mask operation, wherein the specific explanation is as follows:
importkeras.backendasK
and introducing a backup packet under the keras and renaming the packet as the K packet.
x_mask=Lambda(lambdax:K.cast(K.greater(K.expand_dims(x ,2),0),'float32'))(X1)
Data X1 was calculated using the lambda function of python, its own, in combination with the cast, grease, and expanded _ dims functions under K-wrap.
B5, [ X1, X2] as inputs, processing the data using a BERT-wwm-ext model, and outputting an assignment to X, which is specifically explained as:
x=bert_model([X1,X2])
b6 and x are used as input, data are processed by using a BIGRU model, and assignment is output to x, which is specifically explained as follows:
x=Bidirectional(GRU(char_size//2,return_sequences=True ))(x)
b7, adding a lambda2 layer, [ x, x _ mask ] as an input, and assigning an output value to x, wherein the specific explanation is as follows:
x=Lambda(lambdax:x[0]*x[1])([x,x_mask])
b8, adding a Dense1 layer, taking x as an input, assigning an output value to x, and specifically explaining that:
x=Dense(char_size,use_bias=False,activation='tanh')(x)
b9, adding a Dense2 layer, taking x as an input, and assigning an output to ps1, wherein the specific explanation is as follows:
ps1=Dense(1,use_bias=False)(x)
b10, adding lambda3 layer, [ ps1, x _ mask ] as input, and assigning output to ps1, specifically explained as:
ps1=Lambda(lambdax:x[0][...,0]-(1- x[1][...,0])*1e10)([ps1,x_mask])
b11, adding a Dense3 layer, taking x as an input, assigning an output to ps2, and specifically explaining that:
ps2=Dense(1,use_bias=False)(x)
b12, adding lambda4 layer, [ ps2, x _ mask ] as input, and assigning output to ps2, specifically explained as:
ps2=Lambda(lambdax:x[0][...,0]-(1- x[1][...,0])*1e10)([ps1,x_mask])
since the input of the final model is only text related information and label information is not available, declare model one: model, used for predicting the position of the head and the end words of the opinion, with the input of [ x1_ in, x2_ in ], the output of [ ps1, ps2], specifically explained as:
model=Model([x1_in,x2_in],[ps1,ps2])
because learning is needed according to the label information during training, and the loss value is reduced, a second statement model is as follows: train _ model, model one is trained while model two is trained. Its inputs are [ x1_ in, x2_ in, s1_ in, s2_ in ], its outputs are [ ps1, ps2], specifically interpreted as:
train_model=Model([x1_in,x2_in,s1_in,s2_in],[ps1,ps2])
finally, a loss function of the train _ model is defined, and the cross entropy loss function is used for evaluating the difference between the boundary probability distribution obtained by current training and the real target boundary distribution. The difference was calculated using the following equation.
Figure BDA0003010659280000121
Figure BDA0003010659280000122
Loss=Loss1+Loss2
Where s1_ in and s2_ in are indications of true opinion object boundaries. ps1 represents the probability of each word of text starting as an opinion target and ps2 represents the probability of each word of text ending as an opinion target.
Step3, inputting the training sample data into a second model to be trained, and predicting the relationship between the boundary of the opinion target label and the opinion text according to the output result of the second model, as shown in fig. 2 and the model implementation process, the second model comprises a plurality of cascaded data processing layers, the plurality of cascaded data processing layers comprise 1 Input layer, 4 lambda layers, 1 BERT-wwm-ext layer, 1 BIGRU layer and 3 Dense layers, and in the plurality of cascaded data processing layers, the result of each data processing unit in the previous data processing layer is Input into each data processing unit in the next data processing layer.
Step4, obtaining verification sample information of the training sample data from a verification set, training a second model based on the output result and the verification sample information to obtain a trained second model, and optimizing the first model by the trained second model to obtain a first optimal model for predicting the opinion target label boundary, wherein the training process of the second model is as follows: acquiring the training sample data, inputting the training sample data into a second model, and predicting the relation between the opinion target boundary and the text, wherein the relation between the opinion target boundary and the text is used for indicating the probability distribution condition of the opinion target boundary in the opinion text word sequence; and optimizing the second model according to the predicted probability distribution condition and the real target boundary probability distribution condition of the opinion text to obtain the trained second model. As shown in fig. 2, a model one is also optimized while a model two is trained to obtain an optimal model one, the model one is used for predicting the opinion target label boundary, the input of the optimal model one is only a test data word vector, the output is a probability ps1 that each word in the opinion text starts as an opinion target item and a probability ps2 that each word in the opinion text ends as an opinion target item, and the predicted boundary score vectors _ ps1 and _ ps2 are obtained from the probability ps1 and the probability ps 2. Inputting the predicted boundary score vectors _ ps1 and _ ps2 into the decoding layer to obtain the final opinion objective by using a decoding algorithm.
Step5, obtaining a test sample word sequence by word segmentation of the test set data, obtaining a word vector corresponding to the test sample word sequence, inputting the word vector corresponding to the test sample word sequence into the optimal model I to obtain a predicted boundary score vector, converting the predicted boundary score vector into a final opinion target by using a decoding algorithm, and outputting, wherein the test data in the test set only comprises opinion texts. Wherein the decoding algorithm comprises: and respectively carrying out normalization processing on the boundary score vectors _ ps1 and _ ps2 by using two softmax functions, acquiring an attribute value sequence of the word with the maximum probability, and acquiring an opinion target entity fragment by using slicing operation according to the attribute value sequence.
In this embodiment, since the OTE task needs to output specific target entity segments, and after model one processing, only two sets of head and tail fractional vectors _ ps1 and _ ps2 are obtained, a decoding algorithm is needed to convert the fractional vectors into final target entity output. The specific explanation is as follows: and respectively processing the head fraction vector _ ps1 and the tail fraction vector _ ps2 by using two softmax functions, returning the id of the word with the maximum probability by using an own argmax function under a numpy module, and calculating a target entity fragment by using a slicing operation by using the following formula by using the softmax function, wherein the example is the _ ps 1.
And the model II is trained on a training set and a verification set, and compared with a real result by a prediction result, the model II determines a learning direction according to the error reduction, optimizes the model II and optimizes the model I at the same time. The Chinese network comment opinion target boundary prediction model based on the dual-model structure, which is constructed by the invention, positions an opinion target extraction task as a target segment boundary prediction task, avoids complicated part-of-speech tagging operation, is a multi-input multi-output model, has high efficiency, and can compile and train two models simultaneously.
The specific experimental process is as follows:
the experiment used the same dataset of document [1] from Li et al, with 3 groups of 10 ten thousand data, and data [2] from three internet companies, hundredths (baidu), criticism (dianping), and mahangwo, with specific dataset settings as shown in table 1.
Document [1] Yanzeng Li, Tingwen Liu, Diying Li, et al, Character-based BilSt-CRF incorporation POS and Dictionaries for Chinese Opinion Target extraction. aspect reference on Machine Learning, ACML 2018, 518-.
[2]https://github.com/kdsec/chinese-opinion-target- extraction。
TABLE 1 Experimental data set
Data set Training set Verification set Test set Total number of
baidu 7500 1033 3658 12191
dianping 24000 1258 10825 36083
mafengwo 40000 1253 17681 58934
The evaluation indexes used in the experiment are Accuracy, Precision, Recall and F1, and the higher the value of the evaluation indexes, the better the classification capability of the model is represented. Defining TP: for model identification of the number of completely correct entities, FP: the result identified for the model contains the correct entities, but the number of errors in boundary determination is FN: the number of errors is identified. The evaluation index is shown by the following formula:
Accuracy=TP/(TP+FP+FN),
Precision=TP/(TP+FP),
Recall=TP/(TP+FN),
F1=2*(Precision*Recall)/(Precision+Recall),
by observing the extraction result, the condition that extraction is empty does not exist in the experimental process of the model, when the FP is calculated, the fact that the extraction result is not the same as the original sample is noticed, the number of fault-tolerant characters is less than 10, and in order to avoid index calculation understanding difference, a specific scoring algorithm is provided. As follows:
foriinrange(len(test_data)):
ifpredict_label==true_label:
TP+=1
ifpredict_label!=true_labeland(true_labelinpredict_lab eland(len(predict_label)–len(true_label)<10)):FP+=1
FN=len(test_data)-(FP+TP)。
work in document [1] has set up multiple sets of detailed comparison experiments on the same dataset, including the most popular extraction framework, the BILSTM _ CRF model, and demonstrated that its method is optimal, so the present application compares it directly.
BILSTM _ CRF: modeling as a sequence tagging task. Firstly, generating character position information characteristics ([ CP-POS ] @ C) and constructing dictionary characteristics (DictFeature), and finally integrating [ CP-POS ] @ C and DictFeature into a BILSTM _ CRF model embedded based on Word2vec characters.
BERT (bimodal structure): modeling as a boundary prediction task. The method is different from the method in that the neural network model is BERT-wwm-ext model and common Dense layer.
The method adopts BERT _ BIGRU (double-model structure) to model as a boundary prediction task, and the neural network model is a BERT-wwm-ext model and a BIGRU layer. In order to increase the reliability of the test results of the models, the running environments of all the models are as consistent as possible in the experimental process. The test results of the test set are shown in table 2.
TABLE 2 model comparison results
Figure BDA0003010659280000151
Figure BDA0003010659280000161
Table 2 shows the results of the comparison of the 3 sets of models on the test set. As can be seen from the comparison between the group 1 and the group 2, the BERT network with the dual-model structure reduces the work of generating character position information characteristics ([ CP-POS ] @ C) and constructing dictionary characteristics (DictFeature) by the preprocessing part and the sequence part-of-speech tagging operation necessary for the sequence marking task relative to the BILSTM _ CRF model of the sequence marking task, and avoids the complicated semi-automatic feature generation engineering to a great extent. Because the BERT network with the double-model structure is a multi-input multi-output model, two models are compiled and trained synchronously, the first model is used for predicting an opinion target, and the second model is used for learning the relation between the opinion target and a text. And (3) optimizing the model I while training the model II, and finally converting the boundary score vector predicted by the model I into a final answer by using a decoding algorithm for output. As can be known from comparison of the comprehensive evaluation indexes Accuracy and F1, for the data set, the BERT network with a dual-model structure obtains 91.36% of Accuracy and 95.49% of F1 on the baidu data set, 92.06% of Accuracy and 95.87% of F1 on the dianping data set, and 89.45% of Accuracy and 94.43% of F1 on the mafengwo data set, and the experimental result is superior to that of the BILSTM _ CRF model, so that the Accuracy of the OTE task can be effectively improved on the basis of not depending on sequence marks.
From the comparison between the group 3 and the group 2, for the three data sets, the method obtains an Accuracy value of 91.53% and an F1 value of 95.58% on the baidu data set, an Accuracy value of 91.99% and an F1 value of 95.83% on the dianping data set, and an Accuracy value of 89.76% and an F1 value of 94.61% on the mafengwo data set, and the experimental result is superior to the control group on 2 data sets, which indicates that, to some extent, the BIGRU web learning text context semantic features are added, which is beneficial to improving the Accuracy of the model for text boundary prediction.
In order to further quantify the model comparison result, index values predicted on the test set are given. The predicted value statistics are shown in table 3.
TABLE 3 predicted value statistics
Figure BDA0003010659280000171
Right is the total number of samples for which the model extraction is completely correct, and Wrong is the total number of samples for which the model extraction is Wrong. The above results illustrate the feasibility and effectiveness of the proposed method for predicting the target boundary based on the dual-model structure.
It should be noted that the described embodiments of the invention are only preferred ways of implementing the invention, and that all obvious modifications, which belong to the overall concept of the invention, should fall within the scope of protection of the invention.

Claims (8)

1. A Chinese opinion target boundary prediction method based on a double model structure is characterized by comprising the following steps:
step1, acquiring a Chinese opinion text data set, dividing the data set into a training set, a verification set and a test set, and then performing data preprocessing on the training set to obtain training sample data, wherein the training sample data comprises word vectors corresponding to opinion text word sequences and sequences used for representing the real target boundary probability distribution condition of opinion texts, and the word vectors are used for representing the positions of words and words in the opinion text word sequences;
step2, constructing a Chinese opinion target extraction model based on BERT _ BiGRU by adopting a Keras framework, wherein the Chinese opinion target extraction model comprises a model I and a model II;
step3, inputting the training sample data into a model II to be trained, and predicting the relation between the boundary of the opinion target label and the opinion text according to the output result of the model II;
step4, obtaining verification sample information of the training sample data from a verification set, training a second model based on the output result and the verification sample information to obtain a trained second model, and optimizing the first model by the trained second model to obtain an optimal first model for predicting the opinion target label boundary;
step5, obtaining a test sample word sequence by word segmentation of the test set data, obtaining a word vector corresponding to the test sample word sequence, inputting the word vector corresponding to the test sample word sequence into the optimal model I to obtain a predicted boundary score vector, converting the predicted boundary score vector into a final opinion target by using a decoding algorithm, and outputting, wherein the test data in the test set only comprises opinion texts.
2. The dual-model architecture based chinese opinion target boundary prediction method of claim 1, wherein the data pre-processing comprises: converting each character in the training data set into dictionary indexes through a dictionary vocab.txt carried by a BERT-wwm-ext model, constructing a new dictionary token _ fact, and representing characters which are not in the token _ fact by using a [ unused1] label and a [ UNK ] label, wherein the [ unused1] label is used for marking space class untrained characters; disordering the training data set sequence according to random numbers, and carrying out slicing operation on the opinion texts of the training data in the training data set according to maxlen parameter values to obtain an opinion target label sequence and an opinion text sequence; performing word segmentation processing on the opinion text sequence and the opinion target label sequence to obtain an opinion text word sequence and an opinion target label word sequence, and adding [ CLS ] and [ SEP ] labels to the head and the tail of the opinion text word sequence respectively; acquiring the position of the first word of the opinion target label word sequence in the opinion text word sequence, and recording the attribute value corresponding to the position as the initial attribute value; obtaining the position of the end of the opinion target label word sequence through the position of the first word of the opinion target label word sequence in the opinion text word sequence and the length of the opinion target label word sequence; assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, and filling with 0 to obtain a sequence s1, wherein the sequence s1 has the same length as the opinion text word sequence, assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, and filling with 0 to obtain a sequence s2, and the sequence s2 has the same length as the opinion text word sequence; obtaining an opinion text according to token _ fact, splitting the opinion text into words to obtain an opinion text word vector x1 and a sequence x2 after padding the opinion text word vector x 1; performing the operation on each 32 data in the training data set as a batch, and sequentially obtaining vectors X1, X2, S1 and S2 for X1, X2, S1 and S2 of the 32 data obtained after the operation; and padding filling operation is carried out on the batch of data according to the maximum length of single data in a batch to obtain new X1n, X2n, S1n and S2n as training sample data.
3. The dual model architecture based Chinese opinion target boundary prediction method of claim 2 wherein the model two includes a plurality of cascaded data processing layers, the results of each data processing unit in a previous data processing layer among the plurality of cascaded data processing layers being input to a respective data processing unit in a next data processing layer.
4. The method of claim 3, wherein the training process of the model two is as follows: acquiring the training sample data, inputting the training sample data into a second model, and predicting the relation between the opinion target boundary and the text, wherein the relation between the opinion target boundary and the text is used for indicating the probability distribution condition of the opinion target boundary in the opinion text word sequence; and optimizing the second model according to the predicted probability distribution condition and the probability distribution condition of the real target boundary of the opinion text to obtain the trained second model.
5. The dual-model structure-based Chinese opinion target boundary prediction method of claim 4 wherein a first model is trained while optimizing a first model to obtain a first optimal model, the first model is used to predict opinion target label boundaries, the first optimal model has only input of test data word vectors and output of probabilities ps1 that each word in the opinion text starts as an opinion target item and ps2 that each word in the opinion text ends as an opinion target item, and the probabilities ps1 and ps2 obtain predicted boundary score vectors _ ps1 and _ ps 2.
6. The method of claim 5, wherein the boundary prediction method of Chinese opinion targets based on dual model structure, wherein the predicted boundary score vectors _ ps1 and _ ps2 are inputted into decoding layer to obtain final opinion targets by decoding algorithm.
7. The dual-model structure-based Chinese opinion target boundary prediction method of claim 6, wherein the decoding algorithm comprises: and respectively carrying out normalization processing on the boundary score vectors _ ps1 and _ ps2 by using two softmax functions, acquiring an attribute value sequence of the word with the maximum probability, and acquiring an opinion target entity fragment by using slicing operation according to the attribute value sequence.
8. The chinese opinion target boundary prediction method based on dual model structure as claimed in claim 7 wherein in training model two, by comparing the predicted probability distribution with the true target boundary probability distribution of the opinion text, according to the target loss function:
Figure FDA0003010659270000031
Figure FDA0003010659270000032
Loss=Loss1+Loss2
and evaluating the difference between the predicted result and the real result, training a second model, and optimizing the first model, wherein s1_ in and s2_ in are indications of the boundary of the real opinion target, and N is the number of samples.
CN202110374539.3A 2021-04-07 2021-04-07 Chinese opinion target boundary prediction method based on dual-model structure Withdrawn CN113157883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110374539.3A CN113157883A (en) 2021-04-07 2021-04-07 Chinese opinion target boundary prediction method based on dual-model structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110374539.3A CN113157883A (en) 2021-04-07 2021-04-07 Chinese opinion target boundary prediction method based on dual-model structure

Publications (1)

Publication Number Publication Date
CN113157883A true CN113157883A (en) 2021-07-23

Family

ID=76889191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110374539.3A Withdrawn CN113157883A (en) 2021-04-07 2021-04-07 Chinese opinion target boundary prediction method based on dual-model structure

Country Status (1)

Country Link
CN (1) CN113157883A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN111222317A (en) * 2019-10-16 2020-06-02 平安科技(深圳)有限公司 Sequence labeling method, system and computer equipment
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
CN112215004A (en) * 2020-09-04 2021-01-12 中国电子科技集团公司第二十八研究所 Application method in extraction of text entities of military equipment based on transfer learning
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
EP3767516A1 (en) * 2019-07-18 2021-01-20 Ricoh Company, Ltd. Named entity recognition method, apparatus, and computer-readable recording medium
CN111222317A (en) * 2019-10-16 2020-06-02 平安科技(深圳)有限公司 Sequence labeling method, system and computer equipment
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN112215004A (en) * 2020-09-04 2021-01-12 中国电子科技集团公司第二十八研究所 Application method in extraction of text entities of military equipment based on transfer learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
何龙著: "《深入理解XGBoost 高效机器学习算法与进》", 31 January 2020, 机械工业出版社 *
杨飘等: "基于BERT嵌入的中文命名实体识别方法", 《计算机工程》 *

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN110825845B (en) Hierarchical text classification method based on character and self-attention mechanism and Chinese text classification method
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN111145718A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
Mandal et al. Language identification of bengali-english code-mixed data using character & phonetic based lstm models
CN112464669A (en) Stock entity word disambiguation method, computer device and storage medium
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN111930936A (en) Method and system for excavating platform message text
CN113380223B (en) Method, device, system and storage medium for disambiguating polyphone
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
Doval et al. On the performance of phonetic algorithms in microtext normalization
CN113505583A (en) Sentiment reason clause pair extraction method based on semantic decision diagram neural network
CN110837730B (en) Method and device for determining unknown entity vocabulary
CN111898339A (en) Ancient poetry generation method, device, equipment and medium based on constraint decoding
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN112948588B (en) Chinese text classification method for quick information editing
CN113177120B (en) Quick information reorganizing method based on Chinese text classification
CN114925698A (en) Abbreviation disambiguation method, apparatus, computer device and storage medium
CN113157883A (en) Chinese opinion target boundary prediction method based on dual-model structure
Chaimae et al. BERT for Arabic named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210723

WW01 Invention patent application withdrawn after publication