CN116522165A

CN116522165A - Public opinion text matching system and method based on twin structure

Info

Publication number: CN116522165A
Application number: CN202310761055.3A
Authority: CN
Inventors: 陈宏伟; 涂麟曦
Original assignee: Wuhan Agco Software Technology Co ltd
Current assignee: Wuhan Agco Software Technology Co ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-08-01
Anticipated expiration: 2043-06-27
Also published as: CN116522165B

Abstract

The invention discloses a public opinion text matching system based on a twin structure, which comprises a twin neural network module: the method comprises the steps of constructing a coding layer of a twin neural network, and obtaining a first similarity characterization vector between named entities; semantic interaction module: for obtaining a second similarity characterization vector; and a fusion module: the method comprises the steps of splicing a first similarity characterization vector and a second similarity characterization vector to obtain a final similarity characterization vector of a sentence pair; and a matching module: and the final similarity characterization vector is used for obtaining a text matching result through a softMax classification function. According to the method, the named entity similarity characteristics and the text semantic similarity characteristics of the public opinion texts are extracted, semantic similarity calculation is carried out after the two types of characteristics are fused, whether the two public opinion texts are similar or not is analyzed, and accuracy and robustness of matching of the public opinion texts are improved, because the theme and meaning of the texts are not simply matched, and meanwhile matching of expressions aiming at the same person, thing or phenomenon is considered.

Description

Public opinion text matching system and method based on twin structure

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a public opinion text matching system and method based on a twin structure.

Background

The core problem of the current public opinion text matching method is to solve the problem of text data similarity judgment, and the matching accuracy of a public opinion text system can be improved only when the text data similarity judgment is accurate. In the traditional method, a great deal of manpower and time are required for manually judging, labeling and removing similar public opinion texts. Therefore, an intelligent public opinion text matching system is needed to refine important information and improve the efficiency of text analysis. The public opinion text matching plays a crucial role in public opinion analysis and public opinion early warning, and the accuracy of the public opinion text matching is related to the accuracy of subsequent public opinion research judgment.

At present, two modes are mostly adopted for calculating the public opinion text matching, one is based on a traditional text matching algorithm, and the other is based on a text matching algorithm of deep learning. Conventional text matching algorithms can be generally classified into string-based methods, statistical-based methods, and knowledge-base-based methods. Most of the traditional text matching algorithms can only calculate the meaning of the text surface layer, and the deep meaning of the text is difficult to mine. Along with the wider and wider requirements of natural language processing tasks, the traditional method cannot break through the bottleneck of semantic similarity calculation tasks all the time, so that the method is gradually replaced by a semantic similarity algorithm based on deep learning. The deep meaning of the text can be understood by the text matching algorithm based on deep learning, so that the model effect is better, but the accuracy of the model is still to be improved because the research time is not long. The method for generating the distributed word vector, namely word2vec, proposed in 2013 predicts the word vector of each word in the text according to the context in a certain range, and then the generated word vector can represent certain semantic information after being spliced; the context on which each word depends is limited and thus the semantic information of each word vector expression is locally limited. In 2014, a doc2vec method was proposed for vectorizing the text of a document, which differs from words in that the document has no logical structure like words to words, which is an integral text data. The vectors generated by the two methods are static, namely, cannot be dynamically changed according to different text contexts, so that the accuracy and the performance of the methods are affected.

In recent years, the BERT method has great influence on the field of natural language processing, combines a self-attention mechanism, provides two very novel and effective pre-training targets for masking a language model task and a following prediction task, brings great improvement to the performance of the method, and becomes one of the most commonly used methods for generating dynamic word vectors at present. Public opinion text matching has higher difficulty than general text matching, and not only needs to judge whether two texts are similar semantically, but also needs to judge whether the two texts are expressed beliefs, attitudes, ideas, emotions and the like aiming at the same person, thing or phenomenon. The existing text matching algorithm only considers text character matching or text meaning matching, namely, when two texts have a plurality of similar characters or the two texts express the same theme or the same meaning, the two texts are judged to be similar and are not specific to a person or an event layer, so the invention provides a public opinion text matching method based on a twin structure, and the accuracy and the robustness of the text matching of a public opinion scene are further improved.

Disclosure of Invention

Compared with general text matching, the public opinion text matching has higher difficulty, so that whether two texts are similar in semantics or not is judged, and whether the two texts are beliefs, attitudes, ideas, emotions and the like expressed for the same person, thing or phenomenon is also judged.

In order to overcome the defects of the prior art, the invention aims to provide a public opinion text matching system and method based on a twin structure.

According to a first aspect of the present invention, there is provided a system for public opinion text matching based on a twinning structure, comprising

Twin neural network module: the method comprises the steps of constructing a coding layer of a twin neural network, extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector between the named entities;

semantic interaction module: a second similarity characterization vector used for acquiring sentence pairs in terms of semantics;

and a fusion module: the method comprises the steps of splicing a first similarity characterization vector and a second similarity characterization vector to obtain a final similarity characterization vector of a sentence pair;

and a matching module: and the final similarity characterization vector is used for obtaining a text matching result through a softMax classification function.

In an exemplary embodiment of the present invention, the twin neural network module specifically utilizes a bert+crf method to construct an encoding layer of the twin neural network, and the encoding layer includes two coupling three-layer architectures that are the same or similar to each other and are respectively an input layer, a feature extraction layer and a similarity measurement layer, where the input layer inputs a sentence pair to be matched, the feature extraction layer embeds the input sentence pair sample into a high latitude space to obtain a representation vector of the sentence pair two samples, and the similarity measurement layer performs similarity calculation on the extracted representation vectors of the two samples through a mathematical formula to obtain a first similarity representation vector of the sentence pair.

In an exemplary embodiment of the invention, the BERT model of the twin neural network module

The method also comprises a mask language model task unit, wherein (the mask language model task of the BERT layer is adopted to obtain text characteristics of word level in the input sentence pair sentence), part of characters are covered in a training input layer immediately, then the covered characters are predicted by utilizing the remaining unmasked characters, the model can fully learn the text characteristics of word level in the input sentence through the training of the mode, and then the feature vector output by the BERT layer is input to the CRF layer;

the system also comprises a context prediction task unit, a text prediction task unit and a text prediction task unit, wherein the context prediction task unit is used for judging whether an A sentence and a B sentence of an input sentence pair are related in an up-down manner, so that a model learns the relation between two texts and the problem of a sentence level is solved; inputting the feature vector output by the BERT layer to the CRF layer;

in an exemplary embodiment of the present invention, the CRF model of the twin neural network module further includes a transition probability unit between tags in the dataset, and the CRF layer corrects the output of the BERT layer by learning the transition probabilities between tags in the dataset, thereby ensuring the rationality of the predicted tags;

The system also comprises a labeling unit, wherein a named entity in a sentence pair is required to be extracted, a training set, namely the sentence pair, is used for labeling the entity by adopting a BIO method, B (begin) represents that the character is positioned at the beginning of one entity, I (inside) represents that the character is positioned at the internal position of the entity, and O (outside) represents that the non-entity character outside the entity is not concerned; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;

the method also comprises a unit for obtaining part-of-speech states to characterize vectors, wherein [ CLS ] is required to be added to the head of a sentence before the sentence pair is sent into the twin neural network]Identifier, get A sentence of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndsending into BERT for fine tuning, introducing context information for characters at each position in sentence by BERT layer coding to obtain part-of-speech state for representing vector, outputting of all BERTWill be input to the CRF layer;

in an exemplary embodiment of the present invention, the semantic interaction module, specifically based on the BERT, adopts a following prediction task to learn sentence relationship characteristics between texts, including an encoding layer of the interaction module, a pooling layer of the interaction module, and a normalization layer of the interaction module, where the encoding layer of the interaction module adds [ CLS ] to the head of a sentence before sending the sentence pair to the BERT ]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs;

the pooling layer of the interactive module obtains sentence vectors through BERTExtracting important features through a pooling layer to reduce the dimension;

normalization layer of the interaction module and sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.

In an exemplary embodiment of the present invention, in the matching module, a specific SoftMax classification function is as follows,representative meaning is the probability that the sample vector x belongs to the j-th class, where W is a weight coefficient and k represents k classes:

characterizing the final similarity to a vectorIs input into the softmax function,whereinFor the output of the twin neural network module,for the output of the interaction module,x is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value of text similarity is set to 0.5, whenAnd if the two texts are not matched, the two texts are considered to be matched, otherwise, the two texts are not matched.

According to a second aspect of the present invention, there is provided a method for matching public opinion text based on a twin structure, to which the system for matching public opinion text based on a twin structure is applied, comprising the steps of:

Constructing a coding layer of the twin neural network, thereby extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities;

obtaining a second similarity characterization vector of the sentence pairs in terms of semantics;

splicing the first similarity characterization vector and the second similarity characterization vector to obtain a final similarity characterization vector of the sentence pair;

and obtaining a text matching result through a softMax classification function by the final similarity characterization vector.

In an exemplary embodiment of the present invention, the coding layer of the twin neural network is specifically configured by using a bert+crf method, and includes two coupled three-layer architectures built by the same or similar neural networks, which are an input layer, a feature extraction layer and a similarity measurement layer, where the input layer inputs a sentence pair to be matched, the feature extraction layer embeds the input sentence pair sample into a high latitude space to obtain characterization vectors of the sentence pair two samples, and the similarity measurement layer performs similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentence pair.

In an exemplary embodiment of the present invention, the constructing the coding layer of the twin neural network extracts named entity information in sentence pairs, performs similarity calculation on the extracted named entities, and obtains a first similarity characterization vector between the named entities, and specifically further includes:

the mask language model task of the BERT layer is adopted to acquire text features of word levels in input sentences and sentences, and then the feature vectors output by the BERT layer are input to the CRF layer;

the CRF layer corrects the output of the BERT layer by learning the transition probability among the tags in the data set;

the training set, namely, sentence pair, marks the entity by adopting the BIO method, B (begin) indicates that the character is at the beginning of one entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates that the non-entity character is not concerned outside the entity; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;

before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence ]Identifier, obtain A sentence vector of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndand sending the text to the BERT for fine tuning, introducing context information for characters at each position in the sentence through the code of the BERT layer, thereby obtaining part-of-speech states to perform characterization vectors, and taking the output of all BERT as the input of the CRF layer.

In an exemplary embodiment of the present invention, the obtaining a second similarity characterization vector of the sentence pair in terms of semantics specifically includes: specifically, based on BERT, adopting a following prediction task to learn sentence relation characteristics among texts; before sending sentence pairs into BERT, it is necessary to add [ CLS ] to the head of sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs; sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension; sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.

According to a third aspect of the present invention, there is provided a computer readable storage medium comprising a stored program, wherein the program when run performs the above-described method of twin structure-based public opinion text matching.

According to a fourth aspect of the present invention there is provided an electronic device comprising a memory having a computer program stored therein and a processor arranged to perform the twin structure based method of public opinion text matching by the computer program.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the system is divided into two main modules, namely a twin neural network module based on BERT+CRF and a semantic interaction module based on BERT. The twin neural network module utilizes a BERT+CRF method to construct a coding layer of the twin neural network, so that named entity information in sentence pairs including names, places and the like is extracted, similarity calculation is carried out on the extracted named entities, and similarity characteristics (characterization vectors) among the named entities are obtained. The BERT-based semantic interaction module may obtain semantically similar features (token vectors) of sentence pairs. According to the invention, the named entity similarity feature and the text semantic similarity feature of the public opinion text are extracted through the two modules, semantic similarity calculation is carried out after the two types of features are fused, whether the two public opinion texts are similar or not is analyzed, and accuracy and robustness of matching of the public opinion text are improved, because the topic and meaning of the text are not simply matched, and meanwhile, matching of expressions aiming at the same person, thing or phenomenon is considered.

Drawings

Fig. 1 is a schematic diagram of a public opinion text matching system based on a twin structure.

FIG. 2 is a vector diagram of the input characterization of the BERT model of the twin neural network module of the present invention.

Fig. 3 is a specific label form diagram of a training set of the twin neural network module of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

Example 1

Referring to fig. 1, the present embodiment provides a system for matching public opinion text based on a twin structure, including: the twin neural network module is used for constructing a coding layer of the twin neural network, extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities; the semantic interaction module is used for acquiring a second similarity characterization vector of the sentence pairs in terms of semantics; the fusion module is used for splicing the first similarity characterization vector and the second similarity characterization vector to obtain a final similarity characterization vector of the sentence pair; and the matching module is used for obtaining a text matching result from the final similarity characterization vector through a softMax classification function.

In an exemplary embodiment, the twin neural network module specifically utilizes the bert+crf method (i.e., BERT model+crf model) to construct an encoding layer of the twin neural network, and includes a coupled three-layer architecture established by two identical or similar neural networks (specific BERT model+crf model), and the natural advantages of the coupled three-layer architecture make it very suitable for solving the similarity matching problem. The three-layer architecture is respectively an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentences to be matched to the samples, the feature extraction layer embeds the input sentences to the samples into a high latitude space to obtain two characterization vectors of the sentences to the samples, the similarity measurement layer carries out similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentences, and the similarity of the two samples can be generally calculated by adopting methods such as Euclidean distance, cosine distance or Jacquard distance.

Specifically, the BERT model employs a multi-layer Transfomer encoder as its network layer, enabling deep mining of important features in text, capturing context information over longer distances. The BERT is a multitasking model, and the pre-trained BERT model can complete various downstream tasks. The input of the model may be either a single sentence or text. In text input, a special classification symbol [ CLS ] is added to the head of a text sequence, and then a special symbol [ SEP ] is added at the end position of each sentence as a delimiter and an end symbol of the sentence. Each character in the text is first vector initialized by a word2vec model to form an original token vector. To distinguish the character sources, a segment attribution information insert needs to be added to distinguish whether the character is sentence a or sentence B from a sentence pair. Finally, in order for the (yes) model to learn the influence of the position information of each character in the sentence on the meaning of the sentence, a position vector needs to be embedded. The input token vector of the final BERT model is formed by adding three parts of word embedding, segment attribution information embedding and position embedding, as shown in fig. 2.

The pre-training task of the BERT model consists of two unsupervised learning subtasks, a mask language model and a following predictive task, respectively. The mask language model is characterized in that part of characters are covered in a training input layer, and then the covered characters are predicted by utilizing the remaining unmasked characters, and the model can fully learn the word-level text characteristics in an input sentence through the training of the mode. The following prediction task is to make the model judge whether the two input sentences are related up and down, so that the model learns the relation between the two texts and the problem of sentence level is solved. After the full training of the two tasks is carried out on each character through a large amount of unsupervised corpus, language characteristics of the text are learned, and character vector codes with deeper expression are output. In the downstream task, the trained model parameters can be directly utilized to vectorize the text.

In an exemplary embodiment, the BERT model of the twin neural network module further includes a mask language model task unit, (a mask language model task of the BERT layer is adopted to obtain text features of word levels in the input sentence and sentence), part of characters are covered in a trained input layer immediately, and then the covered characters are predicted by using the remaining unmasked characters, so that the model can fully learn the text features of word levels in the input sentence through training in this way, and then feature vectors output by the BERT layer are input to the CRF layer; the system also comprises a context prediction task unit, a text prediction task unit and a text prediction task unit, wherein the context prediction task unit is used for judging whether an A sentence and a B sentence of an input sentence pair are related in an up-down manner, so that a model learns the relation between two texts and the problem of a sentence level is solved; and then inputting the feature vector output by the BERT layer to the CRF layer.

In an exemplary embodiment of the present invention, the CRF model of the twin neural network module further includes a transition probability unit between tags in the dataset, and the CRF layer corrects the output of the BERT layer by learning the transition probability between tags in the dataset, so as to ensure the rationality of predicting the tags, corrects the output of the BERT layer, for example, the vector X is output from the BERT layer before, and the corrected output is X'; the system also comprises a labeling unit, wherein a named entity training set in a sentence pair is required to be extracted, namely, the sentence pair labels an entity by adopting a BIO method, B (begin) represents that the character is at the beginning of one entity, I (inside) represents that the character is at the internal position of the entity, and O (outside) represents that non-entity characters outside the entity are not concerned; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set; the method also comprises a unit for obtaining part-of-speech states to characterize vectors, wherein [ CLS ] is required to be added to the head of a sentence before the A, B sentence of the sentence pair is sent to the twin neural network]Identifier, obtain A sentence vector of A, B sentence pair Andthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndsending the text into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer so as to obtain part-of-speech states to perform characterization vectors, wherein the output of all BERT is used as the input of the CRF layer; the method can also comprise a preprocessing unit, which is used for cleaning the text of a training set of the twin neural network module, namely, sentence pair, as the input of a model, removing stop words and adopting stop word list pairThe whole text is filtered (thereby reducing the text length and improving the calculation efficiency of the model), and the input text length is limited by adopting a direct cut-off mode.

In a word, the twin neural network module constructs a coding layer of the twin neural network by using a BERT+CRF method, so that named entity information in sentence pairs is extracted, and similarity calculation is performed on the extracted named entities. The twin neural network module firstly adopts a mask language model task of the BERT layer to acquire text characteristics of word levels in an input sentence. And then, inputting the feature vector output by the BERT layer to the CRF layer, wherein the CRF layer can correct the output of the BERT layer by learning the transition probability among the tags in the data set, thereby ensuring the rationality of the predicted tags. The method comprises the following steps:

The sentence pairs A and B, namely, the sentence pairs A and B are the sentence pairs which need to be judged whether to be similar or not, namely, the training set of the twin neural network module, the sentence pairs are used as the input of the model, and the input text needs to be cleaned and the word removal is stopped at first. And (3) cleaning the text, namely processing redundant information and error information in the text, deleting unimportant information such as blank symbols or emoticons, converting traditional Chinese characters in the text into simplified Chinese characters, and unifying character formats in the text into half-angle formats so as to facilitate subsequent text characterization vectors. The method can directly delete the mood words or some unimportant words in the text, and filter the whole text by adopting the stop word list, thereby reducing the text length and improving the calculation efficiency of the model. The direct cut-off mode is adopted to limit the length of the input text. The processed sentence a has a length n and the sentence B has a length m, which is denoted as a= { WA1, WA2,..once., WAn }, b= { WB1, WB2,..once., WBn }, wherein WAi and WBi represent the i-th words of the sentence a and the sentence B, respectively. Because named entities in sentence pairs need to be extracted, the training set marks the entities by using a BIO method, B (begin) indicates that the character is at the beginning of an entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates non-entity characters outside the entity that are not concerned. For public opinion texts, people names (PER), place names (GEO), and Organizations (ORG) in the text are important to pay attention to, so the training set is real The body tags include 7 types of tags, B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG, O. A specific label format is shown in fig. 3. Before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence]Identifier, getAnd. Will beAndsending into BERT for fine tuning, and introducing context information for characters at each position in sentence through BERT layer coding to obtain part-of-speech state for representing vectorAnd。the encoding vector representing the i-th word to which sentence a corresponds,the text represents the code vector of the i-th word corresponding to sentence B. The output of all BERTs will be input to the CRF layer. The CRF has two types of feature functions, one for the correspondence between the observed sequence and the state (e.g., "i" are generally nouns) and one for the relationship between states (e.g., "verbs" are generally followed by "nouns"). In the BERT+CRF model, the output of the former type of characteristic function is replaced by the output of BERT, and the output of the latter type of characteristic function is a label transfer matrixLabel transfer matrixRepresentation tagA transition score between. Specifically, the token vector of the BERT layer outputFor a matrix, get each characterThe corresponding tag score is distributed as This matrix is referred to as the transmit matrix. For sentence A, its corresponding tagIs one strand. Sentence A has a length of n and a total of 7 types of tags, so that it is commonThe possible marking results are thatPossible species. For public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the text are important to pay attention to, so the physical labels of the training set are 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O. The number of tags is less or more depending on the specific application scenario, which is only the general case. For charactersIts label score distributionAs a 7-dimensional vector, the tagScore of (2) isWhereinIs of integer type, representing a tag index. Will all ofAnd adding up to obtain the score of each character node. According to a tag matrixObtainingTo the point ofTransfer score of (2). Finally, summing all the scores to obtain the score of each possible labeling result of the sentence A as. Then the probability of each labeling result is obtained by normalization of the Softmax functionWherein. Similarly, the probability of each labeling result of sentence B isWherein。

And taking the labeling result with the highest probability as an entity label of the character, taking the character with the label of B as the beginning of the entity, and splicing all the characters following the label I together to form a word serving as an entity word. Extracting character characterization vectors output by the BERT layer corresponding to the character position of the entity word to obtain Andand constructing a similarity measurement layer by using a cosine algorithm, wherein the distance characteristic between two vectors is calculated as follows:

a first similarity characterization vector, which is a similarity feature matrix representing sentence pairs obtained by using a twin neural network (SEN), is further fused with interaction features of the sentence pairs obtained by BERT.

In an exemplary embodiment, the semantic interaction module, specifically, the semantic interaction module based on BERT, adopts a following prediction task to learn sentence relationship features between texts, and includes a coding layer of the interaction module, a pooling layer of the interaction module, and a normalization layer of the interaction module;

the coding layer of the interactive module adds [ CLS ] to the head of the sentence before sending the sentence pair into BERT]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs; sentenceSpecifically, t= { [ CLS]Good, well-learned, [ SEP ]]Day, upward, up }.

Pooling layer of interactive module, sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension;

normalization layer of interaction module, sentence vector The output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.

As a specific example, a model is trained using a community question-answer dataset, which is a large-scale high-quality question-answer dataset that asks for certain social questions, each question having multiple feedback answers, and feedback of the same question can be used as a similar public opinion.

For the coding layer of the interactive module, before the sentence pair is sent to the BERT, the [ CLS ] needs to be added to the head of the sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e. a vectorized representation of sentence pairs.

For the pooling layer of the interactive module, sentence vectors obtained through BERTImportant feature scaling is extracted by the pooling layer. Averaging pooling is mainly used when all information should contribute, such as to obtain global context or to obtain semantic information in the deep layer of the network. The maximum pooling is mainly to reduce the influence caused by garbage, and at the same time, the method can reduce the feature dimension and extract better and stronger semantic information features. In order to make the model more robust, the feature vectors, i.e. the token vectors, are processed together with an average pooling and a maximum pooling. Sentence vector The result after average pooling isMaximum pooling intoWhereinIs a vector of sentences T obtained after global average pooling,the vectors of sentences T obtained after global maximization are pooled. Splicing the average pooled calculation result with the maximally pooled calculation result, namely。

For the normalization layer of the interaction module,the output result after layer normalization is。I.e. as a second similarity characterization vector for the interaction module.

The system further comprises a matching module, the matching module is used for splicing the first similarity characterization vector obtained by the twin neural network module with the second similarity characterization vector obtained by the BERT-based interaction module to obtain final similarity characterization vectors of sentences A and B。The method not only can express the difference between entity words in the sentence pairs, but also can acquire the deep semantic interaction characteristics of the sentence pairs by combining the BERT model so as to acquire more accurate text similarity information. And finally, obtaining a final result through a softMax classification function.

Characterizing the final similarity to a vectorIs input into the softmax function,whereinFor the output of the twin neural network module,for the output of the interaction module,x is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value for the text similarity of sentences A and B is set to 0.5, whenAnd if the two texts of the sentence A and the sentence B are matched, otherwise, the two texts are not matched.

Example two

The embodiment provides a public opinion text matching method based on a twin structure by using the public opinion text matching system based on the twin structure in the first embodiment, which comprises the following steps:

constructing a coding layer of a twin neural network, extracting named entity information in sentence pairs by utilizing a twin neural network module, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities;

the method comprises the steps of constructing a coding layer of a twin neural network, specifically constructing the coding layer of the twin neural network by using a BERT+CRF model (method), and constructing a coupling three-layer framework which is built by two identical or similar neural networks and is respectively an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentence pairs to be matched, the feature extraction layer embeds input sentence pair samples into a high latitude space to obtain characterization vectors of the sentence pairs and the two samples, and the similarity measurement layer carries out similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentence pairs.

Specifically, a BERT+CRF model (method) is utilized to construct a coding layer of the twin neural network, so that named entity information in sentence pairs is extracted, similarity calculation is carried out on the extracted named entities, a first similarity characterization vector among the named entities is obtained, and the method specifically further comprises:

the method comprises the steps of adopting a BERT layer context prediction task to judge whether an A sentence and a B sentence of an input sentence pair are related up and down, so that a model learns the relation between two texts, and the problem of a sentence level is solved; inputting the feature vector output by the BERT layer to the CRF layer;

Sentence makingBefore the sub-pairs are sent into the twin neural network, the [ CLS ] needs to be added to the head of the sentence]Identifier, obtain A sentence vector of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndand sending the text to the BERT for fine tuning, introducing context information for characters at each position in the sentence through the code of the BERT layer, thereby obtaining part-of-speech states to perform characterization vectors, and taking the output of all BERT as the input of the CRF layer.

The second similarity characterization vector of the sentence pairs in terms of semantics is obtained by utilizing a semantic interaction module, and the method specifically comprises the following steps:

specifically, based on BERT, adopting a following prediction task to learn sentence relation characteristics among texts;

before sending sentence pairs into BERT, it is necessary to add [ CLS ] to the head of sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs;

sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension;

sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.

Splicing the first similarity representation vector and the second similarity representation vector, and splicing by using a fusion module to obtain a final similarity representation vector of the sentence pair;

Obtaining a text matching result from the final similarity characterization vector through a softMax classification function, and splicing the first similarity characterization vector obtained by the twin neural network module with the second similarity characterization vector obtained by the BERT-based interaction module by utilizing a matching module to obtain final similarity characterization vectors of sentences A and B。The method not only can express the difference between entity words in the sentence pairs, but also can acquire the deep semantic interaction characteristics of the sentence pairs by combining the BERT model so as to acquire more accurate text similarity information. And finally, obtaining a final result through a softMax classification function. In any case the number of the devices to be used in the system,expressive of differences between entity words in sentence pairsDeep semantic interaction characteristics of sentence pairs can be obtained, so that more accurate text similarity information can be obtained.

characterizing the final similarity to a vectorIs input into the softmax function,whereinFor the output of the twin neural network module,for the output of the interaction module, X is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value for the text similarity of sentences A and B is set to 0.5, whenAnd if the two texts of the sentence A and the sentence B are matched, otherwise, the two texts are not matched.

In order to further show the technical effect of the invention, the public opinion text matching method based on the twin structure is applied to the STS-B semantic similarity dataset. Each data in the data set comprises a sentence pair and a similarity score, the similarity score is from 0 to 5, the higher the score is, the higher the similarity of the sentence pair is, and the 0 score is, the semantic dissimilarity of the two sentences is represented. And the data set is divided into a training set, a validation set and a test set, wherein the training set has 5231 pieces of data in total, the validation set contains 1458 pieces of data, and the test set contains 1361 piece of data.

In addition, for more visual comparison, the invention simultaneously utilizes several mainstream models of Siamese-CNN and Siamese-LSTM, ABCNN, BERT in the text matching task to carry out comparison experiments. The experimental results of the different models on the STS-B dataset are shown below:

model name	Model accuracy
		Siamese-CNN	60.21
Siamese-LSTM	64.52
		ABCNN	66.80
BERT	75.52
		The method provided by the invention	83.96

The experimental result can find that the twin neural network structure can effectively improve the performance of the model when being applied to the field of semantic similarity. According to the method, the similarity judgment is carried out on the semantic features of the two texts of the sentence A and the sentence B through sentences, and whether the texts are described aiming at the same person, thing or phenomenon or the like is judged through the entity features of the two texts of the sentence A and the sentence B through sentences, so that the similarity judgment of text data is more accurate, the matching accuracy of a public opinion text system is improved, a great amount of manpower and time for artificial judgment is reduced, and the efficiency of public opinion text analysis is improved.

Example III

In another aspect, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored program, where the program executes the above-mentioned method for matching public opinion text based on a twinning structure.

Example IV

The invention also provides an electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor being arranged to perform the method of twin structure based public opinion text matching by the computer program.

Claims

1. A public opinion text matching system based on a twin structure is characterized by comprising

2. The system of claim 1, wherein the twin neural network module constructs a coding layer of the twin neural network by using a bert+crf method, and the coding layer comprises two coupling three-layer architectures built by the same or similar neural networks, namely an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentence pairs to be matched, the feature extraction layer embeds the input sentence pairs into a high latitude space to obtain vector representations of the sentence pairs, and the similarity measurement layer performs similarity calculation on the vector representations of the extracted two samples through a mathematical formula to obtain a first similarity representation vector of the sentence pairs.

3. The system of claim 2, wherein the BERT model of the twin neural network module further comprises a masking language model task unit, (a masking language model task of the BERT layer is adopted to obtain text features of word level in the input sentence to the sentence), partial characters are covered in the trained input layer immediately, and then the remaining unmasked characters are used for predicting the covered characters, by training in this way, the model can fully learn the text features of word level in the input sentence, and then the feature vectors output by the BERT layer are input to the CRF layer;

the training set, namely, the sentence pair, is used for marking the entity by adopting the BIO method, B (begin) indicates that the character is at the beginning of one entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates non-entity characters outside the entity and not concerned; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;

The method also comprises a vector characterization unit for acquiring part-of-speech states, wherein [ CLS ] is required to be added to the head of the sentence before the sentence pair is sent into the twin neural network]Identifier, get A sentence of A, B sentence pairAnd B sentence vector->The method comprises the steps of carrying out a first treatment on the surface of the Will->And->And sending the information into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer, thereby obtaining part-of-speech states for vector characterization, and taking the output of all BERT as the input of the CRF layer.

4. The system of claim 2 or 3, wherein the semantic interaction module, particularly based on BERT, employs a following predictive task to learn sentence relationship features between texts, comprising an encoding layer of the interaction module, a pooling layer of the interaction module, and a normalization layer of the interaction module, wherein the encoding layer of the interaction module adds [ CLS ] to the head of a sentence before sending the sentence pair into BERT]Identifier and insert [ SEP ] between two sentences]Splitting the identifier, and splitting the spliced sentencesSending into BERT model for fine tuning, outputting +.>I.e., vectorized representation of sentence pairs;

the pooling layer of the interactive module obtains sentence vectors through BERT Extracting important features through a pooling layer to reduce the dimension;

5. Twin based as defined in claim 1A system for public opinion text matching of a raw structure, characterized in that, in an exemplary embodiment of the invention, in the matching module, a specific SoftMax classification function is as follows,representative meaning is the probability that the sample vector x belongs to the j-th class, where W is a weight coefficient and k represents k classes:

the final similarity characterization vector +.>Is input into the softmax function,wherein->For the output of the twin neural network module, < >>For the output of the interaction module, +.>X is the softmax function described above; the end result obtained->At [0,1]In the interval, assuming that the text similarity threshold is set to 0.5, then when +.>And if the two texts are not matched, the two texts are considered to be matched, otherwise, the two texts are not matched.

6. A method for matching public opinion texts based on a twin structure, which is applied to the system for matching public opinion texts based on the twin structure as claimed in claims 1-5, and is characterized by comprising the following steps:

7. The method for matching public opinion text based on a twin structure according to claim 6, wherein the coding layer of the constructed twin neural network is constructed by using a bert+crf method, and comprises two coupling three-layer architectures built by the same or similar neural network, namely an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentence pairs to be matched, the feature extraction layer embeds the input sentence pairs into a high latitude space to obtain vector representations of the sentence pairs and the two samples, and the similarity measurement layer performs similarity calculation on the extracted vector representations of the two samples through a mathematical formula to obtain a first similarity representation vector of the sentence pairs.

8. The method of claim 7, wherein the constructing the coding layer of the twin neural network extracts named entity information in sentence pairs, and performs similarity calculation on the extracted named entities to obtain a first similarity characterization vector between the named entities, and further comprising:

the training set, namely sentence pairs, marks the entity by adopting a BIO method, B represents that the character is at the beginning of one entity, I represents that the character is at the internal position of the entity, and O represents non-entity characters outside the entity which are not concerned; for public opinion texts, the name-PER, the place name-GEO and the organization-ORG in the text are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as the entity labels of the training set;

before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence ]Identifier, obtain A sentence vector of A, B sentence pairAnd->The method comprises the steps of carrying out a first treatment on the surface of the Will->And->And sending the information into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer, thereby obtaining part-of-speech states for vector characterization, and taking the output of all BERT as the input of the CRF layer.

9. The method for matching public opinion text based on twin structure of claim 6, wherein the obtaining a second similarity characterization vector of sentence pairs in terms of semantics specifically comprises:

before sending sentence pairs into BERT, it is necessary to add [ CLS ] to the head of sentence]Identifier and insert [ SEP ] between two sentences]Splitting the identifier, and splitting the spliced sentencesSending into BERT model for fine tuning, outputting +.>I.e., vectorized representation of sentence pairs;

10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of public opinion text matching based on twin structure as defined in any of the preceding claims 6 to 9.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to perform the method of twin structure based public opinion text matching of any of claims 6 to 9 by means of the computer program.