CN112966073B

CN112966073B - Short text matching method based on semantics and shallow features

Info

Publication number: CN112966073B
Application number: CN202110373418.7A
Authority: CN
Inventors: 杨洁; 余卫宇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2023-01-06
Anticipated expiration: 2041-04-07
Also published as: CN112966073A

Abstract

The invention discloses a short text matching method based on semantics and shallow features, and relates to the technical field of text matching. The invention comprises the following steps: reading and preprocessing the first text and the second text to obtain word information; mapping the word information into a word feature vector by using a word2vec model; extracting the characteristics of sentence codes, and carrying out normalization processing on the characteristics to obtain statistical characteristic vectors; inputting the character feature vector and the statistical feature vector into an interactive feature learning device and a statistical feature learning device respectively to obtain a decoding vector u _s And r _s (ii) a And splicing the output of the interactive feature learner and the output of the statistical feature learner, inputting a splicing result into an MLP layer for prediction, and if the output result is 1, successfully matching the first text with the second text. The invention further refines the vector information by using the multilayer perceptron, and can obtain excellent text matching performance.

Description

Short text matching method based on semantics and shallow features

Technical Field

The invention relates to the technical field of text matching, in particular to a short text matching method based on semantics and shallow features.

Background

For the retrieval task, it is important to retrieve the content with high semantic relevance. The short text matching method realizes similarity judgment by matching the short text contents, and has important application value in each retrieval task. Matching in short text aims at matching two short texts. The traditional short text matching model has sparse short text semantics, less characteristic information and less training corpus, so that the industrial application of the traditional short text matching method is limited. Meanwhile, the two short texts have the conditions that the length difference is large, and synonyms, aliases and the like cannot be aligned, so that the matching accuracy of the short texts is further limited. The method has the advantages that richer semantic feature representation is obtained, negative influence of texts with large length difference on matching is reduced, and alignment problems of synonyms, aliases, short texts and the like are solved, so that the method is an important technical point.

Disclosure of Invention

In view of this, the present invention designs a feature extractor, an interactive feature learner, and a statistical feature learner, wherein modules of the feature extractor, the interactive feature learner, and the statistical feature learner respectively perform depth coding on short texts and statistical features, learn feature representations generated based on the depth coding, obtain corresponding short text depth representation vectors, further splice the corresponding representation vectors, and finally further refine representation vector information using a multi-layer sensor, so as to obtain excellent performance. The invention provides a short text matching method based on semantics and shallow features.

In order to achieve the purpose, the invention adopts the following technical scheme:

a short text matching method based on semantics and shallow features comprises the following steps:

reading and preprocessing the first text and the second text to obtain word information;

mapping the word information into a word feature vector by using a word2vec model;

extracting the characteristics of sentence codes, and carrying out normalization processing on the characteristics to obtain statistical characteristic vectors;

obtaining a decoding vector u corresponding to the character feature vector by using BilSTM and attention _s (ii) a Updating the statistical feature vector by the multi-head attention mechanism structure to obtain a decoding vector r _s ；

Decoding the vector u _s And said decoded vector r _s And splicing, predicting a splicing result, and if the output result is 1, successfully matching the first text with the second text.

Preferably, the word information includes a word sequence and a word sequence.

Preferably, the sentence-coded features include a distance feature, a text feature and a co-occurrence feature.

Preferably, the decoding vector u is characterized in that _s The specific process of acquisition is as follows:

inputting the character feature vectors into a BilSTM layer, carrying out independent encoder coding, adding a special vector behind each vector, and automatically setting the special vectors according to actual conditions to obtain the following results:

wherein the content of the first and second substances,

inputting the character feature vector of a first text into a BilSTM layer, and carrying out independent encoder coding to obtain the character feature vector;

inputting the character feature vector of the second text into a BilSTM layer, and carrying out independent encoder coding to obtain the character feature vector;

representing a special vector corresponding to the first text;

representing a special vector corresponding to the second text;

the word feature vector representing the first text,

the word feature vector representing a second text; will be provided with

Inputting the vector into a nonlinear activation network to obtain a hidden vector matrix h _b Will be

Inputting the vector into a nonlinear activation network to obtain a hidden vector matrix h _d ：

Computing a hidden vector matrix h _d And a hidden vector matrix h _b The correlation matrix s is as follows:

s＝(h _b ) ^T h _d ∈R ^(b+1)*(d+1) ；

calculating a mutual attention score A according to a softmax function by using the correlation matrix s ^b And A ^d (ii) a Computing a feature vector c _b And feature vector c _d The following were used:

c _b ＝h _d A _b ∈R ^l*(n+1) ；

c _d ＝[h _b ；c _b ]A _d ∈R ^2l*(m+1) ；

hiding vector matrix h _b And a feature vector c _b Splicing is carried out, and a hidden vector matrix h is obtained _d And eigenvector matrix c _d After splicing, splicing the two obtained vectors to obtain w _t And u is a radical of _t-1 And u _t+1 Accessing a BilSTM layer to obtain the current final hidden vector u _t Wherein u is _t-1 And u _t+1 The last hidden vector u at the previous moment and the next moment respectively; inputting the hidden vector u into a BilSTM layer to obtain a corresponding decoding vector u _s 。

Preferably, the decoding vector r _s The acquisition process is specifically as follows:

a multi-head attention mechanism is adopted, wherein the calculation formula of the attention head is as follows:

e _b ,e _d ＝line(f _b ,f _d )；

wherein, f _b A representation of a statistical feature vector representing a first text, f _d A representation of a statistical feature vector representing a second text, e _b ,e _d Representing the vector by f _b ,f _d Projection vectors obtained by projecting onto different planes;

representing a value obtained after point multiplication of the projection vector;

representing the weight scores corresponding to the first text and the second text which are obtained by calculation;

updating the corresponding statistical characteristic expression vector by using the calculated characteristic weight, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

representing projection vectors obtained by projection onto different planes; r is _i Representing a new statistical feature representation vector obtained after the feature i is updated;

representing a join operation; reLU denotes the nonlinear activation function;

finally, calculating the difference and the product of the updated statistical characteristic representation vectors, and splicing; the calculation process is as follows:

r _- ＝|r _i -r _j ；

preferably, the output vectors of the two modules are spliced, and then the MLP layer is used for prediction, and the calculation process is as follows:

o＝softmax(MLP([u _s ,r _s ])。

compared with the prior art, the invention discloses a short text matching method based on semantics and shallow features, and has the following beneficial effects:

1. the scheme characteristic extractor part: and constructing statistical feature vector representation, respectively extracting three types of features including sentence coding distance feature, text feature and co-occurrence feature from the text, and performing normalization processing to enrich the feature of short text matching.

2. The characteristic extractor part of the scheme: character sequence features are extracted based on the first text and the second text, and complete digit and English word features are extracted at the same time, so that semantic information loss caused by splitting digits and English words is avoided, and short text matching prediction is more accurate.

3. The scheme characteristic learning part: by utilizing a co-attention model of the text feature learner, interactive features among short texts can be better learned, a better hidden state expression vector can be obtained, and the features of the short texts, which are more abstract and more robust, can be learned;

4. the characteristic learning part of the scheme: by utilizing a multi-head attention mechanism of the statistical feature learning device part, shallow rich semantic information can be obtained, so that the learned features have stronger robustness;

5. the characteristic learning part of the scheme: the difference and the product of the statistical characteristic learning device are used for calculation, and the difference and the product are calculated respectively, so that the difference and the commonality between texts can be effectively learned, and more accurate prediction of the relationship between the short texts is realized;

6. the prediction part of the scheme splices the characteristics learned by the text characteristic learner and the statistical characteristic learner, predicts the spliced characteristics by utilizing a multilayer perceptron and an activation function, and outputs different labels, thereby improving the identification accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a short text matching method based on semantics and shallow features, which comprises the following steps as shown in figure 1:

mapping word information into a word feature vector by using a word2vec model;

obtaining a decoding vector u corresponding to the character feature vector by using BilSTM and attention _s (ii) a The statistical feature vector is updated by a multi-head attention mechanism structure to obtain a decoding vector r _s ；

Decoding vector u _s And decoding vector r _s And splicing, predicting a splicing result, and if the output result is 1, successfully matching the first text with the second text.

The present embodiment includes the following steps:

feature extractor

Step 1: and inputting the short text 1 to be matched and the corresponding short text 2 to be matched to perform text preprocessing, wherein the text preprocessing comprises special symbol processing, capital and small English conversion and unified simplified and unsimplified characters. And is divided into a word sequence and a word sequence, and word information is mapped into a corresponding word vector (300 dimensions) by using a word2vec model. Three types of characteristics, namely sentence coding distance characteristics, text characteristics and co-occurrence characteristics, are respectively extracted, wherein the sentence coding distance characteristics comprise cosine similarity, euclidean similarity, hamming distance, TF/IDF, word2vec, editing distance and the like; the text characteristics comprise the number of words of the text, the number of words after word segmentation of the text, the number of words after word de-tagging stop words of the text, the number of words after word segmentation and word de-stop words, the number of non-repeated words, the ratio of non-repeated words, the number of tagging points, the ratio of tagging points, the number of parts of speech of POS and the ratio of parts of speech of POS; the co-occurrence characteristics include 1-gram, 2-gram, 3-gram, remove stop words-1-gram, remove stop words-2-gram, remove stop words-3-gram. Meanwhile, the extracted features are normalized, and corresponding 300-dimensional word-oriented feature vector representation and 100-dimensional statistical feature vector representation are obtained.

Step 2: according to the first step, a word feature vector (denoted as w1, w2, w3.. Wn) and a statistical feature vector (denoted as s1, s2, s3.. Sn) of a text 1 to be matched, a word feature vector (denoted as q1, q2, q3.. Qm) and a statistical feature vector (denoted as x1, x2, x3.. Xn) of a text 2 to be matched are obtained, a feature vector matrix of an input text 1 to be matched of an interactive feature learner and a feature vector matrix of an input text 2 to be matched of an interactive feature learner, with the dimensions of n × 300 and m × 300 are obtained respectively, and a feature vector of an input text 1 to be matched of a statistical feature matcher and a feature vector of an input text 2 to be matched with the dimension of 100 are obtained.

Interactive feature learner and statistical feature learner

Step 1: for the interactive feature learner, we take the attention and BilSTM as the basis and make improvements on this basis. Firstly, inputting a text feature vector matrix corresponding to a text 1 to be matched and a text 2 to be matched into a BilSTM layer through the BilSTM layer, and carrying out independent encoder coding, wherein the calculation method comprises the following steps:

h _b ＝LSTM(f _b )；

h _d ＝LSTM(f _d )；

and adding a special vector as an identification symbol after each vector, wherein the special vector is set according to the actual situation, and the following is obtained:

wherein, the first and the second end of the pipe are connected with each other,

inputting a character feature vector of a first text into a BilSTM layer, and performing independent encoder coding to obtain the character feature vector;

inputting a character feature vector of a second text into a BilSTM layer, and carrying out independent encoder coding to obtain the character feature vector;

representing a special vector corresponding to the first text;

representing a special vector corresponding to the second text;

a word feature vector representing the first text,

a word feature vector representing the second text; will be provided with

Input to non-linear activationNetwork to obtain a hidden vector matrix h _b Will be

s＝(h _b ) ^T h _d ∈R ^(b+1)*(d+1) ；

calculating a mutual attention score A according to the softmax function by using the correlation matrix s ^b And A ^d (ii) a The calculation process is as follows:

A ^b ＝softmax(s)；

A ^d ＝softmax(s ^T )；

computing a feature vector c _b And feature vector c _d The following:

c _b ＝h _d A _b ∈R ^l*(n+1) ；

c _d ＝[h _b ；c _b ]A _d ∈R ^2l*(m+1) ；

hiding vector matrix h _b And a feature vector c _b Splicing is carried out, and a hidden vector matrix h is obtained _d And eigenvector matrix c _d After splicing, splicing the two obtained vectors to obtain w _t And u is a radical of _t-1 And u _t+1 Accessing a BilSTM layer to obtain the current final hidden vector u _t Wherein u is _t-1 And u _t+1 The last hidden vector u at the previous and next time, respectively; inputting the hidden vector u into a BilSTM layer to obtain a corresponding decoding vector u _s . The specific formula is as follows:

u _t ＝BiLSTM(u _t-1 ,w _t ,u _t+1 )；

u＝[u ₁ ,u ₂ ,u ₃ ,.....,u _n ]∈R ^2l*n 。

the second step: for the statistical feature learner, a structure of a multi-head attention mechanism is adopted, and improvement is made on the basis of the multi-head attention mechanism. The calculation formula of our attention head is as follows:

e _b ,e _d ＝line(f _b ,f _d )；

wherein f is _b Statistical feature vector representation, f, representing the short text 1 to be matched _d Statistical feature vector representation representing the short text 2 to be matched, e _b ,e _d Representing the vector by f _b ,f _d The resulting projection vectors projected onto different planes.

The value obtained after dot-multiplying the projection vector is represented.

And representing the weight scores corresponding to the short text 1 to be matched and the short text 2 to be matched which are obtained through calculation.

Next, we update the corresponding statistical feature representation vector by using the calculated feature weight, and the calculation formula is as follows:

representing a connection operation; reLU represents a nonlinear activation function.

And finally, calculating the difference and the product of the updated statistical characteristic representation vectors, and splicing. The calculation process is as follows:

r _- ＝|r _i -r _j ；

and (3) splicing prediction:

splicing the output vectors of the two modules, and then predicting by using an MLP layer, wherein the calculation process is as follows:

o＝softmax(MLP([u _s ,r _s ])。

the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A short text matching method based on semantics and shallow features is characterized by comprising the following steps:

Decoding the vector u _s And the decoding vector r _s Splicing, predicting a splicing result, and if the output result is 1, successfully matching the first text with the second text;

the decoding vector u _s The specific process of acquisition is as follows:

inputting the character feature vectors into a BilSTM layer, carrying out independent encoder coding, and adding a special vector behind each vector to obtain the following result:

inputting the character feature vector of the first text into a BilSTM layer, and carrying out independent encoder coding to obtain the character feature vector;

representing a special vector corresponding to the first text;

representing a special vector corresponding to the second text;

the word feature vector representing the first text,

the word feature vector representing a second text; will be provided with

Inputting the data into a nonlinear activation network to obtain a hidden vector matrix h _d ：

Computing a hidden vector matrix h _d And a hidden vector matrix h _b The correlation matrix s of (a) is as follows:

s＝(h _b ) ^T h _d ∈R ^(b+1)*(d+1) ；

calculating a mutual attention score A according to a softmax function by using the correlation matrix s ^b And A ^d (ii) a Computing a feature vector c _b And a feature vector c _d The following:

c _b ＝h _d A _b ∈R ^l*(n+1) ；

c _d ＝[h _b ；c _b ]A _d ∈R ^2l*(m+1) ；

hiding vector matrix h _b And a feature vector c _b Splicing is carried out, and a hidden vector matrix h is obtained _d And eigenvector matrix c _d After splicing, splicing the two obtained vectors to obtain w _t And u is _t-1 And u _t+1 Accessing a BilSTM layer to obtain the current final hidden vector u _t Wherein u is _t-1 And u _t+1 The last hidden vector u at the previous and next time, respectively; inputting the hidden vector u into a BilSTM layer to obtain a corresponding decoding vector u _s 。

2. The method of claim 1, wherein the word information comprises word sequence and word sequence.

3. The method of claim 1, wherein the sentence-coding features comprise distance features, text features, and co-occurrence features.

4. The method of claim 1, wherein the decoding vector r is a short text matching vector based on semantic and shallow feature _s The acquisition process is specifically as follows:

e _b ,e _d ＝line(f _b ,f _d )；

wherein f is _b A representation of a statistical feature vector representing a first text, f _d A representation of a statistical feature vector representing a second text, e _b ,e _d Representing the vector by f _b ,f _d Projection vectors obtained by projecting onto different planes;

and updating the corresponding statistical characteristic expression vector by using the calculated characteristic weight, wherein the calculation formula is as follows:

wherein r is _i ^h Representing projection vectors obtained by projection onto different planes; r is _i Representing a new statistical feature representation vector obtained after the feature i is updated;

representing a join operation; reLU represents a nonlinear activation function;

calculating the difference and the product of the updated statistical feature representation vectors, and then splicing; the calculation process is as follows:

r _- ＝|r _i -r _j ；

5. the short text matching method based on semantic and shallow feature as claimed in claim 1, wherein the computation process of the concatenation prediction is as follows:

o＝softmax(MLP([u _s ,r _s ])。