CN113743079A

CN113743079A - Text similarity calculation method and device based on co-occurrence entity interaction graph

Info

Publication number: CN113743079A
Application number: CN202110639430.8A
Authority: CN
Inventors: 杨鹏; 常欣辰; 赵翰林; 谢亮亮
Original assignee: Zhejiang Huaxun Technology Co ltd
Current assignee: Zhejiang Huaxun Technology Co ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-12-03

Abstract

The invention provides a text similarity calculation method based on a co-occurrence entity interaction graph. Firstly, extracting key entities in a text, and aggregating the key entities appearing in the same sentence into nodes; then, calculating the similarity between sentences in the text and node entities, attaching the sentences to the nodes with the highest similarity, and constructing a co-occurrence entity interaction graph by completing similarity measurement between the nodes; splicing sentences from the same text in the nodes together, and generating feature vectors of the nodes and similarity vectors based on text entity features by using an interaction model; after the feature vector of each node is obtained, feature conversion is carried out by using a graph convolution neural network GCN, a final matching vector is obtained, so that a plurality of node features are aggregated, and a similarity score is calculated by a multilayer perceptron. The invention converts the calculation work of the similarity of the long text into the matching task of the short text on the node formed by the common reality in the two texts, thereby effectively improving the calculation accuracy of the similarity of the texts.

Description

Text similarity calculation method and device based on co-occurrence entity interaction graph

Technical Field

The invention relates to a text similarity calculation method based on a co-occurrence entity interaction graph, and belongs to the technical field of Internet and artificial intelligence.

Background

The text similarity calculation technology is an essential key technology in common application scenes such as search engines, automatic question answering, document classification, news pushing and the like. With the rapid development of the internet, the number of network news is increased dramatically, and higher requirements are provided for the accuracy of text similarity calculation. Common text semantic similarity calculation methods are mainly classified into four categories: string-based, statistical-based, deep learning-based, and knowledge base-based methods. The method based on the character string ignores the meaning of the semantic level, and can only be used as the supplement of other methods at the present stage; the distance of a text vector is calculated through vectorization by a statistical-based method to represent the similarity of the text, sufficient semantic information is lacked, and resource waste can be caused for the vectorization work of a long text; the method based on deep learning can obtain better performance, but the introduced modules are increased, so that the calculation cost of the model is increased; knowledge base based methods rely on the accuracy and richness of the knowledge base data. The existing method has poor calculation capability of the similarity of the long text and low calculation accuracy of the similarity of the text.

The unified Content tag UCL (unified Content tag Label) defined by the national standard unified Content tag format Specification (GB/T35304 and 2017) can provide rich text semantic information, and the UCL Knowledge Space (UCL Knowledge Space, UCLKS) is based on the basic Knowledge bases such as Wikipedia and Baidu encyclopedia and is supplemented by network news indexed by UCL. UCLKS indicates the side-weighted entity of the text by using the semantic weight of UCL, and then the vector matching of the long text is divided into similarity calculation work among a plurality of sentences through key entities. On one hand, the meaning of the semantic level is supplemented, and on the other hand, the problem of resource waste caused by long text vectorization work is solved. And UCLKS can well meet the implementation requirements of the method based on the knowledge base.

The traditional news similarity calculation method cannot be used for vector embedding representation, and the relevancy of the text is obtained by calculating the distance between the text and the corresponding vector. The number of indexing entities corresponding to news needs to be considered, semantic weights of entities in different news reports are different, entities retained according to a threshold value also have difference in number, and it is obvious that final vector lengths are different. However, there is no ideal scheme for applying UCL to text similarity.

Disclosure of Invention

Based on the background, the invention provides a text similarity calculation method based on a co-occurrence entity interaction graph, which utilizes various statistical characteristics in texts to strengthen semantic basis of similarity calculation between the texts. Aiming at the problems of low calculation accuracy and vector space waste between long texts, the method adopts a divide-and-conquer strategy, completes the splitting of the long texts through co-occurrence entities in the texts, constructs a co-occurrence entity interaction diagram, and generates vector representation between the split short texts through an interaction model. And finally, inputting a pair of feature vectors of the nodes in the interactive graph into a multilayer graph convolution neural network to obtain matching vectors of two texts with fixed lengths, and obtaining a final similarity calculation result through a multilayer perceptron.

In order to achieve the above object, the present invention provides a text similarity calculation method based on a co-occurrence entity interaction graph. Firstly, screening out entities shared by text pairs by using a UCL semantic weight calculation method, then aggregating key entities appearing in the same sentence to form nodes, dividing sentences in two texts onto the nodes with the closest semantics, and taking the semantic similarity between the nodes as edge weight, thereby constructing a co-occurring entity interaction graph. And then, outputting a statement set from different texts into a node vector by using an interactive network, judging a text emphasis point indexed by the UCL through a UCL entity emphasis coefficient, and supplementing semantic features of the texts by combining a plurality of similarity features. And finally, obtaining two text vectors with fixed lengths through a multilayer graph convolution neural network, and obtaining the final similarity through a multilayer perceptron.

Specifically, the invention provides the following technical scheme:

a text similarity calculation method based on a co-occurrence entity interaction graph comprises the following steps:

step 1, co-occurrence entity interaction graph construction

Obtaining key entities which commonly appear in matched text pairs by a UCL semantic weight calculation method, and aggregating the key entities appearing in the same sentence to the same node; similarity calculation is carried out on each sentence in the text and the nodes, and the sentences are placed in the nodes with the highest similarity; after sentence grouping is completed, semantic association is formed between nodes by calculating the similarity between the nodes as edge weight, so that the construction of a co-occurrence entity interactive graph is completed;

step 2, node vector generation

Calculating the similarity of sentences from different texts in the node, and splicing the sentences from the same text; firstly, generating hidden feature vectors with fixed sizes at each node, and then generating similarity vectors by utilizing the similarity of entity features among texts;

step 3, node feature aggregation

After the feature vector of each node is obtained, feature conversion is carried out by using a graph convolution neural network GCN, so that a final matching vector is obtained; generating an embedded representation at a node level by encoding information about a neighborhood of nodes; and obtaining a fixed length vector representing the text level similarity by using the average value of the hidden vectors of all the nodes in the last layer obtained by GCN training, and finally calculating the similarity by using a multilayer perceptron.

Preferably, the step 1 specifically includes the following substeps:

substep 1-1, key entity acquisition

Acquiring entities in the text by using a named entity recognition technology, and screening out key entities through entity semantic weight filtering; the entity semantic weight is jointly determined by the appearance frequency of an entity, the position of the entity and the context where the entity is located;

substep 1-2, common entity aggregation

Aggregating key entities in two texts which both exist in the same sentence to a node; a node should contain one or more keywords, and a keyword may also appear in multiple nodes;

substeps 1-3, entity node statement assignment

Similarity calculation is carried out on each sentence in the text and the nodes, the sentences and the nodes are placed in the nodes with the highest similarity, the similarity is obtained by calculating cosine similarity, sentences which do not contain any key words do not need to be calculated, and only the key words in one text do not need to be considered; let the node vector and the text sentence vector be X and Y, respectively, and the cosine similarity calculation formula is shown in formula 5:

where n is the dimension of the vector, x_i、y_iIs the ith term value in the vector;

substeps 1-4, calculating similarity weight between nodes

After the grouping of sentences is completed, calculating the similarity between nodes as edge weight so as to associate the nodes with the nodes; selecting TF-IDF similarity between sentence sets of any two nodes to be calculated as a weight value; the calculation formula of the word frequency TF is shown in formula 6:

wherein n is_d,wRepresenting the number of times a word appears in a document, the denominator

Is the number of all words; the inverse text frequency calculation formula is shown in equation 7:

wherein D is_wIs the number of all documents containing the word w, D represents a document summary; by combining the equations 6 and 7, the calculation equation of TF-IDF can be obtained as shown in equation 8:

TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8

And calculating the distance between the text vectors by vectorization after calculating the weight of the feature words, wherein the closer the distance is, the higher the text similarity is.

Preferably, the substep 1-1 comprises the following processes:

(1) calculating the frequency of occurrence of each entity in the text as shown in formula 1

Wherein, count (e)_i) Representing the occurrence times of the entities, wherein denominators are the occurrence times of all the entities; after the word frequency calculation is finished, filtering out entities with lower word frequency;

(2) the position of the entity is distinguished, the scores of different areas of the entity are different, and the position weight is set as location (e)_i) As shown in equation 2, P is the number of paragraphs of text and P is the current entity e_iThe number of paragraphs being located, location (e) when the total number of paragraphs in the text does not exceed two_i) Is a fixed value; when the text total paragraph exceeds two paragraphs, the entity scores of the first paragraph and the last paragraph are the same, and the scores of other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph:

(3) extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s₁,s₂,...,s_nS in the set_iRepresenting a central sentence; as shown in equation 3, n is the number of central sentences, I (e)_i∈s_t) In order to indicate the function,representing an entity e_iWhether or not there is a current central sentence s_tThe method comprises the following steps:

(4) after the weight parameter values of the three parts are calculated, an entity semantic weight calculation formula in UCL is provided after combination, and the formula is shown as formula 4:

EW(e_i)＝Avg(location(e_i))×(η·freq(e_i)+(1-η)·center(e_i) Equation 4)

Where eta is the tuning parameter, Avg (e)_i) Represents the mean position weight of the entities, after calculating the EW (e) of all the entities_i) And then, obtaining the UCL semantic weight of each entity through normalization, and only keeping the entity with the semantic weight reaching a certain threshold value as the composition of subsequent entity nodes.

Preferably, the step 2 specifically includes the following substeps:

substep 2-1, after completing similarity measurement between nodes, calculating the similarity of sentences from different news in the nodes, thereby performing feature representation on each node by using a vector; the sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two texts_A(v)、S_B(v)；

Substep 2-2, taking node v as an example, let S_A(v)、S_B(v) As input to the interaction model, the set is encoded into two context vectors by an embedding layer in the model that shares weights, the context layer typically containing one or more bi-directional LSTMs or CNN layers with the largest pooling layer, intended to capture S_A(v) And S_B(v) The context information in (1); let Vec_A(v) And Vec_B(v) Are respectively represented as S_A(v) And S_B(v) An obtained context vector; subsequently, a vector representation m of the node is given by the aggregation layer_AB(v) It is calculated by connecting the element direction distance of two context vectors and the Hadamard product as shown in formula 9

Wherein the content of the first and second substances,

representing a Hadamard product;

substeps 2-3, calculating S_A(v)、S_B(v) Generates another node vector m 'for each node based on similarity of entity features in the text'_AB(v) This step obtains three similarity measures; the calculation method of the TF-IDF cosine similarity is shown in formulas 6, 7 and 8;

and a substep 2-4, calculating a weight coefficient, measuring the weight entity in the text through the weight coefficient, and solving the problem of neglecting the weight entity caused by only considering the semantic weight, as shown in formula 10:

wherein EC_doc(e_i) Representing an entity e_iWeight coefficient, EW, in text doc_doc(e_i) Representing an entity e_iThe semantic weight of the text doc is occupied, and the calculation method is shown in formula 4; n is a radical of_DOCRepresenting the total number of newsletters in the knowledge space, I (EW)_j(e_i)≥EW_doc(e_i) Is an indicator function, represents the statistical EW_j(e_i)≥EW_doc(e_i) The number of texts;

the average weighting factor is derived from the weighting factors of the single entity, and for a sentence, the average weighting factor of the sentence is calculated by the following formula 11:

AvgEC_A(Senense) represents the level of the sentence Senense in text AAverage weight coefficient, num represents entity class with weight coefficient in sentence, EC_A(e_i) Denotes e_iWeight factor, N, in text A_iDenotes e_iNumber of occurrences in a sentence, N_sentenceRepresenting the total number of occurrences of all entities in the sentence;

substep 2-5, calculating Jaccard similarity considering sentence length, selecting penalty term adding sentence length difference as denominator, as shown in equation 12:

wherein alpha represents a super parameter and is used for adjusting the influence occupied by the length difference of sentences, len (| A ≦ B |) represents the number of the entities shared by the two texts, and len (| A ≦ B |) represents the union length of the entities in the two texts;

after the result is obtained through calculation, the similarity vector m 'of the node v is obtained through connection'_AB(v)。

Preferably, the step 3 specifically includes the following sub-steps:

substep 3-1, generating a matching vector

The input of GCN is the co-occurrence entity interaction graph G obtained in the above_ABN nodes and edges with weights between nodes; each node V_iComprising a matching vector M (v)_i) It is composed of the node vectors obtained in between, as shown in equation 13:

M(v_i)＝(m_AB(v_i),m'_AB(v_i) Equation 13)

Substep 3-2, generating text level similarity fixed length vector

Drawing G_ABThe corresponding weighted adjacency matrix is expressed as

Wherein each value of Y is Y_ijRepresents a node V_iAnd V_jHas been previously calculated for TF-IDF similarityTo; let D be a diagonal matrix and satisfy D_ii＝∑_jY_ij(ii) a The GCN has an input layer of H⁽⁰⁾M, the initial node characteristics are recorded,

a hidden representation matrix representing the nth layer; each layer is passed through a graphical convolution filter as shown in equation 14 to obtain a hidden representation of the next layer:

wherein

I_NIs an identity matrix; in a manner similar to that of D,

is also a diagonal matrix, satisfies

W⁽ⁿ⁾A trainable weight matrix for the nth level; σ represents an activation function;

finally, taking the average value of the hidden vectors of all the nodes in the last layer to obtain a fixed-length vector representing the similarity of the text level;

substep 3-3, finally calculating a matching score through a multilayer perceptron, wherein the classification module comprises a linear layer, a ReLU layer, another linear layer and a last Sigmoid activation function layer; and calculating and judging the similar situation of the text.

The invention also provides a text similarity calculation device based on the co-occurrence entity interaction graph, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the text similarity calculation method based on the co-occurrence entity interaction graph when being loaded to the processor.

Preferably, the computer program comprises a co-occurrence entity interaction graph building module, a node vector generating module and a node feature aggregation module. The co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, and the construction of the interactive graph is realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to three similarity metrics, and the similarity vector is used as initial input of the node feature aggregation module; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interactive graph into a final fixed length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention uses graph structure to represent the entity relation between two texts, and completes the effective splitting of long texts through co-occurrence entities, thereby solving the problem of vector space resource waste in the matching process of the traditional method.

(2) The invention utilizes the characteristics of entity semantic weight, weighting coefficient and the like in the text to enhance the interpretability of the text similarity calculation semantic level on the basis of the traditional method.

(3) The invention introduces the concept of a co-occurrence entity interactive graph and utilizes a convolutional neural network to complete the effective aggregation of the features on the nodes in the graph. The method can solve the problem of poor calculation capability of the long text similarity in the conventional method, and can improve the accuracy of text similarity calculation.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a diagram illustrating a co-occurrence entity node according to an embodiment of the present invention.

Fig. 3 is an example of a co-occurrence entity interaction diagram according to an embodiment of the present invention.

Fig. 4 is a structural diagram of a node vector generation module according to an embodiment of the present invention.

Fig. 5 is a structural diagram of a node feature aggregation module according to an embodiment of the present invention.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

As shown in fig. 1, a text similarity calculation method based on a co-occurrence entity interaction diagram disclosed in the embodiment of the present invention includes the following specific implementation steps:

step 1, constructing a co-occurrence entity interaction diagram. Extracting entities and acquiring key entities by using an entity identification method, and aggregating keywords appearing in the same sentence into nodes; then attaching the most relevant sentences in the nodes to the nodes through similarity calculation; and finally, the similarity between the nodes is used as a weight value of the edges in the graph, so that the construction of the co-occurrence entity interaction graph is completed. The method comprises the following specific steps:

substep 1-1, key entity acquisition. And acquiring entities in the text by using a named entity recognition technology, and screening out key entities through entity semantic weight filtering. The entity semantic weight is determined by the appearance frequency of the entity, the position of the entity and the context where the entity is located. The acquisition steps are as follows:

(1) the frequency of occurrence of each entity in the text is calculated as shown in equation 1. count (e)_i) The number of times of occurrence of the entity is represented, and the denominator is the number of times of occurrence of all the entities. After the word frequency calculation is completed, entities with lower word frequency are filtered out, so that the subsequent calculation is simplified.

(2) The position of the entity is distinguished, the scores of different areas of the entity are different, and the position weight is set as location (e)_i). As shown in equation 2, P is the number of paragraphs of text and P is the current entity e_iAt the position of the paragraphNumber, location (e) when the total number of paragraphs in the text does not exceed two_i) Is a fixed value; when the total paragraph of text exceeds two paragraphs, the scores of the entities in the first paragraph and the last paragraph are the same, and the scores of the other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph.

(3) Extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s₁,s₂,...,s_nS in the set_iRepresents a central sentence, which is composed of entities. The more times an entity appears in different central sentences, the higher the relative weight of the representative entity. As shown in equation 3, n is the number of central sentences, I (e)_i∈s_t) To indicate a function, an entity e is represented_iWhether or not there is a current central sentence s_tIn (1).

EW(e_i)＝Avg(location(e_i))×(η·freq(e_i)+(1-η)·center(e_i) Equation 4)

Wherein eta is an adjusting parameter and ranges from 0 to 1. Avg (location (e)_i) The weighted average of the entity position weights is calculated comprehensively because the same entity may appear at different positions in the text many times, and the frequency of the entity appearing at each position needs to be set as a weight. After calculating EW (e) of all entities_i) Then, the UCL semantic weight of each entity is obtained through normalization. And only the entity with the semantic weight reaching a certain threshold value is reserved as the composition of the subsequent entity node.

Substep 1-2, consensus entity aggregation. The key entities that exist in the same sentence in both texts are aggregated onto one node. A node should contain one or more keywords, and a keyword may also appear in multiple nodes. As shown in fig. 2, the entities "lakeman" and "schroeder", "lakeman" and "basket net", "basket net" and "euro" and "durat" all satisfy the above conditions, and constitute nodes formed by aggregating three common entities.

Substep 1-3, entity node statement assignment. And carrying out similarity calculation on each sentence in the text and the node, and putting the sentence into the node with the highest similarity. The similarity selection is obtained by calculating the cosine similarity, sentences which do not contain any key words do not need to be calculated, and only key words in one text are not considered. Let the node vector and the text sentence vector be X and Y, respectively, and the cosine similarity calculation formula is shown in formula 5. n is the dimension of the vector, x_i、y_iIs the value of the ith term in the vector.

And substeps 1-4, calculating similarity weights among nodes. After the grouping of sentences is completed, it is necessary to calculate the similarity between nodes as edge weights so as to associate the nodes with the nodes. And selecting TF-IDF similarity between sentence sets of any two nodes to be calculated as a weight value. Although the edge weight can be determined by other methods, many classical methods have proved in experiments that the result obtained by the TF-IDF similarity calculation as the edge weight value can make the generated co-occurrence entity interaction graph more closely connected. The idea of TF-IDF is that a word is a high percentage of the text and rarely occurs in other texts, so the more important the word is at the semantic level of the text.

TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8

The calculation formula of the word frequency TF is shown in formula 6: n is_d,wRepresenting the number of times a word appears in a document, the denominator

Is the number of all words. The formula for calculating the frequency of the inverse text is shown as formula 7: d_wIs the total number of documents containing the word w, and D represents a document summary. By combining the equations 6 and 7, the calculation equation of TF-IDF is shown in equation 8. And calculating the distance between the text vectors by vectorization after calculating the weight of the feature words, wherein the closer the distance is, the higher the text similarity is. The interaction diagram of the co-occurrence entities generated through the above two sub-steps is shown in fig. 3.

And 2, generating a node vector. The sentences from the same text on each node are spliced into a whole to obtain two sentence sets from different texts, and the sentence sets are input into an interaction model to obtain a hidden feature vector m with fixed length_AB(v) (ii) a Subsequently, 3 metrics are utilized: the similarity of the TF-IDF cosine similarity and the average weight coefficient and the Jaccard similarity considering the sentence length are connected to obtain a similarity vector m 'of the nodes'_AB(v) In that respect The overall flow is shown in fig. 4, and the specific steps are as follows:

in sub-step 2-1, after the similarity measure between the nodes is completed, the similarity of sentences from different news in the nodes needs to be calculated, so that each node is characterized by a vector. The sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two texts_A(v)、S_B(v)。

Substep 2-2, taking node v as an example, let S_A(v)、S_B(v) As input to the interaction model, the set is encoded into two context vectors by an embedding layer in the model that shares weights. The context layer typically comprises one or more bi-directional LSTM (BilsTM) or CNN layers with the largest pooling layer, intended to captureS_A(v) And S_B(v) The context information in (1). Let Vec_A(v) And Vec_B(v) Are respectively represented as S_A(v) And S_B(v) The obtained context vector. Subsequently, a vector representation m of the node is given by the aggregation layer_AB(v) It is calculated by connecting the element direction distance of two context vectors and the Hadamard product as shown in equation 9, where

Representing the hadamard product.

Substeps 2-3, calculating S_A(v)、S_B(v) Generates another node vector m 'for each node based on similarity of entity features in the text'_AB(v) In that respect This step obtains three similarity measures. The calculation method of the TF-IDF cosine similarity is shown in formulas 6, 7 and 8.

And a substep 2-4 of calculating the weight factor. The method measures the entity with Emphasis in the text through an Emphasis Coefficient (EC), and solves the problem of neglecting the entity with Emphasis caused by only considering semantic weight. As shown in equation 10:

wherein EC_doc(e_i) Representing an entity e_iThe emphasis factor in the text doc. EW_doc(e_i) Representing an entity e_iThe semantic weight of the text doc is accounted for, and the calculation method is shown in formula 4. N is a radical of_DOCRepresenting the total number of newsletters in the knowledge space, I (EW)_j(e_i)≥EW_doc(e_i) Is an indicator function, here representing the statistical EW_j(e_i)≥EW_doc(e_i) The number of texts can be counted to obtain the entity e in all the texts through the indication function_iAnd their weights are greater than in this reportThe amount of text of the entity weight. To ensure the legality of the formula, 1 is added after the denominator.

AvgEC_A(sensor) represents the average weighting factor of the sentence sensor in text A, num represents the entity category with weighting factor in the sentence, EC_A(e_i) Denotes e_iWeight factor, N, in text A_iDenotes e_iNumber of occurrences in a sentence, N_sentenceRepresenting the total number of occurrences of all entities in the sentence.

Substep 2-5, calculating the Jaccard similarity considering the sentence length. Classical Jaccard similarity calculates the ratio of the intersection to the union between two sets. However, the vertices of the co-occurrence entity interaction graph, although composed of sentences in two texts, may contain the same entities in most sentences, and if Jaccard similarity is directly calculated, meaning is lost. Therefore, the difference of the added sentence length is selected as a penalty term of the denominator, as shown in formula 12, where α represents a super parameter for adjusting the influence of the sentence length difference, len (| a ≦ B |) represents the number of entities in common for the two texts, and len (| a ≦ B |) represents the union length of the entities in the two texts.

And step 3, node feature aggregation. After the feature vector of each node is obtained, feature conversion is carried out by selecting and utilizing a graph convolution neural network GCN, and therefore a final matching vector is obtained to measure the similarity between texts.

Substep 3-1, a matching vector is generated. The input to the GCN allows the generation of node-level embedded representations by encoding information about the domain of nodes for arbitrary sized and shaped graph structures and feature vectors. The GCN input in the invention is the co-occurrence entity interaction diagram G obtained in the above_ABIt contains N nodes and edges with weights between nodes. Each node V_iComprising a matching vector M (v)_i) It is composed of the node vectors obtained in between, as shown in equation 13:

M(v_i)＝(m_AB(v_i),m'_AB(v_i) Equation 13)

And a substep 3-2 of generating a text level similarity fixed-length vector. Drawing G_ABThe corresponding weighted adjacency matrix is expressed as

Wherein each value of Y is Y_ijRepresents a node V_iAnd V_jThe TF-IDF similarity of (a) has been previously calculated. Let D be a diagonal matrix and satisfy D_ii＝∑_jY_ij. The GCN has an input layer of H⁽⁰⁾M, the initial node characteristics are recorded,

a hidden representation matrix representing the nth layer. Each layer is passed through a graphical convolution filter as shown in equation 14 to obtain a hidden representation of the next layer.

Wherein

I_NIs an identity matrix; in a manner similar to that of D,

is also a diagonal matrix, satisfies

W⁽ⁿ⁾A trainable weight matrix for the nth level; sigma represents an activation function, and the sigmoid function is adopted in the invention.

And finally, taking the average value of the hidden vectors of all the nodes in the last layer to obtain a fixed-length vector representing the similarity of the text level. In the invention, the number of GCN layers can be 2 or 3. The output size of the last layer of GCN is set to be constant 16 and the other layers are set to be 128.

And a substep 3-3 of calculating a matching score finally through a multi-layer perceptron, wherein the classification module comprises a linear layer with the output size of 16, a ReLU layer, another linear layer and a last Sigmoid activation function layer. And calculating and judging the similar situation of the text. The general flow of step 3 is shown in fig. 5.

Based on the same inventive concept, the invention further provides a text similarity calculation device based on the co-occurrence entity interaction graph, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the text similarity calculation method based on the co-occurrence entity interaction graph when being loaded to the processor. The text similarity calculation device based on the co-occurrence entity interaction graph comprises a co-occurrence entity interaction graph construction module, a node vector generation module and a node feature aggregation module. The co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, the construction of the interactive graph is realized, and the content of the step 1 is specifically realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to the three similarity metrics, and is used as an initial input of the node feature aggregation module to specifically realize the content of the step 2; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interaction graph into a final fixed-length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine, thereby specifically realizing the content in the step 3.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A text similarity calculation method based on a co-occurrence entity interaction graph is characterized by comprising the following steps:

step 1, co-occurrence entity interaction graph construction

step 2, node vector generation

step 3, node feature aggregation

2. The method for calculating text similarity based on the co-occurrence entity interaction graph according to claim 1, wherein the step 1 specifically comprises the following sub-steps:

substep 1-1, key entity acquisition

substep 1-2, common entity aggregation

substeps 1-3, entity node statement assignment

substeps 1-4, calculating similarity weight between nodes

TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8

3. The co-occurrence entity interaction graph-based text similarity calculation method according to claim 2, wherein the sub-step 1-1 comprises the following procedures:

(2) the position of the entity is distinguished, the scores of different areas of the entity are different, and the position is weightedIs location (e)_i) As shown in equation 2, P is the number of paragraphs of text and P is the current entity e_iThe number of paragraphs being located, location (e) when the total number of paragraphs in the text does not exceed two_i) Is a fixed value; when the text total paragraph exceeds two paragraphs, the entity scores of the first paragraph and the last paragraph are the same, and the scores of other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph:

(3) extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s₁,s₂,...,s_nS in the set_iRepresenting a central sentence; as shown in equation 3, n is the number of central sentences, I (e)_i∈s_t) To indicate a function, an entity e is represented_iWhether or not there is a current central sentence s_tThe method comprises the following steps:

EW(e_i)＝Avg(location(e_i))×(η·freq(e_i)+(1-η)·center(e_i) Equation 4)

4. The method for calculating text similarity based on a co-occurrence entity interaction graph according to claim 1, wherein the step 2 specifically comprises the following sub-steps:

substep 2-1, at the completion nodeAfter similarity measurement, calculating the similarity of sentences from different news in the nodes, and performing feature representation on each node by using a vector; the sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two texts_A(v)、S_B(v)；

Wherein the content of the first and second substances,

representing a Hadamard product;

AvgEC_A(sensor) represents the average weighting factor of the sentence sensor in text A, num represents the entity category with weighting factor in the sentence, EC_A(e_i) Denotes e_iWeight factor, N, in text A_iDenotes e_iNumber of occurrences in a sentence, N_sentenceRepresenting the total number of occurrences of all entities in the sentence;

5. The method for calculating text similarity based on a co-occurrence entity interaction graph according to claim 1, wherein the step 3 specifically comprises the following sub-steps:

substep 3-1, generating a matching vector

M(v_i)＝(m_AB(v_i),m'_AB(v_i) Equation 13)

Substep 3-2, generating text level similarity fixed length vector

Drawing G_ABThe corresponding weighted adjacency matrix is expressed as

Wherein each value of Y is Y_ijRepresents a node V_iAnd V_jThe TF-IDF similarity is calculated previously; let D be a diagonal matrix and satisfy D_ii＝∑_jY_ij(ii) a The GCN has an input layer of H⁽⁰⁾M, the initial node characteristics are recorded,

wherein

I_NIs an identity matrix; and class DSimilarly, the first and second electrodes are arranged in a parallel manner,

is also a diagonal matrix, satisfies

6. A co-occurrence entity interaction graph-based text similarity calculation apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the computer program when loaded into a processor implements the method of computing text similarity based on a co-occurrence entity interaction graph of any of claims 1-5 above.

7. The co-occurrence entity interaction graph-based text similarity calculation apparatus according to claim 6, wherein: the computer program comprises a co-occurrence entity interaction graph construction module, a node vector generation module and a node feature aggregation module; the co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, and the construction of the interactive graph is realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to three similarity metrics, and the similarity vector is used as initial input of the node feature aggregation module; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interactive graph into a final fixed length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine.