CN113743079A - Text similarity calculation method and device based on co-occurrence entity interaction graph - Google Patents

Text similarity calculation method and device based on co-occurrence entity interaction graph Download PDF

Info

Publication number
CN113743079A
CN113743079A CN202110639430.8A CN202110639430A CN113743079A CN 113743079 A CN113743079 A CN 113743079A CN 202110639430 A CN202110639430 A CN 202110639430A CN 113743079 A CN113743079 A CN 113743079A
Authority
CN
China
Prior art keywords
similarity
text
entity
node
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110639430.8A
Other languages
Chinese (zh)
Inventor
杨鹏
常欣辰
赵翰林
谢亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Huaxun Technology Co ltd
Original Assignee
Zhejiang Huaxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Huaxun Technology Co ltd filed Critical Zhejiang Huaxun Technology Co ltd
Priority to CN202110639430.8A priority Critical patent/CN113743079A/en
Publication of CN113743079A publication Critical patent/CN113743079A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text similarity calculation method based on a co-occurrence entity interaction graph. Firstly, extracting key entities in a text, and aggregating the key entities appearing in the same sentence into nodes; then, calculating the similarity between sentences in the text and node entities, attaching the sentences to the nodes with the highest similarity, and constructing a co-occurrence entity interaction graph by completing similarity measurement between the nodes; splicing sentences from the same text in the nodes together, and generating feature vectors of the nodes and similarity vectors based on text entity features by using an interaction model; after the feature vector of each node is obtained, feature conversion is carried out by using a graph convolution neural network GCN, a final matching vector is obtained, so that a plurality of node features are aggregated, and a similarity score is calculated by a multilayer perceptron. The invention converts the calculation work of the similarity of the long text into the matching task of the short text on the node formed by the common reality in the two texts, thereby effectively improving the calculation accuracy of the similarity of the texts.

Description

Text similarity calculation method and device based on co-occurrence entity interaction graph
Technical Field
The invention relates to a text similarity calculation method based on a co-occurrence entity interaction graph, and belongs to the technical field of Internet and artificial intelligence.
Background
The text similarity calculation technology is an essential key technology in common application scenes such as search engines, automatic question answering, document classification, news pushing and the like. With the rapid development of the internet, the number of network news is increased dramatically, and higher requirements are provided for the accuracy of text similarity calculation. Common text semantic similarity calculation methods are mainly classified into four categories: string-based, statistical-based, deep learning-based, and knowledge base-based methods. The method based on the character string ignores the meaning of the semantic level, and can only be used as the supplement of other methods at the present stage; the distance of a text vector is calculated through vectorization by a statistical-based method to represent the similarity of the text, sufficient semantic information is lacked, and resource waste can be caused for the vectorization work of a long text; the method based on deep learning can obtain better performance, but the introduced modules are increased, so that the calculation cost of the model is increased; knowledge base based methods rely on the accuracy and richness of the knowledge base data. The existing method has poor calculation capability of the similarity of the long text and low calculation accuracy of the similarity of the text.
The unified Content tag UCL (unified Content tag Label) defined by the national standard unified Content tag format Specification (GB/T35304 and 2017) can provide rich text semantic information, and the UCL Knowledge Space (UCL Knowledge Space, UCLKS) is based on the basic Knowledge bases such as Wikipedia and Baidu encyclopedia and is supplemented by network news indexed by UCL. UCLKS indicates the side-weighted entity of the text by using the semantic weight of UCL, and then the vector matching of the long text is divided into similarity calculation work among a plurality of sentences through key entities. On one hand, the meaning of the semantic level is supplemented, and on the other hand, the problem of resource waste caused by long text vectorization work is solved. And UCLKS can well meet the implementation requirements of the method based on the knowledge base.
The traditional news similarity calculation method cannot be used for vector embedding representation, and the relevancy of the text is obtained by calculating the distance between the text and the corresponding vector. The number of indexing entities corresponding to news needs to be considered, semantic weights of entities in different news reports are different, entities retained according to a threshold value also have difference in number, and it is obvious that final vector lengths are different. However, there is no ideal scheme for applying UCL to text similarity.
Disclosure of Invention
Based on the background, the invention provides a text similarity calculation method based on a co-occurrence entity interaction graph, which utilizes various statistical characteristics in texts to strengthen semantic basis of similarity calculation between the texts. Aiming at the problems of low calculation accuracy and vector space waste between long texts, the method adopts a divide-and-conquer strategy, completes the splitting of the long texts through co-occurrence entities in the texts, constructs a co-occurrence entity interaction diagram, and generates vector representation between the split short texts through an interaction model. And finally, inputting a pair of feature vectors of the nodes in the interactive graph into a multilayer graph convolution neural network to obtain matching vectors of two texts with fixed lengths, and obtaining a final similarity calculation result through a multilayer perceptron.
In order to achieve the above object, the present invention provides a text similarity calculation method based on a co-occurrence entity interaction graph. Firstly, screening out entities shared by text pairs by using a UCL semantic weight calculation method, then aggregating key entities appearing in the same sentence to form nodes, dividing sentences in two texts onto the nodes with the closest semantics, and taking the semantic similarity between the nodes as edge weight, thereby constructing a co-occurring entity interaction graph. And then, outputting a statement set from different texts into a node vector by using an interactive network, judging a text emphasis point indexed by the UCL through a UCL entity emphasis coefficient, and supplementing semantic features of the texts by combining a plurality of similarity features. And finally, obtaining two text vectors with fixed lengths through a multilayer graph convolution neural network, and obtaining the final similarity through a multilayer perceptron.
Specifically, the invention provides the following technical scheme:
a text similarity calculation method based on a co-occurrence entity interaction graph comprises the following steps:
step 1, co-occurrence entity interaction graph construction
Obtaining key entities which commonly appear in matched text pairs by a UCL semantic weight calculation method, and aggregating the key entities appearing in the same sentence to the same node; similarity calculation is carried out on each sentence in the text and the nodes, and the sentences are placed in the nodes with the highest similarity; after sentence grouping is completed, semantic association is formed between nodes by calculating the similarity between the nodes as edge weight, so that the construction of a co-occurrence entity interactive graph is completed;
step 2, node vector generation
Calculating the similarity of sentences from different texts in the node, and splicing the sentences from the same text; firstly, generating hidden feature vectors with fixed sizes at each node, and then generating similarity vectors by utilizing the similarity of entity features among texts;
step 3, node feature aggregation
After the feature vector of each node is obtained, feature conversion is carried out by using a graph convolution neural network GCN, so that a final matching vector is obtained; generating an embedded representation at a node level by encoding information about a neighborhood of nodes; and obtaining a fixed length vector representing the text level similarity by using the average value of the hidden vectors of all the nodes in the last layer obtained by GCN training, and finally calculating the similarity by using a multilayer perceptron.
Preferably, the step 1 specifically includes the following substeps:
substep 1-1, key entity acquisition
Acquiring entities in the text by using a named entity recognition technology, and screening out key entities through entity semantic weight filtering; the entity semantic weight is jointly determined by the appearance frequency of an entity, the position of the entity and the context where the entity is located;
substep 1-2, common entity aggregation
Aggregating key entities in two texts which both exist in the same sentence to a node; a node should contain one or more keywords, and a keyword may also appear in multiple nodes;
substeps 1-3, entity node statement assignment
Similarity calculation is carried out on each sentence in the text and the nodes, the sentences and the nodes are placed in the nodes with the highest similarity, the similarity is obtained by calculating cosine similarity, sentences which do not contain any key words do not need to be calculated, and only the key words in one text do not need to be considered; let the node vector and the text sentence vector be X and Y, respectively, and the cosine similarity calculation formula is shown in formula 5:
Figure RE-GDA0003339287220000031
where n is the dimension of the vector, xi、yiIs the ith term value in the vector;
substeps 1-4, calculating similarity weight between nodes
After the grouping of sentences is completed, calculating the similarity between nodes as edge weight so as to associate the nodes with the nodes; selecting TF-IDF similarity between sentence sets of any two nodes to be calculated as a weight value; the calculation formula of the word frequency TF is shown in formula 6:
Figure RE-GDA0003339287220000032
wherein n isd,wRepresenting the number of times a word appears in a document, the denominator
Figure RE-GDA0003339287220000033
Is the number of all words; the inverse text frequency calculation formula is shown in equation 7:
Figure RE-GDA0003339287220000034
wherein D iswIs the number of all documents containing the word w, D represents a document summary; by combining the equations 6 and 7, the calculation equation of TF-IDF can be obtained as shown in equation 8:
TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8
And calculating the distance between the text vectors by vectorization after calculating the weight of the feature words, wherein the closer the distance is, the higher the text similarity is.
Preferably, the substep 1-1 comprises the following processes:
(1) calculating the frequency of occurrence of each entity in the text as shown in formula 1
Figure RE-GDA0003339287220000035
Wherein, count (e)i) Representing the occurrence times of the entities, wherein denominators are the occurrence times of all the entities; after the word frequency calculation is finished, filtering out entities with lower word frequency;
(2) the position of the entity is distinguished, the scores of different areas of the entity are different, and the position weight is set as location (e)i) As shown in equation 2, P is the number of paragraphs of text and P is the current entity eiThe number of paragraphs being located, location (e) when the total number of paragraphs in the text does not exceed twoi) Is a fixed value; when the text total paragraph exceeds two paragraphs, the entity scores of the first paragraph and the last paragraph are the same, and the scores of other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph:
Figure RE-GDA0003339287220000041
(3) extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s1,s2,...,snS in the setiRepresenting a central sentence; as shown in equation 3, n is the number of central sentences, I (e)i∈st) In order to indicate the function,representing an entity eiWhether or not there is a current central sentence stThe method comprises the following steps:
Figure RE-GDA0003339287220000042
(4) after the weight parameter values of the three parts are calculated, an entity semantic weight calculation formula in UCL is provided after combination, and the formula is shown as formula 4:
EW(ei)=Avg(location(ei))×(η·freq(ei)+(1-η)·center(ei) Equation 4)
Where eta is the tuning parameter, Avg (e)i) Represents the mean position weight of the entities, after calculating the EW (e) of all the entitiesi) And then, obtaining the UCL semantic weight of each entity through normalization, and only keeping the entity with the semantic weight reaching a certain threshold value as the composition of subsequent entity nodes.
Preferably, the step 2 specifically includes the following substeps:
substep 2-1, after completing similarity measurement between nodes, calculating the similarity of sentences from different news in the nodes, thereby performing feature representation on each node by using a vector; the sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two textsA(v)、SB(v);
Substep 2-2, taking node v as an example, let SA(v)、SB(v) As input to the interaction model, the set is encoded into two context vectors by an embedding layer in the model that shares weights, the context layer typically containing one or more bi-directional LSTMs or CNN layers with the largest pooling layer, intended to capture SA(v) And SB(v) The context information in (1); let VecA(v) And VecB(v) Are respectively represented as SA(v) And SB(v) An obtained context vector; subsequently, a vector representation m of the node is given by the aggregation layerAB(v) It is calculated by connecting the element direction distance of two context vectors and the Hadamard product as shown in formula 9
Figure RE-GDA0003339287220000054
Wherein the content of the first and second substances,
Figure RE-GDA0003339287220000055
representing a Hadamard product;
substeps 2-3, calculating SA(v)、SB(v) Generates another node vector m 'for each node based on similarity of entity features in the text'AB(v) This step obtains three similarity measures; the calculation method of the TF-IDF cosine similarity is shown in formulas 6, 7 and 8;
and a substep 2-4, calculating a weight coefficient, measuring the weight entity in the text through the weight coefficient, and solving the problem of neglecting the weight entity caused by only considering the semantic weight, as shown in formula 10:
Figure RE-GDA0003339287220000051
wherein ECdoc(ei) Representing an entity eiWeight coefficient, EW, in text docdoc(ei) Representing an entity eiThe semantic weight of the text doc is occupied, and the calculation method is shown in formula 4; n is a radical ofDOCRepresenting the total number of newsletters in the knowledge space, I (EW)j(ei)≥EWdoc(ei) Is an indicator function, represents the statistical EWj(ei)≥EWdoc(ei) The number of texts;
the average weighting factor is derived from the weighting factors of the single entity, and for a sentence, the average weighting factor of the sentence is calculated by the following formula 11:
Figure RE-GDA0003339287220000052
AvgECA(Senense) represents the level of the sentence Senense in text AAverage weight coefficient, num represents entity class with weight coefficient in sentence, ECA(ei) Denotes eiWeight factor, N, in text AiDenotes eiNumber of occurrences in a sentence, NsentenceRepresenting the total number of occurrences of all entities in the sentence;
substep 2-5, calculating Jaccard similarity considering sentence length, selecting penalty term adding sentence length difference as denominator, as shown in equation 12:
Figure RE-GDA0003339287220000053
wherein alpha represents a super parameter and is used for adjusting the influence occupied by the length difference of sentences, len (| A ≦ B |) represents the number of the entities shared by the two texts, and len (| A ≦ B |) represents the union length of the entities in the two texts;
after the result is obtained through calculation, the similarity vector m 'of the node v is obtained through connection'AB(v)。
Preferably, the step 3 specifically includes the following sub-steps:
substep 3-1, generating a matching vector
The input of GCN is the co-occurrence entity interaction graph G obtained in the aboveABN nodes and edges with weights between nodes; each node ViComprising a matching vector M (v)i) It is composed of the node vectors obtained in between, as shown in equation 13:
M(vi)=(mAB(vi),m'AB(vi) Equation 13)
Substep 3-2, generating text level similarity fixed length vector
Drawing GABThe corresponding weighted adjacency matrix is expressed as
Figure RE-GDA0003339287220000061
Wherein each value of Y is YijRepresents a node ViAnd VjHas been previously calculated for TF-IDF similarityTo; let D be a diagonal matrix and satisfy Dii=∑jYij(ii) a The GCN has an input layer of H(0)M, the initial node characteristics are recorded,
Figure RE-GDA0003339287220000062
a hidden representation matrix representing the nth layer; each layer is passed through a graphical convolution filter as shown in equation 14 to obtain a hidden representation of the next layer:
Figure RE-GDA0003339287220000063
wherein
Figure RE-GDA0003339287220000064
INIs an identity matrix; in a manner similar to that of D,
Figure RE-GDA0003339287220000065
is also a diagonal matrix, satisfies
Figure RE-GDA0003339287220000066
W(n)A trainable weight matrix for the nth level; σ represents an activation function;
finally, taking the average value of the hidden vectors of all the nodes in the last layer to obtain a fixed-length vector representing the similarity of the text level;
substep 3-3, finally calculating a matching score through a multilayer perceptron, wherein the classification module comprises a linear layer, a ReLU layer, another linear layer and a last Sigmoid activation function layer; and calculating and judging the similar situation of the text.
The invention also provides a text similarity calculation device based on the co-occurrence entity interaction graph, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the text similarity calculation method based on the co-occurrence entity interaction graph when being loaded to the processor.
Preferably, the computer program comprises a co-occurrence entity interaction graph building module, a node vector generating module and a node feature aggregation module. The co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, and the construction of the interactive graph is realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to three similarity metrics, and the similarity vector is used as initial input of the node feature aggregation module; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interactive graph into a final fixed length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the invention uses graph structure to represent the entity relation between two texts, and completes the effective splitting of long texts through co-occurrence entities, thereby solving the problem of vector space resource waste in the matching process of the traditional method.
(2) The invention utilizes the characteristics of entity semantic weight, weighting coefficient and the like in the text to enhance the interpretability of the text similarity calculation semantic level on the basis of the traditional method.
(3) The invention introduces the concept of a co-occurrence entity interactive graph and utilizes a convolutional neural network to complete the effective aggregation of the features on the nodes in the graph. The method can solve the problem of poor calculation capability of the long text similarity in the conventional method, and can improve the accuracy of text similarity calculation.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a diagram illustrating a co-occurrence entity node according to an embodiment of the present invention.
Fig. 3 is an example of a co-occurrence entity interaction diagram according to an embodiment of the present invention.
Fig. 4 is a structural diagram of a node vector generation module according to an embodiment of the present invention.
Fig. 5 is a structural diagram of a node feature aggregation module according to an embodiment of the present invention.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, a text similarity calculation method based on a co-occurrence entity interaction diagram disclosed in the embodiment of the present invention includes the following specific implementation steps:
step 1, constructing a co-occurrence entity interaction diagram. Extracting entities and acquiring key entities by using an entity identification method, and aggregating keywords appearing in the same sentence into nodes; then attaching the most relevant sentences in the nodes to the nodes through similarity calculation; and finally, the similarity between the nodes is used as a weight value of the edges in the graph, so that the construction of the co-occurrence entity interaction graph is completed. The method comprises the following specific steps:
substep 1-1, key entity acquisition. And acquiring entities in the text by using a named entity recognition technology, and screening out key entities through entity semantic weight filtering. The entity semantic weight is determined by the appearance frequency of the entity, the position of the entity and the context where the entity is located. The acquisition steps are as follows:
(1) the frequency of occurrence of each entity in the text is calculated as shown in equation 1. count (e)i) The number of times of occurrence of the entity is represented, and the denominator is the number of times of occurrence of all the entities. After the word frequency calculation is completed, entities with lower word frequency are filtered out, so that the subsequent calculation is simplified.
Figure RE-GDA0003339287220000081
(2) The position of the entity is distinguished, the scores of different areas of the entity are different, and the position weight is set as location (e)i). As shown in equation 2, P is the number of paragraphs of text and P is the current entity eiAt the position of the paragraphNumber, location (e) when the total number of paragraphs in the text does not exceed twoi) Is a fixed value; when the total paragraph of text exceeds two paragraphs, the scores of the entities in the first paragraph and the last paragraph are the same, and the scores of the other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph.
Figure RE-GDA0003339287220000082
(3) Extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s1,s2,...,snS in the setiRepresents a central sentence, which is composed of entities. The more times an entity appears in different central sentences, the higher the relative weight of the representative entity. As shown in equation 3, n is the number of central sentences, I (e)i∈st) To indicate a function, an entity e is representediWhether or not there is a current central sentence stIn (1).
Figure RE-GDA0003339287220000083
(4) After the weight parameter values of the three parts are calculated, an entity semantic weight calculation formula in UCL is provided after combination, and the formula is shown as formula 4:
EW(ei)=Avg(location(ei))×(η·freq(ei)+(1-η)·center(ei) Equation 4)
Wherein eta is an adjusting parameter and ranges from 0 to 1. Avg (location (e)i) The weighted average of the entity position weights is calculated comprehensively because the same entity may appear at different positions in the text many times, and the frequency of the entity appearing at each position needs to be set as a weight. After calculating EW (e) of all entitiesi) Then, the UCL semantic weight of each entity is obtained through normalization. And only the entity with the semantic weight reaching a certain threshold value is reserved as the composition of the subsequent entity node.
Substep 1-2, consensus entity aggregation. The key entities that exist in the same sentence in both texts are aggregated onto one node. A node should contain one or more keywords, and a keyword may also appear in multiple nodes. As shown in fig. 2, the entities "lakeman" and "schroeder", "lakeman" and "basket net", "basket net" and "euro" and "durat" all satisfy the above conditions, and constitute nodes formed by aggregating three common entities.
Substep 1-3, entity node statement assignment. And carrying out similarity calculation on each sentence in the text and the node, and putting the sentence into the node with the highest similarity. The similarity selection is obtained by calculating the cosine similarity, sentences which do not contain any key words do not need to be calculated, and only key words in one text are not considered. Let the node vector and the text sentence vector be X and Y, respectively, and the cosine similarity calculation formula is shown in formula 5. n is the dimension of the vector, xi、yiIs the value of the ith term in the vector.
Figure RE-GDA0003339287220000091
And substeps 1-4, calculating similarity weights among nodes. After the grouping of sentences is completed, it is necessary to calculate the similarity between nodes as edge weights so as to associate the nodes with the nodes. And selecting TF-IDF similarity between sentence sets of any two nodes to be calculated as a weight value. Although the edge weight can be determined by other methods, many classical methods have proved in experiments that the result obtained by the TF-IDF similarity calculation as the edge weight value can make the generated co-occurrence entity interaction graph more closely connected. The idea of TF-IDF is that a word is a high percentage of the text and rarely occurs in other texts, so the more important the word is at the semantic level of the text.
Figure RE-GDA0003339287220000092
Figure RE-GDA0003339287220000093
TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8
The calculation formula of the word frequency TF is shown in formula 6: n isd,wRepresenting the number of times a word appears in a document, the denominator
Figure RE-GDA0003339287220000094
Is the number of all words. The formula for calculating the frequency of the inverse text is shown as formula 7: dwIs the total number of documents containing the word w, and D represents a document summary. By combining the equations 6 and 7, the calculation equation of TF-IDF is shown in equation 8. And calculating the distance between the text vectors by vectorization after calculating the weight of the feature words, wherein the closer the distance is, the higher the text similarity is. The interaction diagram of the co-occurrence entities generated through the above two sub-steps is shown in fig. 3.
And 2, generating a node vector. The sentences from the same text on each node are spliced into a whole to obtain two sentence sets from different texts, and the sentence sets are input into an interaction model to obtain a hidden feature vector m with fixed lengthAB(v) (ii) a Subsequently, 3 metrics are utilized: the similarity of the TF-IDF cosine similarity and the average weight coefficient and the Jaccard similarity considering the sentence length are connected to obtain a similarity vector m 'of the nodes'AB(v) In that respect The overall flow is shown in fig. 4, and the specific steps are as follows:
in sub-step 2-1, after the similarity measure between the nodes is completed, the similarity of sentences from different news in the nodes needs to be calculated, so that each node is characterized by a vector. The sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two textsA(v)、SB(v)。
Substep 2-2, taking node v as an example, let SA(v)、SB(v) As input to the interaction model, the set is encoded into two context vectors by an embedding layer in the model that shares weights. The context layer typically comprises one or more bi-directional LSTM (BilsTM) or CNN layers with the largest pooling layer, intended to captureSA(v) And SB(v) The context information in (1). Let VecA(v) And VecB(v) Are respectively represented as SA(v) And SB(v) The obtained context vector. Subsequently, a vector representation m of the node is given by the aggregation layerAB(v) It is calculated by connecting the element direction distance of two context vectors and the Hadamard product as shown in equation 9, where
Figure RE-GDA0003339287220000104
Representing the hadamard product.
Figure RE-GDA0003339287220000103
Substeps 2-3, calculating SA(v)、SB(v) Generates another node vector m 'for each node based on similarity of entity features in the text'AB(v) In that respect This step obtains three similarity measures. The calculation method of the TF-IDF cosine similarity is shown in formulas 6, 7 and 8.
And a substep 2-4 of calculating the weight factor. The method measures the entity with Emphasis in the text through an Emphasis Coefficient (EC), and solves the problem of neglecting the entity with Emphasis caused by only considering semantic weight. As shown in equation 10:
Figure RE-GDA0003339287220000101
wherein ECdoc(ei) Representing an entity eiThe emphasis factor in the text doc. EWdoc(ei) Representing an entity eiThe semantic weight of the text doc is accounted for, and the calculation method is shown in formula 4. N is a radical ofDOCRepresenting the total number of newsletters in the knowledge space, I (EW)j(ei)≥EWdoc(ei) Is an indicator function, here representing the statistical EWj(ei)≥EWdoc(ei) The number of texts can be counted to obtain the entity e in all the texts through the indication functioniAnd their weights are greater than in this reportThe amount of text of the entity weight. To ensure the legality of the formula, 1 is added after the denominator.
The average weighting factor is derived from the weighting factors of the single entity, and for a sentence, the average weighting factor of the sentence is calculated by the following formula 11:
Figure RE-GDA0003339287220000102
AvgECA(sensor) represents the average weighting factor of the sentence sensor in text A, num represents the entity category with weighting factor in the sentence, ECA(ei) Denotes eiWeight factor, N, in text AiDenotes eiNumber of occurrences in a sentence, NsentenceRepresenting the total number of occurrences of all entities in the sentence.
Substep 2-5, calculating the Jaccard similarity considering the sentence length. Classical Jaccard similarity calculates the ratio of the intersection to the union between two sets. However, the vertices of the co-occurrence entity interaction graph, although composed of sentences in two texts, may contain the same entities in most sentences, and if Jaccard similarity is directly calculated, meaning is lost. Therefore, the difference of the added sentence length is selected as a penalty term of the denominator, as shown in formula 12, where α represents a super parameter for adjusting the influence of the sentence length difference, len (| a ≦ B |) represents the number of entities in common for the two texts, and len (| a ≦ B |) represents the union length of the entities in the two texts.
Figure RE-GDA0003339287220000111
After the result is obtained through calculation, the similarity vector m 'of the node v is obtained through connection'AB(v)。
And step 3, node feature aggregation. After the feature vector of each node is obtained, feature conversion is carried out by selecting and utilizing a graph convolution neural network GCN, and therefore a final matching vector is obtained to measure the similarity between texts.
Substep 3-1, a matching vector is generated. The input to the GCN allows the generation of node-level embedded representations by encoding information about the domain of nodes for arbitrary sized and shaped graph structures and feature vectors. The GCN input in the invention is the co-occurrence entity interaction diagram G obtained in the aboveABIt contains N nodes and edges with weights between nodes. Each node ViComprising a matching vector M (v)i) It is composed of the node vectors obtained in between, as shown in equation 13:
M(vi)=(mAB(vi),m'AB(vi) Equation 13)
And a substep 3-2 of generating a text level similarity fixed-length vector. Drawing GABThe corresponding weighted adjacency matrix is expressed as
Figure RE-GDA0003339287220000112
Wherein each value of Y is YijRepresents a node ViAnd VjThe TF-IDF similarity of (a) has been previously calculated. Let D be a diagonal matrix and satisfy Dii=∑jYij. The GCN has an input layer of H(0)M, the initial node characteristics are recorded,
Figure RE-GDA0003339287220000113
a hidden representation matrix representing the nth layer. Each layer is passed through a graphical convolution filter as shown in equation 14 to obtain a hidden representation of the next layer.
Figure RE-GDA0003339287220000114
Wherein
Figure RE-GDA0003339287220000115
INIs an identity matrix; in a manner similar to that of D,
Figure RE-GDA0003339287220000116
is also a diagonal matrix, satisfies
Figure RE-GDA0003339287220000117
W(n)A trainable weight matrix for the nth level; sigma represents an activation function, and the sigmoid function is adopted in the invention.
And finally, taking the average value of the hidden vectors of all the nodes in the last layer to obtain a fixed-length vector representing the similarity of the text level. In the invention, the number of GCN layers can be 2 or 3. The output size of the last layer of GCN is set to be constant 16 and the other layers are set to be 128.
And a substep 3-3 of calculating a matching score finally through a multi-layer perceptron, wherein the classification module comprises a linear layer with the output size of 16, a ReLU layer, another linear layer and a last Sigmoid activation function layer. And calculating and judging the similar situation of the text. The general flow of step 3 is shown in fig. 5.
Based on the same inventive concept, the invention further provides a text similarity calculation device based on the co-occurrence entity interaction graph, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the text similarity calculation method based on the co-occurrence entity interaction graph when being loaded to the processor. The text similarity calculation device based on the co-occurrence entity interaction graph comprises a co-occurrence entity interaction graph construction module, a node vector generation module and a node feature aggregation module. The co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, the construction of the interactive graph is realized, and the content of the step 1 is specifically realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to the three similarity metrics, and is used as an initial input of the node feature aggregation module to specifically realize the content of the step 2; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interaction graph into a final fixed-length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine, thereby specifically realizing the content in the step 3.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (7)

1. A text similarity calculation method based on a co-occurrence entity interaction graph is characterized by comprising the following steps:
step 1, co-occurrence entity interaction graph construction
Obtaining key entities which commonly appear in matched text pairs by a UCL semantic weight calculation method, and aggregating the key entities appearing in the same sentence to the same node; similarity calculation is carried out on each sentence in the text and the nodes, and the sentences are placed in the nodes with the highest similarity; after sentence grouping is completed, semantic association is formed between nodes by calculating the similarity between the nodes as edge weight, so that the construction of a co-occurrence entity interactive graph is completed;
step 2, node vector generation
Calculating the similarity of sentences from different texts in the node, and splicing the sentences from the same text; firstly, generating hidden feature vectors with fixed sizes at each node, and then generating similarity vectors by utilizing the similarity of entity features among texts;
step 3, node feature aggregation
After the feature vector of each node is obtained, feature conversion is carried out by using a graph convolution neural network GCN, so that a final matching vector is obtained; generating an embedded representation at a node level by encoding information about a neighborhood of nodes; and obtaining a fixed length vector representing the text level similarity by using the average value of the hidden vectors of all the nodes in the last layer obtained by GCN training, and finally calculating the similarity by using a multilayer perceptron.
2. The method for calculating text similarity based on the co-occurrence entity interaction graph according to claim 1, wherein the step 1 specifically comprises the following sub-steps:
substep 1-1, key entity acquisition
Acquiring entities in the text by using a named entity recognition technology, and screening out key entities through entity semantic weight filtering; the entity semantic weight is jointly determined by the appearance frequency of an entity, the position of the entity and the context where the entity is located;
substep 1-2, common entity aggregation
Aggregating key entities in two texts which both exist in the same sentence to a node; a node should contain one or more keywords, and a keyword may also appear in multiple nodes;
substeps 1-3, entity node statement assignment
Similarity calculation is carried out on each sentence in the text and the nodes, the sentences and the nodes are placed in the nodes with the highest similarity, the similarity is obtained by calculating cosine similarity, sentences which do not contain any key words do not need to be calculated, and only the key words in one text do not need to be considered; let the node vector and the text sentence vector be X and Y, respectively, and the cosine similarity calculation formula is shown in formula 5:
Figure FDA0003106597240000011
where n is the dimension of the vector, xi、yiIs the ith term value in the vector;
substeps 1-4, calculating similarity weight between nodes
After the grouping of sentences is completed, calculating the similarity between nodes as edge weight so as to associate the nodes with the nodes; selecting TF-IDF similarity between sentence sets of any two nodes to be calculated as a weight value; the calculation formula of the word frequency TF is shown in formula 6:
Figure FDA0003106597240000021
wherein n isd,wRepresenting the number of times a word appears in a document, the denominator
Figure FDA0003106597240000022
Is the number of all words; the inverse text frequency calculation formula is shown in equation 7:
Figure FDA0003106597240000023
wherein D iswIs the number of all documents containing the word w, D represents a document summary; by combining the equations 6 and 7, the calculation equation of TF-IDF can be obtained as shown in equation 8:
TF-IDF (D, w) ═ TF (D, w) × IDF (D, w) formula 8
And calculating the distance between the text vectors by vectorization after calculating the weight of the feature words, wherein the closer the distance is, the higher the text similarity is.
3. The co-occurrence entity interaction graph-based text similarity calculation method according to claim 2, wherein the sub-step 1-1 comprises the following procedures:
(1) calculating the frequency of occurrence of each entity in the text as shown in formula 1
Figure FDA0003106597240000024
Wherein, count (e)i) Representing the occurrence times of the entities, wherein denominators are the occurrence times of all the entities; after the word frequency calculation is finished, filtering out entities with lower word frequency;
(2) the position of the entity is distinguished, the scores of different areas of the entity are different, and the position is weightedIs location (e)i) As shown in equation 2, P is the number of paragraphs of text and P is the current entity eiThe number of paragraphs being located, location (e) when the total number of paragraphs in the text does not exceed twoi) Is a fixed value; when the text total paragraph exceeds two paragraphs, the entity scores of the first paragraph and the last paragraph are the same, and the scores of other paragraphs are unified into one fourth of the score of the first paragraph and the last paragraph:
Figure FDA0003106597240000031
(3) extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s1,s2,...,snS in the setiRepresenting a central sentence; as shown in equation 3, n is the number of central sentences, I (e)i∈st) To indicate a function, an entity e is representediWhether or not there is a current central sentence stThe method comprises the following steps:
Figure FDA0003106597240000032
(4) after the weight parameter values of the three parts are calculated, an entity semantic weight calculation formula in UCL is provided after combination, and the formula is shown as formula 4:
EW(ei)=Avg(location(ei))×(η·freq(ei)+(1-η)·center(ei) Equation 4)
Where eta is the tuning parameter, Avg (e)i) Represents the mean position weight of the entities, after calculating the EW (e) of all the entitiesi) And then, obtaining the UCL semantic weight of each entity through normalization, and only keeping the entity with the semantic weight reaching a certain threshold value as the composition of subsequent entity nodes.
4. The method for calculating text similarity based on a co-occurrence entity interaction graph according to claim 1, wherein the step 2 specifically comprises the following sub-steps:
substep 2-1, at the completion nodeAfter similarity measurement, calculating the similarity of sentences from different news in the nodes, and performing feature representation on each node by using a vector; the sentences from the same text are spliced into a text set, and the sets S are respectively obtained for the sentences in the two textsA(v)、SB(v);
Substep 2-2, taking node v as an example, let SA(v)、SB(v) As input to the interaction model, the set is encoded into two context vectors by an embedding layer in the model that shares weights, the context layer typically containing one or more bi-directional LSTMs or CNN layers with the largest pooling layer, intended to capture SA(v) And SB(v) The context information in (1); let VecA(v) And VecB(v) Are respectively represented as SA(v) And SB(v) An obtained context vector; subsequently, a vector representation m of the node is given by the aggregation layerAB(v) It is calculated by connecting the element direction distance of two context vectors and the Hadamard product as shown in formula 9
Figure FDA0003106597240000033
Wherein the content of the first and second substances,
Figure FDA0003106597240000034
representing a Hadamard product;
substeps 2-3, calculating SA(v)、SB(v) Generates another node vector m 'for each node based on similarity of entity features in the text'AB(v) This step obtains three similarity measures; the calculation method of the TF-IDF cosine similarity is shown in formulas 6, 7 and 8;
and a substep 2-4, calculating a weight coefficient, measuring the weight entity in the text through the weight coefficient, and solving the problem of neglecting the weight entity caused by only considering the semantic weight, as shown in formula 10:
Figure FDA0003106597240000041
wherein ECdoc(ei) Representing an entity eiWeight coefficient, EW, in text docdoc(ei) Representing an entity eiThe semantic weight of the text doc is occupied, and the calculation method is shown in formula 4; n is a radical ofDOCRepresenting the total number of newsletters in the knowledge space, I (EW)j(ei)≥EWdoc(ei) Is an indicator function, represents the statistical EWj(ei)≥EWdoc(ei) The number of texts;
the average weighting factor is derived from the weighting factors of the single entity, and for a sentence, the average weighting factor of the sentence is calculated by the following formula 11:
Figure FDA0003106597240000042
AvgECA(sensor) represents the average weighting factor of the sentence sensor in text A, num represents the entity category with weighting factor in the sentence, ECA(ei) Denotes eiWeight factor, N, in text AiDenotes eiNumber of occurrences in a sentence, NsentenceRepresenting the total number of occurrences of all entities in the sentence;
substep 2-5, calculating Jaccard similarity considering sentence length, selecting penalty term adding sentence length difference as denominator, as shown in equation 12:
Figure FDA0003106597240000043
wherein alpha represents a super parameter and is used for adjusting the influence occupied by the length difference of sentences, len (| A ≦ B |) represents the number of the entities shared by the two texts, and len (| A ≦ B |) represents the union length of the entities in the two texts;
after the result is obtained through calculation, the similarity vector m 'of the node v is obtained through connection'AB(v)。
5. The method for calculating text similarity based on a co-occurrence entity interaction graph according to claim 1, wherein the step 3 specifically comprises the following sub-steps:
substep 3-1, generating a matching vector
The input of GCN is the co-occurrence entity interaction graph G obtained in the aboveABN nodes and edges with weights between nodes; each node ViComprising a matching vector M (v)i) It is composed of the node vectors obtained in between, as shown in equation 13:
M(vi)=(mAB(vi),m'AB(vi) Equation 13)
Substep 3-2, generating text level similarity fixed length vector
Drawing GABThe corresponding weighted adjacency matrix is expressed as
Figure FDA0003106597240000051
Wherein each value of Y is YijRepresents a node ViAnd VjThe TF-IDF similarity is calculated previously; let D be a diagonal matrix and satisfy Dii=∑jYij(ii) a The GCN has an input layer of H(0)M, the initial node characteristics are recorded,
Figure FDA0003106597240000052
a hidden representation matrix representing the nth layer; each layer is passed through a graphical convolution filter as shown in equation 14 to obtain a hidden representation of the next layer:
Figure FDA0003106597240000053
wherein
Figure FDA0003106597240000054
INIs an identity matrix; and class DSimilarly, the first and second electrodes are arranged in a parallel manner,
Figure FDA0003106597240000055
is also a diagonal matrix, satisfies
Figure FDA0003106597240000056
W(n)A trainable weight matrix for the nth level; σ represents an activation function;
finally, taking the average value of the hidden vectors of all the nodes in the last layer to obtain a fixed-length vector representing the similarity of the text level;
substep 3-3, finally calculating a matching score through a multilayer perceptron, wherein the classification module comprises a linear layer, a ReLU layer, another linear layer and a last Sigmoid activation function layer; and calculating and judging the similar situation of the text.
6. A co-occurrence entity interaction graph-based text similarity calculation apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the computer program when loaded into a processor implements the method of computing text similarity based on a co-occurrence entity interaction graph of any of claims 1-5 above.
7. The co-occurrence entity interaction graph-based text similarity calculation apparatus according to claim 6, wherein: the computer program comprises a co-occurrence entity interaction graph construction module, a node vector generation module and a node feature aggregation module; the co-occurrence entity interactive graph construction module is used for acquiring co-occurrence entities in the text pairs, completing the splitting of the text by utilizing the co-occurrence entities, and corresponding sentences in the long text to co-occurrence entity nodes with the highest similarity; then, the similarity measurement between the nodes is completed through the module, and the construction of the interactive graph is realized; the node vector generation module is used for generating a feature vector of an entity node, wherein the feature vector comprises a hidden feature vector with a fixed size and a similarity vector generated according to three similarity metrics, and the similarity vector is used as initial input of the node feature aggregation module; and the node feature aggregation module aggregates the features of all the nodes of the input co-occurrence entity interactive graph into a final fixed length vector representing the text-level similarity by using the GCN, and finally completes the calculation of the similarity through a perception machine.
CN202110639430.8A 2021-06-08 2021-06-08 Text similarity calculation method and device based on co-occurrence entity interaction graph Pending CN113743079A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110639430.8A CN113743079A (en) 2021-06-08 2021-06-08 Text similarity calculation method and device based on co-occurrence entity interaction graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110639430.8A CN113743079A (en) 2021-06-08 2021-06-08 Text similarity calculation method and device based on co-occurrence entity interaction graph

Publications (1)

Publication Number Publication Date
CN113743079A true CN113743079A (en) 2021-12-03

Family

ID=78728425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110639430.8A Pending CN113743079A (en) 2021-06-08 2021-06-08 Text similarity calculation method and device based on co-occurrence entity interaction graph

Country Status (1)

Country Link
CN (1) CN113743079A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114428850A (en) * 2022-04-07 2022-05-03 之江实验室 Text retrieval matching method and system
CN116304749A (en) * 2023-05-19 2023-06-23 中南大学 Long text matching method based on graph convolution

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114428850A (en) * 2022-04-07 2022-05-03 之江实验室 Text retrieval matching method and system
CN114428850B (en) * 2022-04-07 2022-08-05 之江实验室 Text retrieval matching method and system
CN116304749A (en) * 2023-05-19 2023-06-23 中南大学 Long text matching method based on graph convolution
CN116304749B (en) * 2023-05-19 2023-08-15 中南大学 Long text matching method based on graph convolution

Similar Documents

Publication Publication Date Title
WO2020108608A1 (en) Search result processing method, device, terminal, electronic device, and storage medium
CN112269868B (en) Use method of machine reading understanding model based on multi-task joint training
CN109815336B (en) Text aggregation method and system
CN106021364A (en) Method and device for establishing picture search correlation prediction model, and picture search method and device
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN107145485B (en) Method and apparatus for compressing topic models
CN110674252A (en) High-precision semantic search system for judicial domain
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
WO2015165372A1 (en) Method and apparatus for classifying object based on social networking service, and storage medium
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN110569405A (en) method for extracting government affair official document ontology concept based on BERT
CN111694940A (en) User report generation method and terminal equipment
CN108228541A (en) The method and apparatus for generating documentation summary
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN109255012A (en) A kind of machine reads the implementation method and device of understanding
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN110008365A (en) A kind of image processing method, device, equipment and readable storage medium storing program for executing
CN114742071B (en) Cross-language ideas object recognition analysis method based on graph neural network
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113569118B (en) Self-media pushing method, device, computer equipment and storage medium
CN107908749A (en) A kind of personage's searching system and method based on search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination