CN107357918B

CN107357918B - Text representation method based on graph

Info

Publication number: CN107357918B
Application number: CN201710599697.2A
Authority: CN
Inventors: 周法国
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2022-01-25
Anticipated expiration: 2037-07-21
Also published as: CN107357918A

Abstract

The invention relates to the technical field of text representation, in particular to a text representation method based on a graph, which comprises the following steps: determining the maximum vertex number n of the graph model corresponding to each document, performing word segmentation, part of speech tagging and preprocessing on the documents, and performing word frequency statistics on the documents; selecting the feature entries which can represent the document most, wherein the number of the feature entries is not more than n, and recording the sequence of all feature words in the document; and regarding the document D, all the characteristic entries of the document D are used as vertexes of the graph model, and the appearance frequencies of the corresponding characteristic entries form the weight of the vertexes. The invention has the beneficial effects that: the word meaning space is a network diagram formed by constraint relations between words, and between words, the semantic distance is expressed by the strength of the constraint relation between words, the similarity of the diagram is measured by the basic elements of the diagram, a good clustering effect is obtained, and if the semantic information, the frequency of characteristic entries and the position relation of the characteristic entries are reflected by the external characteristics of the text, the characteristic entries, the frequency of the characteristic entries and the position relation of the characteristic entries are obtained.

Description

Text representation method based on graph

Technical Field

The invention relates to the technical field of text representation, in particular to a text representation method based on a graph.

Background

In natural language processing and related fields, the classical text representation model basically rarely considers the effect of the order relation of terms in text on semantic expression, and assumes that terms are independent of each other. In fact, the order relationship between terms affects the semantics of the text, and changes in the chinese word order often affect the relationship between terms and cause semantic changes. A simple example is "a likes B" and "B likes a", the terms used in the sentence are the same, and the difference in word order results in a difference in semantics. The currently most popular text representation model VSM model ignores the order relationships in its model assumptions.

The most common text representation method is the vector space model, which is a bag-of-words (bag-of-words) based method, but it cannot be changed, and this representation method loses much information in the original text, such as: the order of words in the text, the boundaries of sentences and paragraphs in the text, and the like.

Aiming at the defects of the vector space representation model, many scholars at home and abroad propose a document representation method based on a graph model. Document conceptual diagram representation models based on auxiliary dictionaries VerbNet and WordNet as proposed by sveltana in its paper; bhopesh and Pushpak propose in their papers that feature vectors representing documents are constructed according to UNL graphs, and SOM technology is adopted to cluster texts; also Inderjeet and Eric have proposed in their paper a document graph model representation method for multiple document summarization. Although the graph models well embody semantic information of the documents, the graph models are too complex to provide similarity measurement standards, and some graph models need additional auxiliary information. Recently, Adam Schenker et al proposed a simpler document representation method based on a graph model in their papers, but their models are mainly established on the basis of the position boolean association of the feature terms of the text, and do not consider the influence of the occurrence frequency of the feature terms on the main content of the text.

Therefore, it is necessary to propose a graph-based text representation method for the above-described problem.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, an object of the present invention is to provide a text representation method based on a graph, which can better represent a text and improve the effects of information retrieval and text classification applications.

The text representation method based on the graph comprises the following steps: the method comprises the following steps: inputting a text document D; step two: outputting a text class diagram G (V, E, W)₁,W₂) (ii) a Step three: determining the maximum vertex number n of the graph model corresponding to each document; step four: performing word segmentation, part-of-speech tagging and preprocessing on the document, and performing word frequency statistics on the document; step five: selecting the feature entries which can represent the document most, wherein the number of the feature entries is not more than n, and recording the sequence of all feature words in the document; step six: for the document D, all the characteristic entries of the document D are used as vertexes of the graph model, and the appearance frequency of the corresponding characteristic entries forms the weight of the vertexes; step seven: if two characteristic words appear in a certain paragraph of the document successively, a directed edge is arranged between the two characteristic words, the direction of the edge is pointed by the characteristic word appearing first and then the characteristic word appears later, and the number of times of the two characteristic terms co-occurring in the document is counted; step eight: determining an incidence matrix M and U of the characteristic entries according to a formula (1); step nine: according to the formula

And carrying out normalization processing on the matrix U, and determining the normalized incidence matrix W.

Preferably, the formula (1) is defined by defining 1 a weight of an edge between two feature entries, i.e. a semantic measure, the semantic measure is defined as: w is a_AB＝1/(num(B)-num(A))

(1)

Where num (B) represents the sequence number of the feature entry B in the document, and num (a) represents the sequence number of the feature entry a in the document.

Preferably, where definition 1 is a document D, it corresponds to a class diagram G in the semantic space of words, G being a quadruple G (V, E, W)₁,W₂) And the class weighted directed graph is composed of a weighted vertex set V (G) and a weighted edge set E (G). The vertex set V (G) is composed of all characteristic entries appearing in the document D; set of weights W for the vertices₁Is composed of the word frequency of the vertex in the vertex set V (G); if the feature entries corresponding to the two vertexes appear in sequence, a directed edge exists between the two vertexes, the direction of the directed edge is pointed to the vertex appearing in sequence by the vertex appearing in sequence, the weight W of the edge represents the degree of constraint between the two feature entries related to the edge, a set formed by all the edges is called an edge set E (G), a set formed by the weight W of the edge is called a weight set W of the edge₂。

Preferably, the document expression form of definition 1 is:

T＝[t₁,t₂,…,t_n] (2)

wherein, T: a feature entry set; t is t_iIs a characteristic entry, i ═ 1,2, …, n; m: an incidence matrix of the feature entries; a is_ij: characteristic entry t_iAnd t_jThe correlation strength (i is more than or equal to 1 and less than or equal to j and less than or equal to n),

if a word a constrains another word B in the same paragraph for many times, only the nearest constraint relationship between them is counted, and it can be known from definition 1 that the maximum constraint value is 1, and a matrix U is obtained:

in general, the matrix U needs to be normalized.

Order to

Where i, j, k, l is 1,2, …, n yields the normalized matrix W:

preferably, two documents D₁And D₂The closer the semantics are, the more similar their corresponding document maps are, and conversely, the more similar the two document maps are, the closer they are semantically, the two documents D₁And D₂The closer the semantics are, the more identical vertices and edges there are on the graph features, and the closer the weights on the edges are.

Preferably, assume two documents D₁And D₂The corresponding weighted directed graphs are respectively G₁And G₂，G₁And G₂C, then document D₁And D₂The similarity of (a) is defined as follows:

wherein, | V (C) | represents the weighted directed graph G₁And G₂N is Max { | V (G)₁)|,V(G₂) And the constant factor beta takes a decimal between 0 and 1.

And the document similarity reflects the similarity degree between the two documents. Generally, the value is 0-1, 0 represents dissimilar, 1 represents completely similar, and the larger the value is, the more similar the two documents are.

The closer the semantics of the two documents are, as reflected in the characteristics of the graph, the more the two graphs have the same vertices and edges, and the closer the weights on the edges are, in equation (7),

is a measure of the composition of the vertices of two graphs, the closer the semantics of the two documents are, the more similar the corresponding graphs are,

the larger the value is, the closer to 1 is; while

Is a measure of the composition of the edges of two graphs, the closer the semantics of the two documents are, the more similar the corresponding graphs are,

the larger the value, the closer to 1, the linear combination

Represents a measure of similarity of the corresponding graphs for two documents, and S (D)₁,D₂) The value is between 0 and 1.

Accordingly, two documents D₁And D₂Distance Dis (D)₁,D₂)＝1-S(D₁,D₂)。

Due to the adoption of the technical scheme, the invention has the beneficial effects that: the word meaning space is a network diagram formed by constraint relations among words, the semantic distance is expressed by the strength of the constraint relations among the words, the similarity of the diagram is measured by basic elements (vertexes and edges) of the diagram, a good clustering effect is obtained, and if the semantic information, the frequency of characteristic entries and the position relation of the characteristic entries are reflected by the external characteristics of a text, a new document expression model based on the word meaning space is established. It can successfully capture the following information: (1) part of speech, (2) word order, (3) word frequency, (4) co-occurrence of words, and (5) context information of words in text.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.

As shown in fig. 1, the text representation method of the figure comprises the following steps: the method comprises the following steps: inputting a text document D; step two: outputting a text class diagram G (V, E, W)₁,W₂) (ii) a Step three: determining the maximum vertex number n of the graph model corresponding to each document; step four: performing word segmentation, part-of-speech tagging and preprocessing on the document, and performing word frequency statistics on the document; step five: selecting the feature entries which can represent the document most, wherein the number of the feature entries is not more than n, and recording the sequence of all feature words in the document; step six: for the document D, all the characteristic entries of the document D are used as vertexes of the graph model, and the appearance frequency of the corresponding characteristic entries forms the weight of the vertexes; step seven: if two characteristic words appear in a certain paragraph of the document successively, a directed edge is arranged between the two characteristic words, the direction of the edge is pointed by the characteristic word appearing first and then the characteristic word appears later, and the number of times of the two characteristic terms co-occurring in the document is counted; step eight: determining an incidence matrix M and U of the characteristic entries according to a formula (1); step nine: according to the formula

Further, the formula (1) is defined by defining 1 a weight of an edge between two feature entries, i.e. semantic measure, which is defined by: w is a^AB＝1/(num(B)-num(A))(1)

Wherein definition 1 is a document D corresponding to a class diagram G in the semantic space of words, G is aQuadruplet G (V, E, W)₁,W₂) And the class weighted directed graph is composed of a weighted vertex set V (G) and a weighted edge set E (G). The vertex set V (G) is composed of all characteristic entries appearing in the document D; set of weights W for the vertices₁Is composed of the word frequency of the vertex in the vertex set V (G); if the feature entries corresponding to the two vertexes appear in sequence, a directed edge exists between the two vertexes, the direction of the directed edge is pointed to the vertex appearing in sequence by the vertex appearing in sequence, the weight W of the edge represents the degree of constraint between the two feature entries related to the edge, a set formed by all the edges is called an edge set E (G), a set formed by the weight W of the edge is called a weight set W of the edge₂。

Preferably, the document expression form of definition 1 is:

T＝[t₁,t₂,…,t_n] (2)

in general, the matrix U needs to be normalized.

Order to

Where i, j, k, l is 1,2, …, n yields the normalized matrix W:

further, two documents D₁And D₂The closer the semantics are, the more similar their corresponding document maps are, and conversely, the more similar the two document maps are, the closer they are semantically, the two documents D₁And D₂The closer the semantics are, the more identical vertices and edges there are on the graph features, and the closer the weights on the edges are.

Suppose two documents D₁And D₂The corresponding weighted directed graphs are respectively G₁And G₂，G₁And G₂C, then document D₁And D₂The similarity of (a) is defined as follows:

The closer the semantics of the two documents are, the more identical vertices and edges are present on the graph's features, and the closer the weights on the edges are. In the formula (7), the reaction mixture is,

the larger the value is, the closer to 1 is; while

the larger the value, the closer to 1, and thus, the linear combination

Represents a measure of similarity of the corresponding graphs for two documents, and S (D)₁,D₂) The value is between 0 and 1, and correspondingly, two documents D₁And D₂Distance Dis (D)₁,D₂)＝1-S(D₁,D₂)。

In addition, weak equivalence relation: let R be a binary relation on set A, if R satisfies the condition:

self-reflexibility: for any element x in the set A, x belongs to R

Symmetry: for any 2 elements x and y of the set A, < x, y > ∈ R, then < y, x > ∈ R

Weak transmissibility: for any 3 elements x, y, and z of set A, < x, z > ∈ LR if < x, y > ∈ R, and < y, z > ∈ R. Where LR represents a weak binary relationship for R.

Then R is a weak equivalence relation defined on the set A, and the similarity relation S of the documents is the document set D_setThe above binary relation, the similarity relation S of the documents is a weak equivalence relation.

Wherein, the term meaning space is term space + semantic space, and the formalization of term meaning space is described as follows:

S＝＜T,R,W₁,W₂> - { T ═ T₁,t₂,…,t_i,…,t_n1,2, …, n, T is a characteristic entry set, T_iIs a characteristic entry, i ═ 1,2, …, n; r is a semantic constraint relation on a set T, an element T in the set T_iAnd t_jSatisfies the relationship R, if and only if t_iConstraint t_jIs denoted by t_iRt_jOr < t_i,t_j＞∈R，i,j＝1,2,…,n；W₁Is a set of weights for the feature entries, here t_iI ═ 1,2, …, n; w₂Is a set of the element constraint strengths in the set T.

Obviously, the semantically constrained relationship on the set T is a binary relationship on the set T. As known from the knowledge of set theory, the binary relation can be represented by graph G, wherein the vertex of graph G is composed of all elements in T, if < T_i,t_jIs > ∈ R, then from vertex t_iTo the vertex t_jThere is a directed edge, i, j ═ 1,2, …, n, because the relationship is a set of ordered pairs, the order of the elements in the ordered pairs cannot be reversed, so in the graph representation of the relationship, directed edges are used.

The word meaning space of the invention is a network diagram formed by constraint relations between words, word and word, the semantic distance is expressed by the strength of the constraint relation between words, the similarity of the diagram is measured by basic elements (vertexes and edges) of the diagram, a good clustering effect is obtained, if the semantic information, the frequency of characteristic entries and the position relation of the characteristic entries of the text are reflected by the external characteristics of the text, a new document expression model based on the word meaning space is established, and the following information can be successfully captured: (1) part of speech, (2) word order, (3) word frequency, (4) co-occurrence of words, and (5) context information of words in text.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. The text representation method based on the graph is characterized in that: the method comprises the following steps:

the method comprises the following steps: inputting a text document D;

step two: output text graph G (V, E, W)₁,W₂)；

Step three: determining the maximum vertex number n of the graph model corresponding to each document;

step four: performing word segmentation, part-of-speech tagging and preprocessing on the document, and performing word frequency statistics on the document;

step five: selecting the feature entries which can represent the document most, wherein the number of the feature entries is not more than n, and recording the sequence of all feature words in the document;

step six: for the document D, all the characteristic entries of the document D are used as vertexes of the graph model, the appearance frequency of the corresponding characteristic entries forms the weight of the vertexes, and therefore the weight set W of the vertexes is formed₁；

Step seven: if two characteristic words appear in a certain paragraph of the document successively, a directed edge is arranged between the two characteristic words, and the direction of the edge is directed by the characteristic word appearing first to the characteristic word appearing later;

step eight: determining an incidence matrix M and U of the characteristic entries according to a formula (1);

step nine: according to the formula

Carrying out normalization processing on the matrix U, and determining a normalized incidence matrix W;

eighthly, the formula (1) is defined by semantic measure of an edge between two feature entries of the definition 1, and the semantic measure is defined as follows: w is a_AB＝1/(num(B)-num(A)) (1)

Wherein num (B) represents the sequence number of the feature entry B in the document, and num (a) represents the sequence number of the feature entry a in the document;

where definition 1 is a document D corresponding to a graph G in the semantic space of words, G is a quadruple G (V, E, W)₁,W₂) A weighted directed graph consisting of a weighted set of vertices V (G) and a weighted set of edges E (G), the set of vertices V (G) consisting of all the feature entries appearing in the document D; weight on edge W₂Representing the degree of constraint between two associated feature entries, the set of all edges is called edge set E (G), the weight W on the edge₂The set of constructs is called the weight set W of the edge₂。

2. The graph-based text representation method of claim 1, wherein: the document expression form of definition 1 is:

T＝[t₁,t₂,…,t_n] (2)

in general, the matrix U needs to be normalized,

order to

Where i, j, k, l is 1,2, …, n yields the normalized matrix W:

3. the graph-based text representation method of claim 1, wherein: two documents D₁And D₂The closer the semantics are, the more similar their corresponding document maps are, and conversely, the more similar the two document maps areThe closer they are semantically, two documents D₁And D₂The closer the semantics are, the more identical vertices and edges there are on the graph features, and the closer the weights on the edges are.

4. The graph-based text representation method of claim 1, wherein: suppose two documents D₁And D₂The corresponding weighted directed graphs are respectively G₁And G₂，G₁And G₂C, then document D₁And D₂The similarity of (a) is defined as follows:

wherein, | V (C) | represents the weighted directed graph G₁And G₂N is Max { | V (G)₁)|,V(G₂) The constant factor beta is a decimal between 0 and 1,

the document similarity, which reflects the similarity between two documents, is usually a numerical value between 0 and 1, where 0 indicates dissimilarity, 1 indicates complete similarity, and a larger numerical value indicates more similarity between two documents,

the larger the value is, the closer to 1 is; while

Is a measure of the composition of the edges of the two graphsThe closer the semantics of the two documents are, the more similar the corresponding graphs are,

the larger the value, the closer to 1, the linear combination