CN107357918A

CN107357918A - Document representation method based on figure

Info

Publication number: CN107357918A
Application number: CN201710599697.2A
Authority: CN
Inventors: 周法国
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2017-11-17
Anticipated expiration: 2037-07-21
Also published as: CN107357918B

Abstract

The present invention relates to text representation technical field, the document representation method of figure is based especially on, its method and step is：It is determined that each document corresponds to most number of vertices n of graph model, document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it；The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records the sequencing of all Feature Words in a document；To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit.Beneficial effect of the present invention：Word semantic space is exactly the network being made up of the restriction relation between word and word, word and word, the power of restriction relation between word and word represents semantic distance, the similitude of figure is measured with the basic element of figure, achieve good Clustering Effect, if reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets from the surface of text.

Description

Document representation method based on figure

Technical field

The present invention relates to text representation technical field, the document representation method of figure is based especially on.

Background technology

In natural language processing and association area, classical text representation model seldom considers lexical item in text substantially Effect of the ordinal relation for semantic meaning representation, and it is separate to assume between lexical item.In fact, sequence between lexical item Relation can influence the semanteme of text, and the change of Chinese word order often influences the relation between word and then causes the change of semanteme. One simply example is that " A likes B " and " B likes A ", and the lexical item used in sentence is identical, and the difference of word order result in language The difference of justice.Presently the most popular text representation model VSM models have been ignored as the relation of sequence in its model hypothesis.

The most frequently used in document representation method is exactly vector space model, and this is a kind of to be based on bag of words (bag-of-words) Method, but it can not change, and this method for expressing lost many information in urtext, such as：In text The information such as border of sentence and paragraph in the order of word, text.

The defects of representing model for vector space, domestic and international many scholars propose the document representation side based on graph model Method.As the Svetlana document concepts figures based on auxiliary dictionary VerbNet and WordNet proposed in its paper represent mould Type；Bhoopesh and Pushpak is proposed in their paper represents the characteristic vector of document according to UNL figures to construct, and Text is clustered using SOM technologies；Also Inderjeet and Eric is it is also proposed in their paper for more documents The document map Model representation approach of abstract extraction.Although these graph models embody the semantic information of document, all mistake well In complexity, it is difficult to similarity measures are provided, and some also need to extra auxiliary information.Recently, Adam Schenker Et al. a kind of relatively simple document representing method based on graph model, but their model master are proposed in their paper Establish on the basis of the position boolean association of text feature entry, consider the frequency of document feature sets appearance to text The influence of main contents.

Therefore, it is necessary to propose the document representation method based on figure for above mentioned problem.

The content of the invention

For above-mentioned the deficiencies in the prior art, it is an object of the invention to provide the text representation side based on figure Method, it can preferably represent text, improve information retrieval, the effect of text classification application.

Document representation method based on figure, its method and step are：Step 1：Input text document D；Step 2：Output text This class figure G (V, E, W₁,W₂)；Step 3：It is determined that each document corresponds to most number of vertices n of graph model；Step 4：To document Segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it；Step 5：The Feature Words of the document can most be represented by choosing Bar, its number are no more than n, and record the sequencing of all Feature Words in a document；Step 6：To document D, by its all spy Summit of the entry as graph model is levied, the frequency of occurrences of corresponding document feature sets forms the weight on summit；Step 7：If two Feature Words successively occur in a certain paragraph of document, then have a directed edge between them, and the direction on side is by the spy that first occurs The Feature Words that sign word occurs after pointing to, and count the number of the two document feature sets co-occurrence in the document；Step 8：According to public affairs Formula (1) determines the incidence matrix M, U of document feature sets；Step 9：According to formulaMatrix U is normalized place Reason, it is determined that the incidence matrix W after normalization.

Preferably, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure defines, Semanteme measure definition：

WAB=1/ (num (B)-num (A)) (1)

Wherein, num (B) represents the serial numbers of document feature sets B in a document, and num (A) represents document feature sets A in a document Serial number.

Preferably, defined in it 1 be exactly that class figure G, G under word semantic space is a quaternary corresponding to a document D Group G (V, E, W₁,W₂) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V (G) formed by appearing in all document feature sets in document D；The weight set W on summit₁It is by the summit in vertex set V (G) Word frequency in the text is formed；If the document feature sets of two vertex correspondences occur in priority, there is a directed edge between them, Its direction is the summit occurred after being pointed to by the summit that first occurs, and the power w on side is represented between two document feature sets associated with it The size of degree of restraint, it is all while the set formed we term it while collection E (G), the set that the weight w on side is formed we The referred to as weight sets W on side₂。

Preferably, the document expression-form of the definition 1 is：

T=[t₁,t₂,…,t_n] (2)

Wherein, T：Document feature sets set；t_iIt is characterized entry, i=1,2 ..., n；M：The incidence matrix of document feature sets；a_ij： Document feature sets t_iAnd t_jStrength of association (1≤i≤j≤n),

If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint between them is only counted Relation, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U：

Usually, it is necessary to which matrix U is normalized.

Order

Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W：

Preferably, two document Ds₁And D₂Semantic closer, then their corresponding document map is also more similar, on the contrary, two Document map is more similar, then theirs is being semantically closer, two document Ds₁And D₂It is semantic closer, it is embodied in the spy of figure In sign, two figures just have more identical summits and side, and the weights on side are also closer.

Preferably, it is assumed that two document Ds₁And D₂Corresponding Weighted Directed Graph is respectively G₁And G₂, G₁And G₂It is maximum public Subgraph is C, then document D₁And D₂Similarity definition it is as follows：

Wherein, | V (C) | represent Weighted Directed Graph G₁And G₂Maximum public subgraph C number of vertices, n=Max | V (G₁) |,V(G₂), invariant β takes the decimal between 0~1.

Documents Similarity, the similarity degree reflected between two documents.The usually numerical value between one 0~1,0 represents Dissmilarity, 1 represents completely similar, and numerical value is bigger to represent that two documents are more similar.

Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and on side Weights it is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, corresponding Figure is more similar,Value is bigger, closer to 1；AndIt is the measurement to the composition on the side of two figures, two document semantics are got over Close, corresponding figure is more similar,Value is bigger, and closer 1, linear combination Represent the measurement to the similitude of figure corresponding to two documents, and S (D₁,D₂) value is between 0~1.

Correspondingly, two document Ds₁And D₂Distance Dis (D₁,D₂)=1-S (D₁,D₂)。

Due to using above-mentioned technical proposal, beneficial effect of the present invention：Word semantic space is exactly by between word and word, word and word The network that is formed of restriction relation, the power of the restriction relation between word and word represents semantic distance, with the base of figure This element (summit and side) measures the similitude of figure, achieves good Clustering Effect, if come from the surface of text Reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets, establish a kind of new base In the document representation model of word semantic space.It can successfully capture following information：(1) part of speech, (2) word order, (3) word frequency, (4) in the co-occurrence of word, (5) text word contextual information.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention.

Embodiment

Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims Implement with the multitude of different ways of covering.

As shown in figure 1, in the document representation method of figure, its method and step is：Step 1：Input text document D；Step Two：Export text class figure G (V, E, W₁,W₂)；Step 3：It is determined that each document corresponds to most number of vertices n of graph model；Step Four：Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it；Step 5：Selection can most represent this article The document feature sets of shelves, its number is no more than n, and records the sequencing of all Feature Words in a document；Step 6：To document D, By its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit；Step Seven：Have a directed edge if two Feature Words successively occur in a certain paragraph of document, between them, the direction on side by The Feature Words that the Feature Words first occurred occur after pointing to, and count the number of the two document feature sets co-occurrence in the document；Step Rapid eight：The incidence matrix M, U of document feature sets are determined according to formula (1)；Step 9：According to formulaTo matrix U It is normalized, it is determined that the incidence matrix W after normalization.

Further, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure is determined Justice, semanteme measure definition：w^AB=1/ (num (B)-num (A)) (1)

Defined in it 1 be exactly corresponding to a document D class figure G, G under word semantic space be a four-tuple G (V, E, W₁,W₂) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V (G) is by going out All document feature sets are formed in present document D；The weight set W on summit₁Be by the summit in vertex set V (G) in the text Word frequency formed；If the document feature sets of two vertex correspondences occur in priority, there are a directed edge, its direction between them It is the summit occurred after being pointed to by the summit that first occurs, the power w on side represents to constrain journey between two document feature sets associated with it The size of degree, it is all while the set formed we term it while collection E (G), while the set that forms of weight w we be referred to as while Weight sets W₂。

Preferably, the document expression-form of the definition 1 is：

T=[t₁,t₂,…,t_n] (2)

Usually, it is necessary to which matrix U is normalized.

Order

Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W：

Further, two document Ds₁And D₂Semantic closer, then their corresponding document map is also more similar, on the contrary, two Individual document map is more similar, then theirs is being semantically closer, two document Ds₁And D₂It is semantic closer, it is embodied in figure In feature, two figures just have more identical summits and side, and the weights on side are also closer.

Assuming that two document Ds₁And D₂Corresponding Weighted Directed Graph is respectively G₁And G₂, G₁And G₂Maximum public subgraph be C, Then document D₁And D₂Similarity definition it is as follows：

Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and Weights on side are also closer.In formula (7),It is the measurement formed to the summit of two figures, two document semantics more connect Closely, corresponding figure is more similar,Value is bigger, closer to 1；AndIt is the degree to the composition on the side of two figures Amount, two document semantics are closer, and corresponding figure is more similar,Value is bigger, and closer 1, thus, linear combinationRepresent the measurement to the similitude of figure corresponding to two documents, and S (D₁,D₂) value is between 0~1, correspondingly, two document Ds₁And D₂Distance Dis (D₁,D₂)=1-S (D₁,D₂)。

In addition, weak barbed congruence：If R is the binary crelation on set A, if R meets condition：

Reflexivity：To set A arbitrary element x, have<x,x>∈R

Symmetry：To set A any 2 element x and y, if<x,y>∈ R, then<y,x>∈R

Weak transitivity：To set A any 3 element x, y and z, if<x,y>∈ R, and<y,z>∈ R, then<x,z>∈ LR.Wherein, LR represents R weak binary crelation.

Then R is claimed to be defined in a weak barbed congruence on set A, the similarity relationships S of document is collection of document D_setOn A binary crelation, the similarity relation S of document is a weak barbed congruence.

Wherein, word semantic space=word space+semantic space, the formal specification of word semantic space are as follows：

S=＜ T, R, W₁,W₂＞, wherein T={ t₁,t₂,…,t_i,…,t_n, i=1,2 ..., n, T are characterized entry collection Close, t_iIt is characterized entry, i=1,2 ..., n；R is the semantic constraint relation on set T, the element t in set T_iAnd t_jMeet to close It is R, and if only if t_iConstrain t_j, it is designated as t_iRt_jOr ＜ t_i,t_j＞ ∈ R, i, j=1,2 ..., n；W₁It is the weight of document feature sets Set, herein refers to t_iWord frequency set, i=1,2 ..., n；W₂It is the set that element constraint is strong and weak in set T.

Obviously, the semantic constraint relation on set T is a binary crelation on set T.From the knowledge of set theory, Binary crelation can be represented with figure G, wherein figure G summit is exactly to be made up of all elements in T, if ＜ t_i,t_j＞ ∈ R, Then from summit t_iTo summit t_jThere is a directed edge, i, j=1,2 ..., n. is because relation is the even set of sequence, element in sequence idol Order can not overturn, so in the figure of relation represents, use directed edge.

The word semantic space of the present invention is exactly the network being made up of the restriction relation between word and word, word and word, is used The power of restriction relation between word and word represents semantic distance, with the basic element (summit and side) of figure measures the phase of figure Like property, good Clustering Effect is achieved, if reflecting its semantic information, document feature sets, feature from the surface of text The frequency of entry and its position relationship of document feature sets, a kind of document representation model of new word-based semantic space is established, It can successfully capture following information：(1) in part of speech, (2) word order, (3) word frequency, the co-occurrence of (4) word, (5) text word it is upper Context information.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the scope of the invention, every utilization The equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations Technical field, be included within the scope of the present invention.

Claims

1. the document representation method based on figure, it is characterised in that：Its method and step is：

Step 1：Input text document D；

Step 2：Export text class figure G (V, E, W₁,W₂)；

Step 3：It is determined that each document corresponds to most number of vertices n of graph model；

Step 4：Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it；

Step 5：The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records all Feature Words in document In sequencing；

Step 6：To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets Form the weight on summit；

Step 7：There are a directed edge, side if two Feature Words successively occur in a certain paragraph of document, between them Direction pointed to by the Feature Words that first occur after the Feature Words that occur, and count the two document feature sets co-occurrence in the document Number；

Step 8：The incidence matrix M, U of document feature sets are determined according to formula (1)；

Step 9：According to formulaMatrix U is normalized, it is determined that the incidence matrix W after normalization.

2. the document representation method according to claim 1 based on figure, it is characterised in that：The formula (1) is by defining 1 The weights on side that is, Semantic Measure definition between two document feature sets, semanteme measure definition：

w^AB=1/ (num (B)-num (A)) (1)

Wherein, num (B) represents the serial number of document feature sets B in a document, and num (A) represents document feature sets A in a document suitable Sequence number.

3. the document representation method according to claim 1 based on figure, it is characterised in that：1 is a document D defined in it Corresponding is exactly that class the figure G, G under word semantic space are four-tuple G (V, E, a W₁,W₂) the vertex set V (G) that is had the right by band and The class Weighted Directed Graph that the side collection E (G) that band is had the right is formed, vertex set V (G) are formed by appearing in all document feature sets in document D； The weight set W on summit₁It is to be made up of the word frequency in the text on the summit in vertex set V (G)；If two vertex correspondences Document feature sets occur in priority, then have a directed edge between them, and its direction is that occur after being pointed to by the summit first occurred Summit, the power w on side represent the size of degree of restraint between two document feature sets associated with it, the set that all sides are formed We term it side collection E (G), while the set that forms of weight w we be referred to as while weight sets W₂。

4. the document representation method according to claim 1 based on figure, it is characterised in that：The document expression of the definition 1 Form is：

T=[t₁,t₂,…,t_n] (2)

Wherein, T：Document feature sets set；t_iIt is characterized entry, i=1,2 ..., n；M：The incidence matrix of document feature sets；a_ij：Feature Entry t_iAnd t_jStrength of association (1≤i≤j≤n),

If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint only counted between them is closed System, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U：

Usually, it is necessary to which matrix U is normalized,

Order

Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W：

5. the document representation method according to claim 1 based on figure, it is characterised in that：Two document Ds₁And D₂Semanteme is got over Close, then their corresponding document map is also more similar, on the contrary, two document maps are more similar, then they get over semantically Close, two document Ds₁And D₂It is semantic closer, be embodied in the feature of figure, two figures just have more identical summits and Side, and the weights on side are also closer.

6. the document representation method according to claim 1 based on figure, it is characterised in that：Assuming that two document Ds₁And D₂It is right The Weighted Directed Graph answered is respectively G₁And G₂, G₁And G₂Maximum public subgraph be C, then document D₁And D₂Similarity definition such as Under：

<mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>D</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>&beta;</mi> <mfrac> <mrow> <mo>|</mo> <mi>V</mi> <mrow> <mo>(</mo> <mi>C</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mi>n</mi> </mfrac> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&beta;</mi> <mo>)</mo> </mrow> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>E</mi> <mrow> <mo>(</mo> <mi>C</mi> <mo>)</mo> </mrow> </mrow> </munder> <msubsup> <mi>w</mi> <mi>e</mi> <mi>C</mi> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

Wherein, | V (C) | represent Weighted Directed Graph G₁And G₂Maximum public subgraph C number of vertices, n=Max | V (G₁)|,V (G₂), invariant β takes the decimal between 0~1,

Documents Similarity, the similarity degree reflected between two documents, the usually numerical value between one 0~1,0 represents not phase Seemingly, 1 representing completely similar, numerical value is bigger to represent that two documents are more similar,

Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and the power on side Value is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, and corresponding figure is got over It is similar,Value is bigger, closer to 1；AndIt is the measurement to the composition on the side of two figures, two document semantics more connect Closely, corresponding figure is more similar,Value is bigger, and closer 1, linear combination Represent the measurement to the similitude of figure corresponding to two documents, and S (D₁,D₂) value is between 0~1, correspondingly, two Individual document D₁And D₂Distance Dis (D₁,D₂)=1-S (D₁,D₂)。