CN107357918A - Document representation method based on figure - Google Patents
Document representation method based on figure Download PDFInfo
- Publication number
- CN107357918A CN107357918A CN201710599697.2A CN201710599697A CN107357918A CN 107357918 A CN107357918 A CN 107357918A CN 201710599697 A CN201710599697 A CN 201710599697A CN 107357918 A CN107357918 A CN 107357918A
- Authority
- CN
- China
- Prior art keywords
- mtd
- msub
- mrow
- document
- mtr
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The present invention relates to text representation technical field, the document representation method of figure is based especially on, its method and step is:It is determined that each document corresponds to most number of vertices n of graph model, document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records the sequencing of all Feature Words in a document;To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit.Beneficial effect of the present invention:Word semantic space is exactly the network being made up of the restriction relation between word and word, word and word, the power of restriction relation between word and word represents semantic distance, the similitude of figure is measured with the basic element of figure, achieve good Clustering Effect, if reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets from the surface of text.
Description
Technical field
The present invention relates to text representation technical field, the document representation method of figure is based especially on.
Background technology
In natural language processing and association area, classical text representation model seldom considers lexical item in text substantially
Effect of the ordinal relation for semantic meaning representation, and it is separate to assume between lexical item.In fact, sequence between lexical item
Relation can influence the semanteme of text, and the change of Chinese word order often influences the relation between word and then causes the change of semanteme.
One simply example is that " A likes B " and " B likes A ", and the lexical item used in sentence is identical, and the difference of word order result in language
The difference of justice.Presently the most popular text representation model VSM models have been ignored as the relation of sequence in its model hypothesis.
The most frequently used in document representation method is exactly vector space model, and this is a kind of to be based on bag of words (bag-of-words)
Method, but it can not change, and this method for expressing lost many information in urtext, such as:In text
The information such as border of sentence and paragraph in the order of word, text.
The defects of representing model for vector space, domestic and international many scholars propose the document representation side based on graph model
Method.As the Svetlana document concepts figures based on auxiliary dictionary VerbNet and WordNet proposed in its paper represent mould
Type;Bhoopesh and Pushpak is proposed in their paper represents the characteristic vector of document according to UNL figures to construct, and
Text is clustered using SOM technologies;Also Inderjeet and Eric is it is also proposed in their paper for more documents
The document map Model representation approach of abstract extraction.Although these graph models embody the semantic information of document, all mistake well
In complexity, it is difficult to similarity measures are provided, and some also need to extra auxiliary information.Recently, Adam Schenker
Et al. a kind of relatively simple document representing method based on graph model, but their model master are proposed in their paper
Establish on the basis of the position boolean association of text feature entry, consider the frequency of document feature sets appearance to text
The influence of main contents.
Therefore, it is necessary to propose the document representation method based on figure for above mentioned problem.
The content of the invention
For above-mentioned the deficiencies in the prior art, it is an object of the invention to provide the text representation side based on figure
Method, it can preferably represent text, improve information retrieval, the effect of text classification application.
Document representation method based on figure, its method and step are:Step 1:Input text document D;Step 2:Output text
This class figure G (V, E, W1,W2);Step 3:It is determined that each document corresponds to most number of vertices n of graph model;Step 4:To document
Segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;Step 5:The Feature Words of the document can most be represented by choosing
Bar, its number are no more than n, and record the sequencing of all Feature Words in a document;Step 6:To document D, by its all spy
Summit of the entry as graph model is levied, the frequency of occurrences of corresponding document feature sets forms the weight on summit;Step 7:If two
Feature Words successively occur in a certain paragraph of document, then have a directed edge between them, and the direction on side is by the spy that first occurs
The Feature Words that sign word occurs after pointing to, and count the number of the two document feature sets co-occurrence in the document;Step 8:According to public affairs
Formula (1) determines the incidence matrix M, U of document feature sets;Step 9:According to formulaMatrix U is normalized place
Reason, it is determined that the incidence matrix W after normalization.
Preferably, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure defines,
Semanteme measure definition:
WAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial numbers of document feature sets B in a document, and num (A) represents document feature sets A in a document
Serial number.
Preferably, defined in it 1 be exactly that class figure G, G under word semantic space is a quaternary corresponding to a document D
Group G (V, E, W1,W2) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V
(G) formed by appearing in all document feature sets in document D;The weight set W on summit1It is by the summit in vertex set V (G)
Word frequency in the text is formed;If the document feature sets of two vertex correspondences occur in priority, there is a directed edge between them,
Its direction is the summit occurred after being pointed to by the summit that first occurs, and the power w on side is represented between two document feature sets associated with it
The size of degree of restraint, it is all while the set formed we term it while collection E (G), the set that the weight w on side is formed we
The referred to as weight sets W on side2。
Preferably, the document expression-form of the definition 1 is:
T=[t1,t2,…,tn] (2)
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij:
Document feature sets tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint between them is only counted
Relation, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
Usually, it is necessary to which matrix U is normalized.
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
Preferably, two document Ds1And D2Semantic closer, then their corresponding document map is also more similar, on the contrary, two
Document map is more similar, then theirs is being semantically closer, two document Ds1And D2It is semantic closer, it is embodied in the spy of figure
In sign, two figures just have more identical summits and side, and the weights on side are also closer.
Preferably, it is assumed that two document Ds1And D2Corresponding Weighted Directed Graph is respectively G1And G2, G1And G2It is maximum public
Subgraph is C, then document D1And D2Similarity definition it is as follows:
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1)
|,V(G2), invariant β takes the decimal between 0~1.
Documents Similarity, the similarity degree reflected between two documents.The usually numerical value between one 0~1,0 represents
Dissmilarity, 1 represents completely similar, and numerical value is bigger to represent that two documents are more similar.
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and on side
Weights it is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, corresponding
Figure is more similar,Value is bigger, closer to 1;AndIt is the measurement to the composition on the side of two figures, two document semantics are got over
Close, corresponding figure is more similar,Value is bigger, and closer 1, linear combination
Represent the measurement to the similitude of figure corresponding to two documents, and S (D1,D2) value is between 0~1.
Correspondingly, two document Ds1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
Due to using above-mentioned technical proposal, beneficial effect of the present invention:Word semantic space is exactly by between word and word, word and word
The network that is formed of restriction relation, the power of the restriction relation between word and word represents semantic distance, with the base of figure
This element (summit and side) measures the similitude of figure, achieves good Clustering Effect, if come from the surface of text
Reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets, establish a kind of new base
In the document representation model of word semantic space.It can successfully capture following information:(1) part of speech, (2) word order, (3) word frequency,
(4) in the co-occurrence of word, (5) text word contextual information.
Brief description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Embodiment
Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims
Implement with the multitude of different ways of covering.
As shown in figure 1, in the document representation method of figure, its method and step is:Step 1:Input text document D;Step
Two:Export text class figure G (V, E, W1,W2);Step 3:It is determined that each document corresponds to most number of vertices n of graph model;Step
Four:Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;Step 5:Selection can most represent this article
The document feature sets of shelves, its number is no more than n, and records the sequencing of all Feature Words in a document;Step 6:To document D,
By its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit;Step
Seven:Have a directed edge if two Feature Words successively occur in a certain paragraph of document, between them, the direction on side by
The Feature Words that the Feature Words first occurred occur after pointing to, and count the number of the two document feature sets co-occurrence in the document;Step
Rapid eight:The incidence matrix M, U of document feature sets are determined according to formula (1);Step 9:According to formulaTo matrix U
It is normalized, it is determined that the incidence matrix W after normalization.
Further, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure is determined
Justice, semanteme measure definition:wAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial numbers of document feature sets B in a document, and num (A) represents document feature sets A in a document
Serial number.
Defined in it 1 be exactly corresponding to a document D class figure G, G under word semantic space be a four-tuple G (V, E,
W1,W2) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V (G) is by going out
All document feature sets are formed in present document D;The weight set W on summit1Be by the summit in vertex set V (G) in the text
Word frequency formed;If the document feature sets of two vertex correspondences occur in priority, there are a directed edge, its direction between them
It is the summit occurred after being pointed to by the summit that first occurs, the power w on side represents to constrain journey between two document feature sets associated with it
The size of degree, it is all while the set formed we term it while collection E (G), while the set that forms of weight w we be referred to as while
Weight sets W2。
Preferably, the document expression-form of the definition 1 is:
T=[t1,t2,…,tn] (2)
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij:
Document feature sets tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint between them is only counted
Relation, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
Usually, it is necessary to which matrix U is normalized.
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
Further, two document Ds1And D2Semantic closer, then their corresponding document map is also more similar, on the contrary, two
Individual document map is more similar, then theirs is being semantically closer, two document Ds1And D2It is semantic closer, it is embodied in figure
In feature, two figures just have more identical summits and side, and the weights on side are also closer.
Assuming that two document Ds1And D2Corresponding Weighted Directed Graph is respectively G1And G2, G1And G2Maximum public subgraph be C,
Then document D1And D2Similarity definition it is as follows:
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1)
|,V(G2), invariant β takes the decimal between 0~1.
Documents Similarity, the similarity degree reflected between two documents.The usually numerical value between one 0~1,0 represents
Dissmilarity, 1 represents completely similar, and numerical value is bigger to represent that two documents are more similar.
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and
Weights on side are also closer.In formula (7),It is the measurement formed to the summit of two figures, two document semantics more connect
Closely, corresponding figure is more similar,Value is bigger, closer to 1;AndIt is the degree to the composition on the side of two figures
Amount, two document semantics are closer, and corresponding figure is more similar,Value is bigger, and closer 1, thus, linear combinationRepresent the measurement to the similitude of figure corresponding to two documents, and S
(D1,D2) value is between 0~1, correspondingly, two document Ds1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
In addition, weak barbed congruence:If R is the binary crelation on set A, if R meets condition:
Reflexivity:To set A arbitrary element x, have<x,x>∈R
Symmetry:To set A any 2 element x and y, if<x,y>∈ R, then<y,x>∈R
Weak transitivity:To set A any 3 element x, y and z, if<x,y>∈ R, and<y,z>∈ R, then<x,z>∈
LR.Wherein, LR represents R weak binary crelation.
Then R is claimed to be defined in a weak barbed congruence on set A, the similarity relationships S of document is collection of document DsetOn
A binary crelation, the similarity relation S of document is a weak barbed congruence.
Wherein, word semantic space=word space+semantic space, the formal specification of word semantic space are as follows:
S=< T, R, W1,W2>, wherein T={ t1,t2,…,ti,…,tn, i=1,2 ..., n, T are characterized entry collection
Close, tiIt is characterized entry, i=1,2 ..., n;R is the semantic constraint relation on set T, the element t in set TiAnd tjMeet to close
It is R, and if only if tiConstrain tj, it is designated as tiRtjOr < ti,tj> ∈ R, i, j=1,2 ..., n;W1It is the weight of document feature sets
Set, herein refers to tiWord frequency set, i=1,2 ..., n;W2It is the set that element constraint is strong and weak in set T.
Obviously, the semantic constraint relation on set T is a binary crelation on set T.From the knowledge of set theory,
Binary crelation can be represented with figure G, wherein figure G summit is exactly to be made up of all elements in T, if < ti,tj> ∈ R,
Then from summit tiTo summit tjThere is a directed edge, i, j=1,2 ..., n. is because relation is the even set of sequence, element in sequence idol
Order can not overturn, so in the figure of relation represents, use directed edge.
The word semantic space of the present invention is exactly the network being made up of the restriction relation between word and word, word and word, is used
The power of restriction relation between word and word represents semantic distance, with the basic element (summit and side) of figure measures the phase of figure
Like property, good Clustering Effect is achieved, if reflecting its semantic information, document feature sets, feature from the surface of text
The frequency of entry and its position relationship of document feature sets, a kind of document representation model of new word-based semantic space is established,
It can successfully capture following information:(1) in part of speech, (2) word order, (3) word frequency, the co-occurrence of (4) word, (5) text word it is upper
Context information.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the scope of the invention, every utilization
The equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations
Technical field, be included within the scope of the present invention.
Claims (6)
1. the document representation method based on figure, it is characterised in that:Its method and step is:
Step 1:Input text document D;
Step 2:Export text class figure G (V, E, W1,W2);
Step 3:It is determined that each document corresponds to most number of vertices n of graph model;
Step 4:Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;
Step 5:The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records all Feature Words in document
In sequencing;
Step 6:To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets
Form the weight on summit;
Step 7:There are a directed edge, side if two Feature Words successively occur in a certain paragraph of document, between them
Direction pointed to by the Feature Words that first occur after the Feature Words that occur, and count the two document feature sets co-occurrence in the document
Number;
Step 8:The incidence matrix M, U of document feature sets are determined according to formula (1);
Step 9:According to formulaMatrix U is normalized, it is determined that the incidence matrix W after normalization.
2. the document representation method according to claim 1 based on figure, it is characterised in that:The formula (1) is by defining 1
The weights on side that is, Semantic Measure definition between two document feature sets, semanteme measure definition:
wAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial number of document feature sets B in a document, and num (A) represents document feature sets A in a document suitable
Sequence number.
3. the document representation method according to claim 1 based on figure, it is characterised in that:1 is a document D defined in it
Corresponding is exactly that class the figure G, G under word semantic space are four-tuple G (V, E, a W1,W2) the vertex set V (G) that is had the right by band and
The class Weighted Directed Graph that the side collection E (G) that band is had the right is formed, vertex set V (G) are formed by appearing in all document feature sets in document D;
The weight set W on summit1It is to be made up of the word frequency in the text on the summit in vertex set V (G);If two vertex correspondences
Document feature sets occur in priority, then have a directed edge between them, and its direction is that occur after being pointed to by the summit first occurred
Summit, the power w on side represent the size of degree of restraint between two document feature sets associated with it, the set that all sides are formed
We term it side collection E (G), while the set that forms of weight w we be referred to as while weight sets W2。
4. the document representation method according to claim 1 based on figure, it is characterised in that:The document expression of the definition 1
Form is:
T=[t1,t2,…,tn] (2)
<mrow>
<mi>M</mi>
<mo>=</mo>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<msub>
<mi>a</mi>
<mn>11</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mn>12</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mn>1</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>a</mi>
<mn>21</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mn>22</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mn>2</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mn>2</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>a</mi>
<mrow>
<mi>n</mi>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij:Feature
Entry tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint only counted between them is closed
System, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
<mrow>
<mi>U</mi>
<mo>=</mo>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mn>11</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mn>12</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mn>1</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mn>21</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mn>22</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mn>2</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mn>2</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>u</mi>
<mrow>
<mi>n</mi>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
Usually, it is necessary to which matrix U is normalized,
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
<mrow>
<mi>W</mi>
<mo>=</mo>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<msub>
<mi>w</mi>
<mn>11</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mn>12</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mn>1</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>w</mi>
<mn>21</mn>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mn>22</mn>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mn>2</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mn>2</mn>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
<mo>,</mo>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mn>2</mn>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>,</mo>
<mi>n</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mrow>
<mi>n</mi>
<mo>,</mo>
<mi>n</mi>
</mrow>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>6</mn>
<mo>)</mo>
</mrow>
<mo>.</mo>
</mrow>
5. the document representation method according to claim 1 based on figure, it is characterised in that:Two document Ds1And D2Semanteme is got over
Close, then their corresponding document map is also more similar, on the contrary, two document maps are more similar, then they get over semantically
Close, two document Ds1And D2It is semantic closer, be embodied in the feature of figure, two figures just have more identical summits and
Side, and the weights on side are also closer.
6. the document representation method according to claim 1 based on figure, it is characterised in that:Assuming that two document Ds1And D2It is right
The Weighted Directed Graph answered is respectively G1And G2, G1And G2Maximum public subgraph be C, then document D1And D2Similarity definition such as
Under:
<mrow>
<mi>S</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>D</mi>
<mn>1</mn>
</msub>
<mo>,</mo>
<msub>
<mi>D</mi>
<mn>2</mn>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>&beta;</mi>
<mfrac>
<mrow>
<mo>|</mo>
<mi>V</mi>
<mrow>
<mo>(</mo>
<mi>C</mi>
<mo>)</mo>
</mrow>
<mo>|</mo>
</mrow>
<mi>n</mi>
</mfrac>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&beta;</mi>
<mo>)</mo>
</mrow>
<munder>
<mi>&Sigma;</mi>
<mrow>
<mi>e</mi>
<mo>&Element;</mo>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>C</mi>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<msubsup>
<mi>w</mi>
<mi>e</mi>
<mi>C</mi>
</msubsup>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>7</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1)|,V
(G2), invariant β takes the decimal between 0~1,
Documents Similarity, the similarity degree reflected between two documents, the usually numerical value between one 0~1,0 represents not phase
Seemingly, 1 representing completely similar, numerical value is bigger to represent that two documents are more similar,
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and the power on side
Value is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, and corresponding figure is got over
It is similar,Value is bigger, closer to 1;AndIt is the measurement to the composition on the side of two figures, two document semantics more connect
Closely, corresponding figure is more similar,Value is bigger, and closer 1, linear combination
Represent the measurement to the similitude of figure corresponding to two documents, and S (D1,D2) value is between 0~1, correspondingly, two
Individual document D1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710599697.2A CN107357918B (en) | 2017-07-21 | 2017-07-21 | Text representation method based on graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710599697.2A CN107357918B (en) | 2017-07-21 | 2017-07-21 | Text representation method based on graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357918A true CN107357918A (en) | 2017-11-17 |
CN107357918B CN107357918B (en) | 2022-01-25 |
Family
ID=60284884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710599697.2A Active CN107357918B (en) | 2017-07-21 | 2017-07-21 | Text representation method based on graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357918B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992480A (en) * | 2017-12-25 | 2018-05-04 | 东软集团股份有限公司 | A kind of method, apparatus for realizing entity disambiguation and storage medium, program product |
CN109326327A (en) * | 2018-08-28 | 2019-02-12 | 福建师范大学 | A kind of Sequence clustering method based on SeqRank nomography |
CN110188349A (en) * | 2019-05-21 | 2019-08-30 | 清华大学深圳研究生院 | A kind of automation writing method based on extraction-type multiple file summarization method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090024385A1 (en) * | 2007-07-16 | 2009-01-22 | Semgine, Gmbh | Semantic parser |
-
2017
- 2017-07-21 CN CN201710599697.2A patent/CN107357918B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090024385A1 (en) * | 2007-07-16 | 2009-01-22 | Semgine, Gmbh | Semantic parser |
Non-Patent Citations (2)
Title |
---|
王明文等: "基于词项共现关系图模型的中文观点句识别研究", 《中文信息学报》 * |
王映龙等: "加权最大频繁子图挖掘算法的研究", 《计算机工程与应用》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992480A (en) * | 2017-12-25 | 2018-05-04 | 东软集团股份有限公司 | A kind of method, apparatus for realizing entity disambiguation and storage medium, program product |
CN107992480B (en) * | 2017-12-25 | 2021-09-14 | 东软集团股份有限公司 | Method, device, storage medium and program product for realizing entity disambiguation |
CN109326327A (en) * | 2018-08-28 | 2019-02-12 | 福建师范大学 | A kind of Sequence clustering method based on SeqRank nomography |
CN109326327B (en) * | 2018-08-28 | 2021-11-12 | 福建师范大学 | Biological sequence clustering method based on SeqRank graph algorithm |
CN110188349A (en) * | 2019-05-21 | 2019-08-30 | 清华大学深圳研究生院 | A kind of automation writing method based on extraction-type multiple file summarization method |
Also Published As
Publication number | Publication date |
---|---|
CN107357918B (en) | 2022-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105843897B (en) | A kind of intelligent Answer System towards vertical field | |
CN109783618B (en) | Attention mechanism neural network-based drug entity relationship extraction method and system | |
Gupta et al. | Choosing linguistics over vision to describe images | |
Pedersen | A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN107562919B (en) | Multi-index integrated software component retrieval method and system based on information retrieval | |
CN101398814A (en) | Method and system for simultaneously abstracting document summarization and key words | |
JPWO2014033799A1 (en) | Word semantic relation extraction device | |
Gómez-Adorno et al. | Automatic authorship detection using textual patterns extracted from integrated syntactic graphs | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN107357918A (en) | Document representation method based on figure | |
CN108427723A (en) | A kind of author's recommendation method and system based on clustering algorithm and local sensing reconstructing model | |
CN112597300A (en) | Text clustering method and device, terminal equipment and storage medium | |
CN111782759B (en) | Question-answering processing method and device and computer readable storage medium | |
Simm et al. | Classification of short text comments by sentiment and actionability for voiceyourview | |
Hristea | The Naïve Bayes model for unsupervised word sense disambiguation: aspects concerning feature selection | |
Gärdenfors | A semantic theory of word classes | |
CN111695358A (en) | Method and device for generating word vector, computer storage medium and electronic equipment | |
Sebti et al. | A new word sense similarity measure in WordNet | |
Li et al. | Naive Bayesian automatic classification of railway service complaint text based on eigenvalue extraction | |
CN114997288A (en) | Design resource association method | |
Khan et al. | Offensive language detection for low resource language using deep sequence model | |
Cheng et al. | Domain-specific ontology mapping by corpus-based semantic similarity | |
CN112860781A (en) | Mining and displaying method combining vocabulary collocation extraction and semantic classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |