CN107357918A - Document representation method based on figure - Google Patents

Document representation method based on figure Download PDF

Info

Publication number
CN107357918A
CN107357918A CN201710599697.2A CN201710599697A CN107357918A CN 107357918 A CN107357918 A CN 107357918A CN 201710599697 A CN201710599697 A CN 201710599697A CN 107357918 A CN107357918 A CN 107357918A
Authority
CN
China
Prior art keywords
mtd
msub
mrow
document
mtr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710599697.2A
Other languages
Chinese (zh)
Other versions
CN107357918B (en
Inventor
周法国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology Beijing CUMTB
Original Assignee
China University of Mining and Technology Beijing CUMTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology Beijing CUMTB filed Critical China University of Mining and Technology Beijing CUMTB
Priority to CN201710599697.2A priority Critical patent/CN107357918B/en
Publication of CN107357918A publication Critical patent/CN107357918A/en
Application granted granted Critical
Publication of CN107357918B publication Critical patent/CN107357918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The present invention relates to text representation technical field, the document representation method of figure is based especially on, its method and step is:It is determined that each document corresponds to most number of vertices n of graph model, document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records the sequencing of all Feature Words in a document;To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit.Beneficial effect of the present invention:Word semantic space is exactly the network being made up of the restriction relation between word and word, word and word, the power of restriction relation between word and word represents semantic distance, the similitude of figure is measured with the basic element of figure, achieve good Clustering Effect, if reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets from the surface of text.

Description

Document representation method based on figure
Technical field
The present invention relates to text representation technical field, the document representation method of figure is based especially on.
Background technology
In natural language processing and association area, classical text representation model seldom considers lexical item in text substantially Effect of the ordinal relation for semantic meaning representation, and it is separate to assume between lexical item.In fact, sequence between lexical item Relation can influence the semanteme of text, and the change of Chinese word order often influences the relation between word and then causes the change of semanteme. One simply example is that " A likes B " and " B likes A ", and the lexical item used in sentence is identical, and the difference of word order result in language The difference of justice.Presently the most popular text representation model VSM models have been ignored as the relation of sequence in its model hypothesis.
The most frequently used in document representation method is exactly vector space model, and this is a kind of to be based on bag of words (bag-of-words) Method, but it can not change, and this method for expressing lost many information in urtext, such as:In text The information such as border of sentence and paragraph in the order of word, text.
The defects of representing model for vector space, domestic and international many scholars propose the document representation side based on graph model Method.As the Svetlana document concepts figures based on auxiliary dictionary VerbNet and WordNet proposed in its paper represent mould Type;Bhoopesh and Pushpak is proposed in their paper represents the characteristic vector of document according to UNL figures to construct, and Text is clustered using SOM technologies;Also Inderjeet and Eric is it is also proposed in their paper for more documents The document map Model representation approach of abstract extraction.Although these graph models embody the semantic information of document, all mistake well In complexity, it is difficult to similarity measures are provided, and some also need to extra auxiliary information.Recently, Adam Schenker Et al. a kind of relatively simple document representing method based on graph model, but their model master are proposed in their paper Establish on the basis of the position boolean association of text feature entry, consider the frequency of document feature sets appearance to text The influence of main contents.
Therefore, it is necessary to propose the document representation method based on figure for above mentioned problem.
The content of the invention
For above-mentioned the deficiencies in the prior art, it is an object of the invention to provide the text representation side based on figure Method, it can preferably represent text, improve information retrieval, the effect of text classification application.
Document representation method based on figure, its method and step are:Step 1:Input text document D;Step 2:Output text This class figure G (V, E, W1,W2);Step 3:It is determined that each document corresponds to most number of vertices n of graph model;Step 4:To document Segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;Step 5:The Feature Words of the document can most be represented by choosing Bar, its number are no more than n, and record the sequencing of all Feature Words in a document;Step 6:To document D, by its all spy Summit of the entry as graph model is levied, the frequency of occurrences of corresponding document feature sets forms the weight on summit;Step 7:If two Feature Words successively occur in a certain paragraph of document, then have a directed edge between them, and the direction on side is by the spy that first occurs The Feature Words that sign word occurs after pointing to, and count the number of the two document feature sets co-occurrence in the document;Step 8:According to public affairs Formula (1) determines the incidence matrix M, U of document feature sets;Step 9:According to formulaMatrix U is normalized place Reason, it is determined that the incidence matrix W after normalization.
Preferably, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure defines, Semanteme measure definition:
WAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial numbers of document feature sets B in a document, and num (A) represents document feature sets A in a document Serial number.
Preferably, defined in it 1 be exactly that class figure G, G under word semantic space is a quaternary corresponding to a document D Group G (V, E, W1,W2) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V (G) formed by appearing in all document feature sets in document D;The weight set W on summit1It is by the summit in vertex set V (G) Word frequency in the text is formed;If the document feature sets of two vertex correspondences occur in priority, there is a directed edge between them, Its direction is the summit occurred after being pointed to by the summit that first occurs, and the power w on side is represented between two document feature sets associated with it The size of degree of restraint, it is all while the set formed we term it while collection E (G), the set that the weight w on side is formed we The referred to as weight sets W on side2
Preferably, the document expression-form of the definition 1 is:
T=[t1,t2,…,tn] (2)
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij: Document feature sets tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint between them is only counted Relation, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
Usually, it is necessary to which matrix U is normalized.
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
Preferably, two document Ds1And D2Semantic closer, then their corresponding document map is also more similar, on the contrary, two Document map is more similar, then theirs is being semantically closer, two document Ds1And D2It is semantic closer, it is embodied in the spy of figure In sign, two figures just have more identical summits and side, and the weights on side are also closer.
Preferably, it is assumed that two document Ds1And D2Corresponding Weighted Directed Graph is respectively G1And G2, G1And G2It is maximum public Subgraph is C, then document D1And D2Similarity definition it is as follows:
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1) |,V(G2), invariant β takes the decimal between 0~1.
Documents Similarity, the similarity degree reflected between two documents.The usually numerical value between one 0~1,0 represents Dissmilarity, 1 represents completely similar, and numerical value is bigger to represent that two documents are more similar.
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and on side Weights it is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, corresponding Figure is more similar,Value is bigger, closer to 1;AndIt is the measurement to the composition on the side of two figures, two document semantics are got over Close, corresponding figure is more similar,Value is bigger, and closer 1, linear combination Represent the measurement to the similitude of figure corresponding to two documents, and S (D1,D2) value is between 0~1.
Correspondingly, two document Ds1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
Due to using above-mentioned technical proposal, beneficial effect of the present invention:Word semantic space is exactly by between word and word, word and word The network that is formed of restriction relation, the power of the restriction relation between word and word represents semantic distance, with the base of figure This element (summit and side) measures the similitude of figure, achieves good Clustering Effect, if come from the surface of text Reflect its semantic information, the position relationship of document feature sets, the frequency of document feature sets and its document feature sets, establish a kind of new base In the document representation model of word semantic space.It can successfully capture following information:(1) part of speech, (2) word order, (3) word frequency, (4) in the co-occurrence of word, (5) text word contextual information.
Brief description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Embodiment
Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims Implement with the multitude of different ways of covering.
As shown in figure 1, in the document representation method of figure, its method and step is:Step 1:Input text document D;Step Two:Export text class figure G (V, E, W1,W2);Step 3:It is determined that each document corresponds to most number of vertices n of graph model;Step Four:Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;Step 5:Selection can most represent this article The document feature sets of shelves, its number is no more than n, and records the sequencing of all Feature Words in a document;Step 6:To document D, By its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets forms the weight on summit;Step Seven:Have a directed edge if two Feature Words successively occur in a certain paragraph of document, between them, the direction on side by The Feature Words that the Feature Words first occurred occur after pointing to, and count the number of the two document feature sets co-occurrence in the document;Step Rapid eight:The incidence matrix M, U of document feature sets are determined according to formula (1);Step 9:According to formulaTo matrix U It is normalized, it is determined that the incidence matrix W after normalization.
Further, the formula (1) is by the weights on side between 1 two document feature sets of definition that is, Semantic Measure is determined Justice, semanteme measure definition:wAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial numbers of document feature sets B in a document, and num (A) represents document feature sets A in a document Serial number.
Defined in it 1 be exactly corresponding to a document D class figure G, G under word semantic space be a four-tuple G (V, E, W1,W2) the class Weighted Directed Graph that forms of side collection E (G) had the right of the vertex set V (G) that is had the right by band and band.Vertex set V (G) is by going out All document feature sets are formed in present document D;The weight set W on summit1Be by the summit in vertex set V (G) in the text Word frequency formed;If the document feature sets of two vertex correspondences occur in priority, there are a directed edge, its direction between them It is the summit occurred after being pointed to by the summit that first occurs, the power w on side represents to constrain journey between two document feature sets associated with it The size of degree, it is all while the set formed we term it while collection E (G), while the set that forms of weight w we be referred to as while Weight sets W2
Preferably, the document expression-form of the definition 1 is:
T=[t1,t2,…,tn] (2)
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij: Document feature sets tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint between them is only counted Relation, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
Usually, it is necessary to which matrix U is normalized.
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
Further, two document Ds1And D2Semantic closer, then their corresponding document map is also more similar, on the contrary, two Individual document map is more similar, then theirs is being semantically closer, two document Ds1And D2It is semantic closer, it is embodied in figure In feature, two figures just have more identical summits and side, and the weights on side are also closer.
Assuming that two document Ds1And D2Corresponding Weighted Directed Graph is respectively G1And G2, G1And G2Maximum public subgraph be C, Then document D1And D2Similarity definition it is as follows:
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1) |,V(G2), invariant β takes the decimal between 0~1.
Documents Similarity, the similarity degree reflected between two documents.The usually numerical value between one 0~1,0 represents Dissmilarity, 1 represents completely similar, and numerical value is bigger to represent that two documents are more similar.
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and Weights on side are also closer.In formula (7),It is the measurement formed to the summit of two figures, two document semantics more connect Closely, corresponding figure is more similar,Value is bigger, closer to 1;AndIt is the degree to the composition on the side of two figures Amount, two document semantics are closer, and corresponding figure is more similar,Value is bigger, and closer 1, thus, linear combinationRepresent the measurement to the similitude of figure corresponding to two documents, and S (D1,D2) value is between 0~1, correspondingly, two document Ds1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
In addition, weak barbed congruence:If R is the binary crelation on set A, if R meets condition:
Reflexivity:To set A arbitrary element x, have<x,x>∈R
Symmetry:To set A any 2 element x and y, if<x,y>∈ R, then<y,x>∈R
Weak transitivity:To set A any 3 element x, y and z, if<x,y>∈ R, and<y,z>∈ R, then<x,z>∈ LR.Wherein, LR represents R weak binary crelation.
Then R is claimed to be defined in a weak barbed congruence on set A, the similarity relationships S of document is collection of document DsetOn A binary crelation, the similarity relation S of document is a weak barbed congruence.
Wherein, word semantic space=word space+semantic space, the formal specification of word semantic space are as follows:
S=< T, R, W1,W2>, wherein T={ t1,t2,…,ti,…,tn, i=1,2 ..., n, T are characterized entry collection Close, tiIt is characterized entry, i=1,2 ..., n;R is the semantic constraint relation on set T, the element t in set TiAnd tjMeet to close It is R, and if only if tiConstrain tj, it is designated as tiRtjOr < ti,tj> ∈ R, i, j=1,2 ..., n;W1It is the weight of document feature sets Set, herein refers to tiWord frequency set, i=1,2 ..., n;W2It is the set that element constraint is strong and weak in set T.
Obviously, the semantic constraint relation on set T is a binary crelation on set T.From the knowledge of set theory, Binary crelation can be represented with figure G, wherein figure G summit is exactly to be made up of all elements in T, if < ti,tj> ∈ R, Then from summit tiTo summit tjThere is a directed edge, i, j=1,2 ..., n. is because relation is the even set of sequence, element in sequence idol Order can not overturn, so in the figure of relation represents, use directed edge.
The word semantic space of the present invention is exactly the network being made up of the restriction relation between word and word, word and word, is used The power of restriction relation between word and word represents semantic distance, with the basic element (summit and side) of figure measures the phase of figure Like property, good Clustering Effect is achieved, if reflecting its semantic information, document feature sets, feature from the surface of text The frequency of entry and its position relationship of document feature sets, a kind of document representation model of new word-based semantic space is established, It can successfully capture following information:(1) in part of speech, (2) word order, (3) word frequency, the co-occurrence of (4) word, (5) text word it is upper Context information.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the scope of the invention, every utilization The equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations Technical field, be included within the scope of the present invention.

Claims (6)

1. the document representation method based on figure, it is characterised in that:Its method and step is:
Step 1:Input text document D;
Step 2:Export text class figure G (V, E, W1,W2);
Step 3:It is determined that each document corresponds to most number of vertices n of graph model;
Step 4:Document is segmented, part-of-speech tagging, pretreatment, and word frequency statisticses are carried out to it;
Step 5:The document feature sets of the document can most be represented by choosing, and its number is no more than n, and records all Feature Words in document In sequencing;
Step 6:To document D, by its summit of all document feature sets as graph model, the frequency of occurrences of corresponding document feature sets Form the weight on summit;
Step 7:There are a directed edge, side if two Feature Words successively occur in a certain paragraph of document, between them Direction pointed to by the Feature Words that first occur after the Feature Words that occur, and count the two document feature sets co-occurrence in the document Number;
Step 8:The incidence matrix M, U of document feature sets are determined according to formula (1);
Step 9:According to formulaMatrix U is normalized, it is determined that the incidence matrix W after normalization.
2. the document representation method according to claim 1 based on figure, it is characterised in that:The formula (1) is by defining 1 The weights on side that is, Semantic Measure definition between two document feature sets, semanteme measure definition:
wAB=1/ (num (B)-num (A)) (1)
Wherein, num (B) represents the serial number of document feature sets B in a document, and num (A) represents document feature sets A in a document suitable Sequence number.
3. the document representation method according to claim 1 based on figure, it is characterised in that:1 is a document D defined in it Corresponding is exactly that class the figure G, G under word semantic space are four-tuple G (V, E, a W1,W2) the vertex set V (G) that is had the right by band and The class Weighted Directed Graph that the side collection E (G) that band is had the right is formed, vertex set V (G) are formed by appearing in all document feature sets in document D; The weight set W on summit1It is to be made up of the word frequency in the text on the summit in vertex set V (G);If two vertex correspondences Document feature sets occur in priority, then have a directed edge between them, and its direction is that occur after being pointed to by the summit first occurred Summit, the power w on side represent the size of degree of restraint between two document feature sets associated with it, the set that all sides are formed We term it side collection E (G), while the set that forms of weight w we be referred to as while weight sets W2
4. the document representation method according to claim 1 based on figure, it is characterised in that:The document expression of the definition 1 Form is:
T=[t1,t2,…,tn] (2)
<mrow> <mi>M</mi> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>a</mi> <mn>11</mn> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mn>12</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mn>1</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>a</mi> <mn>21</mn> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mn>22</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mn>2</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> </mtr> <mtr> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>a</mi> <mrow> <mi>n</mi> <mi>n</mi> </mrow> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
Wherein, T:Document feature sets set;tiIt is characterized entry, i=1,2 ..., n;M:The incidence matrix of document feature sets;aij:Feature Entry tiAnd tjStrength of association (1≤i≤j≤n),
If some word A repeatedly constrains another word B simultaneously in same paragraph, the nearest constraint only counted between them is closed System, it can be seen from defining 1, maximum constrained value is 1, obtains matrix U:
<mrow> <mi>U</mi> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>u</mi> <mn>11</mn> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mn>12</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mn>1</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>u</mi> <mn>21</mn> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mn>22</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mn>2</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> </mtr> <mtr> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>u</mi> <mrow> <mi>n</mi> <mi>n</mi> </mrow> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
Usually, it is necessary to which matrix U is normalized,
Order
Wherein i, j, k, l=1,2 ..., n then obtains normalized matrix W:
<mrow> <mi>W</mi> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>w</mi> <mn>11</mn> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mn>12</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mn>1</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>w</mi> <mn>21</mn> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mn>22</mn> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> </mtr> <mtr> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mn>2</mn> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> </mtd> <mtd> <msub> <mi>w</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow>
5. the document representation method according to claim 1 based on figure, it is characterised in that:Two document Ds1And D2Semanteme is got over Close, then their corresponding document map is also more similar, on the contrary, two document maps are more similar, then they get over semantically Close, two document Ds1And D2It is semantic closer, be embodied in the feature of figure, two figures just have more identical summits and Side, and the weights on side are also closer.
6. the document representation method according to claim 1 based on figure, it is characterised in that:Assuming that two document Ds1And D2It is right The Weighted Directed Graph answered is respectively G1And G2, G1And G2Maximum public subgraph be C, then document D1And D2Similarity definition such as Under:
<mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>D</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>&amp;beta;</mi> <mfrac> <mrow> <mo>|</mo> <mi>V</mi> <mrow> <mo>(</mo> <mi>C</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mi>n</mi> </mfrac> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&amp;beta;</mi> <mo>)</mo> </mrow> <munder> <mi>&amp;Sigma;</mi> <mrow> <mi>e</mi> <mo>&amp;Element;</mo> <mi>E</mi> <mrow> <mo>(</mo> <mi>C</mi> <mo>)</mo> </mrow> </mrow> </munder> <msubsup> <mi>w</mi> <mi>e</mi> <mi>C</mi> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>
Wherein, | V (C) | represent Weighted Directed Graph G1And G2Maximum public subgraph C number of vertices, n=Max | V (G1)|,V (G2), invariant β takes the decimal between 0~1,
Documents Similarity, the similarity degree reflected between two documents, the usually numerical value between one 0~1,0 represents not phase Seemingly, 1 representing completely similar, numerical value is bigger to represent that two documents are more similar,
Two document semantics are closer, are embodied in the feature of figure, and two figures just have more identical summits and side, and the power on side Value is also closer, in formula (7),It is the measurement formed to the summit of two figures, two document semantics are closer, and corresponding figure is got over It is similar,Value is bigger, closer to 1;AndIt is the measurement to the composition on the side of two figures, two document semantics more connect Closely, corresponding figure is more similar,Value is bigger, and closer 1, linear combination Represent the measurement to the similitude of figure corresponding to two documents, and S (D1,D2) value is between 0~1, correspondingly, two Individual document D1And D2Distance Dis (D1,D2)=1-S (D1,D2)。
CN201710599697.2A 2017-07-21 2017-07-21 Text representation method based on graph Active CN107357918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710599697.2A CN107357918B (en) 2017-07-21 2017-07-21 Text representation method based on graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710599697.2A CN107357918B (en) 2017-07-21 2017-07-21 Text representation method based on graph

Publications (2)

Publication Number Publication Date
CN107357918A true CN107357918A (en) 2017-11-17
CN107357918B CN107357918B (en) 2022-01-25

Family

ID=60284884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710599697.2A Active CN107357918B (en) 2017-07-21 2017-07-21 Text representation method based on graph

Country Status (1)

Country Link
CN (1) CN107357918B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992480A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of method, apparatus for realizing entity disambiguation and storage medium, program product
CN109326327A (en) * 2018-08-28 2019-02-12 福建师范大学 A kind of Sequence clustering method based on SeqRank nomography
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024385A1 (en) * 2007-07-16 2009-01-22 Semgine, Gmbh Semantic parser

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024385A1 (en) * 2007-07-16 2009-01-22 Semgine, Gmbh Semantic parser

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王明文等: "基于词项共现关系图模型的中文观点句识别研究", 《中文信息学报》 *
王映龙等: "加权最大频繁子图挖掘算法的研究", 《计算机工程与应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992480A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of method, apparatus for realizing entity disambiguation and storage medium, program product
CN107992480B (en) * 2017-12-25 2021-09-14 东软集团股份有限公司 Method, device, storage medium and program product for realizing entity disambiguation
CN109326327A (en) * 2018-08-28 2019-02-12 福建师范大学 A kind of Sequence clustering method based on SeqRank nomography
CN109326327B (en) * 2018-08-28 2021-11-12 福建师范大学 Biological sequence clustering method based on SeqRank graph algorithm
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method

Also Published As

Publication number Publication date
CN107357918B (en) 2022-01-25

Similar Documents

Publication Publication Date Title
CN105843897B (en) A kind of intelligent Answer System towards vertical field
CN109783618B (en) Attention mechanism neural network-based drug entity relationship extraction method and system
Gupta et al. Choosing linguistics over vision to describe images
Pedersen A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation
CN106776562A (en) A kind of keyword extracting method and extraction system
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN101398814A (en) Method and system for simultaneously abstracting document summarization and key words
JPWO2014033799A1 (en) Word semantic relation extraction device
Gómez-Adorno et al. Automatic authorship detection using textual patterns extracted from integrated syntactic graphs
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN107357918A (en) Document representation method based on figure
CN108427723A (en) A kind of author&#39;s recommendation method and system based on clustering algorithm and local sensing reconstructing model
CN112597300A (en) Text clustering method and device, terminal equipment and storage medium
CN111782759B (en) Question-answering processing method and device and computer readable storage medium
Simm et al. Classification of short text comments by sentiment and actionability for voiceyourview
Hristea The Naïve Bayes model for unsupervised word sense disambiguation: aspects concerning feature selection
Gärdenfors A semantic theory of word classes
CN111695358A (en) Method and device for generating word vector, computer storage medium and electronic equipment
Sebti et al. A new word sense similarity measure in WordNet
Li et al. Naive Bayesian automatic classification of railway service complaint text based on eigenvalue extraction
CN114997288A (en) Design resource association method
Khan et al. Offensive language detection for low resource language using deep sequence model
Cheng et al. Domain-specific ontology mapping by corpus-based semantic similarity
CN112860781A (en) Mining and displaying method combining vocabulary collocation extraction and semantic classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant