CN103123685B - Text mode recognition method - Google Patents

Text mode recognition method Download PDF

Info

Publication number
CN103123685B
CN103123685B CN201110367595.0A CN201110367595A CN103123685B CN 103123685 B CN103123685 B CN 103123685B CN 201110367595 A CN201110367595 A CN 201110367595A CN 103123685 B CN103123685 B CN 103123685B
Authority
CN
China
Prior art keywords
text
weight
keyword
direct graph
manifold edges
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110367595.0A
Other languages
Chinese (zh)
Other versions
CN103123685A (en
Inventor
吴秦
张存铨
艾迪·福勒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201110367595.0A priority Critical patent/CN103123685B/en
Publication of CN103123685A publication Critical patent/CN103123685A/en
Application granted granted Critical
Publication of CN103123685B publication Critical patent/CN103123685B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.Compared with classic method, this method can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.

Description

Text mode recognition method
[technical field]
The present invention relates to text identification field, particularly relate to text mode recognition method.
[background technology]
Along with the development of network and the appearance of digital library, how from the text of magnanimity, quick obtaining effective information becomes one of important subject of field of information processing and area of pattern recognition.If we can carry out automatic classification mark to text according to certain taxonomic hierarchies according to the content of text, similarity analysis is carried out to different texts, then can people be helped better to organize and excavate text message.
The implementation of prior art: the keyword in text is used as a characteristic item of text for a long time always.Based on the repetition frequency of keyword, we carry out automatic classification by methods such as decision tree, network neural unit, bayes method or Support Vector Machine to text usually.For the similarity system design between different text, be also compare based on the repetition frequency of keyword usually.
Repetition frequency only based on keyword can compare rough large class division to text to a certain extent, but when the method is used for the similarity segmenting differing document text by us, result is not but fine.This mainly because: (1) only utilizes this method of the repetition frequency of keyword to have ignored the interdependent property that may exist between keyword and keyword.(2) traditional method does not utilize the structural information of text yet.These all directly will affect text classification results and text similarity system design result.
Therefore, be necessary to develop a kind of text mode recognition method that can improve to overcome the problems referred to above.
[summary of the invention]
One of the technical problem to be solved in the present invention is to provide a kind of text mode recognition method, and it can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.
In order to solve the problem, according to an aspect of the present invention, the invention provides a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.
Further, suppose that keyword set is K={k 1, k 2..., k n, key word k iin described text, occurrence number is f i, with F=[f 1, f 2..., f n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.
Further, with node on behalf each in the direct graph with weight of Non-manifold edges keyword k iif, keyword k iposition p in described text ioccur, keyword k jposition p in described text joccur, and position p jat position p iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added ik j, directed edge k ik jweight be p iand p jbetween distance, if keyword k iwith keyword k joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges iand k jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.
Further, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight to comprise:
Using the node set of the node set of the direct graph with weight with Non-manifold edges as simple direct graph with weight;
From node k in simple direct graph with weight ito node k jbetween directed edge be expressed as k ik j, k ik jweight w (k ik j) be:
w ( k i k j ) = Σ e ∈ E ij 1 w ~ ( e ) ,
Wherein E ijrepresent the direct graph with weight interior joint k with Non-manifold edges ito node k jbetween directed edge set, represent directed edge e with the weighted value in the direct graph with weight of Non-manifold edges;
Further, represent that the matrix W of simple direct graph with weight is:
Further, the Text eigenvector R (D) mapping described text is:
R(D)=[f 1,f 2,…,f n,w(k 1,k 1),…,w(k 1,k n),…,w(k n,k 1),…,w(k n,k n)]。
Further, suppose have text to be D 1..., D m, obtaining corresponding Text eigenvector is then R (D 1) ..., R (D m), described text mode recognition method also comprises:
Utilize any two text D of following formulae discovery x, D ybetween similarity.
wherein x, y are more than or equal to 1 and are less than or equal to m.
Compared with prior art, a direct graph with weight model is established in the present invention in order to describe text message.This model not only utilizes this information of the keyword frequency of occurrences in text, utilizes the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector by us.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes us can obtain better result when carrying out text classification and text similarity calculates.
About other objects of the present invention, feature and advantage, describe in detail in a specific embodiment below in conjunction with accompanying drawing.
[accompanying drawing explanation]
In conjunction with reference accompanying drawing and ensuing detailed description, the present invention will be easier to understand, the structure member that wherein same Reference numeral is corresponding same, wherein:
Fig. 1 is the text mode recognition method schematic diagram in one embodiment in the present invention;
Fig. 2 shows the example of a text;
Fig. 3 shows the relative position information of each keyword in the text shown in Fig. 2;
Fig. 4 shows the direct graph with weight with Non-manifold edges of the text shown in Fig. 2; With
Fig. 5 shows the direct graph with weight of Non-manifold edges that has shown in Fig. 3 and simplifies the simple direct graph with weight obtained.
[embodiment]
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Detailed description of the present invention presents mainly through program, step, logical block, process or other symbolistic descriptions, the running of the technical scheme in its direct or indirect simulation the present invention.Affiliated those of skill in the art use the work that these describe and statement effectively introduces them to the others skilled in the art in affiliated field herein essential.
Alleged herein " embodiment " or " embodiment " refers to that the special characteristic relevant to described embodiment, structure or characteristic at least can be contained at least one implementation of the present invention.Different local in this manual " in one embodiment " occurred be non-essential all refers to same embodiment, must not be yet with other embodiments mutually exclusive separately or select embodiment.In addition, represent sequence of modules in the method for one or more embodiment, process flow diagram or functional block diagram and revocablely refer to any particular order, not also being construed as limiting the invention.
Fig. 1 is text mode recognition method 100 schematic flow sheet in one embodiment in the present invention.Described text mode recognition method 100 comprises the steps.
Step 110, urtext of lining by line scan file, records number of times and position that each keyword occurs in described text.
If a certain keyword occurs repeatedly in text, then the particular location occurred each time or relative position are all recorded.Record the number of times that each key word occurs simultaneously.
Suppose that keyword set is K={k 1, k 2..., k n, suppose key word k ioccurrence number is f i, can F=[f be used 1, f 2..., f n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.
Step 120, is mapped as the direct graph with weight G with Non-manifold edges by described text m.
Described direct graph with weight G min each node on behalf keyword k i, that is direct graph with weight G mtotal n node.If keyword k iposition p in described text ioccur, keyword k jposition p in described text joccur, and position p jat position p iafterwards, then at direct graph with weight G min add a directed edge k ik j, directed edge k ik jweight be p iand p jbetween distance.If keyword k iwith keyword k joccur in described text repeatedly, then at direct graph with weight G mthe same rule of middle use is by these keyword k that diverse location occurs in described text iand k jbe mapped as Non-manifold edges, wherein j is more than or equal to 1 and is less than or equal to n.If the occurrence number of a keyword is greater than 1, so it is by multiple for correspondence position.
Step 130, by the direct graph with weight G with Non-manifold edges mbe reduced to simple direct graph with weight G s.
Suppose the direct graph with weight G obtained in step 120 min from node k ito node k j(i.e. keyword k in text iwith keyword k j) between limit set for E ij.
Newly-built G sprocess as follows:
By direct graph with weight G mnode set as direct graph with weight G snode set;
Direct graph with weight G sin from node k ito node k jbetween directed edge be expressed as k ik j, k ik jweight w (k ik j) be defined as follows
w ( k i k j ) = Σ e ∈ E ij 1 w ~ ( e ) ,
Wherein E ijrepresent the direct graph with weight G with Non-manifold edges minterior joint k ito node k jbetween directed edge set, represent directed edge e at the direct graph with weight G with Non-manifold edges min weighted value;
Step 140, described simple direct graph with weight G sdescribe by matrix W.
Step 150, to any text D, according to the keyword occurrence number F of obtained matrix W and record, is mapped as Text eigenvector R (D) by text file D.
R(D)=[f 1,f 2,…,f n,w(k 1,k 1),…,w(k 1,k n),…,w(k n,k 1),…,w(k n,k n)]
Repeat the Text eigenvector that above-mentioned steps 110 to 150 can obtain all texts.Suppose have text to be D 1..., D m, corresponding Text eigenvector is then R (D 1) ..., R (D m).
With the matrix that the proper vector that M represents all texts forms
M = R ( D 1 ) . . . R ( D m ) .
M is normalized and obtains new matrix
M ~ = R ~ ( D 1 ) . . . R ~ ( D m ) .
Described text mode recognition method in the present invention can further include:
Step 160, utilizes any two text D of following formulae discovery x, D ybetween similarity.
wherein x, y are more than or equal to 1 and are less than or equal to m.
One of benefit of the present invention, advantage and disadvantage are: establish a direct graph with weight model in order to describe text message, this model not only utilizes this information of the keyword frequency of occurrences in text, utilize the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.
Fig. 2 shows the example of a text, and the set of keywords wherein used is combined into: { Bank, Account, Fund, Transfer}.The number of times that each keyword recorded occurs in the text shown in Fig. 2, is specially F=[f 1=1, f 2=2, f 3=2, f 4=2], f 1for the number of times that Bank occurs, f 2for the number of times that Account occurs, f 3for the number of times that Fund occurs, f 4for the number of times that Transfer occurs.The relative position information of each keyword of record as shown in Figure 3, described relative position is the word distance between adjacent two keywords, distance 12 word distances both 12 between first keyword bank and second the keyword fund such as occurred represents.Fig. 3 shows the direct graph with weight G with Non-manifold edges of the text shown in Fig. 2 m.Fig. 4 shows by the direct graph with weight G having Non-manifold edges shown in Fig. 3 msimplify the simple direct graph with weight G obtained s.
Described simple direct graph with weight G is described smatrix W be:
W = 0 0.0995 0.0705 0.0459 0 0.0200 0.3848 0.5668 0 0.0227 0.0204 0.0884 0 0.0345 0.3627 0.0323 bank fund account transfer bank fund account transfer
The Text eigenvector of the text shown in Fig. 2 is:
V=[1,2,2,2,0,0.0995,0.0705,0.0459,0,0.0200,0.3848,0.5668,0,0.0227,0.0204,0.0884,0,0.0345,0.3627,0.0323]。
Above to invention has been the enough detailed description with certain singularity.Belonging to those of ordinary skill in field should be appreciated that, the description in embodiment is only exemplary, make under the prerequisite not departing from true spirit of the present invention and scope change and all should belong to protection scope of the present invention.The present invention's scope required for protection is undertaken limiting by described claims, instead of limited by the foregoing description in embodiment.

Claims (5)

1. a text mode recognition method, is characterized in that, it comprises:
Urtext of lining by line scan file, records number of times and position that each keyword occurs in described text;
Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword;
Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight;
Described simple direct graph with weight matrix is represented; With
According to the keyword occurrence number of obtained matrix and record, described text is mapped as Text eigenvector,
Suppose that keyword set is K={k 1, k 2..., k n, key word k iin described text, occurrence number is f i, with F=[f 1, f 2..., f n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number,
With node on behalf each in the direct graph with weight of Non-manifold edges keyword k iif, keyword k iposition p in described text ioccur, keyword k jposition p in described text joccur, and position p jat position p iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added ik j, directed edge k ik jweight be p iand p jbetween distance, if keyword k iwith keyword k joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges iand k jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.
2. text mode recognition method according to claim 1, is characterized in that, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight and comprises:
Using the node set of the node set of the direct graph with weight with Non-manifold edges as simple direct graph with weight;
From node k in simple direct graph with weight ito node k jbetween directed edge be expressed as k ik j, k ik jweight w (k ik j) be:
w ( k i k j ) = Σ e ∈ E i j 1 w ~ ( e ) ,
Wherein E ijrepresent the direct graph with weight interior joint k with Non-manifold edges ito node k jbetween directed edge set, represent directed edge e with the weighted value in the direct graph with weight of Non-manifold edges;
3. text mode recognition method according to claim 2, is characterized in that, represents that the matrix W of simple direct graph with weight is:
4. text mode recognition method according to claim 3, is characterized in that, the Text eigenvector R (D) mapping described text is:
R(D)=[f 1,f 2,…,f n,w(k 1,k 1),…,w(k 1,k n),…,w(k n,k 1),…,w(k n,k n)]。
5. text mode recognition method according to claim 4, is characterized in that, supposing has text to be D 1..., D m, obtaining corresponding Text eigenvector is then R (D 1) ..., R (D m),
Described text mode recognition method also comprises:
Utilize any two text D of following formulae discovery x, D ybetween similarity, wherein x, y are more than or equal to 1 and are less than or equal to m.
CN201110367595.0A 2011-11-18 2011-11-18 Text mode recognition method Expired - Fee Related CN103123685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110367595.0A CN103123685B (en) 2011-11-18 2011-11-18 Text mode recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110367595.0A CN103123685B (en) 2011-11-18 2011-11-18 Text mode recognition method

Publications (2)

Publication Number Publication Date
CN103123685A CN103123685A (en) 2013-05-29
CN103123685B true CN103123685B (en) 2016-03-02

Family

ID=48454659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110367595.0A Expired - Fee Related CN103123685B (en) 2011-11-18 2011-11-18 Text mode recognition method

Country Status (1)

Country Link
CN (1) CN103123685B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622048B (en) * 2017-09-06 2021-06-22 南京硅基智能科技有限公司 Text mode recognition method and system
CN108255797A (en) * 2018-01-26 2018-07-06 上海康斐信息技术有限公司 A kind of text mode recognition method and system
US11410446B2 (en) 2019-11-22 2022-08-09 Nielsen Consumer Llc Methods, systems, apparatus and articles of manufacture for receipt decoding
US11810380B2 (en) 2020-06-30 2023-11-07 Nielsen Consumer Llc Methods and apparatus to decode documents based on images using artificial intelligence
US11822216B2 (en) 2021-06-11 2023-11-21 Nielsen Consumer Llc Methods, systems, apparatus, and articles of manufacture for document scanning
US11625930B2 (en) 2021-06-30 2023-04-11 Nielsen Consumer Llc Methods, systems, articles of manufacture and apparatus to decode receipts based on neural graph architecture

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620616A (en) * 2009-05-07 2010-01-06 北京理工大学 Chinese similar web page de-emphasis method based on microcosmic characteristic
CN101694670A (en) * 2009-10-20 2010-04-14 北京航空航天大学 Chinese Web document online clustering method based on common substrings
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
CN102033867A (en) * 2010-12-14 2011-04-27 西北工业大学 Semantic-similarity measuring method for XML (Extensible Markup Language) document classification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004361987A (en) * 2003-05-30 2004-12-24 Seiko Epson Corp Image retrieval system, image classification system, image retrieval program, image classification program, image retrieval method, and image classification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620616A (en) * 2009-05-07 2010-01-06 北京理工大学 Chinese similar web page de-emphasis method based on microcosmic characteristic
CN101694670A (en) * 2009-10-20 2010-04-14 北京航空航天大学 Chinese Web document online clustering method based on common substrings
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
CN102033867A (en) * 2010-12-14 2011-04-27 西北工业大学 Semantic-similarity measuring method for XML (Extensible Markup Language) document classification

Also Published As

Publication number Publication date
CN103123685A (en) 2013-05-29

Similar Documents

Publication Publication Date Title
CN103123685B (en) Text mode recognition method
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN102096821B (en) Number plate identification method under strong interference environment on basis of complex network theory
CN105824802A (en) Method and device for acquiring knowledge graph vectoring expression
CN101859320B (en) Massive image retrieval method based on multi-characteristic signature
CN101739430B (en) A kind of training method of the text emotion classifiers based on keyword and sorting technique
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
Yang et al. An effective hybrid model for opinion mining and sentiment analysis
CN106156145A (en) The management method of a kind of address date and device
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN104239513A (en) Semantic retrieval method oriented to field data
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
CN101950284A (en) Chinese word segmentation method and system
CN102929894A (en) Online clustering visualization method of text
CN104881689A (en) Method and system for multi-label active learning classification
Shri et al. Prediction of reusability of object oriented software systems using clustering approach
CN101833650A (en) Video copy detection method based on contents
CN107545025A (en) Database is inquired about using morphological criteria
CN104778186A (en) Method and system for hanging commodity object to standard product unit (SPU)
CN104572631A (en) Training method and system for language model
CN110502616A (en) A kind of method, equipment and the computer storage medium of determining garbage classification
CN105930873A (en) Self-paced cross-modal matching method based on subspace
CN106485272A (en) The zero sample classification method being embedded based on the cross-module state of manifold constraint
CN114881742A (en) Graph neural network recommendation method and system based on commodity knowledge graph
CN106599227A (en) Method and apparatus for obtaining similarity between objects based on attribute values

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160302

Termination date: 20191118