CN103123685B

CN103123685B - Text mode recognition method

Info

Publication number: CN103123685B
Application number: CN201110367595.0A
Authority: CN
Inventors: 吴秦; 张存铨; 艾迪·福勒
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2011-11-18
Filing date: 2011-11-18
Publication date: 2016-03-02
Anticipated expiration: 2031-11-18
Also published as: CN103123685A

Abstract

The invention discloses a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.Compared with classic method, this method can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.

Description

Text mode recognition method

[technical field]

The present invention relates to text identification field, particularly relate to text mode recognition method.

[background technology]

Along with the development of network and the appearance of digital library, how from the text of magnanimity, quick obtaining effective information becomes one of important subject of field of information processing and area of pattern recognition.If we can carry out automatic classification mark to text according to certain taxonomic hierarchies according to the content of text, similarity analysis is carried out to different texts, then can people be helped better to organize and excavate text message.

The implementation of prior art: the keyword in text is used as a characteristic item of text for a long time always.Based on the repetition frequency of keyword, we carry out automatic classification by methods such as decision tree, network neural unit, bayes method or Support Vector Machine to text usually.For the similarity system design between different text, be also compare based on the repetition frequency of keyword usually.

Repetition frequency only based on keyword can compare rough large class division to text to a certain extent, but when the method is used for the similarity segmenting differing document text by us, result is not but fine.This mainly because: (1) only utilizes this method of the repetition frequency of keyword to have ignored the interdependent property that may exist between keyword and keyword.(2) traditional method does not utilize the structural information of text yet.These all directly will affect text classification results and text similarity system design result.

Therefore, be necessary to develop a kind of text mode recognition method that can improve to overcome the problems referred to above.

[summary of the invention]

One of the technical problem to be solved in the present invention is to provide a kind of text mode recognition method, and it can more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.

In order to solve the problem, according to an aspect of the present invention, the invention provides a kind of text mode recognition method, it comprises: urtext of lining by line scan file, records number of times and position that each keyword occurs in described text; Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword; Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight; Described simple direct graph with weight matrix is represented; Keyword occurrence number with according to obtained matrix and record, is mapped as Text eigenvector by described text.

Further, suppose that keyword set is K={k ₁, k ₂..., k _n, key word k _iin described text, occurrence number is f _i, with F=[f ₁, f ₂..., f _n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.

Further, with node on behalf each in the direct graph with weight of Non-manifold edges keyword k _iif, keyword k _iposition p in described text _ioccur, keyword k _jposition p in described text _joccur, and position p _jat position p _iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added _ik _j, directed edge k _ik _jweight be p _iand p _jbetween distance, if keyword k _iwith keyword k _joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges _iand k _jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.

Further, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight to comprise:

Using the node set of the node set of the direct graph with weight with Non-manifold edges as simple direct graph with weight;

From node k in simple direct graph with weight _ito node k _jbetween directed edge be expressed as k _ik _j, k _ik _jweight w (k _ik _j) be:

w (k_{i} k_{j}) = \underset{e {&Element; E}_{ij}}{Σ} \frac{1}{\tilde{w} (e)},

Wherein E _ijrepresent the direct graph with weight interior joint k with Non-manifold edges _ito node k _jbetween directed edge set, represent directed edge e with the weighted value in the direct graph with weight of Non-manifold edges;

Further, represent that the matrix W of simple direct graph with weight is:

Further, the Text eigenvector R (D) mapping described text is:

R(D)＝[f ₁，f ₂，…，f _n，w(k ₁，k ₁)，…，w(k ₁，k _n)，…，w(k _n，k ₁)，…，w(k _n，k _n)]。

Further, suppose have text to be D ₁..., D _m, obtaining corresponding Text eigenvector is then R (D ₁) ..., R (D _m), described text mode recognition method also comprises:

Utilize any two text D of following formulae discovery _x, D _ybetween similarity.

wherein x, y are more than or equal to 1 and are less than or equal to m.

Compared with prior art, a direct graph with weight model is established in the present invention in order to describe text message.This model not only utilizes this information of the keyword frequency of occurrences in text, utilizes the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector by us.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes us can obtain better result when carrying out text classification and text similarity calculates.

About other objects of the present invention, feature and advantage, describe in detail in a specific embodiment below in conjunction with accompanying drawing.

[accompanying drawing explanation]

In conjunction with reference accompanying drawing and ensuing detailed description, the present invention will be easier to understand, the structure member that wherein same Reference numeral is corresponding same, wherein:

Fig. 1 is the text mode recognition method schematic diagram in one embodiment in the present invention;

Fig. 2 shows the example of a text;

Fig. 3 shows the relative position information of each keyword in the text shown in Fig. 2;

Fig. 4 shows the direct graph with weight with Non-manifold edges of the text shown in Fig. 2; With

Fig. 5 shows the direct graph with weight of Non-manifold edges that has shown in Fig. 3 and simplifies the simple direct graph with weight obtained.

[embodiment]

For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.

Detailed description of the present invention presents mainly through program, step, logical block, process or other symbolistic descriptions, the running of the technical scheme in its direct or indirect simulation the present invention.Affiliated those of skill in the art use the work that these describe and statement effectively introduces them to the others skilled in the art in affiliated field herein essential.

Alleged herein " embodiment " or " embodiment " refers to that the special characteristic relevant to described embodiment, structure or characteristic at least can be contained at least one implementation of the present invention.Different local in this manual " in one embodiment " occurred be non-essential all refers to same embodiment, must not be yet with other embodiments mutually exclusive separately or select embodiment.In addition, represent sequence of modules in the method for one or more embodiment, process flow diagram or functional block diagram and revocablely refer to any particular order, not also being construed as limiting the invention.

Fig. 1 is text mode recognition method 100 schematic flow sheet in one embodiment in the present invention.Described text mode recognition method 100 comprises the steps.

Step 110, urtext of lining by line scan file, records number of times and position that each keyword occurs in described text.

If a certain keyword occurs repeatedly in text, then the particular location occurred each time or relative position are all recorded.Record the number of times that each key word occurs simultaneously.

Suppose that keyword set is K={k ₁, k ₂..., k _n, suppose key word k _ioccurrence number is f _i, can F=[f be used ₁, f ₂..., f _n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number.

Step 120, is mapped as the direct graph with weight G with Non-manifold edges by described text _m.

Described direct graph with weight G _min each node on behalf keyword k _i, that is direct graph with weight G _mtotal n node.If keyword k _iposition p in described text _ioccur, keyword k _jposition p in described text _joccur, and position p _jat position p _iafterwards, then at direct graph with weight G _min add a directed edge k _ik _j, directed edge k _ik _jweight be p _iand p _jbetween distance.If keyword k _iwith keyword k _joccur in described text repeatedly, then at direct graph with weight G _mthe same rule of middle use is by these keyword k that diverse location occurs in described text _iand k _jbe mapped as Non-manifold edges, wherein j is more than or equal to 1 and is less than or equal to n.If the occurrence number of a keyword is greater than 1, so it is by multiple for correspondence position.

Step 130, by the direct graph with weight G with Non-manifold edges _mbe reduced to simple direct graph with weight G _s.

Suppose the direct graph with weight G obtained in step 120 _min from node k _ito node k _j(i.e. keyword k in text _iwith keyword k _j) between limit set for E _ij.

Newly-built G _sprocess as follows:

By direct graph with weight G _mnode set as direct graph with weight G _snode set;

Direct graph with weight G _sin from node k _ito node k _jbetween directed edge be expressed as k _ik _j, k _ik _jweight w (k _ik _j) be defined as follows

w (k_{i} k_{j}) = \underset{e {&Element; E}_{ij}}{Σ} \frac{1}{\tilde{w} (e)},

Wherein E _ijrepresent the direct graph with weight G with Non-manifold edges _minterior joint k _ito node k _jbetween directed edge set, represent directed edge e at the direct graph with weight G with Non-manifold edges _min weighted value;

Step 140, described simple direct graph with weight G _sdescribe by matrix W.

Step 150, to any text D, according to the keyword occurrence number F of obtained matrix W and record, is mapped as Text eigenvector R (D) by text file D.

R(D)＝[f ₁，f ₂，…，f _n，w(k ₁，k ₁)，…，w(k ₁，k _n)，…，w(k _n，k ₁)，…，w(k _n，k _n)]

Repeat the Text eigenvector that above-mentioned steps 110 to 150 can obtain all texts.Suppose have text to be D ₁..., D _m, corresponding Text eigenvector is then R (D ₁) ..., R (D _m).

With the matrix that the proper vector that M represents all texts forms

M = [\begin{matrix} R (D_{1}) \\ . \\ . \\ . \\ R (D_{m}) \end{matrix}] .

M is normalized and obtains new matrix

\tilde{M} = [\begin{matrix} \tilde{R} (D_{1}) \\ . \\ . \\ . \\ \tilde{R} (D_{m}) \end{matrix}] .

Described text mode recognition method in the present invention can further include:

Step 160, utilizes any two text D of following formulae discovery _x, D _ybetween similarity.

wherein x, y are more than or equal to 1 and are less than or equal to m.

One of benefit of the present invention, advantage and disadvantage are: establish a direct graph with weight model in order to describe text message, this model not only utilizes this information of the keyword frequency of occurrences in text, utilize the range information between keyword positional information in the text and keyword simultaneously, each text is corresponded to a feature direct graph with weight.On this basis, each text is mapped as a Text eigenvector.The frequency information of the keyword that this Text eigenvector not only comprises, also implies the structural information of text simultaneously.Thus, the computational short cut of the similarity between different text is for calculating the similarity between the Text eigenvector corresponding to text.The present invention is compared with classic method, and more, the more effective characteristic information saving urtext file, makes obtain better result when carrying out text classification and text similarity calculates.

Fig. 2 shows the example of a text, and the set of keywords wherein used is combined into: { Bank, Account, Fund, Transfer}.The number of times that each keyword recorded occurs in the text shown in Fig. 2, is specially F=[f ₁=1, f ₂=2, f ₃=2, f ₄=2], f ₁for the number of times that Bank occurs, f ₂for the number of times that Account occurs, f ₃for the number of times that Fund occurs, f ₄for the number of times that Transfer occurs.The relative position information of each keyword of record as shown in Figure 3, described relative position is the word distance between adjacent two keywords, distance 12 word distances both 12 between first keyword bank and second the keyword fund such as occurred represents.Fig. 3 shows the direct graph with weight G with Non-manifold edges of the text shown in Fig. 2 _m.Fig. 4 shows by the direct graph with weight G having Non-manifold edges shown in Fig. 3 _msimplify the simple direct graph with weight G obtained _s.

Described simple direct graph with weight G is described _smatrix W be:

W = [\begin{matrix} 0 & 0.0995 & 0.0705 & 0.0459 \\ 0 & 0.0200 & 0.3848 & 0.5668 \\ 0 & 0.0227 & 0.0204 & 0.0884 \\ 0 & 0.0345 & 0.3627 & 0.0323 \\ bank & fund & account & transfer \end{matrix}] \begin{matrix} bank \\ fund \\ account \\ transfer \end{matrix}

The Text eigenvector of the text shown in Fig. 2 is:

V＝[1，2，2，2，0，0.0995，0.0705，0.0459，0，0.0200，0.3848，0.5668，0，0.0227，0.0204，0.0884，0，0.0345，0.3627，0.0323]。

Above to invention has been the enough detailed description with certain singularity.Belonging to those of ordinary skill in field should be appreciated that, the description in embodiment is only exemplary, make under the prerequisite not departing from true spirit of the present invention and scope change and all should belong to protection scope of the present invention.The present invention's scope required for protection is undertaken limiting by described claims, instead of limited by the foregoing description in embodiment.

Claims

1. a text mode recognition method, is characterized in that, it comprises:

Urtext of lining by line scan file, records number of times and position that each keyword occurs in described text;

Described text is mapped as the direct graph with weight with Non-manifold edges by the number of times occurred in described text according to the keyword of record and position, wherein said with node on behalf each in the direct graph with weight of Non-manifold edges keyword;

Direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight;

Described simple direct graph with weight matrix is represented; With

According to the keyword occurrence number of obtained matrix and record, described text is mapped as Text eigenvector,

Suppose that keyword set is K={k ₁, k ₂..., k _n, key word k _iin described text, occurrence number is f _i, with F=[f ₁, f ₂..., f _n] represent the occurrence number information of all keywords, i is more than or equal to 1 and is less than or equal to n, n be more than or equal to 1 natural number,

With node on behalf each in the direct graph with weight of Non-manifold edges keyword k _iif, keyword k _iposition p in described text _ioccur, keyword k _jposition p in described text _joccur, and position p _jat position p _iafterwards, then in the direct graph with weight of Non-manifold edges, a directed edge k is added _ik _j, directed edge k _ik _jweight be p _iand p _jbetween distance, if keyword k _iwith keyword k _joccur in described text repeatedly, then use the same method these keyword k that diverse location occurs in described text in the direct graph with weight of Non-manifold edges _iand k _jbe mapped as Non-manifold edges, j is more than or equal to 1 and is less than or equal to n.

2. text mode recognition method according to claim 1, is characterized in that, the direct graph with weight with Non-manifold edges is reduced to simple direct graph with weight and comprises:

w (k_{i} k_{j}) = \underset{e &Element; E_{i j}}{Σ} \frac{1}{\tilde{w} (e)},

3. text mode recognition method according to claim 2, is characterized in that, represents that the matrix W of simple direct graph with weight is:

4. text mode recognition method according to claim 3, is characterized in that, the Text eigenvector R (D) mapping described text is:

R(D)＝[f ₁,f ₂,…,f _n,w(k ₁,k ₁),…,w(k ₁,k _n),…,w(k _n,k ₁),…,w(k _n,k _n)]。

5. text mode recognition method according to claim 4, is characterized in that, supposing has text to be D ₁..., D _m, obtaining corresponding Text eigenvector is then R (D ₁) ..., R (D _m),

Described text mode recognition method also comprises:

Utilize any two text D of following formulae discovery _x, D _ybetween similarity, wherein x, y are more than or equal to 1 and are less than or equal to m.