CN101833582A

CN101833582A - Mining method and system for correlation of vocabulary entities based on template

Info

Publication number: CN101833582A
Application number: CN 201010174505
Authority: CN
Inventors: 吴毓杰; 卢阳正
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-05-04
Filing date: 2010-05-04
Publication date: 2010-09-15

Abstract

The invention provides a mining method and a system for correlation of vocabulary entities based on a template. The invention is characterized by defining according to part-of-speech styles predefined by a user, disused words and named entities, and mining and presenting the correlation styles meeting statistical independence or correlation by a sequential pattern mining method according to the correlation of various styles. Process detail regulation of the invention is parameterized, and can be defined and added to other file information such as time, date, source and the like according to the favor of a user; and the user can obtain highly relevant named entities or vocabulary relation styles in a designated fileset within a limit time.

Description

Correlation of vocabulary entities method for digging and system based on template

Technical field

The present invention relates to the text mining method and system of a kind of information processing and information retrieval field, particularly a kind of correlation of vocabulary entities method for digging and system based on template.

Background technology

The present invention is positioned one can be among heap file, utilize natural language part of speech acquisition annotation results, according to the named entity rule of predefined, and then prospects rule with mass data and carries out the connection formula and prospect.This invention involves several fields knowledge as (1) natural language processing: vocabulary acquisition (Natural language processing), the automatic mark of part of speech (Part-of-speech tagging), vocabulary aftertreatment (Post term processing), named entity rule research (Named entity recognition); (2) Date Mining: sequence data is prospected (Sequential patternmining), association is prospected (Association mining); (3) domain knowledge such as related coefficient calibrating research.

With regard to overall architecture design spirit, the present invention is an innovation and a cross-cutting combination.Though the technical capability that domestic and international technology of past can be accomplished is all had a certain level, really technology such as do not prospect as yet and be integrated into a systemic framework in conjunction with above-mentioned natural language, sequence.Past document or invention research emphatically be prospecting of general pattern (pattern) but not prospecting that correlative converges on file.Simultaneously these technology are conceived to calculate the similarity between vocabulary and the vocabulary, but not a model formula of setting out based on part of speech is prospected.In the natural language processing, though existing similar vocabulary acquisition technology, but, be defined as method main and can be multi-lingual for tool association, part of speech model, named entity, but have never seen.

File is prospected on the subject under discussion, and wherein main application still is classification and hives off last the method breakthrough, and the association of further not expanding to vocabulary-phrase-named entity is prospected.First problem is how elder generation is with among a large amount of file datas and file is often prospected, Yang and Liu has made a summary at its 1998SIGIR article, put into different categories and correspond to (Yang among the good item name of user's predefined to it, Y.and Liu, X., " ARe-Examination of Text Categorization methods; " Proceedings of SIGIR ' 99:22nd Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval, 1999, pp.42-49).For example under the input of newspaper news, electronic medium file, judge automatically, give its suitable class,, and these files are finished with each self-organization of these classifications as bookmark names such as politics, society, films and dramas by machine.The application in a step then is to utilize file to hive off technology (as hierarchy type grouping method or the K-means that is widely known by the people) seeming irrelevant file each other again, see through in conjunction with lexical analysis, file similarity calculates, estimate particular demographic, yet group must not specify in advance, but is got by the automatic computing of machine.Though these two kinds of methods are all current file and prospect the main flow subject under discussion, in fact are still in the article stage, thinless one goes on foot to levels such as sentence or phrases.And level such as vocabulary-phrase-named entity and one piece of article exist sizable difference, and the technology of therefore prospecting file is obviously different with this patent method basically.

Based on the advantage that the past develops, the development of this patent can successfully be based upon powerful Language Processing and file is prospected on the basis.Past part that these methods are had in mind is obviously different with this patent.Though the vocabulary similarity is calculated and is seemed relevant with this patent, is actually two diverse aspects.This patent does not limit the family of languages, and is not limited to vocabulary, but can allow the user reserve the part of speech framework of wanting earlier, the pattern of tool statistic height correlation is excavated, and satisfied minimum door conditions such as confidence level.As for methods such as vocabulary similarity calculating, mainly be to be used for differentiating that similarity is high or low between speech and the speech, whether identical or relevant with the association of differentiating between vocabulary.Moreover it is only to be suitable for Chinese in essence, and only based on the Chinese word separating result; Yet this patent not only is suitable for the multinational family of languages, and is to be main the analysis according to part of speech, utilizes definition (having comprised vocabulary) on the named entity, and the pattern of height correlation is excavated out, not only has breakthrough technically, also contains not cognation simultaneously.

Summary of the invention

The invention provides a kind of framework that is applicable to that multi-lingual named entity relevance is excavated.

The present invention is according to the defined in advance part of speech pattern of user, stop words and named entity definition, and according to the relevance of each pattern, and with the method for sequential mode mining, the related pattern that will meet statistical independence or correlativity excavates and presents.

The thin portion of process of the present invention adjusts all parametrization, and can be defined according to user preferences, adds other fileinfo, as time, date, source etc.The user can obtain in finite time in specified file set, the named entity of height correlation or lexical relation pattern.

The present invention discloses a kind of correlation of vocabulary entities digging system based on template, comprising: the natural language processing module forms mark file with part-of-speech tagging in order to source document that will input; Named entity rule module is in order to defining named entity rule and excavation template; The stop words module is in order to the storage stop words; Pattern data for projection storehouse is used to store and removes after the stop words and meet the vocabulary that excavates template; Relevance is excavated module, and the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file; The relevance computing module calculates the degree of correlation between the described vocabulary that meets definite condition.

Wherein, the natural language processing module comprises vocabulary cutting module and vocabulary labeling module, described vocabulary cutting module in order to vocabulary in the source document with blank or sign field every, described vocabulary labeling module gives mark with the part of speech of each vocabulary.Excavate template in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.Relevance is excavated module on the basis in pattern data for projection storehouse, adopts the sequence pattern to excavate, and the named entity that satisfies minimum occurrence number is excavated in proper order according to appearance in its file.The relevance computing module adopts the test of independence method, calculates the degree of correlation between the vocabulary.Definite condition comprises: vocabulary is apart from length, maximum pattern tap length and threshold value.The test of independence method comprises: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the relevant probability that occurs.

The present invention discloses a kind of correlation of vocabulary entities method for digging based on template, the steps include:

1), the source document of input is formed mark file with part-of-speech tagging;

2), will mark file through named entity rule and stop speech;

3), set up pattern data for projection storehouse to meeting the vocabulary that excavates template;

4), the sequence pattern excavates, the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file;

5), calculate the degree of correlation between the vocabulary excavated.

Under framework of the present invention, emphasize multinational family of languages applicability, vocabulary that therefore must the collocation language-specific extracts and the part-of-speech tagging method, its vocabulary and part of speech are indicated finish, and then the pattern that will meet the named entity rule gets by excavating in the heap file.So in an embodiment, this part assembly can be obtained file vocabulary and part of speech with not cognation or different labeled vocabulary method, can carry out named entity vocabulary and excavate.

In an embodiment, the named entity rule definition is to be made up according to its part of speech by the user, will define legal part of speech combination (i.e. a continuous vocabulary) through its observation and synthesize a named entity, forms the excavation elementary cell.Its named entity rule can elasticity add, revise deletion.

In an embodiment, it is with all named entity elementary cells that pattern excavates, and according to its ordinal relation in article, uses the sequence mining method, and the pattern that satisfies user's definition is excavated.

In an embodiment, it is with all vocabulary in the pattern that relevance is calculated, and utilizes the defined calculating degree of correlation of user method, lists according to its relevancy ranking.

Description of drawings

Accompanying drawing 1 is a system architecture synoptic diagram of the present invention;

Accompanying drawing 2 is operational flowcharts of the present invention.

Embodiment

Describe specific embodiments of the invention in detail below in conjunction with accompanying drawing.

Under framework of the present invention, emphasize multinational family of languages applicability, vocabulary that therefore must the collocation language-specific extracts and the part-of-speech tagging method, its vocabulary and part of speech are indicated finish, and then the pattern that will meet the named entity rule gets by excavating in the heap file.So in an embodiment, this part assembly can be obtained file vocabulary and part of speech with not cognation or different labeled vocabulary method, can carry out named entity vocabulary and excavate.Following spy illustrates with the excavation of the named entity vocabulary of Chinese, and other family of languageies are suitable for the present invention in a similar fashion.

Accompanying drawing 1 is a system architecture synoptic diagram of the present invention, wherein imports link or article 100, and it is modules relevant with the user that template definition 130 and stop words (speech) remove 150, and input link or article 100 are a series of files that the user imported; Template definition 130 is the specified module that is used for defining named entity rule of user; Stop words (speech) is removed 150 by being defined in advance, needs stop words or the table removed.

Webpage article extraction 110 is being represented and is being used a network data gatherer (web-crawler), in order to the article of user's input is obtained data via the network linking transmission or with the automatic visit of network data gatherer.The user also can be without thus, among the direct input system of article content.Comprised vocabulary cutting 121 (or claiming vocabulary to extract) and part-of-speech tagging 122 modules among natural language processing 120 modules, wherein vocabulary cutting 121 is used for the vocabulary among the file is separated out with blank or other define symbols.122 of part-of-speech taggings are that the part of speech with vocabulary gives mark.

130 of template definitions are to remove 150 for named entity definition rule 140 and stop words (speech) effective definition is provided.When the article of input or after file passed through part-of-speech tagging 122, the named entity definition then was that the named entity that utilizes named entity definition rule 140 modules will meet definition extracts, and a plurality of vocabulary that will meet the named entity definition form a unit entity.Stop words (speech) is removed named entity and the vocabulary removal that 150 modules then will meet the vocabulary of stopping using.Via setting up named entity projection table 160 modules, set up the named entity projection table.It is to utilize the named entity projection table of being set up that relevance is excavated 170, adopts the sequence pattern to excavate.Relevance is excavated 170 modules and is comprised that pattern extraction 171 and pattern excavate 172 modules, and wherein pattern extraction module 171 is used to extract the pattern that meets named entity, and the pattern that pattern excavation module 172 is used for meeting its relevance excavates.The named entity that satisfies minimum occurrence number can be excavated in proper order according to appearance in its file.

At last, it then is the selected test of independence method of utilizing that will excavate that relevance is calculated 180 modules, calculates the relevance between its vocabulary and the vocabulary.Named entity incidence set 190 is according to degree of association ordering output with named entity vocabulary.

Accompanying drawing 2 is operational flowcharts of the present invention.As shown in the figure, bring into operation S1 and enter original state S2 of system judges that whether the input file collection is nonvoid set, enters next procedure S3 if file set is a non-NULL, otherwise returns original state S2.The input file collection is that nonvoid set then enters S4 and judges whether parameter setting is legal, illegally returns original state S1, the legal vocabulary cutting S5 that enters.Finish and carry out vocabulary behind the vocabulary cutting S5 and mark S6 automatically, enter named entity definition rule S7 then.Next, judge whether filter rule keeps important vocabulary S8, if whether, flow process finishes S12; If be, then set up pattern data for projection storehouse S9, next carry out correlation calculations S10, finish the laggard line correlation ordering output of S10 S11, flow process finishes S12.

Below introduce detailed calculating process proposed by the invention, and contain the operating process of above-mentioned introduction.Wherein, the input content is an article to be analyzed, and the user sets rule and template in advance; Output content is the vocabulary relevance tabulation of excavating.

Input (in advance prepare) comprises and is not limited to:

Zero webpage, news or the generic-document of collecting is designated as Dtext;

The zero definition part of speech and the then Stag of vocabulary filtration method that stops using;

Zero defining named entity rule NRule;

Zero definition will be extracted pattern Template;

* for example: noun-noun, noun-noun-verb, verb-noun-patterns such as adjective;

Degree of correlation method and threshold value θ are calculated in zero definition.

Output (output result)

Zero produces significant style set

* for example:

● Ma Ying-jeou's-leader-Kuomintang (noun-verb-noun) triple combination → degree of correlation=90%

● the binary combination → degree of correlation of young-Zhao Youting (noun-noun)=80%

● binary combination → degree of correlation=78% of Youda-panel (noun-noun)

Compute mode

STEP1: all DText are carried out the vocabulary cutting;

STEP2:, carry out part of speech and mark automatically-TagText to the vocabulary that DText cut out;

STEP3: use Nrule that the named entity in the TagText is extracted again;

STEP4: use the STag definition to remove to the vocabulary in the TagText, become STagText;

STEP5: with all the pattern-PaText that will excavate in the Template reservation STagText;

STEP6: among PaText, meet the information of the style set contrast DText of Template to each all, set up data for projection storehouse (ProjDB);

STEP7: take sequence pattern mining method, each sequence pattern (Frequent Pattern) is excavated and work out the vocabulary distance match its statistic of enumerating and add up at K with interior all patterns (Patterns);

STEP8: utilize following several test of independence modes to calculate the vocabulary degree of correlation;

Chi-square test (Chi-square statistics)

Related coefficient (Correlation Coefficient)

Information gain amount (Information Gain)

Expectation interactive information (Expectation Mutual Information)

Confidence level (Confidence)

The relevant probability (Related Prob.) that occurs

STEP9: each Patterns that excavates is calculated the above-mentioned degree of correlation, and its degree of correlation must be higher than threshold value (θ)-become set Pat;

STEP10: all Pat are sorted.

With next input Chinese file is example, and the user defines following parameter:

Chinese file (handling) via Chinese word separating and Chinese part-of-speech tagging

C=5 (=5 speech of vocabulary distance)

M=10% (extracting preceding 10% the most remarkable vocabulary)

θ=70% (the entity correlativity wants＞70% at least)

MinS=1 (entity is occurrence number at least)

The maximum pattern tap length of MaxLen=2

Chi-square test (Chi-square test) is the test of independence method

Source document reference table 1 behind Chinese word separating and part-of-speech tagging.Taken passages the multistage newsletter archive in the table 1.Wherein " _ " is compartmented, and the English symbol behind each speech is the part of speech of this speech, as the Na noun, and Nb proper noun etc.The detailed part of speech table of comparisons is as shown in table 7.

Table 1

After the definition of named entity rule, named entity vocabulary changes to form as shown in table 2 again.Its substantial definition rule defines as shown in table 3 at this embodiment.

Table 2

Table 3

Then with the words of stopping using, more above-mentioned file is comprised the stop words speech and remove, table 4 is depicted as the file of removing behind the stop words, and the words of stopping using is shown in this embodiment is defined as follows: Be Also Have After big Think Investigation Point out According to No matter The aspect All The spaceThe content that strikethrough indicated in the table 4 is through stopping the part in place to go behind the speech.

Table 4

According to the part of speech that reservation that the user defines needs, with the part of speech vocabulary of non-reservation among the file, to be removed, following table 5 is for removing file after the non-reservation part of speech, and keep the part of speech table shown in this embodiment is defined as follows: the part of speech group that keep is A Dfa I Na Nb Nc Ncd Nd Neu Nf Ng VA VB VC VE VG VH VJ

Table 5

To residue vocabulary (being named entity) implementation sequence formula method for digging, according to parameter setting, can excavate following connection entity list, and comprise chi-square value at last, according to incremental manner ordering output, detailed content is as shown in table 6.

?? The connection list of entities of being found outThe Taiwan Semiconductor Manufacturing Co. cicada connects the χ=tap χ of 2.721 Taiwan Semiconductor Manufacturing Co.s=2.721 χ of group of Taiwan Semiconductor Manufacturing Co.=2.7199 χ of enterprise of Taiwan Semiconductor Manufacturing Co.=2.7199 profit king χ of the Taiwan Semiconductor Manufacturing Co.=2.4551 profit king χ of group=6.8935 χ of winning post enterprise=8.5239	? Excavate parameter?MaxLen＝2?C＝5?M＝10％?θ＝70％?MinS＝1?χ ²-statistics
		Cicada connects leading χ=8.5253

Table 6

Mark	Part of speech	Mark	Part of speech
Mark	Part of speech	Mark	Part of speech	??A	Non-meaning adjective	??Nh	Synonym
??Caa	The equity conjunction	??I	Interjection	??A	Non-meaning adjective	??Nh	Synonym
??Caa	The equity conjunction	??I	Interjection	??Cab	Conjunction, as: or the like	??P	Preposition
??Cba	Conjunction, as: words	??T	Auxiliary word that indicates mood	??Cab	Conjunction, as: or the like	??P	Preposition
??Cba	Conjunction, as: words	??T	Auxiliary word that indicates mood	??Cbb	Related conjunction	??VA	The action intransitive verb
??Da	The quantity adverbial word	??VAC	Action makes verb	??Cbb	Related conjunction	??VA	The action intransitive verb
??Da	The quantity adverbial word	??VAC	Action makes verb	??Dfa	Degree adverb before the verb	??VB	Action class transitive verb
??Dfb	Degree adverb behind the verb	??VC	The action transitive verb	??Dfa	Degree adverb before the verb	??VB	Action class transitive verb
??Dfb	Degree adverb behind the verb	??VC	The action transitive verb	??Di	The tense mark	??VCL	Action ground connection side object verb
??Dk	Sentence adverb	??VD	Ditransitive verb	??Di	The tense mark	??VCL	Action ground connection side object verb
??Dk	Sentence adverb	??VD	Ditransitive verb	??D	Adverbial word	??VE	Action sentence guest verb
??Na	Common noun	??VF	Action meaning guest verb	??D	Adverbial word	??VE	Action sentence guest verb
??Na	Common noun	??VF	Action meaning guest verb	??Nb	Proper noun	??VG	The classification verb
??Nc	Local speech	??VH	The state intransitive verb	??Nb	Proper noun	??VG	The classification verb
??Nc	Local speech	??VH	The state intransitive verb	??Ncd	The position speech	??VHC	State makes verb
??Nd	Time word	??VI	The state class transitive verb	??Ncd	The position speech	??VHC	State makes verb
??Nd	Time word	??VI	The state class transitive verb	??Neu	Speech decided in number	??VJ	The state transitive verb
??Nes	Refer in particular to and decide speech	??VK	State sentence guest verb	??Neu	Speech decided in number	??VJ	The state transitive verb
??Nes	Refer in particular to and decide speech	??VK	State sentence guest verb	??Nep	Refer to and decide speech	??VL	State meaning guest verb
??Neqa	Quantity is decided speech	??V_2	Have	??Nep	Refer to and decide speech	??VL	State meaning guest verb
??Neqa	Quantity is decided speech	??V_2	Have	??Neqb	Rearmounted quantity is decided speech	??DE	, it,, ground
??Nf	Measure word	??SHI	Be	??Neqb	Rearmounted quantity is decided speech	??DE	, it,, ground
??Nf	Measure word	??SHI	Be	??Ng	Postposition	??FW	The foreign language mark

Mark	Part of speech	Mark	Part of speech
Mark	Part of speech	Mark	Part of speech	DASHCATEGORY	-	EXCLANATIONCATEGORY
ETCCATEGORY		PARENTHESISCATEGORY	" " () []	DASHCATEGORY	-	EXCLANATIONCATEGORY
ETCCATEGORY		PARENTHESISCATEGORY	" " () []	COMMACATEGORY	,	PAUSECATEGORY	,
PERIODCATEGORY		SPCHANGECATEGORY	//	COMMACATEGORY	,	PAUSECATEGORY	,
PERIODCATEGORY		SPCHANGECATEGORY	//	QUESTIONCATEGORY		DM	Quantitative compound word
COLONCATEGORY	:	BM	Bound morpheme	QUESTIONCATEGORY		DM	Quantitative compound word
COLONCATEGORY	:	BM	Bound morpheme	SEMICOLONCATEGORY

Table 7

Described in this instructions is preferred embodiment of the present invention, and above embodiment is only in order to illustrate technical scheme of the present invention but not limitation of the present invention.All those skilled in the art all should be within the scope of the present invention under this invention's idea by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims

1. the correlation of vocabulary entities digging system based on template is characterized in that, comprising:

The natural language processing module forms mark file with part-of-speech tagging in order to source document that will input;

Named entity rule module is in order to defining named entity rule and excavation template;

The stop words module is in order to the storage stop words;

Pattern data for projection storehouse is used to store and removes after the stop words and meet the vocabulary that excavates template;

Relevance is excavated module, and the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file; And

The relevance computing module calculates the degree of correlation between the described vocabulary that meets definite condition.

2. correlation of vocabulary entities digging system as claimed in claim 1, it is characterized in that, described natural language processing module comprises vocabulary cutting module and vocabulary labeling module, described vocabulary cutting module in order to vocabulary in the source document with blank or sign field every, described vocabulary labeling module gives mark with the part of speech of each vocabulary.

3. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described excavation template will be in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.

4. correlation of vocabulary entities digging system as claimed in claim 1, it is characterized in that, described relevance is excavated module on the basis in pattern data for projection storehouse, adopts the sequence pattern to excavate, and the named entity that satisfies minimum occurrence number is excavated in proper order according to appearance in its file.

5. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described relevance computing module adopts the test of independence method, calculates the degree of correlation between the vocabulary.

6. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described definite condition comprises: vocabulary is apart from length, the minimum occurrence number of vocabulary, maximum pattern tap length and threshold value.

7. correlation of vocabulary entities digging system as claimed in claim 5 is characterized in that, described test of independence method comprises: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the relevant probability that occurs.

8. the correlation of vocabulary entities method for digging based on template the steps include:

2), will mark file through named entity rule and stop speech;

5), calculate the degree of correlation between the vocabulary excavated.

9. correlation of vocabulary entities method for digging as claimed in claim 8 is characterized in that, described excavation template will be in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.

10. correlation of vocabulary entities method for digging as claimed in claim 8, it is characterized in that, described step 5) adopts the test of independence method, comprise: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the methods such as probability occur of being correlated with, calculate the degree of correlation between the vocabulary.