CN103207921A

CN103207921A - Method for automatically extracting terms from Chinese electronic document

Info

Publication number: CN103207921A
Application number: CN2013101564948A
Authority: CN
Inventors: 于娟
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2013-04-28
Filing date: 2013-04-28
Publication date: 2013-07-17

Abstract

The invention relates to a method for automatically extracting terms from a Chinese electronic document. The method is characterized by comprising the following steps of: step S01: processing the electronic document into a group of word strings consisting of atomic words with a special property; step S02: counting the frequency of the atomic word strings and substrings, adopting the atomic word string with the appearance times being more than N times as a candidate term, wherein N is a settable parameter; and step S03: deleting the term which only appears as a substring in a candidate term set to obtain a term set appearing in the document, and realizing the purpose for automatically extracting the terms in the Chinese electronic document. The method has the effects and benefits that the real problem and difficulty that the performance for automatically extracting the term is not high and the automation degree is limited can be solved. The high-efficient automatic method for extracting the terms is a foundation for automatically processing a text and can powerfully guarantee the information search, text summarization, content management and the like. The good term extracting method can promote the automation degree and the performance of the work.

Description

Automatically extract the method for word in a kind of therefrom message subdocument

Technical field

The invention belongs to natural language processing field, relate to the method for extracting set of words in the message subdocument therefrom automatically.

Background technology

In recent years, along with developing rapidly of fields such as scientific research, economy and Internet, the quantity accelerated growth of electronic document, how handling these magnanimity electronic documents fast and has effectively become one of the mission critical in fields such as information retrieval, information management, Web service.Electronic document technology for automatically treating such as text retrieval thus,, classification, autoabstract become the research focus of association area.In these technology, all words (being called for short " prompter ") that extract automatically in the electronic document are element tasks.Prompter method of the present invention at be the automatic processing of Chinese electronic document, if no special instructions, " document " hereinafter all refers to " Chinese electronic document ", " word " all refers to " Chinese word ".

Whether the word in the document (term or word) foundation follows the meaning combination principle, and (thePrinciple of Compositionality, the meaning of a complex expression is determined by meaning and the unitized construction thereof of its each ingredient.) be divided into two kinds: atom word and compound word (also claiming compound word).The atom word (atomic word is the short word that is used for being combined to form other neologisms in the language aw), does not follow the meaning combination principle, as, " system ", " knowledge " etc.(formation of these words is generally followed the meaning combination principle to compound word for compoundword, the cw) long word of being made up of a plurality of atom words towards content.As, " systems engineering ", " information management " etc.

The automatic extraction of atom word can easily be finished based on atom word dictionary.Because the atom word is more stable, the less neologisms that occur, so, just can extract based on dictionaries such as Chinese key word table or Chinese classification scheme vocabularys and to obtain, and accuracy rate and recall rate are all satisfactory.

The extraction method of compound word mainly contains two classes: a kind of method that is based on statistics, and as based on string frequently and the long prompter method of string etc.A kind ofly be based on the method that part of speech is analyzed, as the method for the group word Rule Extraction compound word of foundation part of speech etc.These two kinds of methods respectively have its relative merits.

The basic thought that extracts compound word based on statistical method is: the frequency of adjacent Chinese characters co-occurrence is more high, more might be an independently word.Therefore, the general process of this method is: (1) obtains each substring wherein according to a certain algorithm cutting electronic document; (2) add up the frequency of occurrences of each substring or the judge index such as probability that its left and right sides substring occurs separately; (3) whether reach threshold value according to these indexs and judge whether this substring independently becomes word.The advantage of this method is: not based on dictionary, therefore not limited by dictionary, general recall rate is higher, can extract and obtain emerging word.Shortcoming is: (1) statistical method generally is only applicable to extract automatically the word in the big language material; (2) can not guarantee accuracy rate and recall rate simultaneously, the threshold value that sets for the pursuit high-accuracy will inevitably cause lower recall rate; (3) when the cutting document obtains substring, do not consider grammer and morphology, thereby cause the most at last the substring of a part " not becoming word " to list the prompter result mistakenly in yet, as, " system worker ", " knowing management " etc.

Method based on the part of speech analysis is generally carried out the atom word segmentation based on atom word dictionary to language material, gets atom contamination (as, polynary noun) as word according to rule then.People such as Zhang Xin have proposed a kind ofly to judge according to part of speech whether Chinese character string independently becomes the method for the automatic extraction term of word in the article research of the automatic acquisition methods of Ontological concept of statistics " rule-based with ".Prompter method advantage based on the part of speech analysis is the accuracy rate height; Shortcoming is: recall rate is extremely low, is subject to accuracy and the completeness of regular collection.

For the defective that overcomes above-mentioned compound word extracting method improving the performance of automatic prompter, people such as Yu Juan have proposed a kind of prompter method of adding up frequently in conjunction with the part of speech analysis of atom word and atom word string in article " in conjunction with part of speech analysis and the string word extracting method of statistics frequently ".The basic thought of this method is: the probability of the atom word participation group word of specific part of speech is higher, and the possibility of the higher atom word string " one-tenth word " of co-occurrence frequency is higher.Based on this thought, this method at first is treated to electronic document one group of word string of being made up of the atom word of specific part of speech, and the frequency of adding up these word strings and substring thereof then obtains the word of " one-tenth word " at last, reaches the purpose of word in the automatic extraction document.But still there is defective in this method.Although the recall rate of this method is satisfactory, but exist in the results set a large amount of " half word ", as " management information " (being present in ' management information system '), " global " words such as (being present in ' global enterprises '), influenced the accuracy rate of method.

Summary of the invention

In view of this, the purpose of this invention is to provide the method for extracting word in a kind of therefrom message subdocument automatically, solve owing to existing " half word " to influence the problem of accuracy rate among the extraction result automatically, realize that computing machine automatically extracts the word in the Chinese electronic document efficiently.

The present invention adopts following scheme to realize: extract the method for word in a kind of therefrom message subdocument automatically, it is characterized in that may further comprise the steps:

Step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech;

Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter;

Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.

In an embodiment of the present invention, the implementation of described step S01 comprises the steps:

S011: electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging;

S012: delete useless atom word, obtain the set of atom word string, go on foot two kinds of useless atom words of deletion comprising following two:

S0121: delete useless atom word according to part of speech: the atom word that will not participate in organizing word replaces with one first predetermined symbol, among the output result, adopt one second predetermined symbol as at interval between the atom word, adopt described first predetermined symbol as the interval between the atom word string;

S0122: further delete the atom word according to an inactive atom word tabulation, the atom word of will stopping using replaces with described first predetermined symbol, generates the ordered set of new atom word string thus.

In an embodiment of the present invention, described first predetermined symbol is newline, and described second predetermined symbol is the space.

In an embodiment of the present invention, among the described step S011 electronic document being carried out atom word segmentation and part-of-speech tagging adopts the Words partition system IRLAS of the Words partition system ICTCLAS of the Chinese Academy of Sciences or Harbin Institute of Technology to finish.

In an embodiment of the present invention, described step S02 adopts following algorithm to realize:

1) for each the atom word string AWS in the atom word set of strings, execution in step 2);

2) for each atom word of atom word string, order execution in step 3), 4);

3) cutting obtains all substrings with the AWS headed by this atom word;

4) for each substring, execution in step 5);

5) judge whether the number of times that substring occurs surpasses N time in language material, if, execution in step 6); Otherwise, execution in step 7);

6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously;

7) return step 2).

The present invention designs and Implements a method of extracting institute's occurring words in the Chinese electronic document automatically.Compare this method with existing prompter method: (1) is step-length with the atom word when the cutting Chinese character string, has avoided the wrong prompter that caused by cutting because of the atom word, as " system worker ", " knowing management " etc.(2) when extracting compound word, show higher performance; The compound word that uses also can extract and obtain seldom separately, as " decision support " etc.(3) solve the problem that has " half word " in the results set, improved the accuracy rate of automatic prompter.Effect of the present invention and benefit are: solved automatic prompter performance is not high, automaticity is limited practical problems and difficulty.Automatically the prompter method is the basis of Text Automatic Processing efficiently, is the strong guarantee that information retrieval, text snippet, Content Management etc. are used.Good word extracting method can promote automaticity and the performance of above-mentioned work.

Description of drawings

Fig. 1 is the method flow synoptic diagram of the embodiment of the invention.

Fig. 2 is the concrete method flow synoptic diagram of another embodiment of the present invention.

Embodiment

The present invention will be further described below in conjunction with drawings and Examples.

As shown in Figure 1, present embodiment provides the method for extracting word in a kind of therefrom message subdocument automatically, it is characterized in that may further comprise the steps: step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech; Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter, this preferable N can be 2; Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.

Concrete, seeing also Fig. 2, the described automatic prompter method branch following steps of present embodiment are extracted the set of words in the Chinese electronic document:

1. electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging.

This step is carried out atom word segmentation and part-of-speech tagging to the electronic document of input.Can adopt the Words partition system IRLAS of the Words partition system ICTCLAS of the Chinese Academy of Sciences or Harbin Institute of Technology etc.

2. delete useless atom word, obtain the set of atom word string.

Useless atom word refers to those atom words that does not generally participate in forming compound word.This step process is deleted two kinds of useless atom words in two steps through the electronic document of atom word segmentation and part-of-speech tagging, and the output result serves as reasons and keeps the ordered set of the word string that the atom word forms.

Here be convenient follow-up explanation, do as giving a definition:

Definition 1: (Chinese atomic word string AWS) is a finite sequence that is made of one or more Chinese atom words to atom word string.Be designated as AWS=" aw _{1_}Aw _{2_}... aw _{N_}", aw wherein _{1_}Aw _{2_}... aw _{N_}Be the value of AWS, aw _i(1≤i≤n) is the atom word.The length (being designated as AWSLen) of an atom word string refers to constitute the number of the atom word of this atom word string.

For example, " information _ system _ " is an atom word string, and length is 2, and " infosystem " carried out forming behind the atom word segmentation.

Can use the space as separator between the adjacent atom word in the atom word string.For the purpose of distinct, might as well adopt underscore " _ " expression space.

Definition 2: the substring of atom word string is a subsequence of this atom word string.

For example, " information _ ", " system _ " and " information _ system _ " is the substring of atom word string " information _ system _ ".

1) deletes according to part of speech.This step is deleted useless atom word according to part of speech.Behind the electronic document of input through atom word segmentation and part-of-speech tagging, this module keeps those atom words that is labeled as specific part of speech, with generally do not participate in organizing word the atom word (as, preposition, auxiliary word etc.) replace with newline (or other predetermined symbol, here not as limit), so, output be the ordered set of atom word string, atom word string is made of the atom word that keeps.Among the output result, adopt single space as at interval between the atom word, adopt newline as the interval between the atom word string.

2) inactive atom word deletion.This step is further deleted the atom word according to an inactive atom word tabulation, and the atom word of will stopping using replaces with newline, generates the ordered set of new atom word string thus.The atom word of stopping using, namely those are judged from part of speech and might participate in forming compound word but generally do not participate in organizing the word of word under the actual conditions, as, be (verb), want (verb), (verb) be provided, many (adjective) etc.

3. the statistics substring frequency of occurrences obtains candidate's set of words.

Previous step will be treated to the ordered set of one group of atom word string through the electronic document of atom word segmentation and part-of-speech tagging.The substring of these atom word strings of this step cutting, the substring that output repeatedly occurs in document is as candidate's word.These candidate's words comprise that atom word, compound word and part can not independently become the Chinese character string of word.Algorithm steps is as follows:

1) for each the atom word string AWS in the atom word set of strings, carries out 2).

2) for each atom word of atom word string, order carries out 3), 4).

3) cutting obtains all substrings with the AWS headed by this atom word.

4) for each substring, carry out 5).

5) judge whether the number of times that substring occurs surpasses N time (but N is setup parameter) in language material, if carry out 6); Otherwise, carry out 7).

6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously.7) return 2).

4. deletion " half word " obtains set of words.

This step process candidate set of words, candidate's word that deletion wherein only occurs as substring obtains final automatic prompter result---the set of words in the electronic document.Only the candidate's word that occurs as substring refers to, the substring that those frequencies of occurrences in document are identical with its female string.

In the word leaching process of reality, in order to improve result's accuracy, after extraction obtains set of words automatically, also can add an artificial step of revising.The artificial correction is the process that expert's manual modification extracts the result automatically.

In order to allow those skilled in the art better understand the present invention: be example with document shown in the table one.

Table one

Adopt the document of the Words partition system ICTCLAS of the Chinese Academy of Sciences to carry out atom word segmentation and part-of-speech tagging through step 1, the document after the cutting as shown in Table 2.

Table two

The useless atom word of step 2 deletion.The result is as shown in Table 3:

Table three

The step 3 statistics substring frequency of occurrences is obtained candidate's word.The result as shown in Table 4.

Sequence number	Word	The frequency of occurrences
			1	Enterprise	3
2	The whole world	3
			3	Global	2
4	Global enterprises	2
			5	Communication	2
6	System	2
			7	Information	2
8	Infosystem	2
			9	Scope	2

Table four

Step 4 deletion " half word " obtains the set of words that the method for the invention is carried out automatic prompter.The result as shown in Table 5.

Sequence number	Word	The frequency of occurrences
			1	Enterprise	3
2	The whole world	3
			3	Global enterprises	2
4	Communication	2
			5	Infosystem	2
6	Scope	2

Table five

The above only is preferred embodiment of the present invention, and all equalizations of doing according to the present patent application claim change and modify, and all should belong to covering scope of the present invention.

Claims

1. method of therefrom extracting word in the message subdocument automatically is characterized in that may further comprise the steps:

2. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: the implementation of described step S01 comprises the steps:

3. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: described first predetermined symbol is newline, and described second predetermined symbol is the space.

4. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: among the described step S011 electronic document is carried out atom word segmentation and part-of-speech tagging employing Words partition system ICTCLAS or Words partition system IRLAS and finish.

5. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: described step S02 adopts following algorithm to realize:

2) for each atom word of atom word string, order execution in step 3), 4);

3) cutting obtains all substrings with the AWS headed by this atom word;

4) for each substring, execution in step 5);

7) return step 2).