CN103207921A - Method for automatically extracting terms from Chinese electronic document - Google Patents

Method for automatically extracting terms from Chinese electronic document Download PDF

Info

Publication number
CN103207921A
CN103207921A CN2013101564948A CN201310156494A CN103207921A CN 103207921 A CN103207921 A CN 103207921A CN 2013101564948 A CN2013101564948 A CN 2013101564948A CN 201310156494 A CN201310156494 A CN 201310156494A CN 103207921 A CN103207921 A CN 103207921A
Authority
CN
China
Prior art keywords
word
atom
atom word
string
electronic document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101564948A
Other languages
Chinese (zh)
Inventor
于娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN2013101564948A priority Critical patent/CN103207921A/en
Publication of CN103207921A publication Critical patent/CN103207921A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a method for automatically extracting terms from a Chinese electronic document. The method is characterized by comprising the following steps of: step S01: processing the electronic document into a group of word strings consisting of atomic words with a special property; step S02: counting the frequency of the atomic word strings and substrings, adopting the atomic word string with the appearance times being more than N times as a candidate term, wherein N is a settable parameter; and step S03: deleting the term which only appears as a substring in a candidate term set to obtain a term set appearing in the document, and realizing the purpose for automatically extracting the terms in the Chinese electronic document. The method has the effects and benefits that the real problem and difficulty that the performance for automatically extracting the term is not high and the automation degree is limited can be solved. The high-efficient automatic method for extracting the terms is a foundation for automatically processing a text and can powerfully guarantee the information search, text summarization, content management and the like. The good term extracting method can promote the automation degree and the performance of the work.

Description

Automatically extract the method for word in a kind of therefrom message subdocument
Technical field
The invention belongs to natural language processing field, relate to the method for extracting set of words in the message subdocument therefrom automatically.
Background technology
In recent years, along with developing rapidly of fields such as scientific research, economy and Internet, the quantity accelerated growth of electronic document, how handling these magnanimity electronic documents fast and has effectively become one of the mission critical in fields such as information retrieval, information management, Web service.Electronic document technology for automatically treating such as text retrieval thus,, classification, autoabstract become the research focus of association area.In these technology, all words (being called for short " prompter ") that extract automatically in the electronic document are element tasks.Prompter method of the present invention at be the automatic processing of Chinese electronic document, if no special instructions, " document " hereinafter all refers to " Chinese electronic document ", " word " all refers to " Chinese word ".
Whether the word in the document (term or word) foundation follows the meaning combination principle, and (thePrinciple of Compositionality, the meaning of a complex expression is determined by meaning and the unitized construction thereof of its each ingredient.) be divided into two kinds: atom word and compound word (also claiming compound word).The atom word (atomic word is the short word that is used for being combined to form other neologisms in the language aw), does not follow the meaning combination principle, as, " system ", " knowledge " etc.(formation of these words is generally followed the meaning combination principle to compound word for compoundword, the cw) long word of being made up of a plurality of atom words towards content.As, " systems engineering ", " information management " etc.
The automatic extraction of atom word can easily be finished based on atom word dictionary.Because the atom word is more stable, the less neologisms that occur, so, just can extract based on dictionaries such as Chinese key word table or Chinese classification scheme vocabularys and to obtain, and accuracy rate and recall rate are all satisfactory.
The extraction method of compound word mainly contains two classes: a kind of method that is based on statistics, and as based on string frequently and the long prompter method of string etc.A kind ofly be based on the method that part of speech is analyzed, as the method for the group word Rule Extraction compound word of foundation part of speech etc.These two kinds of methods respectively have its relative merits.
The basic thought that extracts compound word based on statistical method is: the frequency of adjacent Chinese characters co-occurrence is more high, more might be an independently word.Therefore, the general process of this method is: (1) obtains each substring wherein according to a certain algorithm cutting electronic document; (2) add up the frequency of occurrences of each substring or the judge index such as probability that its left and right sides substring occurs separately; (3) whether reach threshold value according to these indexs and judge whether this substring independently becomes word.The advantage of this method is: not based on dictionary, therefore not limited by dictionary, general recall rate is higher, can extract and obtain emerging word.Shortcoming is: (1) statistical method generally is only applicable to extract automatically the word in the big language material; (2) can not guarantee accuracy rate and recall rate simultaneously, the threshold value that sets for the pursuit high-accuracy will inevitably cause lower recall rate; (3) when the cutting document obtains substring, do not consider grammer and morphology, thereby cause the most at last the substring of a part " not becoming word " to list the prompter result mistakenly in yet, as, " system worker ", " knowing management " etc.
Method based on the part of speech analysis is generally carried out the atom word segmentation based on atom word dictionary to language material, gets atom contamination (as, polynary noun) as word according to rule then.People such as Zhang Xin have proposed a kind ofly to judge according to part of speech whether Chinese character string independently becomes the method for the automatic extraction term of word in the article research of the automatic acquisition methods of Ontological concept of statistics " rule-based with ".Prompter method advantage based on the part of speech analysis is the accuracy rate height; Shortcoming is: recall rate is extremely low, is subject to accuracy and the completeness of regular collection.
For the defective that overcomes above-mentioned compound word extracting method improving the performance of automatic prompter, people such as Yu Juan have proposed a kind of prompter method of adding up frequently in conjunction with the part of speech analysis of atom word and atom word string in article " in conjunction with part of speech analysis and the string word extracting method of statistics frequently ".The basic thought of this method is: the probability of the atom word participation group word of specific part of speech is higher, and the possibility of the higher atom word string " one-tenth word " of co-occurrence frequency is higher.Based on this thought, this method at first is treated to electronic document one group of word string of being made up of the atom word of specific part of speech, and the frequency of adding up these word strings and substring thereof then obtains the word of " one-tenth word " at last, reaches the purpose of word in the automatic extraction document.But still there is defective in this method.Although the recall rate of this method is satisfactory, but exist in the results set a large amount of " half word ", as " management information " (being present in ' management information system '), " global " words such as (being present in ' global enterprises '), influenced the accuracy rate of method.
Summary of the invention
In view of this, the purpose of this invention is to provide the method for extracting word in a kind of therefrom message subdocument automatically, solve owing to existing " half word " to influence the problem of accuracy rate among the extraction result automatically, realize that computing machine automatically extracts the word in the Chinese electronic document efficiently.
The present invention adopts following scheme to realize: extract the method for word in a kind of therefrom message subdocument automatically, it is characterized in that may further comprise the steps:
Step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech;
Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter;
Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.
In an embodiment of the present invention, the implementation of described step S01 comprises the steps:
S011: electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging;
S012: delete useless atom word, obtain the set of atom word string, go on foot two kinds of useless atom words of deletion comprising following two:
S0121: delete useless atom word according to part of speech: the atom word that will not participate in organizing word replaces with one first predetermined symbol, among the output result, adopt one second predetermined symbol as at interval between the atom word, adopt described first predetermined symbol as the interval between the atom word string;
S0122: further delete the atom word according to an inactive atom word tabulation, the atom word of will stopping using replaces with described first predetermined symbol, generates the ordered set of new atom word string thus.
In an embodiment of the present invention, described first predetermined symbol is newline, and described second predetermined symbol is the space.
In an embodiment of the present invention, among the described step S011 electronic document being carried out atom word segmentation and part-of-speech tagging adopts the Words partition system IRLAS of the Words partition system ICTCLAS of the Chinese Academy of Sciences or Harbin Institute of Technology to finish.
In an embodiment of the present invention, described step S02 adopts following algorithm to realize:
1) for each the atom word string AWS in the atom word set of strings, execution in step 2);
2) for each atom word of atom word string, order execution in step 3), 4);
3) cutting obtains all substrings with the AWS headed by this atom word;
4) for each substring, execution in step 5);
5) judge whether the number of times that substring occurs surpasses N time in language material, if, execution in step 6); Otherwise, execution in step 7);
6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously;
7) return step 2).
The present invention designs and Implements a method of extracting institute's occurring words in the Chinese electronic document automatically.Compare this method with existing prompter method: (1) is step-length with the atom word when the cutting Chinese character string, has avoided the wrong prompter that caused by cutting because of the atom word, as " system worker ", " knowing management " etc.(2) when extracting compound word, show higher performance; The compound word that uses also can extract and obtain seldom separately, as " decision support " etc.(3) solve the problem that has " half word " in the results set, improved the accuracy rate of automatic prompter.Effect of the present invention and benefit are: solved automatic prompter performance is not high, automaticity is limited practical problems and difficulty.Automatically the prompter method is the basis of Text Automatic Processing efficiently, is the strong guarantee that information retrieval, text snippet, Content Management etc. are used.Good word extracting method can promote automaticity and the performance of above-mentioned work.
Description of drawings
Fig. 1 is the method flow synoptic diagram of the embodiment of the invention.
Fig. 2 is the concrete method flow synoptic diagram of another embodiment of the present invention.
Embodiment
The present invention will be further described below in conjunction with drawings and Examples.
As shown in Figure 1, present embodiment provides the method for extracting word in a kind of therefrom message subdocument automatically, it is characterized in that may further comprise the steps: step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech; Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter, this preferable N can be 2; Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.
Concrete, seeing also Fig. 2, the described automatic prompter method branch following steps of present embodiment are extracted the set of words in the Chinese electronic document:
1. electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging.
This step is carried out atom word segmentation and part-of-speech tagging to the electronic document of input.Can adopt the Words partition system IRLAS of the Words partition system ICTCLAS of the Chinese Academy of Sciences or Harbin Institute of Technology etc.
2. delete useless atom word, obtain the set of atom word string.
Useless atom word refers to those atom words that does not generally participate in forming compound word.This step process is deleted two kinds of useless atom words in two steps through the electronic document of atom word segmentation and part-of-speech tagging, and the output result serves as reasons and keeps the ordered set of the word string that the atom word forms.
Here be convenient follow-up explanation, do as giving a definition:
Definition 1: (Chinese atomic word string AWS) is a finite sequence that is made of one or more Chinese atom words to atom word string.Be designated as AWS=" aw 1_Aw 2_... aw N_", aw wherein 1_Aw 2_... aw N_Be the value of AWS, aw i(1≤i≤n) is the atom word.The length (being designated as AWSLen) of an atom word string refers to constitute the number of the atom word of this atom word string.
For example, " information _ system _ " is an atom word string, and length is 2, and " infosystem " carried out forming behind the atom word segmentation.
Can use the space as separator between the adjacent atom word in the atom word string.For the purpose of distinct, might as well adopt underscore " _ " expression space.
Definition 2: the substring of atom word string is a subsequence of this atom word string.
For example, " information _ ", " system _ " and " information _ system _ " is the substring of atom word string " information _ system _ ".
1) deletes according to part of speech.This step is deleted useless atom word according to part of speech.Behind the electronic document of input through atom word segmentation and part-of-speech tagging, this module keeps those atom words that is labeled as specific part of speech, with generally do not participate in organizing word the atom word (as, preposition, auxiliary word etc.) replace with newline (or other predetermined symbol, here not as limit), so, output be the ordered set of atom word string, atom word string is made of the atom word that keeps.Among the output result, adopt single space as at interval between the atom word, adopt newline as the interval between the atom word string.
2) inactive atom word deletion.This step is further deleted the atom word according to an inactive atom word tabulation, and the atom word of will stopping using replaces with newline, generates the ordered set of new atom word string thus.The atom word of stopping using, namely those are judged from part of speech and might participate in forming compound word but generally do not participate in organizing the word of word under the actual conditions, as, be (verb), want (verb), (verb) be provided, many (adjective) etc.
3. the statistics substring frequency of occurrences obtains candidate's set of words.
Previous step will be treated to the ordered set of one group of atom word string through the electronic document of atom word segmentation and part-of-speech tagging.The substring of these atom word strings of this step cutting, the substring that output repeatedly occurs in document is as candidate's word.These candidate's words comprise that atom word, compound word and part can not independently become the Chinese character string of word.Algorithm steps is as follows:
1) for each the atom word string AWS in the atom word set of strings, carries out 2).
2) for each atom word of atom word string, order carries out 3), 4).
3) cutting obtains all substrings with the AWS headed by this atom word.
4) for each substring, carry out 5).
5) judge whether the number of times that substring occurs surpasses N time (but N is setup parameter) in language material, if carry out 6); Otherwise, carry out 7).
6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously.7) return 2).
4. deletion " half word " obtains set of words.
This step process candidate set of words, candidate's word that deletion wherein only occurs as substring obtains final automatic prompter result---the set of words in the electronic document.Only the candidate's word that occurs as substring refers to, the substring that those frequencies of occurrences in document are identical with its female string.
In the word leaching process of reality, in order to improve result's accuracy, after extraction obtains set of words automatically, also can add an artificial step of revising.The artificial correction is the process that expert's manual modification extracts the result automatically.
In order to allow those skilled in the art better understand the present invention: be example with document shown in the table one.
Figure BDA00003126119600081
Table one
Adopt the document of the Words partition system ICTCLAS of the Chinese Academy of Sciences to carry out atom word segmentation and part-of-speech tagging through step 1, the document after the cutting as shown in Table 2.
Table two
The useless atom word of step 2 deletion.The result is as shown in Table 3:
Figure BDA00003126119600092
Table three
The step 3 statistics substring frequency of occurrences is obtained candidate's word.The result as shown in Table 4.
Sequence number Word The frequency of occurrences
1 Enterprise 3
2 The whole world 3
3 Global 2
4 Global enterprises 2
5 Communication 2
6 System 2
7 Information 2
8 Infosystem 2
9 Scope 2
Table four
Step 4 deletion " half word " obtains the set of words that the method for the invention is carried out automatic prompter.The result as shown in Table 5.
Sequence number Word The frequency of occurrences
1 Enterprise 3
2 The whole world 3
3 Global enterprises 2
4 Communication 2
5 Infosystem 2
6 Scope 2
Table five
The above only is preferred embodiment of the present invention, and all equalizations of doing according to the present patent application claim change and modify, and all should belong to covering scope of the present invention.

Claims (5)

1. method of therefrom extracting word in the message subdocument automatically is characterized in that may further comprise the steps:
Step S01: electronic document is treated to one group of word string of being made up of the atom word of specific part of speech;
Step S02: add up the frequency of those atom word word strings and substring thereof, occurrence number is surpassed N time atom word word string as candidate's word, but wherein N is setup parameter;
Step S03: only as the substring occurring words, obtain the set of occurring words in the document in the deletion candidate set of words, realize extracting automatically the purpose of the word in the Chinese electronic document.
2. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: the implementation of described step S01 comprises the steps:
S011: electronic document is carried out atom word segmentation and part-of-speech tagging, obtain the document through atom word segmentation and part-of-speech tagging;
S012: delete useless atom word, obtain the set of atom word string, go on foot two kinds of useless atom words of deletion comprising following two:
S0121: delete useless atom word according to part of speech: the atom word that will not participate in organizing word replaces with one first predetermined symbol, among the output result, adopt one second predetermined symbol as at interval between the atom word, adopt described first predetermined symbol as the interval between the atom word string;
S0122: further delete the atom word according to an inactive atom word tabulation, the atom word of will stopping using replaces with described first predetermined symbol, generates the ordered set of new atom word string thus.
3. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: described first predetermined symbol is newline, and described second predetermined symbol is the space.
4. extract the method for word in a kind of therefrom message subdocument according to claim 2 automatically, it is characterized in that: among the described step S011 electronic document is carried out atom word segmentation and part-of-speech tagging employing Words partition system ICTCLAS or Words partition system IRLAS and finish.
5. extract the method for word in a kind of therefrom message subdocument according to claim 1 automatically, it is characterized in that: described step S02 adopts following algorithm to realize:
1) for each the atom word string AWS in the atom word set of strings, execution in step 2);
2) for each atom word of atom word string, order execution in step 3), 4);
3) cutting obtains all substrings with the AWS headed by this atom word;
4) for each substring, execution in step 5);
5) judge whether the number of times that substring occurs surpasses N time in language material, if, execution in step 6); Otherwise, execution in step 7);
6) blank character of removing in the substring forms Chinese character string, as candidate's word; Preserve its frequency of occurrences simultaneously;
7) return step 2).
CN2013101564948A 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document Pending CN103207921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101564948A CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101564948A CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Publications (1)

Publication Number Publication Date
CN103207921A true CN103207921A (en) 2013-07-17

Family

ID=48755142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101564948A Pending CN103207921A (en) 2013-04-28 2013-04-28 Method for automatically extracting terms from Chinese electronic document

Country Status (1)

Country Link
CN (1) CN103207921A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013899A1 (en) * 2013-07-31 2015-02-05 Empire Technology Development Llc Information extraction from semantic data
CN104766504A (en) * 2015-03-31 2015-07-08 黄庆梅 Atomic word point contact learning machine
CN106970904A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device of new word discovery
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN110969009A (en) * 2019-12-03 2020-04-07 哈尔滨工程大学 Word segmentation method of Chinese natural language text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0802492A1 (en) * 1996-04-17 1997-10-22 International Business Machines Corporation Document search system
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0802492A1 (en) * 1996-04-17 1997-10-22 International Business Machines Corporation Document search system
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于娟 等: "结合词性分析与串频统计的词语提取方法", 《系统工程理论与实践》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013899A1 (en) * 2013-07-31 2015-02-05 Empire Technology Development Llc Information extraction from semantic data
CN104766504A (en) * 2015-03-31 2015-07-08 黄庆梅 Atomic word point contact learning machine
CN106970904A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device of new word discovery
CN106970904B (en) * 2016-01-14 2020-06-05 北京国双科技有限公司 Method and device for discovering new words
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN109213988B (en) * 2017-06-29 2022-06-21 武汉斗鱼网络科技有限公司 Barrage theme extraction method, medium, equipment and system based on N-gram model
CN110969009A (en) * 2019-12-03 2020-04-07 哈尔滨工程大学 Word segmentation method of Chinese natural language text
CN110969009B (en) * 2019-12-03 2023-10-13 哈尔滨工程大学 Word segmentation method for Chinese natural language text

Similar Documents

Publication Publication Date Title
CN110874531B (en) Topic analysis method and device and storage medium
CN109710947B (en) Electric power professional word bank generation method and device
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN105138514A (en) Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
CN103729402A (en) Method for establishing mapping knowledge domain based on book catalogue
CN102915299A (en) Word segmentation method and device
CN102831194A (en) New word automatic searching system and new word automatic searching method based on query log
CN103207921A (en) Method for automatically extracting terms from Chinese electronic document
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
CN104536830A (en) KNN text classification method based on MapReduce
CN113221559A (en) Chinese key phrase extraction method and system in scientific and technological innovation field by utilizing semantic features
Pande et al. Application of natural language processing tools in stemming
Korobkin et al. Method of identification of patent trends based on descriptions of technical functions
CN102375863A (en) Method and device for keyword extraction in geographic information field
CN107577713B (en) Text handling method based on electric power dictionary
CN101872363B (en) Method for extracting keywords
Alhanini et al. The enhancement of arabic stemming by using light stemming and dictionary-based stemming
Daille Building bilingual terminologies from comparable corpora: The TTC TermSuite
CN103927176A (en) Method for generating program feature tree on basis of hierarchical topic model
CN106096014A (en) The Text Clustering Method of mixing length text set based on DMR
Elrajubi An improved Arabic light stemmer
CN106682107B (en) Method and device for determining incidence relation of database table
He et al. An approach to automatically constructing domain ontology
CN111209737B (en) Method for screening out noise document and computer readable storage medium
CN112395856B (en) Text matching method, text matching device, computer system and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130717