CN104572628A - System and method for automatically extracting academic definition based on syntax characteristics - Google Patents
System and method for automatically extracting academic definition based on syntax characteristics Download PDFInfo
- Publication number
- CN104572628A CN104572628A CN201510059166.5A CN201510059166A CN104572628A CN 104572628 A CN104572628 A CN 104572628A CN 201510059166 A CN201510059166 A CN 201510059166A CN 104572628 A CN104572628 A CN 104572628A
- Authority
- CN
- China
- Prior art keywords
- definition
- sentence
- word
- feature
- verb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a system and a method for automatically extracting academic definition based on syntax characteristics. The system comprises a pretreatment module, a definition sentence extraction module, a definition term extraction module and an output module, wherein the pretreatment module is used for extracting an abstract and a full text part from the input academic document, and dividing the extracted abstract and full text into single sentences; the definition sentence extraction module is used for judging whether the single sentences are definition sentences or not by rules and a statistical method; the definition term extraction module is used for carrying out pretreatment on the defined single sentences, and extracting a word string into term words according to the term extraction module, and correcting the front and back adjacent word strings to obtain the definition terms; and the output module is used for outputting the definition terms. According to the system and the method, the sentences of representing definition and corresponding term words are extracted from the document, and are presented to users; and the users can conveniently, rapidly and accurately understand the retrieved content.
Description
Technical field
The invention belongs to areas of information technology, particularly relate to a kind of science based on syntactic feature definition Automatic Extraction system and method.
Background technology
For academic documents, user wishes that can retrieve the content oneself will searched quickly and accurately also understands rapidly.But due to academic documents self, a large amount of appearance of technical term and neologisms term, causing user to need the document with retrieving to research and analyse, finding out the sentence that this keyword occurs and conscientiously studying, understand.And this process is very poor efficiency.
The research work of existing distich sub-definite Automatic Extraction, recent years just starts active.Wherein, the rule-based method of many employings.Namely the method for sentence coupling is carried out by summing up the conventional several modes structure template of definition.But due to template coverage problem, cause recall rate very low.In addition, also there is employing Statistics-Based Method, namely utilize the model in statistics, algorithm calculates, find out the definition sentence meeting statistical law.But this method is not analyzed from syntax aspect, causes accuracy rate lower.
Summary of the invention
For solving the problems of the technologies described above, the object of this invention is to provide a kind of science based on syntactic feature definition Automatic Extraction system and method.
Object of the present invention is realized by following technical scheme:
Based on a science definition Automatic Extraction system for syntactic feature, this system comprises:
Pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text;
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, will be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, obtain defining term;
Output module, for exporting definition term.
Based on a science definition Automatic Extraction method for syntactic feature, the method comprises:
To academic documents extraction summary and the full text part of input, and the summary of extraction and full text are divided into simple sentence;
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
Compared with prior art, one or more embodiment of the present invention can have the following advantages by tool:
The sentence and corresponding term word that represent definition in document extract by the present invention, and present to user, facilitate user to understand the content retrieved fast and accurately.And present specification proposes on the basis of rule template, based on the science definition Automatic Extraction method of syntactic feature.The method combines rule-based and advantage that is statistical method, and studies academic documents sentence the aspect of syntactic structure.
Accompanying drawing explanation
Fig. 1 is the science definition Automatic Extraction system construction drawing based on syntactic feature;
Fig. 2 is rule-based definition sentence abstracting method process flow diagram;
Fig. 3 is that the definition sentence of Corpus--based Method method extracts process flow diagram;
Fig. 4 defines terminology extraction process flow diagram.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, be the science definition Automatic Extraction system architecture based on syntactic feature, comprise: described system comprises pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text; Participle instrument and syntactic analysis instrument is utilized to carry out the work such as participle, part-of-speech tagging, syntactic analysis.
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, to be judged as that the simple sentence defined carries out pre-service, mark wherein there being the character string of segmentation effect, such as " what is called ", " being called ", " being defined as ", " being referred to as " etc., secondly, according to term word extraction template, word string on ad-hoc location is extracted and alternatively defines term, finally, utilize and obtain front adjacent word and rear adjacent vocabulary by statistics high frequency term, candidate is defined in term be not term word ingredient remove, obtain definition term (as shown in Figure 4); Above-mentioned to being judged as that the simple sentence defined carries out pre-service and comprises: to carry out making a summary and extraction in full to the academic documents of input, and subordinate sentence is carried out to the whole section of language material obtained; Does the word string of above-mentioned ad-hoc location refer to first (.* in table 3 template?) and table 4 template in second (.*?) the word string that mates extracts as term word;
Output module, for exporting definition term.
Above-mentioned regular method carries out affirmative template matches to each simple sentence or/and negative template matches, if table 1 is definition sentence template certainly; Table 2 is definition sentence negative template.
Table 1
1^ (.*?) so-called (.*?) $ |
2^ (.*?) (be called | be called | be called | be called) (.*?) $ |
3^ (.*?) (being defined as) (.*?) $ |
4^ (.*?) (referring to) (.*?) $ |
5^ (.*?) (referring to) (.*?) $ |
6^ (.*?) (being referred to as) (.*?) $ |
7^ (.*?) (quilt |) (.*?) (being defined as) (.*?) $ |
8^ (.*?) (also known as | cry again | also claim | also cry | also cry | also known as | be commonly called as | therefore claim | be referred to as | common name) (.*?) $ |
9^ (.*?) (being called) (.*?) $ |
11^ (.*?) (general designation is | be referred to as | being commonly referred to as | common name is | called after) (.*?) $ |
13^ (.*?) (concept | definition) (.*?) (: |: | be | for) (.*?) $ |
14^ (.*?) (refer generally to | refer to) (.*?) $ |
15^ (.*?) (be a kind of | be one | be a class) (.*?) $ |
12^ (.*?) (YES) (.*?) (one | one | be a class) (.*?) $ |
Table 2
^ (.*?) (name is called | claim for | be known as | be called as) (.*?) $ |
^ (.*?) (deserving to be called) (.*?) $ |
^ (.*?) (being called as respectively) (.*?) $ |
^ (.*?) (what is called | so-called | it doesn't matter | just so-called) (.*?) $ |
^ (.*?) (be specify | be refer to | but refer to | do not refer to | be index | be instruct | be make a comment or criticism) (.*?) $ |
^(\[[0-9]+\])(.*?)$ |
^ (.*?) (this | should | also) be (a .*?) $ |
^ (.*?) (this is) (.*?) (one) (.*?) $ |
^ (.*?) (prove | indicate | think) (.*?) (being one) (.*?) $ |
Each template first (.* above?) content of location matches is term word (as table 3, being term word template one).
Table 3
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $ |
^ (.*?) (generally main refer to) (.*?) $ |
^ (.*?) (typically referring to) (.*?) $ |
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $ |
^ (.*?) (referring to) (.*?) $ |
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $ |
^ (.*?) (be defined as | it is defined as) (.*?) $ |
Each template second (.* above?) content of location matches is term word (as table 4, being term word template two).
Table 4
^ (.*?) (what is called) (.*?) (mainly referring to) (.*?) $ |
^ (.*?) (what is called) (.*?) (be exactly | refer to | i.e.) (.*?) $ |
^ (.*?) (what is called) (.*?) ($ $) (.*?) $ |
^ (.*?) (namely so-called) (.*?) ($ $) (.*?) $ |
^ (.*?) (being called) (.*?) ($ $) $ |
^ (.*?) (be called | be called) (.*?) ($ $) $ |
^ (.*?) (being defined as) (.*?) ($ $) $ |
^ (.*?) (being referred to as) (.*?) ($ $) $ |
^ (.*?) (also known as | cry again | also claim | also cry | also cry | also known as | therefore claim) (.*?) ($ $) $ |
^ (.*?) (general designation is | be referred to as | be commonly referred to as | common name is) (.*?) ($ $) $ |
Above-mentioned statistic law carries out N unit sentence characteristics to each simple sentence to extract and syntactic feature extraction, and calculate the probability of N unit's sentence characteristics and syntactic feature, according to described definition of probability discriminant function.And carry out being that the marking value weightNo_total of the marking value weightYes_total and non-a defined sentence that define sentence compares, if weightYes_total>weightNo_total, then think the marking value defining sentence.
Above-mentioned N unit sentence characteristics comprises unitary characteristic sum binary feature;
Above-mentioned unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature.
Described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
Described definition discriminant function is that N unit characteristic sum syntactic feature is divided into two classes statistics, obtains probability weightYes1 that N unit feature is definition and is not that the probability weightNo1 of definition and syntactic feature are the probability weightYes_sen of definition and are not the probability weightNo_sen defined; If F1 and F2 is respectively the weight shared by the probable value of two category features, F1+F2=1 need be met.Probability weightYes_total after two category features combine and weightNo_total computing method are:
weightYes_total=F1*weightYes1+F2*weightYes_sen
weightNo_total=F1*weightNo1+F2*weightNo_sen
Wherein, F1=0.5, F2=0.5.
The determination of N unit feature weight in definition discriminant function.Because the ratio defining sentence and non-a defined sentence in corpus is not 1:1, but the ratio of about 1:10.Therefore after adding unitary characteristic sum binary feature, need to feature each in training result be definition probability and be not definition probability adjust, specific practice be to each feature be not define probability reduce divided by a constant.Wherein C1 is unitary feature is not the probability minification defined, and C2 is binary feature is not the probability minification defined.Finally determine C1=10, C2=2 is as parameter.
The present embodiment additionally provides a kind of science based on syntactic feature definition Automatic Extraction method, and the method comprises:
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, the word string on ad-hoc location be extracted as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
Above-mentioned regular method carries out affirmative template matches to each simple sentence or/and negative template matches;
If template matches success certainly, then carry out negative rule template coupling;
Then template matches of establishing rules if not is failed, then think to define sentence, and export (as shown in Figure 2).
Carry out N unit sentence characteristics to above-mentioned pretreated simple sentence to extract and syntactic feature extraction, and calculate N unit's sentence characteristics probability and syntactic feature probability, according to described definition of probability discriminant function, whether function judges successfully, success, then export definition sentence (as shown in Figure 3).
Although the embodiment disclosed by the present invention is as above, the embodiment that described content just adopts for the ease of understanding the present invention, and be not used to limit the present invention.Technician in any the technical field of the invention; under the prerequisite not departing from the spirit and scope disclosed by the present invention; any amendment and change can be done what implement in form and in details; but scope of patent protection of the present invention, the scope that still must define with appending claims is as the criterion.
Claims (10)
1. based on a science definition Automatic Extraction system for syntactic feature, it is characterized in that, described system comprises pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text;
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, will be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, obtain defining term;
Output module, for exporting definition term.
2., as claimed in claim 1 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described regular method carries out affirmative template matches to each simple sentence or/and negative template matches.
3. as claimed in claim 1 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described statistic law carries out N unit sentence characteristics to each simple sentence to extract and syntactic feature extraction, and calculate the probability of N unit's sentence characteristics and syntactic feature, according to described definition of probability discriminant function.
4., as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described N unit sentence characteristics comprises unitary characteristic sum binary feature;
Described unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature.
5., as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
6. as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described definition discriminant function is that N unit characteristic sum syntactic feature is divided into two classes statistics, obtains probability that sentence characteristics is definition and is not that the probability of definition and syntactic feature are the probability of definition and are not the probability defined; And
Determine N unit feature weight in definition discriminant function.
7., based on a science definition Automatic Extraction method for syntactic feature, it is characterized in that, described method comprises:
To academic documents extraction summary and the full text part of input, and the summary of extraction and full text are divided into simple sentence;
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
8., as claimed in claim 7 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, described regular method carries out affirmative template matches to each simple sentence or/and negative template matches;
If template matches is unsuccessful certainly, then not think it is define sentence;
If template matches success certainly, then carry out negative rule template coupling;
Then template matches of establishing rules if not is failed, then think to define sentence, and export.
9. as claimed in claim 7 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, carry out N unit sentence characteristics to described pretreated simple sentence to extract and syntactic feature extraction, and calculate N unit's sentence characteristics probability and syntactic feature probability, according to described definition of probability discriminant function, whether function judges successfully, if success, then export definition sentence, otherwise, do not export definition sentence.
10., as claimed in claim 9 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, described in
Described N unit sentence characteristics comprises unitary characteristic sum binary feature;
Described unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature;
Described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510059166.5A CN104572628B (en) | 2015-02-05 | 2015-02-05 | A kind of science based on syntactic feature defines automatic extraction system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510059166.5A CN104572628B (en) | 2015-02-05 | 2015-02-05 | A kind of science based on syntactic feature defines automatic extraction system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104572628A true CN104572628A (en) | 2015-04-29 |
CN104572628B CN104572628B (en) | 2017-08-08 |
Family
ID=53088732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510059166.5A Active CN104572628B (en) | 2015-02-05 | 2015-02-05 | A kind of science based on syntactic feature defines automatic extraction system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104572628B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106960041A (en) * | 2017-03-28 | 2017-07-18 | 山西同方知网数字出版技术有限公司 | A kind of structure of knowledge method based on non-equilibrium data |
CN108573025A (en) * | 2018-03-12 | 2018-09-25 | 北京云知声信息技术有限公司 | The method and device of sentence characteristic of division is extracted based on hybrid template |
CN108647194A (en) * | 2018-04-28 | 2018-10-12 | 北京神州泰岳软件股份有限公司 | information extraction method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101233484A (en) * | 2005-08-01 | 2008-07-30 | 微软公司 | Definition extraction |
CN101710343A (en) * | 2009-12-11 | 2010-05-19 | 北京中机科海科技发展有限公司 | Body automatic build system and method based on text mining |
-
2015
- 2015-02-05 CN CN201510059166.5A patent/CN104572628B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101233484A (en) * | 2005-08-01 | 2008-07-30 | 微软公司 | Definition extraction |
CN101710343A (en) * | 2009-12-11 | 2010-05-19 | 北京中机科海科技发展有限公司 | Body automatic build system and method based on text mining |
Non-Patent Citations (2)
Title |
---|
张榕: "术语定义抽取、聚类与术语识别研究", 《中国优秀博硕士学位论文全文数据库(博士) 哲学与人文科学辑》 * |
钱菲 等: "一种软/硬模板相结合的定义抽取算法", 《计算机技术与发展》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106960041A (en) * | 2017-03-28 | 2017-07-18 | 山西同方知网数字出版技术有限公司 | A kind of structure of knowledge method based on non-equilibrium data |
CN108573025A (en) * | 2018-03-12 | 2018-09-25 | 北京云知声信息技术有限公司 | The method and device of sentence characteristic of division is extracted based on hybrid template |
CN108573025B (en) * | 2018-03-12 | 2021-07-02 | 云知声智能科技股份有限公司 | Method and device for extracting sentence classification characteristics based on mixed template |
CN108647194A (en) * | 2018-04-28 | 2018-10-12 | 北京神州泰岳软件股份有限公司 | information extraction method and device |
Also Published As
Publication number | Publication date |
---|---|
CN104572628B (en) | 2017-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103077164B (en) | Text analyzing method and text analyzer | |
CN104636466B (en) | Entity attribute extraction method and system for open webpage | |
CN100595760C (en) | Method for gaining oral vocabulary entry, device and input method system thereof | |
CN110019658B (en) | Method and related device for generating search term | |
CN103678684B (en) | A kind of Chinese word cutting method based on navigation information retrieval | |
CN108363725B (en) | Method for extracting user comment opinions and generating opinion labels | |
CN106844331A (en) | Sentence similarity calculation method and system | |
CN105808526A (en) | Commodity short text core word extracting method and device | |
CN109284352A (en) | A kind of querying method of the assessment class document random length words and phrases based on inverted index | |
CN105893444A (en) | Sentiment classification method and apparatus | |
JP5403696B2 (en) | Language model generation apparatus, method and program thereof | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN110175585B (en) | Automatic correcting system and method for simple answer questions | |
CN104281565B (en) | Semantic dictionary construction method and device | |
CN108763348A (en) | A kind of classification improved method of extension short text word feature vector | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN101556596A (en) | Input method system and intelligent word making method | |
CN104573030A (en) | Textual emotion prediction method and device | |
CN108959630A (en) | A kind of character attribute abstracting method towards English without structure text | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN104572628A (en) | System and method for automatically extracting academic definition based on syntax characteristics | |
CN111444713A (en) | Method and device for extracting entity relationship in news event | |
CN112528640A (en) | Automatic domain term extraction method based on abnormal subgraph detection | |
CN107015966A (en) | Text audio automaticabstracting based on improved PageRank algorithms | |
CN111027308A (en) | Text generation method, system, mobile terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Zhao Jiyuan Inventor after: Luo Xiao Inventor after: Du Yufeng Inventor after: Zheng Ping Inventor before: Zhao Jiyuan Inventor before: Luo Xiao Inventor before: Du Yufeng |
|
COR | Change of bibliographic data | ||
GR01 | Patent grant | ||
GR01 | Patent grant |