CN104572628A - System and method for automatically extracting academic definition based on syntax characteristics - Google Patents

System and method for automatically extracting academic definition based on syntax characteristics Download PDF

Info

Publication number
CN104572628A
CN104572628A CN201510059166.5A CN201510059166A CN104572628A CN 104572628 A CN104572628 A CN 104572628A CN 201510059166 A CN201510059166 A CN 201510059166A CN 104572628 A CN104572628 A CN 104572628A
Authority
CN
China
Prior art keywords
definition
sentence
word
feature
verb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510059166.5A
Other languages
Chinese (zh)
Other versions
CN104572628B (en
Inventor
赵纪元
罗霄
杜玉锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANXI TONGFANG ZHIWANG DIGITAL PUBLISHING TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
" Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society
Original Assignee
SHANXI TONGFANG ZHIWANG DIGITAL PUBLISHING TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
" Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANXI TONGFANG ZHIWANG DIGITAL PUBLISHING TECHNOLOGY Co Ltd, TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd, " Academic Magazine (cd-Rom) " Co Ltd Of E-Magazine Society filed Critical SHANXI TONGFANG ZHIWANG DIGITAL PUBLISHING TECHNOLOGY Co Ltd
Priority to CN201510059166.5A priority Critical patent/CN104572628B/en
Publication of CN104572628A publication Critical patent/CN104572628A/en
Application granted granted Critical
Publication of CN104572628B publication Critical patent/CN104572628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a system and a method for automatically extracting academic definition based on syntax characteristics. The system comprises a pretreatment module, a definition sentence extraction module, a definition term extraction module and an output module, wherein the pretreatment module is used for extracting an abstract and a full text part from the input academic document, and dividing the extracted abstract and full text into single sentences; the definition sentence extraction module is used for judging whether the single sentences are definition sentences or not by rules and a statistical method; the definition term extraction module is used for carrying out pretreatment on the defined single sentences, and extracting a word string into term words according to the term extraction module, and correcting the front and back adjacent word strings to obtain the definition terms; and the output module is used for outputting the definition terms. According to the system and the method, the sentences of representing definition and corresponding term words are extracted from the document, and are presented to users; and the users can conveniently, rapidly and accurately understand the retrieved content.

Description

A kind of definition of the science based on syntactic feature Automatic Extraction system and method
Technical field
The invention belongs to areas of information technology, particularly relate to a kind of science based on syntactic feature definition Automatic Extraction system and method.
Background technology
For academic documents, user wishes that can retrieve the content oneself will searched quickly and accurately also understands rapidly.But due to academic documents self, a large amount of appearance of technical term and neologisms term, causing user to need the document with retrieving to research and analyse, finding out the sentence that this keyword occurs and conscientiously studying, understand.And this process is very poor efficiency.
The research work of existing distich sub-definite Automatic Extraction, recent years just starts active.Wherein, the rule-based method of many employings.Namely the method for sentence coupling is carried out by summing up the conventional several modes structure template of definition.But due to template coverage problem, cause recall rate very low.In addition, also there is employing Statistics-Based Method, namely utilize the model in statistics, algorithm calculates, find out the definition sentence meeting statistical law.But this method is not analyzed from syntax aspect, causes accuracy rate lower.
Summary of the invention
For solving the problems of the technologies described above, the object of this invention is to provide a kind of science based on syntactic feature definition Automatic Extraction system and method.
Object of the present invention is realized by following technical scheme:
Based on a science definition Automatic Extraction system for syntactic feature, this system comprises:
Pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text;
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, will be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, obtain defining term;
Output module, for exporting definition term.
Based on a science definition Automatic Extraction method for syntactic feature, the method comprises:
To academic documents extraction summary and the full text part of input, and the summary of extraction and full text are divided into simple sentence;
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
Compared with prior art, one or more embodiment of the present invention can have the following advantages by tool:
The sentence and corresponding term word that represent definition in document extract by the present invention, and present to user, facilitate user to understand the content retrieved fast and accurately.And present specification proposes on the basis of rule template, based on the science definition Automatic Extraction method of syntactic feature.The method combines rule-based and advantage that is statistical method, and studies academic documents sentence the aspect of syntactic structure.
Accompanying drawing explanation
Fig. 1 is the science definition Automatic Extraction system construction drawing based on syntactic feature;
Fig. 2 is rule-based definition sentence abstracting method process flow diagram;
Fig. 3 is that the definition sentence of Corpus--based Method method extracts process flow diagram;
Fig. 4 defines terminology extraction process flow diagram.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, be the science definition Automatic Extraction system architecture based on syntactic feature, comprise: described system comprises pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text; Participle instrument and syntactic analysis instrument is utilized to carry out the work such as participle, part-of-speech tagging, syntactic analysis.
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, to be judged as that the simple sentence defined carries out pre-service, mark wherein there being the character string of segmentation effect, such as " what is called ", " being called ", " being defined as ", " being referred to as " etc., secondly, according to term word extraction template, word string on ad-hoc location is extracted and alternatively defines term, finally, utilize and obtain front adjacent word and rear adjacent vocabulary by statistics high frequency term, candidate is defined in term be not term word ingredient remove, obtain definition term (as shown in Figure 4); Above-mentioned to being judged as that the simple sentence defined carries out pre-service and comprises: to carry out making a summary and extraction in full to the academic documents of input, and subordinate sentence is carried out to the whole section of language material obtained; Does the word string of above-mentioned ad-hoc location refer to first (.* in table 3 template?) and table 4 template in second (.*?) the word string that mates extracts as term word;
Output module, for exporting definition term.
Above-mentioned regular method carries out affirmative template matches to each simple sentence or/and negative template matches, if table 1 is definition sentence template certainly; Table 2 is definition sentence negative template.
Table 1
1^ (.*?) so-called (.*?) $
2^ (.*?) (be called | be called | be called | be called) (.*?) $
3^ (.*?) (being defined as) (.*?) $
4^ (.*?) (referring to) (.*?) $
5^ (.*?) (referring to) (.*?) $
6^ (.*?) (being referred to as) (.*?) $
7^ (.*?) (quilt |) (.*?) (being defined as) (.*?) $
8^ (.*?) (also known as | cry again | also claim | also cry | also cry | also known as | be commonly called as | therefore claim | be referred to as | common name) (.*?) $
9^ (.*?) (being called) (.*?) $
11^ (.*?) (general designation is | be referred to as | being commonly referred to as | common name is | called after) (.*?) $
13^ (.*?) (concept | definition) (.*?) (: |: | be | for) (.*?) $
14^ (.*?) (refer generally to | refer to) (.*?) $
15^ (.*?) (be a kind of | be one | be a class) (.*?) $
12^ (.*?) (YES) (.*?) (one | one | be a class) (.*?) $
Table 2
^ (.*?) (name is called | claim for | be known as | be called as) (.*?) $
^ (.*?) (deserving to be called) (.*?) $
^ (.*?) (being called as respectively) (.*?) $
^ (.*?) (what is called | so-called | it doesn't matter | just so-called) (.*?) $
^ (.*?) (be specify | be refer to | but refer to | do not refer to | be index | be instruct | be make a comment or criticism) (.*?) $
^(\[[0-9]+\])(.*?)$
^ (.*?) (this | should | also) be (a .*?) $
^ (.*?) (this is) (.*?) (one) (.*?) $
^ (.*?) (prove | indicate | think) (.*?) (being one) (.*?) $
Each template first (.* above?) content of location matches is term word (as table 3, being term word template one).
Table 3
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $
^ (.*?) (generally main refer to) (.*?) $
^ (.*?) (typically referring to) (.*?) $
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $
^ (.*?) (referring to) (.*?) $
^ (.*?) (concept | definition) (.*?) (be | for |: |: | $ $) (.*?) $
^ (.*?) (be defined as | it is defined as) (.*?) $
Each template second (.* above?) content of location matches is term word (as table 4, being term word template two).
Table 4
^ (.*?) (what is called) (.*?) (mainly referring to) (.*?) $
^ (.*?) (what is called) (.*?) (be exactly | refer to | i.e.) (.*?) $
^ (.*?) (what is called) (.*?) ($ $) (.*?) $
^ (.*?) (namely so-called) (.*?) ($ $) (.*?) $
^ (.*?) (being called) (.*?) ($ $) $
^ (.*?) (be called | be called) (.*?) ($ $) $
^ (.*?) (being defined as) (.*?) ($ $) $
^ (.*?) (being referred to as) (.*?) ($ $) $
^ (.*?) (also known as | cry again | also claim | also cry | also cry | also known as | therefore claim) (.*?) ($ $) $
^ (.*?) (general designation is | be referred to as | be commonly referred to as | common name is) (.*?) ($ $) $
Above-mentioned statistic law carries out N unit sentence characteristics to each simple sentence to extract and syntactic feature extraction, and calculate the probability of N unit's sentence characteristics and syntactic feature, according to described definition of probability discriminant function.And carry out being that the marking value weightNo_total of the marking value weightYes_total and non-a defined sentence that define sentence compares, if weightYes_total>weightNo_total, then think the marking value defining sentence.
Above-mentioned N unit sentence characteristics comprises unitary characteristic sum binary feature;
Above-mentioned unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature.
Described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
Described definition discriminant function is that N unit characteristic sum syntactic feature is divided into two classes statistics, obtains probability weightYes1 that N unit feature is definition and is not that the probability weightNo1 of definition and syntactic feature are the probability weightYes_sen of definition and are not the probability weightNo_sen defined; If F1 and F2 is respectively the weight shared by the probable value of two category features, F1+F2=1 need be met.Probability weightYes_total after two category features combine and weightNo_total computing method are:
weightYes_total=F1*weightYes1+F2*weightYes_sen
weightNo_total=F1*weightNo1+F2*weightNo_sen
Wherein, F1=0.5, F2=0.5.
The determination of N unit feature weight in definition discriminant function.Because the ratio defining sentence and non-a defined sentence in corpus is not 1:1, but the ratio of about 1:10.Therefore after adding unitary characteristic sum binary feature, need to feature each in training result be definition probability and be not definition probability adjust, specific practice be to each feature be not define probability reduce divided by a constant.Wherein C1 is unitary feature is not the probability minification defined, and C2 is binary feature is not the probability minification defined.Finally determine C1=10, C2=2 is as parameter.
The present embodiment additionally provides a kind of science based on syntactic feature definition Automatic Extraction method, and the method comprises:
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, the word string on ad-hoc location be extracted as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
Above-mentioned regular method carries out affirmative template matches to each simple sentence or/and negative template matches;
If template matches success certainly, then carry out negative rule template coupling;
Then template matches of establishing rules if not is failed, then think to define sentence, and export (as shown in Figure 2).
Carry out N unit sentence characteristics to above-mentioned pretreated simple sentence to extract and syntactic feature extraction, and calculate N unit's sentence characteristics probability and syntactic feature probability, according to described definition of probability discriminant function, whether function judges successfully, success, then export definition sentence (as shown in Figure 3).
Although the embodiment disclosed by the present invention is as above, the embodiment that described content just adopts for the ease of understanding the present invention, and be not used to limit the present invention.Technician in any the technical field of the invention; under the prerequisite not departing from the spirit and scope disclosed by the present invention; any amendment and change can be done what implement in form and in details; but scope of patent protection of the present invention, the scope that still must define with appending claims is as the criterion.

Claims (10)

1. based on a science definition Automatic Extraction system for syntactic feature, it is characterized in that, described system comprises pretreatment module, definition sentence abstraction module, definition terminology extraction module and output module, described in
Pretreatment module, for making a summary and full text part to the academic documents extraction of input, and is divided into simple sentence by the summary of extraction and full text;
Definition sentence abstraction module, whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
Definition terminology extraction module, will be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, obtain defining term;
Output module, for exporting definition term.
2., as claimed in claim 1 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described regular method carries out affirmative template matches to each simple sentence or/and negative template matches.
3. as claimed in claim 1 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described statistic law carries out N unit sentence characteristics to each simple sentence to extract and syntactic feature extraction, and calculate the probability of N unit's sentence characteristics and syntactic feature, according to described definition of probability discriminant function.
4., as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described N unit sentence characteristics comprises unitary characteristic sum binary feature;
Described unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature.
5., as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
6. as claimed in claim 3 based on the science definition Automatic Extraction system of syntactic feature, it is characterized in that, described definition discriminant function is that N unit characteristic sum syntactic feature is divided into two classes statistics, obtains probability that sentence characteristics is definition and is not that the probability of definition and syntactic feature are the probability of definition and are not the probability defined; And
Determine N unit feature weight in definition discriminant function.
7., based on a science definition Automatic Extraction method for syntactic feature, it is characterized in that, described method comprises:
To academic documents extraction summary and the full text part of input, and the summary of extraction and full text are divided into simple sentence;
Whether described simple sentence is definition sentence to adopt rule and statistic law to judge;
To be judged as that the simple sentence defined carries out pre-service, and according to term word extraction template, extract word string as term word, and pass through the word string correction of front and back neighbour, and obtain defining term;
Export definition term.
8., as claimed in claim 7 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, described regular method carries out affirmative template matches to each simple sentence or/and negative template matches;
If template matches is unsuccessful certainly, then not think it is define sentence;
If template matches success certainly, then carry out negative rule template coupling;
Then template matches of establishing rules if not is failed, then think to define sentence, and export.
9. as claimed in claim 7 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, carry out N unit sentence characteristics to described pretreated simple sentence to extract and syntactic feature extraction, and calculate N unit's sentence characteristics probability and syntactic feature probability, according to described definition of probability discriminant function, whether function judges successfully, if success, then export definition sentence, otherwise, do not export definition sentence.
10., as claimed in claim 9 based on the science definition Automatic Extraction method of syntactic feature, it is characterized in that, described in
Described N unit sentence characteristics comprises unitary characteristic sum binary feature;
Described unitary feature comprises the position of part of speech and copula word distance beginning of the sentence after part of speech before everyday words word segmentation result, professional word word segmentation result, copula word, copula word, copula word;
Described binary feature is the combination of described unitary feature and copula word feature;
Described syntactic feature comprises unitary syntactic feature and binary syntactic feature;
Described unitary syntactic feature comprises: the phrase in sentence before the phrase type before first verb, first verb, the phrase type after first verb, last verb, last verb and the phrase after last verb;
Described binary syntactic feature comprises: the combination of first verb and first postverbal phrase type in the combination of first verb and first verb before phrase type in sentence, sentence, last verb and the combination of last verb before phrase and the combination of last verb and last postverbal phrase.
CN201510059166.5A 2015-02-05 2015-02-05 A kind of science based on syntactic feature defines automatic extraction system and method Active CN104572628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510059166.5A CN104572628B (en) 2015-02-05 2015-02-05 A kind of science based on syntactic feature defines automatic extraction system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510059166.5A CN104572628B (en) 2015-02-05 2015-02-05 A kind of science based on syntactic feature defines automatic extraction system and method

Publications (2)

Publication Number Publication Date
CN104572628A true CN104572628A (en) 2015-04-29
CN104572628B CN104572628B (en) 2017-08-08

Family

ID=53088732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510059166.5A Active CN104572628B (en) 2015-02-05 2015-02-05 A kind of science based on syntactic feature defines automatic extraction system and method

Country Status (1)

Country Link
CN (1) CN104572628B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960041A (en) * 2017-03-28 2017-07-18 山西同方知网数字出版技术有限公司 A kind of structure of knowledge method based on non-equilibrium data
CN108573025A (en) * 2018-03-12 2018-09-25 北京云知声信息技术有限公司 The method and device of sentence characteristic of division is extracted based on hybrid template
CN108647194A (en) * 2018-04-28 2018-10-12 北京神州泰岳软件股份有限公司 information extraction method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233484A (en) * 2005-08-01 2008-07-30 微软公司 Definition extraction
CN101710343A (en) * 2009-12-11 2010-05-19 北京中机科海科技发展有限公司 Body automatic build system and method based on text mining

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233484A (en) * 2005-08-01 2008-07-30 微软公司 Definition extraction
CN101710343A (en) * 2009-12-11 2010-05-19 北京中机科海科技发展有限公司 Body automatic build system and method based on text mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张榕: "术语定义抽取、聚类与术语识别研究", 《中国优秀博硕士学位论文全文数据库(博士) 哲学与人文科学辑》 *
钱菲 等: "一种软/硬模板相结合的定义抽取算法", 《计算机技术与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106960041A (en) * 2017-03-28 2017-07-18 山西同方知网数字出版技术有限公司 A kind of structure of knowledge method based on non-equilibrium data
CN108573025A (en) * 2018-03-12 2018-09-25 北京云知声信息技术有限公司 The method and device of sentence characteristic of division is extracted based on hybrid template
CN108573025B (en) * 2018-03-12 2021-07-02 云知声智能科技股份有限公司 Method and device for extracting sentence classification characteristics based on mixed template
CN108647194A (en) * 2018-04-28 2018-10-12 北京神州泰岳软件股份有限公司 information extraction method and device

Also Published As

Publication number Publication date
CN104572628B (en) 2017-08-08

Similar Documents

Publication Publication Date Title
CN103077164B (en) Text analyzing method and text analyzer
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN110019658B (en) Method and related device for generating search term
CN103678684B (en) A kind of Chinese word cutting method based on navigation information retrieval
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
CN106844331A (en) Sentence similarity calculation method and system
CN105808526A (en) Commodity short text core word extracting method and device
CN109284352A (en) A kind of querying method of the assessment class document random length words and phrases based on inverted index
CN105893444A (en) Sentiment classification method and apparatus
JP5403696B2 (en) Language model generation apparatus, method and program thereof
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN104281565B (en) Semantic dictionary construction method and device
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN101556596A (en) Input method system and intelligent word making method
CN104573030A (en) Textual emotion prediction method and device
CN108959630A (en) A kind of character attribute abstracting method towards English without structure text
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN104572628A (en) System and method for automatically extracting academic definition based on syntax characteristics
CN111444713A (en) Method and device for extracting entity relationship in news event
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN107015966A (en) Text audio automaticabstracting based on improved PageRank algorithms
CN111027308A (en) Text generation method, system, mobile terminal and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhao Jiyuan

Inventor after: Luo Xiao

Inventor after: Du Yufeng

Inventor after: Zheng Ping

Inventor before: Zhao Jiyuan

Inventor before: Luo Xiao

Inventor before: Du Yufeng

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant