CN107577670B - Term extraction method based on definition and relation - Google Patents

Term extraction method based on definition and relation Download PDF

Info

Publication number
CN107577670B
CN107577670B CN201710833633.4A CN201710833633A CN107577670B CN 107577670 B CN107577670 B CN 107577670B CN 201710833633 A CN201710833633 A CN 201710833633A CN 107577670 B CN107577670 B CN 107577670B
Authority
CN
China
Prior art keywords
term
word
text
term candidate
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710833633.4A
Other languages
Chinese (zh)
Other versions
CN107577670A (en
Inventor
许斌
李思良
杨玉基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710833633.4A priority Critical patent/CN107577670B/en
Publication of CN107577670A publication Critical patent/CN107577670A/en
Application granted granted Critical
Publication of CN107577670B publication Critical patent/CN107577670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention belongs to the field of text mining, and particularly relates to a term extraction method based on definitions and relations. The term extraction method provided by the invention is beneficial to improving the identification capability of low-frequency terms and the extraction capability of terms with higher universality and long-number terms.

Description

Term extraction method based on definition and relation
Technical Field
The invention belongs to the field of text mining, and particularly relates to a term extraction method based on definition and relation.
Background
Terms are used as promptness symbols for expressing professional concepts in specific fields, and play an important role in natural language fields such as chinese word segmentation and syntactic analysis. In the process of constructing the domain knowledge base, the terms are used as main embodiments of knowledge in the domain and have an important position in the expansion work of knowledge examples. Manually labeling terms from unstructured text takes a lot of labor and time, and there are cases where the recall rate is reduced due to label omission. Automated term extraction work is therefore gaining increasing attention from researchers.
The term extraction method in the prior art mainly comprises two steps. The first step is to obtain candidate terms by the unit calculation of the character string; the second step extracts the real term by using the measurement index of the terminology. Wherein the cellularity is used for describing the stability of a specific character string combination, and the terminology is used for describing the degree of relevance of a language unit in the field. Most of these methods are based on statistical methods. However, when the term extraction is performed for the construction of the basic education knowledge base, the statistical characteristics of the terms are greatly different from those of the terms in the professional field. Taking the mathematics as an example, the term "triangle" appears 1779 times in the text book of junior high school, while the term "tangent circle" appears only 3 times. Only a few important terms in the mathematical textbook are used repeatedly, and the long tail characteristic causes omission of low-frequency words. In addition, some basic terms such as "face" and "line" are also widely used in other fields, and this phenomenon may cause the terms with high generality to be regarded as field-independent words due to the inverse document frequency.
Therefore, the term extraction method in the prior art has low recognition capability for low-frequency terms, is easy to cause omission, and has poor extraction capability for terms with high universality and long-word terms.
Professional vocabulary explanation
Defining: an exact and brief description of the meaning and extension of a concept. For example, what is a "circumscribed circle".
Word formation rules: i.e., how words are constructed, such as "right triangle" is constructed from the terms "right angle" and "triangle".
Template: refers to an abstract representation that summarizes some kind of situation. For example, a < lower part > is < upper part > "template encompasses all cases where" a is B ", where" a "is the" lower part "and" B "is the" upper part ".
Boundary word list: common adverbs (e.g., "time", "all", "on", "each", etc.) appearing at the boundary positions and common pronouns and quantifier combinations (e.g., "this", "a group", "a pair", "a bar", etc.) form a vocabulary.
Boundary detection: and screening terms in the term candidates through the boundary word list. For example, if "a straight line" is in the term candidates, but there is a "one" in our boundary vocabulary, it means that "a straight line" is not a term and needs to be deleted from the term candidates.
The term candidate: is a collection of words or sentence fragments that are more likely to be terms but that need further confirmation. For example, the above-mentioned "a straight line".
Right type candidate (Rc): the term candidate which is more likely to appear on the right part, such as the term candidate "a straight line" is Rc, the term is "a straight line", appears on the right side of "a straight line".
Left type candidate (Lc): the term candidates for which the term appears more easily in the left part, such as "straight" is Lc, the term is "straight", and appears on the left side of "straight".
Disclosure of Invention
In view of the above problems, the present invention provides a definition and relationship based term extraction method, which comprises the following steps:
step (1): preprocessing and initializing the text in the html format;
step (2): performing term candidate extraction based on definition according to the text obtained after the processing in the step (1) to generate a term candidate set;
and (3): performing term candidate extraction based on the superior-inferior relation by using the text obtained after the processing in the step (1) and the term candidate set generated in the step (2), and generating a new term candidate set;
and (4): performing term candidate extraction based on the integral part relation by using the text obtained after the processing in the step (1) and the term candidate set generated in the step (3) to generate a new term candidate set;
and (5): performing term candidate extraction based on a parallel relation by using the text obtained after the processing in the step (1) and the term candidate set generated in the step (4) to generate a new term candidate set;
and (6): performing a word formation rule-based check on each term candidate in the term candidate set generated in the step (5) to generate a term candidate set checked by the word formation rule;
and (7): performing boundary detection on term candidates by using the term candidate set generated in the step (5) and the term candidate set which is generated in the step (6) and is checked by the word construction rule to generate a new term set;
and (8): counting the number of elements of the term set generated in the step (7) to be n, comparing the number of the elements of the term set before the step (7) is executed with the number of the elements of the term set, ending the program if the number of the elements of the term set is equal to the number of the elements of the term set before the step (7) is executed, and returning to the current term set; otherwise, executing step (9);
and (9): and (4) the computer corrects the word segmentation result and returns to the step (3).
Further, step (1) comprises the following sub-steps:
step (1.1): identifying img and table labels in the text, and filtering out pictures and tables;
step (1.2): filtering html tags in the text, and extracting text contents in the tags;
step (1.3): finding out symbols and formulas from the cleaned text, and filtering;
step (1.4): re-segmenting the text according to periods, commas, semicolons and question marks;
step (1.5): and performing Chinese word segmentation on the text by using an ansj word segmentation tool, and calculating the word frequency of each word.
Further, the step (2) comprises the following sub-steps:
step (2.1): manually summarizing a definition template according to the text characteristics, wherein the definition template is in a regular expression form and comprises two matching contents of a defined part and a definition part;
step (2.2): extracting a defined part and a defined part from the text through the template established in the step (2.1), wherein the defined part is marked as a right-type candidate and is set as a low confidence coefficient, and the defined part is marked as a left-type candidate and is set as a high confidence coefficient;
step (2.3): the newly found term candidate is added to the current term candidate set.
Further, the step (3) comprises the following sub-steps:
step (3.1): manually summarizing upper and lower templates according to text characteristics: "< lower part > is < upper part >";
step (3.2): extracting a lower part and an upper part from the text through the template established in the step (3.1), if the upper part is a term which is found, setting the lower part as a right type candidate and setting the lower part as a low confidence coefficient, and if the lower part is a term which is found, setting the upper part as a left type candidate and setting the upper part as a high confidence coefficient;
step (3.3): adding the term candidates found in step (3.2) to the current term candidate set.
Further, the step (4) comprises the following sub-steps:
step (4.1): manually summarizing an integral part template according to the text characteristics: < right part > "< left part >";
step (4.2): extracting a left part and a right part from the text through the template established in the step (4.1), and setting the left part as s1The right part is s2,s1The word segmentation result is w1_1……w1_m,s2The word segmentation result is w2_1……w2_nFirst, examine w1_mAnd w2_nIf w is exactly one of the terms found1_mIs a term, then take s2As a candidate term, if w2_nIs a term, then take s1As term candidates;
step (4.3): checking the proportion of terms in all words having integral part relation with tc after word segmentation for the term candidate tc found in the step (4.2), and if the proportion is larger than or equal to a given threshold value, setting tc as low confidence Rc and adding the Rc into the current term candidate set;
step (4.4): find a term candidate tc satisfying the template formulated by the step (4.1) in the current term candidate set, and extract "left part" s therefrom1And "right part" s2If tc is a high confidence term candidate, s is set1Set s for high confidence Rc2Is high confidence Lc, when tc is low confidence term candidate, if tc is Lc, only s is taken1Set it to low confidence Rc, otherwise take only s2Set it to low confidence Lc, after which tc is deleted from the current term candidate set;
step (4.5): adding the new term candidate generated in step (4.4) to the current term candidate set.
Further, the step (5) comprises the following sub-steps:
step (5.1): manually summarizing a parallel relation template according to the text characteristics: "< parallel part > (< parallel part >,) [ and | or | and ] < parallel part >, etc.? ";
step (5.2): extracting a 'parallel relation' from the text obtained after the processing in the step (1) through the template established in the step (5.1);
step (5.3): let the parallel content extracted in step (5.2) be s1……sn,snWord segmentation result w1……wmLet a ti=wi……wm(1<=i<M), computer statistics s1ti、……、sn-1tiThe sum T of the occurrences of the n-1 character stringsiLet Ta (t corresponding to this time)iIs ta) is TiOf a maximum value, s if Ta is greater than a given threshold1ta、……、sn-1ta、snThese n words are set to a high confidence Rc;
step (5.4): the computer adds the term candidates generated in step (5.3) to the current term candidate set.
Further, the step (6) comprises the following sub-steps:
step (6.1): manually analyzing the composition of the general terms, and summarizing term construction rules according to different terms with different confidence degrees;
step (6.2): detecting term candidates in the current term candidate set according to the word construction rule obtained in the step (6.1), and adding all term candidates meeting the word construction rule into the term candidate set checked by the word construction rule;
step (6.3): for each low confidence term candidate tc in the term candidate set checked by the word construction rule, checking whether a term t exists in the found terms so that t is a suffix of tc, and if t does not exist, removing tc from the term candidate set checked by the word construction rule;
step (6.4): the computer calculates the occurrence times of tc in the text for each low-confidence term candidate tc in the term candidate set checked by the word construction rule, and if the occurrence times of tc is less than a given threshold value, the tc is removed from the term candidate set checked by the word construction rule.
Further, step (7) comprises the sub-steps of:
step (7.1): manually constructing a boundary word list, and selecting common auxiliary words and common pronouns and quantifier collocation;
step (7.2): and (4) carrying out term boundary detection according to the boundary word list obtained in the step (7.1) and the current term candidate set and the term candidate set checked by the word construction rule.
Further, step (9) comprises the sub-steps of:
step (9.1): let t be a term string in the term set, which is divided into n words by the current word segmentation system: w is a1、……、wnIts corresponding part of speech is p1、……、pnAdding t as a new word into the word segmentation system to make the word segmentation result be a word, if p isnIs noun class n, adjective class a or verb class v, the part of speech of t and pnIf not, setting the part of speech of t as noun class n;
step (9.2): and (4) updating all current word segmentation results, recalculating the word frequency of all words, and skipping to the step (3).
In view of the above, a term extraction method is proposed herein, which mainly exploits term definitions and term relationships, and integrates word formation rules and boundary detection. The technical scheme of the invention is beneficial to improving the identification capability of low-frequency terms and the extraction capability of terms with higher universality and long-number terms.
Drawings
FIG. 1 is an overall flow chart of the term extraction method proposed by the present invention
Detailed Description
The embodiments are described in detail below with reference to the accompanying drawings.
Fig. 1 is an overall flowchart of the term extraction method proposed by the present invention, which specifically includes the following steps:
step (1), text preprocessing, word segmentation and word frequency statistics
The resources presented in the web page html format are the resources with the most extensive sources and the most convenient acquisition mode, and the html text is selected as the data input of the method. html-formatted resources are not in plain text form, and therefore data cleaning work for text preprocessing is required.
The pictures in the webpage only have links without semantic information in the text, and the processing difficulty of the table is high due to variable formats, so the content in the < img > tag and the < table > tag is filtered from the text by the regular expression. On the other hand, mathematical symbols and formulas are often mixed in the text, which does not help the term extraction and can influence the word segmentation effect, so the invention filters the mathematical symbols and the mathematical formulas in the book by using the regular expression. Then the invention removes all the web page labels and segments the text again according to the periods, commas, semicolons and question marks. The invention utilizes ansj word segmentation tools to perform Chinese word segmentation on the text and calculates the word frequency of each word.
Step (2), term candidate extraction based on definition
The text that generally contains definitions is first analyzed manually, from which a general definition extraction template is derived as shown in table 1.
The definition extracted by template matching will be decomposed into two parts: a defined part and a definition part. The defined part reveals the definition that describes who the definition is, and the definition part indicates what is used to define.
The present invention utilizes the definition to obtain term candidates based on the following two assumptions:
(1) the definitions in textbooks are all used to teach the subject knowledge and must all be used to describe terms in the subject;
(2) terms in the basic educational discipline should exhibit strong self-contained characteristics, i.e., it is likely that the words used to define a term are themselves also terms. Therefore, for a definition, the present invention can take the definition part and the non-definition part as a term candidate.
TABLE 1. definition of extraction template
<Definition part>(Called. | call) (doing | is)<Is defined part>
<Definition part>Means that<Is defined part>
<Is defined part>Is defined (is |)<Definition part>
Balance<Definition part>(do |)<Is defined part>
The term candidate acquisition based on the definition is shown with the definition of the vertical line as an example. Definition of the vertical line: "one of two straight lines perpendicular to each other is called a perpendicular line of the other straight line". According to the template matching, the present invention can determine that the defined portion is "a perpendicular to another straight line" and the defined portion is "one straight line of two straight lines perpendicular to each other". Both of these parts can be term candidates based on the assumptions above.
From the above example, it can be seen that the sentence complexity of the defined portion and the defined portion are different. Often, the sentences that define the parts are more complex. In the above case, although two term candidates can be generated in one definition, the term candidates generated by the definition section are set to low confidence and the term candidates generated by the defined section are set to high confidence. In the term screening step, different screening strategies are adopted according to different confidence levels.
In addition, the present invention considers that in the term candidates generated in the definition part and the defined part, the terms should be in the right part, and they are all labeled as right type candidates (Rc). The left-type candidate (Lc) and the right-type candidate (Rc) are used to indicate whether a term is more likely to appear in the left-hand portion or the right-hand portion of the term candidate. Different strategic analysis can be adopted in the term screening stage according to the different directions of term candidates.
Step (3) of term candidate extraction based on the superior-inferior relation
The upper and lower level relation refers to the semantic inclusion relation embodied between two words. For example, "square" is a special rectangle, "square" is a hyponym, and "rectangle" is a hypernym. The relationship depends on the template: "< lower part > is < upper part >" to perform extraction. The invention extracts all sentences containing the upper and lower relations, and exactly one part of the matched lower part or upper part is the found term. The present invention takes a part which is not a term therein as a term candidate. For example, in the above example, if "square" has appeared in the found term set, "a special rectangle" may be used as the term candidate according to the above rule.
The lower portion of the match will be labeled Rc and the upper portion will be labeled Lc. Since the sentence patterns that reflect the superior and inferior features do not necessarily all have such a relationship, for example, in "solving a triangle is an important mathematical problem", the matched superior part is not a real term, so the present invention sets the term candidates generated by the superior and inferior features to low confidence.
Step (4), term candidate extraction based on integral part relation
Whole-to-part relationships are extracted by the word phrase "of (template: < right part > of < left part >). In the whole and part relations, both the relation of "triangle side" only referring to terms and the relation of "function difficulty" participating in non-terms exist. In order to better utilize the whole and partial relations to extract terms on the premise of ensuring the term extraction accuracy, the invention adopts different relation analysis methods respectively aiming at high-confidence term candidates, low-confidence term candidates and common texts according to different sources of the extracted relations.
Since it is highly possible that the overall partial relationship found from the high-confidence term candidates is a relationship that relates only to terms in the description, the present invention is considered to be the most reliable, and therefore, the contents on both the left and right sides of "in" the word phrase "are directly set as the high-confidence term candidates. The content to the left of the word "is labeled Rc and the right of the word" is labeled Lc.
Since term candidates with low confidence levels are often complex, the present invention needs to select a simpler part of the sentence in the "word phrase, where there is a higher probability of terms being present, as a term candidate. Here, the present invention makes a determination based on the left and right type candidates. If the term candidate is Lc, then the present invention selects the "left side portion of the word phrase" as the term candidate and marks it as Rc, otherwise selects the right side portion and sets it as Lc. Finally, the present invention sets this newly discovered term candidate to low confidence.
For the content extracted from the ordinary text, the left part is set as s1The right part is s2。s1The word segmentation result is w1_1……w1_m,s2The word segmentation result is w2_1……w2_n. Computer first checks w1_mAnd w2_nWhether exactly one of them is the term that has been found. If w1_mIs a term, then take it s2As term candidates tc, if w2_nIs a term, then take s1And as a term candidate tc, obtaining all words w which have integral part relation with tc in the word segmentation result, and checking the proportion of terms in w. If the ratio is greater than or equal to the given threshold, tc is set to a low confidence Rc to be added to the TCS. The threshold is actually set to 1, i.e. it is required that w must all be terms that have been found.
Step (5) of term candidate extraction based on the parallel relation
The invention adopts the artificially constructed template: "< parallel part > (< parallel part >,) [ and | or | and ] < parallel part >, etc.? "to identify the juxtaposition. The expression represents a template that parallels the relationship. Wherein, () represents a group; denotes 0 or more times matching the preceding sub-expression; i represents or; [] Represents a character; is there a Indicating zero or one matching of the previous sub-expression. The above expression includes all the parallel relations. For example, in "triangle and side", the first "< parallel part >" is "triangle", "(< parallel part >,)" appears 0 times, "[ and | or | and ]" appears as "and", and then the other "< parallel part >" is "side". If one of the listed components is a term found, then the other listed components are likely to be terms. The present invention takes the parallel portion satisfying the condition as a term candidate based on the above assumption. For example, in the case where "pyramids and prisms are common geometries," the present invention adds "pyramids" to the term candidates if "prisms" are concentrated in the terms that have been found. Since the sentence structure of the parallel parts in the parallel relation is generally simple, and once one parallel part is a term, the possibility that the other parallel part is a term is high, the invention sets the extracted term candidate as a high-confidence term candidate and marks the term candidate as Rc.
In order to solve the problem of root omission in 'acute angle, right angle and obtuse angle triangle', the invention adopts a root checking mechanism: let the extracted parallel content be s1……sn,snWord segmentation result w1……wm. Let ti=wi……wm(1<=i<M), computer statistics s1ti、……、sn-1tiThe sum T of the occurrences of the n-1 character stringsi. Let Ta (t corresponding to this time)iIs ta) is TiThe highest value of (d). If Ta is greater than a given threshold value, the computer will s1ta、……、sn-1ta、snThese n words are set to a high degree of confidence Rc. The actually set threshold value is (n-1) × 3.
Step (6), screening term candidates based on word-building rules
The invention refers to the number of the composition words after the word segmentation of a term as the number of elements of the term. For example, "triangle" is a univariate term, while "right triangle" is a binary term because the word segmentation results in "right angle" and "triangle". The number of elements of the term will vary with the outcome of the word segmentation. The present invention makes a word formation rule table of terms according to the analysis result of manual word formation on terms with reference to the part of speech table provided by ansj _ seg, as shown in table 2.
TABLE 2 term word construction rules
Figure BDA0001409274060000101
Terms above the quaternion are found over multiple iterations. For example, the initial segmentation result for "unit orthogonal basis" is: the term "unit orthogonal base" is a five-membered term. But after the first iteration is finished, the word segmentation result is as follows: the unit orthogonal basis is a ternary term, so that the term candidate is identified as the term in the second iteration.
The invention will check the naming rules for low confidence term candidates, i.e. check if there is a suffix in the found term that is a candidate for that term. And rejecting term candidates with a word frequency less than 10.
Step (7) of term candidate screening based on boundary vocabulary detection
The linguistic expressions of disciplines often have some form of fixed representation, which can lead to some domain-independent words being mistaken for being part of a term because they are often collocated with a particular term. For example, "a straight line" is mistaken for a term because "a line" often matches a straight line. To overcome this problem, the present invention introduces boundary detection.
The invention selects to establish the boundary word list manually. Common adverbs (e.g., "time", "all", "on", "each", etc.) and common pronouns and quantifier combinations (e.g., "this", "a group", "a pair", "a bar", etc.) are selected. The boundary detection steps are as follows:
we examine each term candidate. If the candidate is Rc, searching the first word appearing in the boundary word list from the right side of the word segmentation result, taking the part on the right side of the word as a new term candidate, and deleting the original term candidate. For example: in the Rc candidate "one straight line", we find that the word in the first boundary word list is "one" from the right side, and take the right part thereof, i.e., "one straight line", as a new term candidate, and delete the original "one straight line" from the term candidate set. If the candidate is Lc, then search from the left and pick the left part as the new term candidate.
Step (8), term updating and word segmentation result updating
And (4) checking whether the number of terms in the term set generated in the step (7) is changed, if not, returning the term set, ending the procedure, otherwise, updating the segmentation result.
Let t be a term string in the term set, which is divided into n words by the current word segmentation system: w is a1、……、wnIts corresponding part of speech is p1、……、pn. And adding t as a new word into the word segmentation system to enable the word segmentation result to be a word. If p isnIs noun class n, adjective class a or verb class v, the part of speech of t and pnAnd if not, setting the part of speech of t as a noun class n. And (4) after all word segmentation results are updated, skipping to the step (3).
Referring to fig. 1, a flow chart of the whole term extraction method can be seen, and the above steps (1) - (8) are utilized to create a method for extracting terms from text in html format. All experiments of the invention are realized by Java programming and run on a notebook computer configured as Intel PM1400MHZ, 768M memory.
Furthermore, for the term extraction method of the present invention, the following experimental verification process was performed:
(1) building data sets
The mathematics subjects of basic education are selected as research objects, 6 junior high school mathematics textbooks of people education press are selected, 12 high school mathematics essential and science optional repair textbooks and 2 high school teaching auxiliary books of junior high school are selected, and 20 electronic versions of the books are calculated. The digitized resources are organized in epub format (where the text is in xhtml format, consistent with html format). After text preprocessing, 7 ten thousand short sentences are obtained, and 46 ten thousand words are obtained in total.
(2) Criteria for experimental measurements
First, please refer to the basic education mathematics teacher to perform a term annotation on the whole data set. And then, the system is used for extracting terms, and experts are requested to examine the extracted terms, and the correctly extracted terms are selected from the extracted terms. The manually labeled terms are merged with the correct results extracted by the method as the total terms in the data set.
The precision, recall and F1-Measure are used for evaluating the experimental result, and are specifically defined as follows:
precision: p is the number of terms correctly extracted by the system/the number of terms extracted by the system;
recall: r ═ the number of terms the system correctly extracted/the number of total terms;
F1-Measure:F1=2PR/(P+R)
(3) results of the experiment
Firstly, the distribution of terms extracted by the system is counted. Table 3 presents the distribution of different meta-terms. It can be seen that although the terms of binary and above account for 44% of the total terms, their word frequency sum is less than 2%, and it can be seen that the system of the present invention can effectively extract low frequency terms.
TABLE 3 term distribution
Figure BDA0001409274060000121
Figure BDA0001409274060000131
Thereafter, the present invention compares the experimental results with the Baseline method. The Baseline method is used for extracting terms through left and right mutual information entropy and word frequency of words. Table 4 shows the evaluation results of the two methods.
TABLE 4 comparison of results with baseline
Figure BDA0001409274060000132
Compared with baseline, the method provided by the invention can effectively improve the precision ratio and the recall ratio of the keyword extraction, the F1 value is increased from 41.9% to 82.7%, and the term extraction capability is greatly improved.
In addition, experiments are separately carried out on the time overhead problem of the method, and the experiments prove that the method has the efficiency equivalent to that of the Baseline method, and the introduction of the strategy does not cause overlarge time overhead.
The above embodiments are only preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (1)

1. A method for term extraction based on definitions and relationships, the method comprising the steps of:
step 1: preprocessing and initializing the text in the html format, comprising:
step 1.1: identifying img and table labels in the text, and filtering out pictures and tables;
step 1.2: filtering html tags in the text, and extracting text contents in the tags;
step 1.3: finding out symbols and formulas from the cleaned text, and filtering;
step 1.4: re-segmenting the text according to periods, commas, semicolons and question marks;
step 1.5: performing Chinese word segmentation on the text by using an ansj word segmentation tool, and calculating the word frequency of each word;
step 2: performing definition-based term candidate extraction according to the text obtained after the processing in the step 1 to generate a term candidate set, comprising:
step 2.1: summarizing a definition template according to the text characteristics, wherein the definition template is in a regular expression form and comprises two matching contents of a defined part and a definition part;
step 2.2: extracting a defined part and a defined part from the text through the template established in the step 2.1, wherein the defined part is marked as a right candidate and is set as a low confidence coefficient, and the defined part is marked as a left candidate and is set as a high confidence coefficient;
step 2.3: adding the newly found term candidate to the current term candidate set;
and step 3: performing term candidate extraction based on the context relationship by using the text obtained after the processing in the step 1 and the term candidate set generated in the step 2 to generate a new term candidate set, comprising:
step 3.1: summarizing upper and lower templates according to the text characteristics;
step 3.2: extracting a lower part and an upper part from the text through the template established in the step 3.1, if the upper part is a term which is found, setting the lower part as a right-type candidate and setting the lower part as a low confidence coefficient, and if the lower part is a term which is found, setting the upper part as a left-type candidate and setting the upper part as a high confidence coefficient;
step 3.3: adding the term candidates found in step 3.2 to the current term candidate set;
and 4, step 4: performing term candidate extraction based on the integral part relationship by using the text obtained after the processing in the step 1 and the term candidate set generated in the step 3 to generate a new term candidate set, comprising:
step 4.1: summarizing an integral part template according to the text characteristics;
step 4.2: extracting a left part and a right part from the text through the template established in the step 4.1, and setting the left part as s1The right part is s2,s1The word segmentation result is w1_1……w1_m,s2The word segmentation result is w2_1……w2_nFirst, examine w1_mAnd w2_nIf w is exactly one of the terms found1_mIs a term, then take s2As a candidate term, if w2_nIs a term, then take s1As term candidates;
step 4.3: for the term candidate tc found in the step 4.2, checking the proportion of terms in all words having integral part relation with tc after word segmentation, and if the proportion is greater than or equal to a given threshold value, setting tc as low confidence Rc and adding the Rc into the current term candidate set;
step 4.4: finding the current term candidate set satisfies the criteria set by step 4.1And extracting "left part" s therefrom1And "right part" s2If tc is a high confidence term candidate, s is set1Set s for high confidence Rc2Is high confidence Lc, when tc is low confidence term candidate, if tc is Lc, only s is taken1Set it to low confidence Rc, otherwise take only s2Set it to low confidence Lc, after which tc is deleted from the current term candidate set;
step 4.5: adding the new term candidate generated in step 4.4 to the current term candidate set;
and 5: performing term candidate extraction based on a parallel relationship by using the text obtained after the processing in the step 1 and the term candidate set generated in the step 4 to generate a new term candidate set, comprising:
step 5.1: manually summarizing a parallel relation template according to the text characteristics;
step 5.2: extracting a parallel relation from the text obtained after the processing in the step 1 through the parallel relation template established in the step 5.1;
step 5.3: let the parallel content extracted in step 5.2 be s1……sn,snWord segmentation result w1……wmLet a ti=wi……wmIn which 1 is<=i<M, computer statistics of s1ti、……、sn-1tiThe sum T of the occurrences of the n-1 character stringsiAnd Ta is TiMaximum value of (d), corresponding to tiIs Ta, if Ta is greater than a given threshold value s1ta、……、sn-1ta、snThese n words are set to a high confidence Rc;
step 5.4: the computer adds the term candidates generated in step 5.3 to the current term candidate set;
step 6: performing a term-formation-rule-based check on each term candidate in the term candidate set generated in step 5 to generate a term candidate set checked by a term-formation rule, including:
step 6.1: manually analyzing the composition of the general terms, and summarizing term construction rules according to different terms with different confidence degrees;
step 6.2: detecting term candidates in the current term candidate set according to the word construction rule obtained in the step 6.1, and adding all term candidates meeting the word construction rule into the term candidate set checked by the word construction rule;
step 6.3: for each low confidence term candidate tc in the term candidate set checked by the word construction rule, checking whether a term t exists in the found terms so that t is a suffix of tc, and if t does not exist, removing tc from the term candidate set checked by the word construction rule;
step 6.4: the computer calculates the occurrence times of tc in the text for each low-confidence term candidate tc in the term candidate set checked by the word construction rule, and if the occurrence times of tc is less than a given threshold value, the tc is removed from the term candidate set checked by the word construction rule;
and 7: performing boundary detection on term candidates using the term candidate set generated in step 5 and the term candidate set checked by the word construction rule generated in step 6 to generate a new term set, including:
step 7.1: manually constructing a boundary word list;
step 7.2: performing term boundary detection according to the boundary vocabulary obtained in the step 7.1 and the current term candidate set and the term candidate set checked by the word construction rule;
and 8: counting the number of elements of the term set generated in the step 7 to be n, comparing the number of the elements of the term set before the execution of the step 7 with the number of the elements of the term set, ending the program if the number of the elements of the term set is equal to the number of the elements of the term set before the execution of the step 7, and returning to the current term set; otherwise, executing step 9;
and step 9: and (3) correcting the word segmentation result by the computer, returning to the step 3, and comprising the following steps:
step 9.1: let t be a term string in the term set, which is divided into n words by the current word segmentation system: w is a1、……、wnIts corresponding part of speech is p1、……、pnAdding t as a new word into the word segmentation system to enable the word segmentation result to be a word segmentation resultIs a word, if pnIs noun class n, adjective class a or verb class v, the part of speech of t and pnIf not, setting the part of speech of t as noun class n;
step 9.2: and updating all current word segmentation results, recalculating the word frequency of all words, and skipping to the step 3.
CN201710833633.4A 2017-09-15 2017-09-15 Term extraction method based on definition and relation Active CN107577670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710833633.4A CN107577670B (en) 2017-09-15 2017-09-15 Term extraction method based on definition and relation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710833633.4A CN107577670B (en) 2017-09-15 2017-09-15 Term extraction method based on definition and relation

Publications (2)

Publication Number Publication Date
CN107577670A CN107577670A (en) 2018-01-12
CN107577670B true CN107577670B (en) 2020-09-22

Family

ID=61033612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710833633.4A Active CN107577670B (en) 2017-09-15 2017-09-15 Term extraction method based on definition and relation

Country Status (1)

Country Link
CN (1) CN107577670B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9116865B2 (en) * 2012-12-05 2015-08-25 Chegg, Inc. Enhancing automated terms listings in HTML document publishing based on user searches

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于句法分析的文本定义抽取方法;张淑征 等;《海南大学学报自然科学版》;20160229;第34卷(第2期);全文 *

Also Published As

Publication number Publication date
CN107577670A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111104789B (en) Text scoring method, device and system
Liu et al. Review sentiment scoring via a parse-and-paraphrase paradigm
Brychcín et al. HPS: High precision stemmer
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
CN112668319B (en) Vietnamese news event detection method based on Chinese information and Vietnamese statement method guidance
KR20080021017A (en) Comparing text based documents
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN113377916B (en) Extraction method of main relations in multiple relations facing legal text
CN113312922B (en) Improved chapter-level triple information extraction method
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN112069312A (en) Text classification method based on entity recognition and electronic device
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
Al-Azzawy et al. Arabic words clustering by using K-means algorithm
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN107577670B (en) Term extraction method based on definition and relation
Bloodgood et al. Data cleaning for xml electronic dictionaries via statistical anomaly detection
Hermansson et al. Tracking amendments to legislation and other political texts with a novel minimum-edit-distance algorithm: DocuToads
Lu et al. Attributed rhetorical structure grammar for domain text summarization
Mansouri et al. Named entity recognition using a new fuzzy support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant