CN110162791B - Text keyword extraction method and system for national defense science and technology field - Google Patents

Text keyword extraction method and system for national defense science and technology field Download PDF

Info

Publication number
CN110162791B
CN110162791B CN201910438831.XA CN201910438831A CN110162791B CN 110162791 B CN110162791 B CN 110162791B CN 201910438831 A CN201910438831 A CN 201910438831A CN 110162791 B CN110162791 B CN 110162791B
Authority
CN
China
Prior art keywords
keywords
class
extracting
rule
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910438831.XA
Other languages
Chinese (zh)
Other versions
CN110162791A (en
Inventor
孙孟阳
晏裕生
姚晗
董文轩
程洁丹
江洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Institute Of Marine Technology & Economy
Original Assignee
China Institute Of Marine Technology & Economy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Institute Of Marine Technology & Economy filed Critical China Institute Of Marine Technology & Economy
Priority to CN201910438831.XA priority Critical patent/CN110162791B/en
Publication of CN110162791A publication Critical patent/CN110162791A/en
Application granted granted Critical
Publication of CN110162791B publication Critical patent/CN110162791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a text keyword extraction method and system for the field of national defense science and technology. According to the method, a set of machine processing mechanism capable of accurately extracting the keywords capable of representing the main contents of the articles from the texts in a certain field of national defense science and technology is trained through a large number of samples, the correctness and authority of keyword extraction are guaranteed through the quality and quantity of the training samples, and the continuity of improvement of the extraction process is guaranteed through the complete training method. The keywords extracted by the method are extracted according to the conceptual features of the keywords, and even though the keywords may not appear in the articles, the topics of the articles can be accurately reflected through the semantic features, so that the problems that the keywords are not accurately extracted based on a word frequency method and the retrieval hit rate is low are solved.

Description

Text keyword extraction method and system for national defense science and technology field
Technical Field
The invention relates to the technical field of article retrieval, in particular to a text keyword extraction method and system for the national defense science and technology field.
Background
For a retrieval system, the core problem is to extract keywords from a text to represent the main content of the text. When the user searches for the keyword, the text can be quickly retrieved. The method for automatically extracting the keywords adopted in the field of national defense science and technology at present is mainly based on word frequency statistics, and has certain irrational property, and the extracted keywords cannot fully express the theme of the article, so that the retrieval hit rate of a user is reduced, and reasonable utilization of resources is not facilitated.
Disclosure of Invention
The invention aims to provide a text keyword extraction method and system for the field of national defense science and technology, and aims to solve the problem that extraction of keywords by a keyword extraction method mainly based on word frequency statistics is inaccurate.
In order to achieve the purpose, the invention provides the following scheme:
a text keyword extraction method for the field of defense science and technology comprises the following steps:
acquiring a large number of electronic texts in the national defense science and technology field as training samples;
extracting a class of keywords of the training sample according to the bibliographic information of the electronic text;
extracting second-class keywords of the training sample by adopting a characteristic judgment rule; the characteristic judgment rule comprises a character rule and a collocation rule;
calculating the similarity between the first class of keywords and the second class of keywords by adopting an edit distance algorithm;
judging whether the similarity is higher than a similarity threshold value or not to obtain a first judgment result;
if the similarity is higher than a similarity threshold value as a first judgment result, adding the second type of keywords as the keywords of the electronic text;
and if the first judgment result is that the similarity is not higher than the similarity threshold, modifying the characteristic judgment rule, and returning to the step of extracting the second class keywords of the training sample by adopting the characteristic judgment rule.
Optionally, the acquiring of a large number of electronic texts in the field of defense science and technology as training samples specifically includes:
acquiring electronic texts in more than 50000 national defense science and technology fields as training samples; each electronic text comprises an original text document and corresponding bibliographic information; the bibliographic information comprises a title, an abstract, an author unit, a publishing mechanism, publishing time, a meeting name, a academic degree, a periodical name, a textual link and key terms of the original document, wherein the key terms comprise at least 3 key words of the original document.
Optionally, the extracting a type of keywords of the training sample according to the bibliographic information of the electronic text specifically includes:
and extracting the first 3 to 5 keywords in the keyword items as a class of keywords of the training sample.
Optionally, the extracting the second class keywords of the training sample by using the feature determination rule specifically includes:
dividing the training sample into a series of words by adopting a word segmentation algorithm based on a hidden Markov model;
and extracting words which accord with the character rule or the collocation rule in the series of words as the second class keywords of the training sample according to the characteristic judgment rule.
A text keyword extraction system for the national defense technology field, the system comprising:
the training sample acquisition module is used for acquiring a large amount of electronic texts in the national defense science and technology field as training samples;
the first-class keyword extraction module is used for extracting a first-class keyword of the training sample according to bibliographic information of the electronic text;
the second-class keyword extraction module is used for extracting the second-class keywords of the training sample by adopting a characteristic judgment rule; the characteristic judgment rule comprises a character rule and a collocation rule;
the similarity calculation module is used for calculating the similarity between the first class of keywords and the second class of keywords by adopting an edit distance algorithm;
the similarity judging module is used for judging whether the similarity is higher than a similarity threshold value or not to obtain a first judging result;
the keyword extraction module is used for increasing the second class of keywords as the keywords of the electronic text if the similarity is higher than a similarity threshold value according to the first judgment result;
and the keyword re-extraction module is used for modifying the characteristic judgment rule and returning to the second-class keyword extraction module if the similarity is not higher than the similarity threshold value according to the first judgment result.
Optionally, the training sample obtaining module specifically includes:
the training sample acquisition unit is used for acquiring more than 50000 electronic texts in the national defense science and technology field as training samples; each electronic text comprises an original text document and corresponding bibliographic information; the bibliographic information comprises a title, an abstract, an author unit, a publishing mechanism, publishing time, a meeting name, a degree, a journal name, a textual link and key terms of the textual document, wherein the key terms comprise at least 3 key words of the textual document.
Optionally, the first-class keyword extraction module specifically includes:
and the first class keyword extraction unit is used for extracting the first 3 to 5 keywords in the keyword items as the first class keywords of the training sample.
Optionally, the second-class keyword extraction module specifically includes:
a word segmentation unit, which is used for segmenting the training sample into a series of words by adopting a word segmentation algorithm based on a hidden Markov model;
and the second-class keyword extraction unit is used for extracting the words which accord with the character rule or the collocation rule in the series of words according to the characteristic judgment rule to be used as the second-class keywords of the training sample.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a method and a system for extracting text keywords facing to the national defense science and technology field, wherein a set of machine processing mechanism capable of accurately extracting keywords capable of representing main contents of an article from texts in a certain field of the national defense science and technology is trained through a large number of samples, the correctness and authority of keyword extraction are ensured through the quality and quantity of training samples, and the continuity of improvement of an extraction process is ensured through a complete training method. The keywords extracted by the method are extracted according to the conceptual features of the keywords, and even though the keywords may not appear in the articles, the topics of the articles can be accurately reflected through the semantic features, so that the problems that the keywords are not accurately extracted based on a word frequency method and the retrieval hit rate is low are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings provided by the present invention without any creative effort.
FIG. 1 is a flow chart of a method for extracting text keywords in the field of defense science and technology;
FIG. 2 is a basic schematic diagram of a text keyword extraction method for the field of defense science and technology provided by the invention;
fig. 3 is a system structure diagram of the text keyword extraction system for the field of defense science and technology provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention aims to provide a text keyword extraction method and system for the field of national defense science and technology, and aims to solve the problem that extraction of keywords by a keyword extraction method mainly based on word frequency statistics is inaccurate.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for extracting text keywords in the field of defense science and technology. FIG. 2 is a basic schematic diagram of the text keyword extraction method for the field of defense science and technology provided by the invention. Referring to fig. 1 and 2, the method for extracting text keywords in the national defense technology field includes:
step 101: a large number of electronic texts in the national defense science and technology field are obtained as training samples.
A large number of electronic texts in a certain field of national defense science and technology are obtained as training samples. The electronic text in a certain field of the national defense science and technology is a character carrier for recording relevant information in the certain field of the national defense science and technology, and comprises a science and technology report, a meeting paper, a journal article, news information, a book, a academic paper, patent information and the like relevant to the certain field of the national defense science and technology. Here, a large number of 50000 electronic texts each include a textual document and corresponding bibliographic information, which includes a title, an abstract, an author unit, a distribution organization, a distribution time, a meeting name, a degree, a journal name, textual links, and key terms of the textual document. One item of keywords in the bibliographic information comprises at least 3 keywords, and the keywords are derived from the bibliographic information given by the official part and have higher authority and accuracy.
And simultaneously, establishing a concept system and a feature judgment rule of a certain field of national defense science and technology, namely forming a concept feature database of the certain field of national defense science and technology. The method comprises the following specific steps:
s1.1: according to the long-term work accumulation of a certain field of national defense science and technology, a concept system of the certain field of national defense science and technology is formed. The concept system may include multiple levels, the layers have subordinative relations, taking the engineering science field of defense science and technology as an example, the engineering science is used as a first level, the subordinate concept nodes of mechanical engineering, engineering thermophysics, electrical engineering disciplines and the like are used as a second level, the subordinate concept nodes of mechanical engineering, mechanical robots, transmission mechanics, mechanical dynamics and the like are used as a third level, and so on, so that the concept node at the lowest level of the concept system is suitable to be used as a common keyword and cannot be subdivided to form the concept system for the step S1.2 to use.
S1.2: according to the definition, the characteristics and the Chinese natural language description of related concepts of each node of each layer in a concept system formed by S1.1 accumulated in long-term work, description contents are converted into a characteristic judgment rule which can not be subdivided, and each rule has a unique rule number. The feature judgment rules are classified into two types: the first is a character rule which comprises synonyms, near-synonyms, related words, related phrases and related abbreviations of concept nodes; and the second is collocation rules, namely logical judgment conditions are formed by summarizing and inducing fixed collocation, sentence patterns and grammar rules related to concept nodes and are expressed by regular expressions. For use in step 103.
Step 102: and extracting a class of keywords of the training sample according to the bibliographic information of the electronic text.
And establishing a database according to the electronic text information acquired in the step 101, wherein the data in the database is mainly stored according to the formats of a data table 1, a data table 2 and a data table 3.
The header of the data table 1 includes a sample number (sample unique identification number), a title, an abstract, an author unit, a release mechanism, release time, a conference name, a degree, a journal name, a textual link, a first class keyword, a second class keyword, a rule number, and a keyword similarity. The keywords are keywords carried by the training sample, are provided by the author of the sample, a publishing organization and other authorities, and have high accuracy and authority; the first 3 to 5 keywords in the keyword items are extracted to serve as the class of keywords of the training sample. The second category of keywords are keywords extracted by a machine by adopting the method. The method of the invention leads the second class of keywords to gradually converge to the first class of keywords through multiple iterations, thereby leading the generation of the second class of keywords to become a keyword automatic extraction process with high accuracy and strong authority. The invention establishes a special data file server for the database, and the original file link of the text document refers to the position of the data file server for storing the text. The first class keywords and the second class keywords respectively correspond to existing keywords of the sample and keywords formed based on a feature description method, the rule number is used for recording the feature judgment rule number matched with each second class keyword, and the similarity of the keywords records the similarity of the first class keywords and the second class keywords. In the data table 2, the table header includes a keyword number, a second-class keyword, a rule number, and a sample number. In the data table 3, the header includes a keyword number, a category of keywords, and a sample number.
The bibliographic data obtained in step 101 is stored in the database established in step 102, and the extracted keywords are stored in the keyword fields (i.e. data table 3), so as to form a database for use in step 103.
Step 103: and extracting the second class keywords of the training sample by adopting a characteristic judgment rule.
And extracting the keywords of each training sample through a conceptual feature judgment rule based on the results of the step 101 and the step 102 to form two types of keywords. The characteristic judgment rule comprises a character rule and a collocation rule. The method comprises the following specific steps:
s2.1, dividing each training sample file into a series of words by a word segmentation algorithm based on HMM (hidden Markov Model) for the text data file in the step 102.
S2.2: and (3) comparing the word segmentation text data formed in the step (S2.1) with the character rules formed in the step (102) through a computer program, judging whether the words conform to the character rules, if so, marking the word segmentation words as second-class key words, storing the second-class key words into a second-class key word field in the data table (1), and simultaneously storing corresponding rule numbers, wherein the storage sequence of the rule numbers is the same as the storage sequence of the key words. For example, there are text rules under node a: synonym B, near-synonym C, related words, related phrases, related abbreviations D, E, F \8230, \8230, when n times of B words or m times of C words or D, E, F \8230, and \8230occurin a piece of participle text data, and when k words occur j times (j, k, m, n are adjusted according to node and sample conditions), the A words are stored in the two types of keyword fields of the training record, and corresponding rule numbers are stored at the same time, and the data table in the database is updated for use in step 104.
S2.3: and (3) comparing the word segmentation words formed in the step (S2.1) with the collocation rules formed in the step (S102) through a computer program, judging whether the words conform to the collocation rules, if so, marking the words as second-class keywords of the text data, storing the words into second-class keyword fields in the data table 1, and simultaneously storing corresponding rule numbers in the data table 1, wherein the storage sequence of the rule numbers is the same as the storage sequence of the keywords. For example, the node "aircraft on-orbit maintenance and service technology" includes collocation type rules such as "spacecraft" {0, i } on-orbit } {0, j } maintenance "," on-orbit. {0, i } (maintenance | service) ", and the like, where {0, i } in the rules indicates that 0 to i characters are included between the front and back contents, and (maintenance | service) indicates that" maintenance "or" service "occurs, and when n times or m items of the participle text data occur and meet the conditions of the collocation rules (m, n are adjusted according to the node and sample conditions), the text data is marked with" aircraft on-orbit maintenance and service technology "as a keyword, and the keyword is stored in the field of the second type of keywords of the training sample data record, and the rule number corresponding to each of the second type of keywords is stored at the same time, so that the database updates the data table 1 for use in step 104.
Step 104: and calculating the similarity between the first class of keywords and the second class of keywords by adopting an edit distance algorithm.
And (5) calculating the similarity of the first class keyword and the second class keyword of each training sample record, if the calculation result is in a tolerable range, finishing the keyword extraction step, and otherwise, entering the step 107. The specific scheme of step 104 is as follows:
for the first class keyword and the second class keyword generated in step 102 and step 103, an edit Distance algorithm (Levenshtein Distance) is adopted to calculate the similarity between the first class keyword and the second class keyword.
The edit distance refers to the minimum number of edit operations required to change from one character string to another. The editing operation referred to herein includes replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity between two character strings.
The basic principle of the edit distance algorithm is as follows: assuming that d [ i, j ] steps are used for representing the minimum number of steps required for converting the character string s [1 \8230i ] into the character string t [1 \8230j ], when i is equal to 0, namely the character string s is empty, j characters are added to the corresponding d [0, j ], so that the character string s is converted into t; when j is equal to 0, i.e. the string t is empty, its d [ i,0] is to decrement i characters, so that the string s is converted to t.
To ensure that the string s [1.. I ] is transformed into t [1.. J ] with the least number of edits, it is necessary to ensure that the minimum number of edits can be performed before, so that the string s and the string t can be transformed from s [1.. I ] to t [1.. J ] only once or not. The so-called "before" is divided into three cases:
1) S1 \8230iis converted into t 1 \8230j-1 through k operations;
2) Converting s [1.. I-1] into t [1.. J ] by k operations;
3) S1 \8230, i-1 is converted into t 1 \8230, j-1 by k operations.
For case 1, matching is done by simply adding t [ j ] to s [1.. I ], which requires a total of k +1 operations. For case 2, s [ i ] is only removed at the end, and then k operations are done, for a total of k +1 operations. For case 3, s [ i ] only needs to be replaced with t [ j ] at the end, so that s [1.. I ] = = t [1.. J ] is satisfied, and k +1 operations are also needed in total, and if s [ i ] is exactly equal to t [ j ], only k operations may be needed.
In order to ensure that the number of operations obtained is always the smallest, the least expensive one of the three cases must be selected as the minimum number of operations required to convert s [1.. I ] to t [1.. J ].
The basic steps of the edit distance algorithm include: (1) Constructing a matrix with the row number of m +1 and the column number of n +1, storing the times of operations required to be executed for completing certain conversion, and taking the times of the operations required to be executed for converting the string s [1.. N ] into the string t [1 \8230; m ] as the value of matrix [ n ] [ m ]; (2) Matrix is initialized in first rows 0 to n and in first columns 0 to m. Matrix [0] [ j ] represents the value of the j-1 st column in the 1 st row, and the value represents the times of operations required to be executed for converting the string s [1 \82300 ]; 0] into t [1.. J ], obviously, an empty string is converted into a string with the length of j, only j times of adding operations are required, so the value of Matrix [0] [ j ] is j, and the like; (3) examining each s [ i ] character from 1 to n; (4) examining each s [ j ] character from 1 to m; (5) Comparing every two characters of the string s and the string t, and if the two characters are equal, making cost value cost be 0, and if the two characters are not equal, making cost be 1; (6) First, if s [1.. I-1] can be converted to t [1.. J ] within k operations, s [ i ] can be removed and then the k operations are done, i.e., k +1 operations are required in total. Secondly, if s [1 \8230i ] can be converted into t [1 \8230j-1 ], i.e. d [ i, j-1] = k, within k operations, t [ j ] can be added with s [1.. I ], and k +1 operations are needed in total. Thirdly, if s [1 \8230i-1 ] can be converted into t [1 \8230j-1 ] in k steps, s [ i ] can be converted into t [ j ] so that s [1.. I ] = = t [1.. J ] is satisfied, i.e., k +1 operations are required in total. The cost value cost is added, because if s [ i ] is just equal to t [ j ], the requirement can be met without any replacement operation; if not, another replacement operation is needed, i.e. k +1 operations are needed. If the minimum number of operations is to be obtained, the numbers of the operations in the three cases are compared, and the minimum value is taken as the value of d [ i, j ]; and finally, repeating the steps (3), (4), (5) and (6) to obtain d [ n, m ], namely the final calculated distance.
The invention takes a first kind of keywords as a character string to be represented by str1, takes a second kind of keywords as a character string to be represented by str2, and takes Math.Max (str1.length, str2.length) to represent the length of a longer character in two character strings, so that the similarity of the two kinds of keywords is S =1-d [ n, m ]/Math.Max (str1.length, str2.length), wherein str1.length represents the length of the character string str1, and str2.length represents the length of the character string str2. The calculated similarity S between the first class keyword and the second class keyword is stored in the keyword similarity field in the sample data record (data table 1) for use in step 105.
Step 105: and judging whether the similarity is higher than a similarity threshold value or not, and obtaining a first judgment result.
According to the method, whether the extraction results of the second class of keywords are within a tolerable range or not can be determined by counting the keyword similarity S, calculating the numerical indexes such as the average value of the keyword similarity of the sample, the proportion of the sample with the keyword similarity of 0 and the like and combining with actual requirements; or determining whether the extraction result of the second class of keywords is in a tolerable range by judging whether the similarity is higher than a similarity threshold value. If the range is tolerable, the keyword extraction step ends, otherwise, the step 107 is entered.
As a specific embodiment of the present invention, the step 105 obtains a first determination result by determining whether the similarity is higher than a similarity threshold.
Step 106: and if the similarity is higher than a similarity threshold value as a first judgment result, adding the second type of keywords as the keywords of the electronic text.
And if the extraction result of the second class of keywords is in a tolerable range, determining the second class of keywords as the keywords of the training sample, and storing the second class of keywords and the corresponding electronic text. And if the two types of keywords do not exist in the concept system, creating a concept node for the two types of keywords, and updating the concept system.
The invention trains a set of machine processing mechanism capable of accurately extracting keywords capable of representing main contents of articles from texts in a certain field of national defense science and technology through a proper sample and method. Because the two types of keywords extracted by the method are extracted according to the designated conceptual features, even though the keywords may not appear in the text, the topics of the articles can be more accurately reflected, a user can conveniently and quickly retrieve corresponding texts through the keywords, and the article retrieval hit rate and retrieval efficiency in the field of national defense science and technology are improved.
Step 107: and if the first judgment result is that the similarity is not higher than the similarity threshold, modifying the characteristic judgment rule, and returning to the step of extracting the second class keywords of the training sample by adopting the characteristic judgment rule.
If the extraction result of the second class of keywords is not in a tolerable range, the rule causing the content with lower relevance to the first class of keywords in the second class of keywords is confirmed and modified, the triggering condition of the rule is increased, and the triggering threshold value of the rule is improved. The method comprises the following specific steps:
s3.1, extracting sample data records with the similarity S of the keywords being more than 0, respectively carrying out editing distance algorithm calculation on each word in the second class of keywords and the whole field of the first class of keywords, and storing the second class of keywords with the similarity of 0, corresponding rule numbers and sample numbers into a data table 2; and for the second category of keywords with similarity not being 0, taking the first 5 words from high to low (if the similarity is less than 5, taking all the words), storing the second category of keywords according to the sequence from high to low, synchronously sequencing and updating the corresponding rule numbers, and updating the database for S3.3.
S3.2: and extracting sample data records with the keyword similarity equal to 0, extracting each second-class keyword and the corresponding rule number in each record respectively, and storing the extracted second-class keywords and the corresponding rule numbers into a data table 2 together for S3.3.
And S3.3, modifying and optimizing the rules corresponding to the rule numbers in the data table 2, such as deleting the character type rules, refining the collocation type rules, properly improving the word frequency number requirement, increasing necessary conditions and the like, and improving the trigger threshold value of the rules. And updating the characteristic judgment rule according to the modified and optimized character rule and the collocation rule for the step 103. And after the rules corresponding to the rule numbers of the data table 2 are modified and optimized, emptying the data table 2.
And S3.4, carrying out editing distance calculation on the whole fields of each first-class keyword and each second-class keyword in each sample record, and if the similarity is 0, storing the first-class keywords and the corresponding sample numbers into a data table 3 for S3.5.
S3.5, judging whether the keywords appearing in the data table 3 exist in the concept system, if so, turning to S3.6; if not, a concept node is created for the data, and the concept system is updated for S3.6.
And S3.6, modifying and optimizing rules corresponding to the concept nodes, such as supplementing character type rules, simplifying collocation type rules, properly reducing word frequency number requirements, relaxing necessary conditions and the like, and reducing trigger thresholds of the rules. The feature decision rule is updated for use in step 103. And (4) clearing the data table 3 after processing S3.4 and S3.5 on all the keywords in the data table 3.
The method provided by the invention realizes automatic extraction of keywords of texts in a certain field of national defense science and technology based on conceptual characteristic judgment, realizes continuous improvement of the extraction effect of the keywords of texts in a certain field of national defense science and technology based on sample training, realizes continuous supplement and improvement of concept systems and characteristic judgment rules in a certain field of national defense science and technology based on sample training, and can continuously improve the accuracy of keyword extraction in actual citation, thereby improving the retrieval hit rate.
Based on the method provided by the invention, the invention also provides a system for extracting the text keywords facing the national defense science and technology field, as shown in fig. 3, the system comprises:
a training sample acquisition module 301, configured to acquire a large number of electronic texts in the national defense science and technology field as training samples;
a first class keyword extraction module 302, configured to extract a first class keyword of the training sample according to bibliographic information of the electronic text;
a second class keyword extraction module 303, configured to extract a second class keyword of the training sample by using a feature judgment rule; the characteristic judgment rule comprises a character rule and a collocation rule;
a similarity calculation module 304, configured to calculate a similarity between the first category keyword and the second category keyword by using an edit distance algorithm;
a similarity determining module 305, configured to determine whether the similarity is higher than a similarity threshold, and obtain a first determination result;
a keyword extraction module 306, configured to add the second type of keywords as the keywords of the electronic text if the similarity is higher than a similarity threshold as the first determination result;
and a keyword re-extraction module 307, configured to modify the feature determination rule and return to the second-class keyword extraction module if the first determination result indicates that the similarity is not higher than the similarity threshold.
The training sample obtaining module 301 specifically includes:
the training sample acquisition unit is used for acquiring more than 50000 electronic texts in the national defense science and technology field as training samples; each electronic text comprises an original text document and corresponding bibliographic information; the bibliographic information comprises a title, an abstract, an author unit, a publishing mechanism, publishing time, a meeting name, a academic degree, a periodical name, a textual link and key terms of the original document, wherein the key terms comprise at least 3 key words of the original document.
The first-class keyword extraction module 302 specifically includes:
and the first class keyword extraction unit is used for extracting the first 3 to 5 keywords in the keyword items as the first class keywords of the training sample.
The second-class keyword extraction module 303 specifically includes:
the word segmentation unit is used for segmenting the training sample into a series of words by adopting a word segmentation algorithm based on a hidden Markov model;
and the second-class keyword extraction unit is used for extracting the words which accord with the character rule or the collocation rule in the series of words according to the characteristic judgment rule to be used as the second-class keywords of the training sample.
The text data resources in the technical field of national defense have the characteristics of various form structures and huge information quantity, and meanwhile, the knowledge system architecture in the technical field of national defense generally has stronger generality, authority and stability. The retrieval system for the information resources in the multi-source heterogeneous national defense technology field needs to deal with an important problem: how to quickly and accurately automatically extract keywords from a piece of text to represent the main content of the text is convenient for a user to quickly retrieve the text through the keywords.
At present, the automatic extraction of information resource keywords in the national defense science and technology field mainly adopts an extraction method based on word frequency, and the method has certain problems: on one hand, the requirement for the construction and maintenance of the stop word list is higher, and the complex conditions of synonyms, similar synonyms, multiple expression forms and the like existing in keywords are difficult to be effectively handled; on the other hand, words which only meet the word frequency requirement cannot fully express the theme of the article, and the retrieval hit rate of users with practical requirements can be reduced.
According to the method, a set of machine processing mechanism capable of automatically and accurately extracting the key words representing the main content of the article from the text in a certain field of national defense science and technology is trained through a proper sample and method, the correctness and authority of the key words are guaranteed by the quality and quantity of the training sample, and the continuity of the improvement of the extraction process is guaranteed by the complete training method. The final keywords are extracted according to the concept features indicated by the keywords, and even though the keywords may not appear in the articles, the topics of the articles can be accurately reflected through the semantic features, so that the problems that the keywords are not accurately extracted based on a word frequency method and the retrieval hit rate is low are solved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A text keyword extraction method for the national defense science and technology field is characterized by comprising the following steps:
acquiring a large number of electronic texts in the national defense science and technology field as training samples;
extracting a class of keywords of the training sample according to the bibliographic information of the electronic text; the class of keywords refer to keywords carried by the training sample;
extracting second-class keywords of the training sample by adopting a characteristic judgment rule; the characteristic judgment rule comprises a character rule and a collocation rule; the second category of keywords refer to keywords extracted by a machine based on a feature judgment rule; the word rules comprise synonyms, similar words, related phrases and related abbreviations of the keywords; the collocation rule comprises fixed collocation, sentence patterns and grammatical rules related to the keywords;
calculating the similarity between the first class of keywords and the second class of keywords by adopting an edit distance algorithm;
judging whether the similarity is higher than a similarity threshold value or not, and obtaining a first judgment result;
if the similarity is higher than a similarity threshold value as a first judgment result, adding the second type of keywords as the keywords of the electronic text;
and if the first judgment result is that the similarity is not higher than the similarity threshold, modifying the characteristic judgment rule, and returning to the step of extracting the second class keywords of the training sample by adopting the characteristic judgment rule.
2. The method for extracting text keywords according to claim 1, wherein the obtaining of a large number of electronic texts in the national defense technology field as training samples specifically comprises:
acquiring electronic texts in more than 50000 national defense science and technology fields as training samples; each electronic text comprises an original text document and corresponding bibliographic information; the bibliographic information comprises a title, an abstract, an author unit, a publishing mechanism, publishing time, a meeting name, a academic degree, a periodical name, a textual link and key terms of the original document, wherein the key terms comprise at least 3 key words of the original document.
3. The method for extracting keywords from text according to claim 2, wherein the extracting of the category of keywords of the training sample according to the bibliographic information of the electronic text specifically comprises:
and extracting the first 3 to 5 keywords in the keyword items as a class of keywords of the training sample.
4. The method for extracting text keywords according to claim 3, wherein the extracting the second type of keywords of the training samples by using the feature judgment rule specifically comprises:
dividing the training sample into a series of words by adopting a word segmentation algorithm based on a hidden Markov model;
and extracting words which accord with the character rule or the collocation rule in the series of words as the second class keywords of the training sample according to the characteristic judgment rule.
5. The utility model provides a text keyword extraction system towards national defense science and technology field which characterized in that, the system includes:
the training sample acquisition module is used for acquiring a large amount of electronic texts in the national defense science and technology field as training samples;
the first class keyword extraction module is used for extracting a first class keyword of the training sample according to the bibliographic information of the electronic text; the class of keywords refer to keywords carried by the training sample;
the second-class keyword extraction module is used for extracting second-class keywords of the training sample by adopting a characteristic judgment rule; the characteristic judgment rule comprises a character rule and a collocation rule; the second category of keywords refer to keywords extracted by a machine based on a feature judgment rule; the word rules comprise synonyms, similar words, related phrases and related abbreviations of the keywords; the collocation rule comprises fixed collocation, sentence patterns and grammatical rules related to the keywords;
the similarity calculation module is used for calculating the similarity between the first class of keywords and the second class of keywords by adopting an edit distance algorithm;
the similarity judging module is used for judging whether the similarity is higher than a similarity threshold value or not to obtain a first judging result;
the keyword extraction module is used for increasing the second class of keywords as the keywords of the electronic text if the similarity is higher than a similarity threshold value according to the first judgment result;
and the keyword re-extraction module is used for modifying the characteristic judgment rule and returning to the second-class keyword extraction module if the similarity is not higher than the similarity threshold value according to the first judgment result.
6. The system for extracting text keywords according to claim 5, wherein the training sample obtaining module specifically comprises:
the training sample acquisition unit is used for acquiring more than 50000 electronic texts in the national defense science and technology field as training samples; each electronic text comprises an original text document and corresponding bibliographic information; the bibliographic information comprises a title, an abstract, an author unit, a publishing mechanism, publishing time, a meeting name, a academic degree, a periodical name, a textual link and key terms of the original document, wherein the key terms comprise at least 3 key words of the original document.
7. The system for extracting text keywords according to claim 6, wherein the one-class keyword extraction module specifically comprises:
and the first class keyword extraction unit is used for extracting the first 3 to 5 keywords in the keyword items as the first class keywords of the training sample.
8. The system for extracting text keywords according to claim 7, wherein the module for extracting the second category of keywords specifically comprises:
the word segmentation unit is used for segmenting the training sample into a series of words by adopting a word segmentation algorithm based on a hidden Markov model;
and the second-class keyword extraction unit is used for extracting the words which accord with the character rule or the collocation rule in the series of words as the second-class keywords of the training sample according to the characteristic judgment rule.
CN201910438831.XA 2019-05-24 2019-05-24 Text keyword extraction method and system for national defense science and technology field Active CN110162791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910438831.XA CN110162791B (en) 2019-05-24 2019-05-24 Text keyword extraction method and system for national defense science and technology field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910438831.XA CN110162791B (en) 2019-05-24 2019-05-24 Text keyword extraction method and system for national defense science and technology field

Publications (2)

Publication Number Publication Date
CN110162791A CN110162791A (en) 2019-08-23
CN110162791B true CN110162791B (en) 2023-04-07

Family

ID=67632291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910438831.XA Active CN110162791B (en) 2019-05-24 2019-05-24 Text keyword extraction method and system for national defense science and technology field

Country Status (1)

Country Link
CN (1) CN110162791B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597790B (en) * 2019-09-27 2023-05-02 东方航空技术有限公司 Method for establishing Chinese-English translation database for maintenance of civil aircraft and data card

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN109190111A (en) * 2018-08-07 2019-01-11 北京奇艺世纪科技有限公司 A kind of document text keyword extracting method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193803A (en) * 2017-05-26 2017-09-22 北京东方科诺科技发展有限公司 A kind of particular task text key word extracting method based on semanteme
CN109190111A (en) * 2018-08-07 2019-01-11 北京奇艺世纪科技有限公司 A kind of document text keyword extracting method and device

Also Published As

Publication number Publication date
CN110162791A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN102298635B (en) Method and system for fusing event information
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN109543178B (en) Method and system for constructing judicial text label system
CN101464898B (en) Method for extracting feature word of text
CN101079025B (en) File correlation computing system and method
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN111309925A (en) Knowledge graph construction method of military equipment
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN104679738A (en) Method and device for mining Internet hot words
CN103678412A (en) Document retrieval method and device
CN111831786A (en) Full-text database accurate and efficient retrieval method for perfecting subject term
CN113157860B (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN108509490B (en) Network hot topic discovery method and system
CN107341188A (en) Efficient data screening technique based on semantic analysis
CN115186654A (en) Method for generating document abstract
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
CN110162791B (en) Text keyword extraction method and system for national defense science and technology field
CN111259661B (en) New emotion word extraction method based on commodity comments
Tahmasebi et al. On the applicability of word sense discrimination on 201 years of modern english

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant