CN115221871A - Multi-feature fusion English scientific and technical literature keyword extraction method - Google Patents

Multi-feature fusion English scientific and technical literature keyword extraction method Download PDF

Info

Publication number
CN115221871A
CN115221871A CN202210725706.9A CN202210725706A CN115221871A CN 115221871 A CN115221871 A CN 115221871A CN 202210725706 A CN202210725706 A CN 202210725706A CN 115221871 A CN115221871 A CN 115221871A
Authority
CN
China
Prior art keywords
word
result
words
feature
marking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210725706.9A
Other languages
Chinese (zh)
Other versions
CN115221871B (en
Inventor
毕开龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210725706.9A priority Critical patent/CN115221871B/en
Publication of CN115221871A publication Critical patent/CN115221871A/en
Application granted granted Critical
Publication of CN115221871B publication Critical patent/CN115221871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the method for extracting the keywords of the English scientific and technical literature with the multi-feature fusion, the keyword extraction process is not selected any more, but is used as a process for marking a keyword topic sequence, a deep network learning model is used for carrying out supervised sequence marking, the model carries out multi-feature word segmentation on texts in a corpus according to the characteristics of the scientific and technical literature, the texts are segmented into word sets, all words are subjected to feature topic processing, the words are subjected to sequence marking processing through a marked keyword position marking file, the words and the characteristics are spliced together in a vector form to be used as input, marking results of the words are transmitted to the deep network learning model in a vector form to be trained, the trained model is used for extracting the keywords, the extraction efficiency and the extraction precision of the keywords of the English scientific and technical literature are greatly improved, the extraction results of the model are evaluated in real time, and the extraction effect of the keywords after continuous correction is better.

Description

Multi-feature fusion English scientific and technical literature keyword extraction method
Technical Field
The application relates to a scientific and technical literature database retrieval method, in particular to a multi-feature fusion English scientific and technical literature keyword extraction method, and belongs to the technical field of scientific and technical big data keyword extraction.
Background
In learning and scientific research, the retrieval of relevant scientific and technical documents and resources from the internet is an indispensable or scarce link, and the rapid and accurate search of the wanted scientific and technical document resources is greatly helpful for learning and scientific research. However, the amount of network information is exponentially and explosively increased nowadays, the resources of science and technology documents are too large as the sea of tobacco, and it is almost impossible to review the documents one by one. The scientific and technical literature usually has only a few keywords provided by the author, some documents even have no keywords, and the indexing capability of the search engine on the scientific and technical literature is very limited because the text length is large.
The technology for automatically extracting the keywords to describe the documents is a technology which can well solve the pain point in the scientific research and study process. The keywords are the concentration of the gist of the literature, the core content of the literature can be described simply and effectively, the correct keywords can enable people to accurately and quickly understand the core content of the literature when the literature is consulted, the time consumed by consulting the literature is greatly reduced, and the automatic keyword extraction technology for researching and developing the scientific literature has great application value.
The traditional full-text index has low efficiency in establishing and searching indexes for resources, and the index construction for long texts or whole documents is more time-consuming. Efficiency in searching resources is also lowered. The keyword information of the document is automatically extracted, and the keywords are taken as the index of the document, so that the efficiency of document retrieval can be improved to a great extent. Moreover, the key words of the document in the result of the retrieval request can briefly display the gist and the core content of the document, thereby greatly improving the efficiency of acquiring correct document resources. And classifying and clustering all the documents in the document library by adopting a keyword extraction technology. When a user searches for documents, a document set belonging to the same category can be obtained very easily. And documents with similar subjects can be recommended to the user through similar keywords, so that the time required by document retrieval can be greatly reduced, and the retrieval efficiency is improved.
The method has the advantages that a small part of functions of indexing and classifying document resources are eliminated, keyword extraction is widely applied, and how to quickly and accurately feed back resources required by users to search engines and explosive growth of resource indexes on Web in the big data era is a problem which needs to be solved by the search engines at present. By extracting the keywords of the webpage content and using the keywords as the indexes of the webpages, the related webpages can be accurately and quickly matched when the user inquires, and the keywords can be timely fed back to the user. Based on the webpage snapshot formed by the extracted keywords, the theme content of the webpage can be roughly displayed, and the user can judge whether the webpage is meaningful or not. Moreover, the keywords are used as characteristics, so that related keywords can be provided according to the query requirements of the user when the user queries, the query requirements of the user are normalized, a plurality of irrelevant webpages can be filtered, and the query efficiency is improved.
However, the application of the prior art to keyword extraction of the english scientific and technical literature has obvious disadvantages, the prior art lacks a qualified keyword extraction method of the english scientific and technical literature, and the defects and design difficulties of the current method include:
firstly, retrieving relevant scientific and technical documents and resources from the internet is very important for learning and scientific research, but nowadays, the index type of network information is increased, scientific and technical document resources are as large as that of the Yanhai, and it is almost impossible to examine documents one by one, and scientific and technical documents only have a few keywords provided by authors, and some documents even have no keywords, and the provided keywords have large subjective factors, and meanwhile, because the text length is large, the indexing capability of a search engine on the scientific and technical documents is very limited, in addition, only broad search requirements are usually provided during searching, so that the efficiency of searching for the documents is low, and the keywords cannot be automatically extracted to describe the documents, so that the time consumed by looking up the documents is greatly increased, and the accuracy of retrieving the scientific and technical documents is low, and because the desired scientific and technical document resources cannot be quickly and accurately searched, the learning and scientific and research are not facilitated.
Secondly, a scientific and technological author often provides a plurality of keywords, but a plurality of documents exist at the same time, the author does not provide the keywords, and the number of the keywords provided by the author is insufficient compared with the whole document, keyword extraction is greatly helpful for document classification and document retrieval application, but the keyword extraction technology in the prior art takes alternative words as an input unit and then screens out the keywords from the alternative words, the method has the great disadvantage that the selection result of the alternative words directly influences the final effect, and because the rule of alternative word identification hardly reaches 100% or even 90% coverage rate, a plurality of keywords are omitted in the first step of keyword extraction in the prior art, the extraction effect is seriously reduced, a plurality of problems exist in selecting the keywords from the alternative words, and the method of taking the keyword extraction process as the keyword topic sequence marking process in the prior art is lacked.
Thirdly, the extraction of the keyword dictionary method in the prior art is easy to realize, but the extraction efficiency completely depends on the coverage rate and accuracy of the dictionary, if the dictionary accuracy rate and coverage rate are both low, the extraction accuracy rate and recall rate are both low, new words or missed words can not be identified and extracted completely, people must continuously maintain and upgrade the dictionary to be continuously used, the dictionary method is not suitable for scientific and technical documents, the dictionary efficiency is inversely proportional to the size of the dictionary used by the dictionary method, only when the dictionary is small, the efficiency of extracting the keywords is very low, the dictionary application range is small, the specified dictionary can only be applied to the specified field, the dictionary is not suitable for retrieval of the scientific and technical documents, the cost for manufacturing the dictionary is considered, and the value of the dictionary method is low.
Fourthly, the keyword extraction process based on statistics in the prior art is complex, the calculated amount is large, various analyses and calculations are performed on all words in the whole corpus due to the statistical analysis, the calculated amount is large, a naive Bayes model, a conditional random field, a neural network and the like are adopted in a model of the statistical analysis, a large amount of operations are performed during model training, and the model algorithms need to be deeply known to realize the models, so that the models are relatively complex and difficult. The statistical-based keyword extraction system needs a large amount of original corpora, a supervision mode is adopted to train a keyword extraction model, a large amount of corpus texts with marked results are needed to serve as training data, the marking of the corpora is extremely labor-consuming work, particularly for scientific and technical documents, the marking accuracy is greatly related to the subjective judgment of a marker, and the method is completely not suitable for keyword extraction of English scientific and technical documents.
Fifth, the method in the prior art for extracting keywords from english scientific and technical literature has some problems, a supervised sequence labeling is lacking by using a deep web learning model, the algorithm in the prior art has high complexity, multi-feature word segmentation cannot be performed on texts in a corpus, feature subject processing cannot be performed on all words, consideration on keywords of the scientific and technical literature is lacking, sequence labeling processing cannot be performed on the words through a labeled keyword position labeling file, words generally need to be prepared, errors are amplified, the words and features cannot be represented in a vector form to be spliced as input, labeling results of the words cannot be transmitted to the deep web learning model in the vector form to be trained, the extraction of the keywords is lacking in a trained model, the method in the prior art cannot accurately and quickly extract the keywords of the english scientific and technical literature, the process of evaluating the extraction results of the model in real time is lacking, a correction process is lacking, and the extraction effect of the keywords of the scientific and technical literature is poor.
Disclosure of Invention
Aiming at the defects of the prior art, the method creatively provides that alternative words are not selected any more, but a keyword extraction process is taken as a process for marking a keyword topic sequence, a deep web learning model is adopted for supervised sequence marking, the model aims at the technical literature characteristics, multi-characteristic word segmentation is carried out on texts in a corpus, the texts are segmented into word sets, characteristic topic processing is carried out on all words, sequence marking processing is carried out on the words through a marked keyword position marking file, the words and the characteristics are spliced together in a vector form to be used as input, the marking results of the words are also transmitted to the deep web learning model in the vector form to be trained, the trained model is adopted for extracting the keywords, the extraction efficiency and the extraction precision of the keywords of the English technical literature are greatly improved, the extraction results of the model are evaluated in real time, and the extraction effect of the keywords after continuous correction is better.
In order to achieve the technical effects, the technical scheme adopted by the application is as follows:
a multi-feature fusion English scientific and technical literature keyword extraction method converts a keyword extraction process into a keyword subject sequence marking process, and model training takes words as an input unit and adopts a deep network learning model to carry out supervised sequence marking;
firstly, extracting keywords, converting the keyword extraction into a sequence recognition task, adopting a P/N sequence marking method based on two classifications, using the keyword extraction task as sequence marking of a word two classification, and solving the problem of fragmented keywords in a prediction result;
secondly, four key features are set for training the model by analyzing the keyword set in the marking result and fusing the features; firstly, based on the meaning of the professional keywords and the proper nouns to the existing texts, the keywords marked in the marking result file in the corpus and the document keywords captured from the Web are adopted to jointly form a prior technology dictionary feature (STD); secondly, based on the fact that the probability that the technical literature keywords are nouns or verbs is extremely high, characteristic part-of-speech characteristics (FPOS) are adopted; thirdly, based on the importance of the TF-IDF value of the word to the classification of the word in the corpus, the TF-IDF enabling characteristic value of the word is modified to serve as a characteristic; fourthly, full capitalized terms in the writing format based on the analyzed keywords are the keywords, 30% of the first letter capitalized terms are the keywords, and the writing format (C) is adopted by the text as the fourth characteristic;
thirdly, converting the word, the feature marks and the result marks into mathematical expressions, firstly converting words in the text into 300-dimensional vector feature expressions by adopting a word vector model GoogleNews300 model after open source training, then converting the features into vector feature expressions by a user-defined expression mode aiming at the feature format of the words, and finally converting the result marks into the vector feature expressions by the user-defined mode;
fourthly, for feature processing of the text, performing multi-feature word segmentation on the text by adopting an nltk toolkit, then marking results of the multi-feature word segmentation of the text by adopting a result marking file corresponding to the text, then marking four features of the multi-feature word segmentation of the text in sequence, and finally converting the word, the feature marks of the word and the result marks of the word into vector feature representation in sequence;
fifthly, aggregation of word vectors and feature vectors is used as training input of the deep network learning model, the result vectors are also transmitted to the deep network learning model as target results to perform model training, and keyword extraction is realized by using the deep network learning model;
and sixthly, evaluating and correcting the extraction result of the model in real time, evaluating the result by adopting the prediction effect of a Precision, recall and F1-score three-standard comprehensive evaluation system based on three evaluation standards and a traditional calculation method, and evaluating and correcting the extraction result of the model in real time.
A multi-feature fusion English scientific and technical literature keyword extraction method is further provided, wherein the method is based on a deep network learning keyword extraction method and comprises the following steps: firstly, changing a marking mode into a P/N mode, namely adopting P to represent that a word is a keyword, adopting N to represent that the word is not the keyword, converting a keyword extraction model from a multi-class model into a two-class model, and aggregating adjacent continuous word sequences marked as P to extract the words as the keyword;
the method for extracting the keywords based on the deep web learning comprises the following steps:
the method comprises the following steps: performing multi-feature word segmentation operation on the text in the corpus;
step two: performing feature extraction on the result of the multi-feature word segmentation;
step three: performing P/N result marking on the text according to the marking result file of the corpus;
step four: the words and the features are represented in a vector form and spliced together to serve as input, the result marked by the P/N is represented as a vector to serve as an expected result, and the expected result is transmitted to a deep network learning model for training;
step five: extracting keywords by using a trained model aiming at the corpus;
step six: and evaluating the performance of the model in real time according to the result of model prediction.
The method for extracting the key words of the multi-feature fused English scientific and technical literature further comprises the following steps of fusing feature settings: selecting four characteristics of a fusion prior Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF enabling characteristic value and a writing subject format (C);
(one) a priori science dictionary
Marking words with strong domain and specialty in a document, performing model training by taking a marking result as a feature, adopting a dictionary method for keyword extraction model training, collecting the acquired keywords, removing duplication to form a prior keyword dictionary, adopting words in the dictionary to match text contents one by one during text processing, marking all words formed by the words existing in the dictionary in the text as a class P, marking the words not existing in the dictionary as another class N, and taking the marking result of the dictionary as an input feature during training a model;
the sources of the prior science and technology dictionary are divided into two parts:
a first part: according to a keyword marking result in a marking result file in a corpus provided by a sciences IE task, 5620 keywords which are obtained after repeated or independent meaningless keywords are collected and removed;
a second part: adopting external resources, wherein the corpus is sourced from a SciensDirect document library, adopting a crawler to capture 15000 scientific and technical documents from SciensDirect, and removing repeated keywords to obtain 46569 keywords in total; combining the two parts together, removing repeated keywords and filtering some meaningless words to finally obtain a keyword dictionary consisting of 50000 keywords;
(II) characteristic part of speech
The method comprises the steps of integrating a Python library nltk of a Stanford NLP tool kit to carry out feature part of speech marking, wherein keywords of a document comprise phrases consisting of nouns, verbs and adjectives;
the feature part of speech is divided into four big categories: n, V, J, O; n comprises single complex nouns and single complex proper nouns, V comprises basic forms of verbs, present multi-feature participles, past multi-feature participles, dynamic nouns, past formulas, third names and non-third names, J comprises adjectives and comparison level and highest level forms thereof, O represents all feature parts of speech except the feature parts of speech contained in the three forms, and finally, the four categories are respectively weighted and are used as input for training the model together with word vectors;
(III) TF-IDF enabling characteristic values
The method only takes the TF-IDF enabling characteristic value of a word as a characteristic value, makes up the defects of the word by adding other characteristics, calculates the TF-IDF enabling characteristic value of each word, takes the TF-IDF enabling characteristic value as a characteristic training deep network learning model of the word, and extracts the keywords of the English scientific and technical literature;
(IV) writing Format
And classifying words in the corpus into three categories of UA, UP and L according to the writing format, taking the category to which the words belong as the characteristic, weighting and assigning values to the three categories respectively, and bringing the three categories into a deep network learning model for training.
The method for extracting the English scientific and technical literature keywords with multi-feature fusion further comprises the following steps of: the method comprises the following steps of carrying out vector feature representation on words in a corpus by adopting a trained GoogleNews300 model, wherein the method adopting specified vector feature representation is adopted:
w = (1.0, 1.0.. 1.0) 300 formula 1
S = (0.0, 0.0.. 0.0) 300 formula 2
The 300-dimensional vector W composed of 1.0 is used to represent words in the model GoogleNews300 for which no corresponding vector exists, and the remaining non-words in the model for which no corresponding vector exists, such as numbers or various types of symbols, are represented by a 300-dimensional vector S composed of 0.0.
The method for extracting the English scientific and technical literature keywords with multi-feature fusion further comprises the following steps of: in the case of converting a word into a vector feature representation, four features of the word are represented by a vector, which is specifically represented as follows:
1) A priori science dictionary: the labeling result of the dictionary is divided into two types of P and N, which are respectively expressed as P = (0, 1) and N = (1, 0);
2) Characteristic part of speech: the characteristic part-of-speech characteristics include four types of N, V, J and O, which are respectively expressed as four 4-dimensional vectors: n = (1,0,0,0), V = (0,1,0,0), J = (0,0,0,1), O = (0,0,0,1);
3) TF-IDF enabling characteristic values: the TF-IDF enabling characteristic value feature is a numerical value, and the TF-IDF enabling characteristic value feature is expressed as a one-dimensional vector formed by the numerical value of the TF-IDF enabling characteristic value feature;
4) Writing format: the writing format features are classified into UA, UP and L, which are respectively expressed as 3-dimensional vectors UA = (1, 0), UP = (O, 1, 0) and L = (0, 1).
The method for extracting the key words of the English scientific and technical literature with multi-feature fusion further comprises the following text preprocessing:
step 1: reading the content in the document and performing multi-feature word segmentation processing on the content text, wherein the result is stored as Words;
step 2: according to a marking result file corresponding to the document, carrying out result marking in a P/N mode on the multi-feature word segmentation result Words, wherein the marked result is stored as Labels;
and step 3: selecting four characteristics: the method comprises the steps that a priori Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF enabling characteristic value and a writing format (C) are used for marking characteristics of results Words of multi-characteristic word segmentation one by one, and corresponding results are respectively stored as wd, wp, wt and wc;
and 4, step 4: converting the multi-feature word segmentation result Words, the corresponding marking result Labels and the feature marks wd, wp, wt and wc into corresponding vector feature representations, and splicing the word vector and the feature vector;
and (3) marking result feedback:
step 1: reading Text of a document, performing multi-feature word segmentation operation on the Text to segment the document into Words, and simultaneously storing position information indexes corresponding to the Words in the document;
step 2: reading the content of a result marking file corresponding to the document, reading out marking results in the file, namely position information corresponding to all keywords in the document, including starting positions and ending positions, arranging the marking results in an ascending mode according to the starting positions, and storing the marking results as kptinds;
and 3, step 3: and traversing the indexes set according to the content of the kptinds, if the position of the word in the document is not in the position range stored by the kptinds, the word is not in the keyword subject sequence and is marked as N according to a P/N marking mode, the word with the initial position and the end position in the range contained by the kptinds is marked as P, and after the traversal is finished, storing the marking result set as Labels.
The multi-feature fusion English science and technology literature keyword extraction method further marks each feature of a word before model training:
the first step is as follows: extracting prior science and technology dictionary features (STD) of Words, traversing keywords in a constructed keyword dictionary, adopting the keywords to match an original text of a document, marking positions matched by the keywords in the text if the keywords exist in the text, marking all Words corresponding to the keywords in a multi-feature word segmentation result word as P to indicate that the Words belong to a keyword subject sequence, marking all Words which are not marked as P in the word as N after traversal is finished, indicating that the Words do not belong to the keywords, and storing the result as wd after marking;
the second step is that: extracting characteristic part-of-speech characteristics (FPOS) of a word, marking the characteristic part-of-speech of the word based on a natural language processing tool kit nltk, traversing the Words in a multi-characteristic word segmentation result word set, marking each word by using the characteristic part-of-speech marking tool of nltk, dividing the marked characteristic part-of-speech into four major classes of N, V, J and O according to a classification mode expressed by vector characteristics of the characteristics, and storing a marking result as wp after the traversal is finished;
the third step: extracting TF-IDF enabling characteristic value characteristics of Words, wherein the TF-IDF enabling characteristic values of the Words are obtained by multiplying the current document frequency TF of the Words by the inverse document frequency TF of the Words in the whole document set, when the TF-IDF enabling characteristic values of the Words are calculated, all texts in the whole corpus are read at the same time, each text is subjected to multi-feature word segmentation, repeated Words and punctuation marks are removed, the Words are stored into a word list containing the Words and the number of times that the Words appear in the document, when the TF-IDF enabling characteristic value calculation is carried out on the Words in a certain document, the word list corresponding to the content of the current document is traversed, for each word in the word list, the word frequency TF of the word is calculated according to the number of times that the word frequency TF appears in the document, the word list corresponding to the content of the whole corpus is traversed, the number of times that the current word appears in the whole corpus is calculated, the inverse frequency IDF of the current Words in the whole corpus is calculated, finally, the TF-IDF enabling characteristic values of the Words are calculated by multiplying the IDF of the IDF after the multi-IDF characteristic values of the word frequency TF document are traversed, the word set is calculated, and the enabling characteristic values are stored as the word set, and the enabling values of the word set of the word-IDF;
the fourth step: extracting writing format characteristics (C) of the words, and dividing the words in the corpus into three categories according to the writing formats: UA, UP and L, traversing a word set corresponding to the multi-feature participle of the document, judging the writing format of each word, marking the capitalized Words of the whole word as UA, marking the capitalized Words of only the first letter as UP, marking the rest Words as L, and storing the marking result as wc after the traversal is finished.
The method for extracting the key words of the English scientific and technical literature with multi-feature fusion further comprises the following steps of vector feature transformation: after completing multi-feature word segmentation, result marking and feature extraction of a corpus and before deep network learning model training, converting word itself, marking results wd, wp, wt and wc of features and actual results Labels of the Words into a vector form:
step 1): for Words, traversing word sets, adopting a trained word vector model GoogleNews300 to represent each word as a 300-dimensional vector, and for Words which do not have corresponding vector feature representation in the word vector model, storing a converted result as WX according to whether the word is a vector with 300-dimensional values and all dimension values being 1.0 or a vector with 300-dimensional values and all dimension values being 0.0;
step 2), the step of: converting the marking results wd, wp, w and wc of the characteristics into a vector form;
step 3), the step of: for the labeled result wd of the dictionary feature (STD), the labeled result wd of the dictionary is divided into two categories, P and N. For the two types of marked results, the marked P is expressed as a vector (0, 1), the marked N is expressed as a vector (1, 0), and the converted result is stored as DX;
for the marking result wp of the characteristic part-of-speech Feature (FPOS), the characteristic part-of-speech marks four types of N, V, J and O, and the four types of N, V, J and O are respectively expressed as four 4-dimensional vectors: the symbol N is represented by a vector (1, 0), and the symbol V is represented by a vector (0, 1, 0) the symbol J is represented by a vector (0, 1), the symbol O is represented by a vector (0, 1), the converted result is stored as PX;
for the labeled result wt of the TF-IDF enabling characteristic value characteristic of the word, expressing the TF-IDF enabling characteristic value characteristic as a one-dimensional vector consisting of the numerical values of the TF-IDF enabling characteristic value characteristic, and storing the result as TX;
for the marking result wc of the writing format feature (C) of the word, the writing format feature is divided into three types of UA, UP and L, the three types of UA, UP and L are respectively converted into 3-dimensional vectors, the mark UA is converted into a vector (1, 0), the mark UP is converted into a vector (0, 1, 0), finally the mark L is converted into a vector (0, 1), and the converted result is CX;
step 4), the step of: marking a set of Labels for an actual result of the word result of the multi-feature word segmentation, if the set is marked as P, representing the set as a vector (0, 1), and if the set is marked as N, representing the set as a vector (1, 0), and storing a converted result as Y;
after the representation of the word, the feature and the vector feature of the result is completed, the word vector and the feature vector are connected together according to formula 3 to be used as an input X of model training:
x = WX + DX + PX + TX + CX formula 3
And transmitting the vector characteristic representation Y corresponding to the actual result as an expected result to the keyword extraction model for training.
The method for extracting the key words of the English scientific and technical literature with multi-feature fusion further comprises the following model evaluation standards: three evaluation criteria of P precision, R recall and F1 are adopted for evaluation.
F1 is the weighted harmonic mean of P and R, see formula 4:
Figure BDA0003713135060000081
f1 integrates the performance of accuracy and recall rate, F1 is high only when P and R are both high, and the performance is ideal only when the accuracy and the recall rate are both high, so F1 is taken as the performance of the system.
For the results of model prediction, they are classified into four categories:
first type, TP: correctly predicted keywords;
second class, FP: a mispredicted keyword;
the third type, TN: correctly predicted non-keywords;
fourth, FN: mispredicted non-keywords;
according to the four result classification modes, the calculation modes of P and R are obtained:
Figure BDA0003713135060000091
Figure BDA0003713135060000092
the results for P, R and F1 were calculated.
A multi-feature fusion English scientific and technical literature keyword extraction method is further provided, and a model evaluation method comprises the following steps: identifying the key word subject sequence in the prediction result, and comparing the key word subject sequence with the key word in the actual result to calculate P, R and F1, wherein the method comprises the following two steps:
1) Identifying keywords in the prediction results: the word sequence with the continuous prediction result marked as P is regarded as a keyword, the continuous P sequence is aggregated into the keyword to be extracted by identifying the keyword by adopting the traversal prediction result, and the total number Np of the keyword in the prediction result is obtained;
2) Identifying correctly predicted keywords: obtaining a keyword which is completely and correctly predicted in the prediction result by comparing the keyword extracted from the prediction result with the keyword in the actual result, namely the keyword in the prediction result is completely consistent with the topic sequence of the keyword in the corresponding actual result, counting the number of the keyword as Nt, and obtaining the total number Na of the keywords in the actual result according to a result marking file in a corpus;
according to the calculation method and the calculation result, the calculation methods of the application P, R and F1 are obtained:
Figure BDA0003713135060000093
Figure BDA0003713135060000094
Figure BDA0003713135060000095
the model was evaluated based on P, R and F1.
Compared with the prior art, the innovation points and advantages of the application are as follows:
firstly, aiming at the key word extraction technology in the prior art, all the alternative words are used as an input unit, then the key words are screened out from the alternative words, a plurality of key words are omitted in the first step of extracting the key words in the prior art, and the extraction effect is seriously reduced.
Secondly, the method creatively provides that keyword extraction is converted into sequence recognition task processing, a P/N sequence marking method based on two classifications is adopted, the keyword extraction task is used as sequence marking of a word two classification, and the problem of fragmented keywords in a prediction result is solved; analyzing a keyword set in a marking result, setting four key features for model training by combining features, forming a priori technology dictionary feature (STD) by using the keywords marked in a marking result file in a corpus and document keywords captured from Web, and using feature part of speech (FPOS) and TF-IDF enabling feature values of modified words as features; the text adopts a writing format (C) as a fourth characteristic; the method has the advantages that the documents are described by automatically extracting the keywords, the problem that the extraction quality of the scientific and technological literature keywords is poor in the scientific research and learning process can be well solved, the extracted keywords are the concentration of the gist of the scientific and technological literature, the core content of the literature can be described simply and effectively, the core content of the literature can be understood accurately and quickly, the time consumed for looking up the literature is greatly reduced, and the method has great application value.
Thirdly, the method creatively provides that words, feature marks and result marks are converted into mathematical expressions, the words in the text are converted into 300-dimensional vector feature expressions, then the features are converted into the vector feature expressions in a user-defined expression mode according to the feature format of the words, the result marks are converted into the vector feature expressions in a user-defined mode, the quality of keyword extraction is greatly improved according to the set of design of the features of the scientific and technological documents, the provided keywords can fully understand the topics of the scientific and technological documents, the topics and the feature words can be accurately captured even if the text is long, the indexing capability of a search engine on the scientific and technological documents is enhanced, the keywords are automatically extracted to describe the documents, the time consumed by looking up the documents is greatly reduced, the accuracy of retrieval of the scientific and technological documents is high, the required scientific and technological document resources are quickly and accurately searched, and learning and scientific and research are facilitated.
Fourthly, the method creatively provides the feature processing of the text, adopts an nltk toolkit to perform multi-feature word segmentation on the text, adopts a result marking file corresponding to the text, performs result marking on the multi-feature word segmentation result of the text, sequentially marks four features of the multi-feature word segmentation result of the text, and converts the word itself, the feature marking of the word and the result marking of the word into vector feature representation, so that the efficiency of the text processing based on the scientific and technological features is greatly improved; the word vector and the feature vector are aggregated to be used as training input of the deep network learning model, the result vector is also transmitted to the deep network learning model as a target result to perform model training, and the quality of keyword extraction of the deep network learning model is improved; meanwhile, the extraction result of the model is evaluated and corrected in real time, and the extraction effect of the scientific and technical literature keywords of the model is continuously improved.
Drawings
FIG. 1 is a diagram illustrating the results of a text segment labeled in P/N.
FIG. 2 is a schematic diagram of feature part-of-speech classification of the present application.
Fig. 3 is a diagram illustrating the written format of a word associated with the word and keywords.
Detailed Description
The following further describes the technical solution of the multi-feature-fused key word extraction method for english scientific and technical documents in combination with the accompanying drawings, so that those skilled in the art can better understand and implement the present application.
Firstly, keyword extraction is converted into sequence recognition task processing, a P/N sequence marking method based on two classifications is adopted, the keyword extraction task is used as sequence marking of a word two classification, and the problem of fragmented keywords in a prediction result is solved;
secondly, four key features are set for training the model by analyzing the keyword set in the marking result and fusing the features; firstly, based on the meaning of the professional keywords and the proper nouns to the existing texts, the keywords marked in the marking result file in the corpus and the document keywords captured from the Web are adopted to jointly form a prior technology dictionary feature (STD); secondly, based on the fact that the probability that the technical literature keywords are nouns or verbs is extremely high, characteristic part-of-speech characteristics (FPOS) are adopted; thirdly, based on the importance of the TF-IDF value of the word to the classification of the word in the corpus, the TF-IDF enabling characteristic value of the word is modified to serve as a characteristic; fourthly, words in full capitalization in the writing format based on the analyzed keywords are the keywords, words in 30% of capitalization are the keywords, and the writing format (C) is adopted by the text as the fourth characteristic;
thirdly, converting the word, the feature marks and the result marks into mathematical expressions, firstly converting words in the text into 300-dimensional vector feature expressions by adopting a word vector model GoogleNews300 model after open source training, then converting the features into vector feature expressions by a user-defined expression mode aiming at the feature format of the words, and finally converting the result marks into the vector feature expressions by the user-defined mode;
fourthly, for feature processing of the text, performing multi-feature word segmentation on the text by adopting an nltk toolkit, then marking results of the multi-feature word segmentation of the text by adopting a result marking file corresponding to the text, then marking four features of the multi-feature word segmentation of the text in sequence, and finally converting the word, the feature marks of the word and the result marks of the word into vector feature representation in sequence;
fifthly, aggregation of word vectors and feature vectors is used as training input of the deep network learning model, the result vectors are used as target results and are also transmitted to the deep network learning model for model training, and the deep network learning model is used for extracting keywords;
and sixthly, evaluating and correcting the extraction result of the model in real time, evaluating the result by adopting the prediction effect of a Precision, recall and F1-score three-standard comprehensive evaluation system based on three evaluation standards and a traditional calculation method, and evaluating and correcting the extraction result of the model in real time.
1. Multi-feature fusion keyword extraction framework
The keyword extraction algorithm of the prior art is generally performed in two steps. Firstly, extracting alternative words meeting the rules from the text by formulating the rules, then carrying out various weighting and calculation on the alternative words, and finally screening out key words from the alternative words. In the keyword extraction technology in the prior art, candidate words are used as input units, and then keywords are screened out from the candidate words. The method has the great disadvantage that the final effect is directly influenced by the selection result of the alternative words. And because the rule of alternative word recognition is difficult to reach 100% or even 90% coverage rate, many keywords are missed in the first step of extracting keywords in the prior art, and the extraction effect is reduced.
Aiming at the problem, the method and the device do not select alternative words, select key words from the alternative words, and take the key word extracting process as the process of marking the key word topic sequence.
The method comprises the steps of firstly, carrying out multi-feature word segmentation on a text in a corpus, segmenting the text into word sets, then carrying out feature theme processing on all words, carrying out sequence marking processing on the words through a marked keyword position marking file, then splicing the words and features in a vector form to serve as input, and transmitting a marking result of the words to a deep network learning model in the vector form for training. And finally, extracting the keywords by adopting the trained model, and evaluating the extraction result of the model in real time.
2. Keyword extraction method based on deep network learning
The key point of the application is that the problem of recognizing and extracting the keywords of the scientific and technological literature is taken as the problem of marking the keyword subject sequence of the scientific and technological literature, a BIESO marking method improved based on a BIO marking method is firstly adopted, S marks are adopted to represent the keywords consisting of single words, B marks are adopted to represent the beginning of the keywords, I marks are adopted to represent the middle positions of the keywords, E marks are adopted to represent the end of the keywords, O marks are adopted to represent the current words as non-keywords, the marking mode is effective in converting the recognition process of the keywords into a multi-category process, words in a text are differentiated into five categories of B, I, E, S and O, then the words marked as S or the continuous sequence marked by the beginning E and the end E of the B is recognized, and the key result is finally extracted.
A BIESO multi-class marking mode is adopted to train a model, and a problem exists when keywords are extracted from a test text. When a long keyword topic sequence is targeted, fragmentation often occurs due to a word in the keyword being classified incorrectly, a given labeling result should be "biiii e", however, a trained model is used to label the result as "biebe", so that a complete keyword is divided into two fragments, or a keyword fragment sequence of "OOBIIE" occurs, or even an incorrect labeling sequence of "EIOBOE" occurs. The present application thus further modifies this manner of labeling.
Firstly, the marking mode is changed into a P/N mode, namely P (positive) is adopted to represent that a word is a keyword, N (negative) is adopted to represent that the word is not the keyword, a keyword extraction model is converted into a binary model from a multi-class model, and through calculation and analysis of a given corpus of the marked keywords, only less than 60 keywords in the corpus are connected or intersected, namely more than 99 percent of keywords have word or symbol intervals, and adjacent continuous word sequences marked as P are aggregated to be extracted as the keywords. FIG. 1 shows the results of a piece of text labeled in P/N.
The continuous P sequences are aggregated to obtain a plurality of keywords, and under the condition that other conditions are completely consistent, the effect of the two-classification keyword extraction model trained in the P/N marking mode is improved by 2% -4% on Fl-score compared with the effect of the multi-classification model trained in the BIESO marking mode, so that the P/N marking method is finally adopted in the application.
In summary, the steps of the keyword extraction method based on deep web learning are briefly summarized as follows:
the method comprises the following steps: performing multi-feature word segmentation operation on the text in the corpus;
step two: extracting the features of the multi-feature word segmentation result;
step three: performing P/N result marking on the text according to the marking result file of the corpus;
step four: the words and the features are represented in a vector form and spliced together to serve as input, the result marked by the P/N is represented as a vector to serve as an expected result, and the expected result is transmitted to a deep network learning model for training;
step five: extracting key words aiming at the corpus by adopting a trained model;
step six: and evaluating the performance of the model in real time according to the result predicted by the model.
3. Fusion feature settings
According to the method and the device, in the training of the keyword extraction model, the composition rule of the keywords in the marking result is analyzed, the common characteristics of the keywords are found out from the analysis result, the characteristics of the keywords are assigned in a weighting mode and input into the model to train the model, the keywords are extracted to play an important role, and the extraction effect of the model is greatly improved. The method selects four characteristics of a fusion prior Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF energized characteristic value and a writing subject format (C) by analyzing the characteristics of key words in the corpus.
(one) a priori science dictionary
Based on the calculation and analysis of the labeling results of the scientific and technical documents in the corpus, the keywords of the scientific and technical documents are basically words with strong field and specialty. Whenever it appears in a document, whether it is a keyword or not, it is a word that makes sense to classify the content of the document. Therefore, words with strong domain and professionalism existing in the document are marked, and the marked results are used as features for model training.
The keyword extraction model training adopts a dictionary method to collect the acquired keywords, the keywords are removed from the keywords and then form a prior keyword dictionary, when the text is processed, the words in the dictionary are adopted to match the text content one by one, all the words formed by the words existing in the dictionary in the text are marked as a class P, the words not existing in the dictionary are marked as another class N, and the marking result of the dictionary is used as the input characteristic of a training model for improving the effect of the model.
The sources of the prior science and technology dictionary are divided into two parts:
a first part: according to the keyword marking results in the marking result files in the corpus provided by the science IE task, 5620 keywords obtained after repeated or independent meaningless keywords are collected and removed. Since 5620 keywords by comparison are slightly less abundant.
A second part: adopting external resources, wherein the corpus is sourced from a document library of sciences direct, adopting a crawler to capture 15000 scientific documents from sciences direct, and removing repeated keywords to obtain 46569 keywords in total. The two parts are combined together, repeated keywords are removed, some meaningless words are filtered, and finally, a keyword dictionary consisting of 50000 keywords is obtained.
(II) characteristic part of speech
More than 90% of the keywords are noun characteristic part-of-speech phrases or phrases composed of verbs, adjectives and a few prepositions. Therefore, the method marks the characteristic part of speech of the word, gives a weight value to the corresponding characteristic part of speech as a characteristic to be used for training a system, has a remarkable effect on keyword extraction effect, and adopts Stanford NLP English characteristic part of speech marking.
The feature part of speech is marked by integrating a Python library nltk of the Stanford NLP toolkit, and the keywords of the document comprise phrases consisting of nouns, verbs and adjectives, so that when the feature part of speech of the words in the text is marked, the feature part of speech is simply classified again.
The classification results are shown in fig. 2. The present application divides feature part of speech into four large categories: n, V, J, O; n comprises single plural nouns and single plural proper nouns, V comprises basic forms of verbs, present multi-feature participles, past multi-feature participles, dynamic nouns, past expressions, third expressions and non-third expressions, J comprises adjectives and comparison level and highest level forms thereof, O represents all feature parts of speech except the feature parts of speech contained in the three forms, and finally, the four categories are respectively weighted and word vectors are used as input for training the model.
(III) TF-IDF enabling characteristic value
The TF-IDF algorithm is simple and quick, but the assumption of TF-IDF obviously fails to hold for some document sets with similar contents, so the TF-IDF weighting is relatively low in importance, and therefore the TF-IDF weighting can be eliminated in keyword screening. Moreover, the TF-IDF enabled eigenvalue algorithm, considering word frequency (TF) and Inverse Document Frequency (IDF) of words alone, is not sufficient in many cases, as in structured text, the position information of words is also quite valuable information.
Therefore, only the TF-IDF enabling characteristic value of a word is used as a characteristic value, other characteristics are added to make up for the deficiency, the TF-IDF enabling characteristic value of each word is calculated, the TF-IDF enabling characteristic value is used as a characteristic of the word to train a deep network learning model, and keywords of the English scientific and technical literature are extracted.
(IV) writing Format
Keywords in dictionaries are divided into three writing formats: UA stands for full capitalization (e.g., abbreviation) writing format, UP capitalization writing format, and L stands for lowercase writing format, and simple calculation and analysis of all words in the entire corpus reveals 1279 full capitalization words in the entire corpus, 862 words that are single or belong to a keyword component, accounting for 67.40%, and 3831 capitalization words in total, 1132 words that belong to a keyword component or are alone as a keyword, accounting for 29.55%. Specific results are shown in fig. 3.
As is clear from fig. 3, the writing format of a word has an important relationship as to whether the word belongs to a keyword, and when a word is in full-capitalization format (UA), such as an abbreviation, there is a probability of nearly 70% that the word is a keyword or that the word is a part of a keyword, and when a word is in first-letter capitalization format, there is a probability of 30% that the word is a keyword or a part of a keyword, and the writing format of a word has a great weight as to whether the word is a keyword.
The method comprises the steps of dividing words in a corpus into three categories including UA, UP and L according to writing formats, taking the categories to which the words belong as characteristics, weighting and assigning values to the three categories respectively, and bringing the three categories into a deep network learning model for training, wherein the importance of the writing formats of the words on whether the words are classified as keywords is used for a keyword extraction model of the method.
4. Vector feature representation
Vector feature representation of word(s)
The method comprises the following steps of performing vector feature representation on words in a corpus by using a trained GoogleNews300 model, wherein the model is trained aiming at a data set of a news page, so that the situation that the model does not contain some words possibly exists aiming at scientific and technical documents, and the method adopting the specified vector feature representation is adopted for the situation:
w = (1.0, 1.0.. 1.0) 300 formula 1
S = (0.0, 0.0.. 0.0) 300 formula 2
The 300-dimensional vector W composed of 1.0 is used to represent words in the model GoogleNews300 for which no corresponding vector exists, and the remaining non-words in the model for which no corresponding vector exists, such as numbers or various types of symbols, are represented by a 300-dimensional vector S composed of 0.0.
Vector feature representation of (II) features
The words and the characteristics of the words are used as input data for training a deep network learning model, so that under the condition of converting the words into vector characteristic representation, the four characteristics of the words are represented by vectors, and the four characteristics are specifically represented as follows:
1) A priori science dictionary: the labeling result of the dictionary is divided into two types of P and N, which are respectively expressed as P = (0, 1) and N = (1, 0);
2) Characteristic part of speech: the characteristic part-of-speech characteristics include four types of N, V, J and O, which are respectively expressed as four 4-dimensional vectors: n = (1,0,0,0), V = (0,1,0,0), J = (0,0,0,1), O = (0,0,0,1);
3) TF-IDF enabling characteristics: the TF-IDF enabling characteristic value feature is a numerical value, and the TF-IDF enabling characteristic value feature is expressed as a one-dimensional vector formed by the numerical value of the TF-IDF enabling characteristic value feature;
4) Writing format: the writing format features are divided into three categories, namely UA, UP and L, which are respectively expressed as 3-dimensional vectors UA = (1, 0), UP = (O, 1, 0) and L = (0, 1).
(III) vector feature representation of actual and predicted results
During training, the marking result of the words is transmitted to the deep network learning model for comparing the prediction result and the actual result after each iteration, calculating errors and adjusting the weight matrix according to the errors, wherein the marking of the actual result of the words also needs to be expressed in a vector form.
If the word belongs to a part of the keyword topic sequence or is alone a keyword, denoted P, is denoted (0, 1). Otherwise, it is denoted as N and expressed as (1, 0).
5. Text pre-processing
The step of converting the natural language in the text into the mathematical symbols required by mathematical computation is a preparation stage of training a keyword extraction model.
For a given document set with marked results, the text preprocessing is divided into four steps:
step 1: reading the content in the document and performing multi-feature word segmentation processing on the content text, and storing the result as Words;
step 2: according to a marking result file corresponding to the document, carrying out result marking in a P/N mode on the multi-feature word segmentation result Words, wherein the marked result is stored as Labels;
and step 3: selecting four characteristics: the method comprises the steps that a priori Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF enabling characteristic value and a writing format (C) are used for marking characteristics of results Words of multi-characteristic word segmentation one by one, and corresponding results are respectively stored as wd, wp, wt and wc;
and 4, step 4: converting the multi-feature word segmentation result Words, the corresponding marked result Labels and the feature marks wd, wp, wt and wc of the multi-feature word segmentation result into corresponding vector feature representations, and splicing the word vector and the feature vector.
(one) multiple feature participle
The model training data takes Words as basic data units, multi-feature word segmentation reads Text in a document, then the Text is segmented into a set word consisting of single Words, and whether stop Words are removed or not, whether punctuation marks are removed or not, and whether separators are added between texts or between paragraphs of the same Text or not are determined according to actual conditions.
According to the method, an English multi-feature word segmentation method is adopted to perform multi-feature word segmentation on a text, spaces and punctuations are adopted to perform multi-feature word segmentation on the text, and multi-feature word segmentation errors can occur in some complex situations.
(II) marking result feedback
According to the method, a supervision mode is adopted to train the deep network learning model, so that when model training is carried out, the actual result of the training corpus needs to be input into the model for calculating the error between the model prediction result and the target result in the training process, and therefore the weight matrix of the model is adjusted, and the performance of the model is improved by reducing the error of the model.
Step 1: reading Text of a document, and simultaneously saving position information (including a starting position and an ending position) indexes corresponding to Words in the document when the document is segmented into Words by performing multi-feature word segmentation operation on the Text;
step 2: reading the content of a result marking file corresponding to the document, reading out marking results in the file, namely position information corresponding to all keywords in the document, including starting positions and ending positions, arranging the marking results in an ascending mode according to the starting positions, and storing the marking results as kptinds;
and 3, step 3: traversing the indexes set according to the content of the kpidinces, if the position of the word in the document is not in the position range saved by the kpidinces, indicating that the word does not belong to the keyword topic sequence, marking the word as N according to a P/N marking mode, marking the word with the starting position and the ending position in the range contained by the kpidinces as P, and after the traversal is finished, storing the marking result set as Labels.
(III) extracting features
Before model training, each feature of a word needs to be labeled:
the first step is as follows: extracting prior science and technology dictionary features (STD) of Words, traversing keywords in a constructed keyword dictionary, adopting the keywords to match an original text of a document, marking positions matched by the keywords in the text if the keywords exist in the text, marking all Words corresponding to the keywords in a multi-feature word segmentation result word as P to indicate that the Words belong to a keyword subject sequence, marking all Words which are not marked as P in the word as N after traversal is finished, indicating that the Words do not belong to the keywords, and storing the result as wd after marking;
the second step is that: extracting characteristic part-of-speech characteristics (FPOS) of a word, marking the characteristic part-of-speech of the word based on a natural language processing tool kit nltk, traversing the Words in a multi-characteristic word segmentation result word set, marking each word by using the characteristic part-of-speech marking tool of nltk, dividing the marked characteristic part-of-speech into four major classes of N, V, J and O according to a classification mode expressed by vector characteristics of the characteristics, and storing a marking result as wp after the traversal is finished;
the third step: extracting TF-IDF enabling characteristic value characteristics of Words, wherein the TF-IDF enabling characteristic values of the Words are obtained by multiplying the current document frequency TF of the Words by the inverse document frequency TF of the Words in the whole document set, when the TF-IDF enabling characteristic values of the Words are calculated, all texts in the whole corpus are read at the same time, each text is subjected to multi-feature word segmentation, repeated Words and punctuation marks are removed, the Words are stored into a word list containing the Words and the number of times that the Words appear in the document, when the TF-IDF enabling characteristic value calculation is carried out on the Words in a certain document, the word list corresponding to the content of the current document is traversed, for each word in the word list, the word frequency TF of the word is calculated according to the number of times that the word frequency TF appears in the document, the word list corresponding to the content of the whole corpus is traversed, the number of times that the current word appears in the whole corpus is calculated, the inverse frequency IDF of the current Words in the whole corpus is calculated, finally, the TF-IDF enabling characteristic values of the Words are calculated by multiplying the IDF of the IDF after the multi-IDF characteristic values of the word frequency TF document are traversed, the word set is calculated, and the enabling characteristic values are stored as the word set, and the enabling values of the word set of the word-IDF;
the fourth step: extracting writing format characteristics (C) of the words, and dividing the words in the corpus into three categories according to the writing formats: UA, UP and L, traversing a word set corresponding to the multi-feature participle of the document, judging the writing format of each word, marking the capitalized Words of the whole word as UA, marking the capitalized Words of only the first letter as UP, marking the rest Words as L, and storing the marking result as wc after the traversal is finished.
(IV) vector feature transformation
After completing the multi-feature word segmentation, result marking and feature extraction of the corpus, before deep network learning model training, the word itself word, the marking results wd, wp, wt and wc of the features and the actual result Labels of the word need to be converted into a vector form.
Step 1), the following steps: for Words, traversing word sets, adopting a trained word vector model GoogleNews300 to represent each word as a 300-dimensional vector, and for Words which do not have corresponding vector feature representation in the word vector model, storing a converted result as WX according to whether the word is a vector with 300-dimensional values and all dimension values being 1.0 or a vector with 300-dimensional values and all dimension values being 0.0;
step 2), the step of: converting the marking results wd, wp, w and wc of the features into a vector form;
step 3), the step of: for the labeled result wd of the dictionary feature (STD), the labeled result wd of the dictionary is divided into two categories, P and N. For the two types of labeled results, the labeled P is expressed as a vector (0, 1), the labeled N is expressed as a vector (1, 0), and the converted result is stored as DX;
for the marking result wp of the characteristic part-of-speech Feature (FPOS), the characteristic part-of-speech marks four types of N, V, J and O, and the four types of N, V, J and O are respectively expressed as four 4-dimensional vectors: the symbol N is represented by a vector (1, 0), and the symbol V is represented by a vector (0, 1, 0) the symbol J is represented by a vector (0, 1), the symbol O is represented by a vector (0, 1), the converted result is stored as PX;
for the marked result wt of the TF-IDF enabling characteristic value characteristic of the word, expressing the TF-IDF enabling characteristic value characteristic as a one-dimensional vector formed by numerical values of the TF-IDF enabling characteristic value characteristic, and storing the result as TX;
for the marking result wc of the writing format feature (C) of the word, the writing format feature is divided into three types of UA, UP and L, the three types of UA, UP and L are respectively converted into 3-dimensional vectors, the mark UA is converted into a vector (1, 0), the mark UP is converted into a vector (0, 1, 0), finally the mark L is converted into a vector (0, 1), and the converted result is CX;
step 4), the step of: marking a set of Labels for an actual result of the word result of the multi-feature word segmentation, if the set is marked as P, representing the set as a vector (0, 1), and if the set is marked as N, representing the set as a vector (1, 0), and storing a converted result as Y;
after the expression of the word, the feature and the vector feature of the result is completed, the word vector and the feature vector are connected together according to the formula 3 to be used as the input X of model training:
x = WX + DX + PX + TX + CX formula 3
And transmitting the vector characteristic representation Y corresponding to the actual result as an expected result to the keyword extraction model for training.
6. Model evaluation
(ii) evaluation criteria
Aiming at the actual extraction effect of the trained keyword extraction model, three evaluation standards of P accuracy, R recall rate and F1 are adopted for evaluation.
F1 is a weighted harmonic mean of P and R, see formula 4:
Figure BDA0003713135060000181
f1 integrates the performance of accuracy and recall rate, F1 is high only when P and R are both high, and the performance is ideal only when the accuracy and recall rate are both high, so F1 is taken as the performance of the system.
For the results of model prediction, it is classified into four categories:
first type, TP: correctly predicted keywords;
second class, FP: a mispredicted keyword;
the third type, TN: correctly predicted non-keywords;
fourth class, FN: mispredicted non-keywords;
according to the four result classification modes, the calculation modes of P and R are obtained:
Figure BDA0003713135060000191
Figure BDA0003713135060000192
the results for P, R and F1 were calculated.
(II) evaluation method
Identifying the key word subject sequence in the prediction result, and comparing the key word subject sequence with the key word in the actual result to calculate P, R and F1, wherein the method comprises the following two steps:
1) Identifying keywords in the prediction results: the word sequence with the continuous prediction result marked as P is regarded as a keyword, the keyword is identified, the continuous P sequence is aggregated into the keyword by traversing the prediction result, and the keyword is extracted to obtain the total number Np of the keyword in the prediction result;
2) Identifying correctly predicted keywords: obtaining a keyword which is completely and correctly predicted in the prediction result by comparing the keyword extracted from the prediction result with the keyword in the actual result, namely the keyword in the prediction result is completely consistent with the topic sequence of the keyword in the corresponding actual result, counting the number of the keyword as Nt, and obtaining the total number Na of the keywords in the actual result according to a result marking file in a corpus;
according to the calculation method and the calculation result, the calculation methods of the application P, R and F1 are obtained:
Figure BDA0003713135060000193
Figure BDA0003713135060000194
Figure BDA0003713135060000195
the model was evaluated based on P, R and F1.
The method adopts the deep network learning model to extract the keywords in the English scientific and technical literature, and in the training of the model, based on the fusion feature setting, the effect of the model is improved by adopting four features of a prior scientific and technical dictionary, feature part of speech, TF-IDF energized feature value and writing format. Therefore, the method adopts the cycle deep network learning model, trains the model by processing the linguistic data into the serialization input, and finally achieves 92.3% of Fl-score, and the fragmented keywords in the prediction result of the model are obviously reduced. The effect is also highly ranked in the results of all submissions of the sciencel task. Compared with the keyword extraction algorithm in the prior art, such as KEA, decision tree or naive Bayes, the effect is also greatly improved.

Claims (10)

1. The method for extracting the key words of the English scientific and technical literature with multi-feature fusion is characterized in that the key word extraction process is converted into a process for marking key word subject sequences, the model training takes words as an input unit, and a deep network learning model is adopted for carrying out supervised sequence marking;
firstly, converting keyword extraction into sequence recognition task processing, and using a P/N sequence marking method based on two classifications to take the keyword extraction task as sequence marking of word two classifications to solve the fragmentation keyword problem in a prediction result;
secondly, four key features are set for training the model by analyzing the keyword set in the marking result and fusing the features; firstly, based on the meaning of the professional keywords and the proper nouns to the existing text, the keywords marked in the marking result file in the corpus and the document keywords captured from the Web are adopted to jointly form a prior scientific and technological dictionary feature (STD); secondly, based on the fact that the probability that the technical literature keywords are nouns or verbs is extremely high, characteristic part-of-speech characteristics (FPOS) are adopted; thirdly, based on the importance of the TF-IDF value of the word to the classification of the word in the corpus, the TF-IDF enabling characteristic value of the word is modified to serve as a characteristic; fourthly, words in full capitalization in the writing format based on the analyzed keywords are the keywords, words in 30% of capitalization are the keywords, and the writing format (C) is adopted by the text as the fourth characteristic;
thirdly, converting the word, the feature marks and the result marks into mathematical expressions, firstly converting words in the text into 300-dimensional vector feature expressions by adopting a word vector model GoogleNews300 model after open source training, then converting the features into vector feature expressions by a user-defined expression mode aiming at the feature format of the words, and finally converting the result marks into the vector feature expressions by the user-defined mode;
fourthly, for the feature processing of the text, firstly, carrying out multi-feature word segmentation on the text by adopting an nltk toolkit, then, adopting a result marking file corresponding to the text, carrying out result marking on the multi-feature word segmentation result of the text, then, sequentially marking four features of the multi-feature word segmentation result of the text, and finally, converting the word, the feature marking of the word and the result marking of the word into vector feature representation;
fifthly, aggregation of word vectors and feature vectors is used as training input of the deep network learning model, the result vectors are also transmitted to the deep network learning model as target results to perform model training, and keyword extraction is realized by using the deep network learning model;
and sixthly, evaluating and correcting the extraction result of the model in real time, evaluating the result by adopting the prediction effect of a Precision, recall and F1-score three-standard comprehensive evaluation system based on three evaluation standards and a traditional calculation method, and evaluating and correcting the extraction result of the model in real time.
2. The multi-feature fusion English scientific and technical literature keyword extraction method according to claim 1, wherein the keyword extraction method based on deep web learning comprises the following steps: firstly, changing a marking mode into a P/N mode, namely adopting P to represent that a word is a keyword, adopting N to represent that the word is not the keyword, converting a keyword extraction model from a multi-class model into a two-class model, and aggregating adjacent continuous word sequences marked as P to extract the words as the keyword;
the keyword extraction method based on deep web learning comprises the following steps:
the method comprises the following steps: performing multi-feature word segmentation operation on a text in a corpus;
step two: extracting the features of the multi-feature word segmentation result;
step three: performing P/N result marking on the text according to the marking result file of the corpus;
step four: words and characteristics are expressed in a vector form and spliced together to serve as input, a result marked by P/N is expressed as a vector to serve as an expected result, and the expected result is transmitted to a deep network learning model for training;
step five: extracting key words aiming at the corpus by adopting a trained model;
step six: and evaluating the performance of the model in real time according to the result predicted by the model.
3. The multi-feature fusion English scientific and technical literature keyword extraction method according to claim 1, wherein the fusion features are set as follows: selecting four characteristics of a fusion prior Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF enabling characteristic value and a writing subject format (C);
(I) a priori science dictionary
Marking words with strong domain and specialty in a document, performing model training by taking a marking result as a feature, adopting a dictionary method for keyword extraction model training, collecting the acquired keywords, removing duplication to form a prior keyword dictionary, adopting words in the dictionary to match text contents one by one during text processing, marking all words formed by the words existing in the dictionary in the text as a class P, marking the words not existing in the dictionary as another class N, and taking the marking result of the dictionary as an input feature during training a model;
the sources of the prior science and technology dictionary are divided into two parts:
a first part: according to a keyword marking result in a marking result file in a corpus provided by a sciences IE task, 5620 keywords which are obtained after repeated or independent meaningless keywords are collected and removed;
a second part: adopting external resources, wherein the corpus is sourced from a SciensDirect document library, adopting a crawler to capture 15000 scientific and technical documents from the SciensDirect, and removing repeated keywords to obtain 46569 keywords in total; combining the two parts together, removing repeated keywords and filtering some meaningless words to finally obtain a keyword dictionary consisting of 50000 keywords;
(II) characteristic part of speech
The method comprises the steps of integrating a Python library nltk of a Stanford NLP tool kit to carry out feature part of speech marking, wherein keywords of a document comprise phrases consisting of nouns, verbs and adjectives;
the feature part-of-speech is divided into four large categories: n, V, J, O; n comprises a single plural noun and a single plural proper noun, V comprises a basic form of a verb, a present multi-feature participle, a past multi-feature participle, a dynamic noun, a past formula, a third person name and a non-third person name singular form, J comprises an adjective and a comparison level form and a highest level form thereof, O represents all feature parts of speech except the feature parts of speech contained in the three forms, and finally, the four categories are respectively weighted and a word vector is used as input for training a model;
(III) TF-IDF enabling characteristic values
The method only takes the TF-IDF enabling characteristic value of a word as a characteristic value, other characteristics are added to make up for the deficiency of the word, the TF-IDF enabling characteristic value of each word is calculated, the TF-IDF enabling characteristic value is taken as a characteristic training deep network learning model of the word, and keywords of the English scientific and technical literature are extracted;
(IV) writing format
And dividing words in the corpus into three categories of UA, UP and L according to the writing format, taking the category to which the words belong as the characteristic, weighting and assigning values to the three categories respectively, and bringing the three categories into a deep network learning model for training.
4. The multi-feature fusion English scientific and technical literature keyword extraction method according to claim 1, wherein the vector features of the words are represented as follows: the method comprises the following steps of carrying out vector feature representation on words in a corpus by adopting a trained GoogleNews300 model, wherein the method adopting specified vector feature representation is adopted:
W=(1.0,1.0,1.0...1.0) 300 formula 1
S=(0.0,0.0,0.0...0.0) 300 Formula 2
The 300-dimensional vector W composed of 1.0 is used to represent words in the model GoogleNews300 for which no corresponding vector exists, and the remaining non-words in the model for which no corresponding vector exists, such as numbers or various types of symbols, are represented by a 300-dimensional vector S composed of 0.0.
5. The method for extracting keywords from multi-feature-fused english technical literature according to claim 1, wherein the vector features of the features represent: in the case of converting a word into a vector feature representation, four features of the word are represented by a vector, which is specifically represented as follows:
1) A prior science dictionary: the labeling result of the dictionary is divided into two types of P and N, which are respectively expressed as P = (0, 1) and N = (1, 0);
2) Characteristic part of speech: the characteristic part-of-speech characteristics comprise four types of N, V, J and O, which are respectively expressed as four 4-dimensional vectors: n = (1,0,0,0), V = (0,1,0,0), J = (0,0,0,1), O = (0,0,0,1);
3) TF-IDF enabling characteristics: the TF-IDF enabling characteristic value feature is a numerical value, and the TF-IDF enabling characteristic value feature is expressed as a one-dimensional vector formed by the numerical value of the TF-IDF enabling characteristic value feature;
4) Writing format: the writing format features are classified into UA, UP and L, which are respectively expressed as 3-dimensional vectors UA = (1, 0), UP = (0, 1, 0) and L = (0, 1).
6. The method for extracting keywords from multi-feature-fused English scientific and technical literature according to claim 1, wherein the text preprocessing comprises:
step 1: reading the content in the document and performing multi-feature word segmentation processing on the content text, wherein the result is stored as Words;
step 2: according to a marking result file corresponding to the document, carrying out result marking in a P/N mode on the multi-feature word segmentation result Words, and storing a marked result as Labels;
and step 3: selecting four characteristics: the method comprises the steps that a priori Science and Technology Dictionary (STD), a characteristic part of speech (FPOS), a TF-IDF enabling characteristic value and a writing format (C) are used for marking characteristics of results Words of multi-characteristic word segmentation one by one, and corresponding results are respectively stored as wd, wp, wt and wc;
and 4, step 4: converting the multi-feature word segmentation result Words, the corresponding marking result Labels and the feature marks wd, wp, wt and wc into corresponding vector feature representations, and splicing the word vector and the feature vector;
and (3) marking result feedback:
step 1: reading Text of a document, and simultaneously storing position information indexes corresponding to Words in the document when the document is segmented into Words by performing multi-feature word segmentation on the Text;
step 2: reading the content of a result marking file corresponding to the document, reading out marking results in the file, namely position information corresponding to all keywords in the document, including initial positions and end positions, arranging the marking results in an ascending order of the initial positions, and storing the marking results as kptinds;
and 3, step 3: and traversing the indexes set according to the content of the kptinds, if the position of the word in the document is not in the position range stored by the kptinds, the word is not in the keyword subject sequence and is marked as N according to a P/N marking mode, the word with the initial position and the end position in the range contained by the kptinds is marked as P, and after the traversal is finished, storing the marking result set as Labels.
7. The multi-feature fusion method for extracting keywords from english scientific and technical literature according to claim 1, wherein before performing model training, each feature of a word is labeled as follows:
the first step is as follows: extracting prior science and technology dictionary features (STD) of Words, traversing keywords in a constructed keyword dictionary, adopting the keywords to match an original text of a document, marking positions matched by the keywords in the text if the keywords exist in the text, marking all Words corresponding to the keywords in a multi-feature word segmentation result word as P to indicate that the Words belong to a keyword subject sequence, marking all Words which are not marked as P in the word as N after traversal is finished, indicating that the Words do not belong to the keywords, and storing the result as wd after marking;
the second step: extracting characteristic part-of-speech characteristics (FPOS) of a word, marking the characteristic part-of-speech of the word based on a natural language processing tool kit nltk, traversing the Words in a multi-characteristic word segmentation result word set, marking each word by using the characteristic part-of-speech marking tool of nltk, dividing the marked characteristic part-of-speech into four major classes of N, V, J and O according to a classification mode expressed by vector characteristics of the characteristics, and storing a marking result as wp after the traversal is finished;
the third step: extracting TF-IDF enabling characteristic value characteristics of Words, wherein the TF-IDF enabling characteristic value values of the Words are obtained by multiplying the current document frequency TF of the Words and the inverse document frequency TF of the Words in the whole document set, when the TF-IDF enabling characteristic value values of the Words are calculated, all texts in the whole corpus are read at the same time, multi-characteristic word segmentation is carried out on each text, repeated Words and punctuation marks are removed, the Words are stored into a word list containing the Words and the number of times of the Words appearing in the document, when the TF-IDF enabling characteristic value calculation is carried out on the Words in a certain document, the word list corresponding to the content of the current document is traversed, for each word, the word frequency TF of the word is calculated according to the number of times of the word appearing in the document, then the word list corresponding to the content of the whole corpus is traversed, the number of times of the current Words appearing in the whole corpus is calculated, the inverse frequency IDF of the current Words in the whole corpus is calculated according to calculate the inverse frequency IDF of the Words in the word list, finally, the TF characteristic value TF of the Words are obtained by multiplying the IDF, the multi-IDF enabling value TF of the Words after the traversal, the word sets are calculated, the word set of the word frequency IDF enabling value TF, and the result TF-IDF of the word are calculated, and the feature values are stored as the enabling value TF-IDF;
the fourth step: extracting writing format characteristics (C) of the words, and dividing the words in the corpus into three categories according to the writing formats: UA, UP and L, traversing a word set corresponding to the multi-feature participle of the document, judging the writing format of each word, marking the capitalized Words of the whole word as UA, marking the capitalized Words of only the first letter as UP, marking the rest Words as L, and storing the marking result as wc after the traversal is finished.
8. The method for extracting keywords from multi-feature-fused English scientific and technical literature according to claim 1, wherein vector feature transformation comprises the following steps: after completing the multi-feature word segmentation, result marking and feature extraction of the corpus, and before deep network learning model training, converting word itself, marking results wd, wp, wt and wc of features and actual results Labels of the word into vector form:
step 1), the following steps: for the Words, traversing the word set, adopting a trained word vector model GoogleNews300 to represent each word as a 300-dimensional vector, and for the Words which do not have corresponding vector feature representation in the word vector model, storing the converted result as WX according to whether the Words are respectively represented as 300-dimensional vectors with each dimension value being 1.0 or 300-dimensional vectors with each dimension value being 0.0;
step 2), the step of: converting the marking results wd, wp, w and wc of the features into a vector form;
step 3), the step of: for the marking result wd of the dictionary feature (STD), the marking result wd of the dictionary is divided into two classes of P and N, for the two classes of marking result, the mark is expressed as a vector (0, 1), the mark N is expressed as a vector (1, 0), and the converted result is stored as DX;
for the marking result wp of the characteristic part-of-speech characteristic (FPOS), the characteristic part-of-speech is marked with four types of N, V, J and O, which are respectively expressed as four 4-dimensional vectors: the symbol N is represented by a vector (1, 0), and the symbol V is represented by a vector (0, 1, 0) the symbol J is represented by a vector (0, 1), the symbol O is represented by a vector (0, 1), the converted result is stored as PX;
for the labeled result wt of the TF-IDF enabling characteristic value characteristic of the word, expressing the TF-IDF enabling characteristic value characteristic as a one-dimensional vector consisting of the numerical values of the TF-IDF enabling characteristic value characteristic, and storing the result as TX;
for the marking result wc of the writing format feature (C) of the word, the writing format feature is divided into three types of UA, UP and L, the three types of UA, UP and L are respectively converted into 3-dimensional vectors, the mark UA is converted into a vector (1, 0), the mark UP is converted into a vector (0, 1, 0), finally the mark L is converted into a vector (0, 1), and the converted result is CX;
step 4), the step of: marking a set of Labels for an actual result of the word result of the multi-feature word segmentation, if the set is marked as P, representing the set as a vector (0, 1), and if the set is marked as N, representing the set as a vector (1, 0), and storing a converted result as Y;
after the representation of the word, the feature and the vector feature of the result is completed, the word vector and the feature vector are connected together according to formula 3 to be used as an input X of model training:
x = WX + DX + PX + TX + CX formula 3
And transmitting the vector feature representation Y corresponding to the actual result as an expected result to the keyword extraction model of the application for training.
9. The method for extracting keywords from multi-feature-fused English scientific and technical literature according to claim 1, wherein the model evaluation criteria are as follows: evaluating by adopting three evaluation standards of P precision, R recall and F1;
f1 is the weighted harmonic mean of P and R, see formula 4:
Figure FDA0003713135050000061
f1 comprehensive accuracy and recall rate performance, F1 is high only when P and R are both high, and performance is ideal only when the accuracy and recall rate are both high, so that F1 is used as the performance of the system;
for the results of model prediction, it is classified into four categories:
first type, TP: correctly predicted keywords;
second class, FP: a mispredicted keyword;
the third type, TN: correctly predicted non-keywords;
fourth class, FN: mispredicted non-keywords;
according to the four result classification modes, the calculation modes of P and R are obtained:
Figure FDA0003713135050000062
Figure FDA0003713135050000063
the results for P, R and F1 were calculated.
10. The method for extracting keywords from multi-feature-fused English scientific and technical literature according to claim 1, wherein the model evaluation method comprises the following steps: identifying the keyword topic sequence in the prediction result, and comparing the keyword topic sequence with the keywords in the actual result to calculate P, R and F1, wherein the method comprises the following two steps:
1) Identifying keywords in the prediction results: the word sequence with the continuous prediction result marked as P is regarded as a keyword, the keyword is identified, the continuous P sequence is aggregated into the keyword by traversing the prediction result, and the keyword is extracted to obtain the total number Np of the keyword in the prediction result;
2) Identifying correctly predicted keywords: obtaining a keyword which is completely and correctly predicted in the prediction result by comparing the keyword extracted from the prediction result with the keyword in the actual result, namely the keyword in the prediction result is completely consistent with the topic sequence of the keyword in the corresponding actual result, counting the number of the keyword as Nt, and obtaining the total number Na of the keywords in the actual result according to a result marking file in a corpus;
according to the calculation method and the calculation result, the calculation methods of the application P, R and F1 are obtained:
Figure FDA0003713135050000071
Figure FDA0003713135050000072
Figure FDA0003713135050000073
the model was evaluated based on P, R and F1.
CN202210725706.9A 2022-06-24 2022-06-24 Multi-feature fusion English scientific literature keyword extraction method Active CN115221871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210725706.9A CN115221871B (en) 2022-06-24 2022-06-24 Multi-feature fusion English scientific literature keyword extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210725706.9A CN115221871B (en) 2022-06-24 2022-06-24 Multi-feature fusion English scientific literature keyword extraction method

Publications (2)

Publication Number Publication Date
CN115221871A true CN115221871A (en) 2022-10-21
CN115221871B CN115221871B (en) 2024-02-20

Family

ID=83610649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210725706.9A Active CN115221871B (en) 2022-06-24 2022-06-24 Multi-feature fusion English scientific literature keyword extraction method

Country Status (1)

Country Link
CN (1) CN115221871B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539907A (en) * 2008-03-19 2009-09-23 日电(中国)有限公司 Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN110309303A (en) * 2019-05-22 2019-10-08 浙江工业大学 A kind of judicial dispute data visualization analysis method based on Weighted T F-IDF
CA3065784A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant classifier based on deep neural networks
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
US20200125928A1 (en) * 2018-10-22 2020-04-23 Ca, Inc. Real-time supervised machine learning by models configured to classify offensiveness of computer-generated natural-language text
CN113723084A (en) * 2021-07-26 2021-11-30 内蒙古工业大学 Mongolian text emotion analysis method fusing priori knowledge
CN114065758A (en) * 2021-11-22 2022-02-18 杭州师范大学 Document keyword extraction method based on hypergraph random walk
CN114254653A (en) * 2021-12-23 2022-03-29 深圳供电局有限公司 Scientific and technological project text semantic extraction and representation analysis method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539907A (en) * 2008-03-19 2009-09-23 日电(中国)有限公司 Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CA3065784A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant classifier based on deep neural networks
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
US20200125928A1 (en) * 2018-10-22 2020-04-23 Ca, Inc. Real-time supervised machine learning by models configured to classify offensiveness of computer-generated natural-language text
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN110309303A (en) * 2019-05-22 2019-10-08 浙江工业大学 A kind of judicial dispute data visualization analysis method based on Weighted T F-IDF
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences
CN113723084A (en) * 2021-07-26 2021-11-30 内蒙古工业大学 Mongolian text emotion analysis method fusing priori knowledge
CN114065758A (en) * 2021-11-22 2022-02-18 杭州师范大学 Document keyword extraction method based on hypergraph random walk
CN114254653A (en) * 2021-12-23 2022-03-29 深圳供电局有限公司 Scientific and technological project text semantic extraction and representation analysis method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SARKAR KAMAL 等: "A new approach to keyphrase extraction using neural networks" *
SIAUTAMA RYAN 等: "Extractive hotel review summarization based on TF/IDF and adjective-noun pairing by considering annual sentiment trends" *
潘湑: "航空领域术语定义抽取关键技术及其应用研究" *
韩普 等: "基于多特征融合的中文疾病名称归一化研究" *

Also Published As

Publication number Publication date
CN115221871B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN110059311B (en) Judicial text data-oriented keyword extraction method and system
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN110298033B (en) Keyword corpus labeling training extraction system
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN112256939B (en) Text entity relation extraction method for chemical field
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN108363691B (en) Domain term recognition system and method for power 95598 work order
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN112818661B (en) Patent technology keyword unsupervised extraction method
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN111008530A (en) Complex semantic recognition method based on document word segmentation
CN110889275A (en) Information extraction method based on deep semantic understanding
CN112307182A (en) Question-answering system-based pseudo-correlation feedback extended query method
CN111460147B (en) Title short text classification method based on semantic enhancement
CN113159969A (en) Financial long text rechecking system
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN114266256A (en) Method and system for extracting new words in field
CN113360582A (en) Relation classification method and system based on BERT model fusion multi-element entity information
CN110569270B (en) Bayesian-based LDA topic label calibration method, system and medium
CN113032550B (en) Viewpoint abstract evaluation system based on pre-training language model
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant