CN106951414A

CN106951414A - A kind of academic text vocabulary identification of function method sorted based on machine learning

Info

Publication number: CN106951414A
Application number: CN201710204292.4A
Authority: CN
Inventors: 万迅; 程齐凯; 陆伟
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2017-07-14

Abstract

The invention discloses a kind of academic text vocabulary identification of function method sorted based on machine learning, including construction training data；Recognition methods based on sequence；Latent structure；Model training；The model obtained using training is ranked up to the sequence of words that documentation summary is included, and to the result of sequence generation, the result using top1 is of the invention to pass through the training set in structure as 5 steps such as result are extracted（18690 titles collected in CNKI databases meet the documentation summary data of AD HOC）Learning model, to test data（From ACM and ACL include document in extract and obtain 156 documents after screening）Comprising sequence of words be ranked up.Itself test result indicate that, identification paper key problem and core methed on have preferable recognition effect.

Description

A kind of academic text vocabulary identification of function method sorted based on machine learning

Technical field

The invention belongs to intelligent identification technology field, more particularly to a kind of documentation level vocabulary work(sorted based on machine learning Can automatic identifying method.

Background technology

The retrieval of existing INFORMATION and information management are primary concern is that the information of documentation level, on document representation Use bag of words more.Such processing brings the facility on calculating, but is lost the deep layer language to academic text simultaneously Reason and good sense solution, they can not answer the relevant content of academic documents and the more specifically problem of theme.Also, in the storage of academic documents Today of unacceptable stage is all arrived with growth rate, traditional INFORMATION retrieval and information management have not had Method is grasped to whole documents of subject, and it is also huge that this searches and read the pressure that document brings to scholars.

In existing directly related achievement in research, Ding is concerned about this topic, but Ding achievement is also simply mentioned to The concept of vocabulary function, in-depth study achievement is not obtained, is not made a breakthrough on technical method yet.Other correlations Research has occurred in that a large amount of achievements as information extraction, ontology knowledge base build research：Researcher knows around information extraction, body Know storehouse structure and propose series of theories and technical research achievement, also occur in that technical products and the application of result of a large amount of maturations. In general, existing achievement negligible amounts, there is also certain deficiency：(1) word of the existing achievement in research to academic text The functional semantics framework that converges sets excessively simple, only gives the classification of two classes or the classification of three classes, it is impossible to cover in academic text All functional attributes of vocabulary；(2) actual effect of existing recognition methods can be to ensure, the result reported from correlative theses See, the performance and effect of recognition methods are all not enough, it is difficult to be put to actual semantic analysis application；(3) it is existing to study into Fruit only identifies the function of vocabulary, but semantic relation vocabulary is not analysed in depth, the analysis result so obtained Simply several isolated vocabulary, it is impossible to truly accomplishing the semantic understanding to text, commented for example, not only to obtain statement Estimate the vocabulary (" recall rate " and " accuracy rate " in such as information retrieval) of index, in addition it is also necessary to obtain specific targets associated therewith Numerical value.

The content of the invention

In order to solve the above problems, the present invention proposes a kind of documentation level vocabulary identification of function sorted based on machine learning Method.

The technical solution adopted in the present invention is：A kind of academic text vocabulary identification of function side sorted based on machine learning Method, it is characterised in that comprise the following steps：

Step 1：Construct training data；

Step 1.1：Some title forms are collected for " document of the Y " based on X, for every document, its English is inscribed Name is converted into the representation of part of speech and frequent part of speech；

Step 1.2：By being counted to the text representation pattern after conversion, " Y " the category title moulds based on X are obtained Formula；

Step 1.3：By being labeled the pattern obtained in step 1.2, obtain extracting problem and method from title Text matches pattern；

Step 2：Recognition methods based on sequence；

Step 2.1：Given word combination P={ w₁,w₂,...,w_mAnd annotation results sequence of words P '={ w '₁,w ′₂,...,w′_n}；Terminology extraction is carried out to text first by most long character string matching method, by being carried out on different grain size Cutting, structural string cutting tree carries out synonymous conflation of words；After cutting tree merger, the character string that have matched in text is each being returned It is removed in the bag of words of category, thus obtains P and P ' new expression P_processedWith P '_processed；

Step 2.2：Using vocabulary is disabled, to P_processedWith P '_processedIn vocabulary do stop words filtration treatment；

Step 2.3：Calculate P and P ' similarity score；

Step 3：Latent structure；

Include for sequence of words to be sorted construction feature：Lexical feature, syntactic feature and TextRank features；

Step 4：Model training；

Step 5：The model obtained using training is ranked up to the sequence of words that documentation summary is included, to sequence generation As a result, the result using top1 is used as extraction result.

Relative to prior art, the beneficial effects of the invention are as follows the documentation level vocabulary function based on machine learning sequence is certainly In dynamic recognition methods, by the way that in the training set of structure, (18690 titles collected in CNKI databases meet the text of AD HOC Shelves summary data) learning model, to test data (from ACM and ACL include document in extract and obtain 156 texts after screening Offer) sequence of words that includes is ranked up.Itself test result indicate that, identification paper key problem and core methed on have Preferable recognition effect.

Brief description of the drawings

Fig. 1 is the character string cutting tree example of the embodiment of the present invention.

Embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, below in conjunction with the accompanying drawings and embodiment is to this hair It is bright to be described in further detail, it will be appreciated that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

A kind of documentation level vocabulary identification of function method sorted based on machine learning that the present invention is provided, including following step Suddenly：

Step 1, the construction of training data.The present embodiment is received from CNKI computer realm and figure feelings field journal data 88865 title forms are collected for " its Subject Title, for every document, is converted into part of speech and frequency by the document of the Y " based on X The representation of numerous part of speech.

Building method is as follows：

Step 1.1, sentence s is expressed as sequence of words { w₁,w₂,…,w_n, w_iI-th of vocabulary in sentence is represented, n is s Length.Frequent word lists F have recorded a series of previously given frequent vocabulary.By by all non-frequent vocabulary in s, i.e., not Appear in the vocabulary in F and be substituted for the corresponding chunk of vocabulary (Chunk) mark, you can obtain sentence s corresponding based on frequent word The text representation of item and part of speech.

For example, sentence " In this paper, we present a method for information Retrieval. ", F is in, we, present, for, then the corresponding Star mode of sentence is " In NN, we present NN for NN.”。

Step 1.2, by being counted to the text representation pattern after conversion, obtain that " Y " category titles based on X are the most Common English Title pattern, is shown in Table 1.

Table 1 is the decimation pattern example of the embodiment of the present invention；

By the mark to above-mentioned pattern, it can obtain extracting the text matches pattern of problem and method from title, take out Modulus formula mark example is shown in Table 2.

The decimation pattern of table 2 marks example

Using these patterns, corresponding word combination is extracted from the Subject Title of CNKI papers, and be these vocabulary groups Conjunction is assigned to classification.By extracting, for obtaining key problem and core methed labeled data totally 18690.What these were extracted Problem constitutes the key problem to place text and the mark of core methed with method data.

In order to illustrate these regular reliabilities and across source applicability, using the decimation rule shown in table 2 to ACM data Storehouse include paper title carry out information extraction, if the title of these papers can matching template, export corresponding vocabulary Sequence is used as recognition result.The extraction result of 1555 titles is randomly choosed during evaluation and test, artificial judgment extracts the accurate of result Property.Evaluation result is shown：Key problem recognition accuracy is 99.55%；The accuracy rate that core methed is extracted is with evaluating standard Change changed, if the method that the instrument mainly used in experiment is also regarded as solving the problems, such as, accuracy rate be for 98.65%, such as tool-class is foreclosed, then accuracy rate is 90.23%.

Step 2, the recognition methods based on sequence, the present embodiment uses the PairWise side in machine learning order models Method.

Step 2.1, word combination P={ w are given₁,w₂,…,w_mAnd annotation results sequence of words P '={ w₁′,w₂′,…, w_n′}.Terminology extraction is carried out to text first, the present embodiment has used most long character string matching method to extract term, by not Cutting, structural string cutting tree are carried out in one-size.

For example, to text " support vector machine basedmethod ", it is assumed that there are term " support Vector " and " support vector machine ", then can be with structural string cutting tree construction, as shown in Figure 1.

Step 2.2, after construction obtains the cutting tree representations of two character strings, ensuing calculating just based on two set into OK.The Alphabetical List provided using synonymicon, two nodes of merger Income Maximum are carried out in two trees of selection every time Merger, once some node is merged, then its father node and descendant nodes will be no longer participate in follow-up merger, so repeat, directly It can be merged to no node.By cutting tree merger, the synonym of text pair can be matched, the word being matched It is considered as synonymous vocabulary to accord with string, needs to be removed in the bag of words of respective ownership.Thus, P={ w are obtained₁,w₂,….,w_m} With P '={ w₁′,w₂′,…,w_n' new expression P_processed={ w₁,w₂,….,w_mAnd P '_processed={ w₁′,w₂′,…, w_n′}。

Step 2.3, in order to avoid the influence of noise vocabulary, further processing is also needed to the character string being converted to.One A little vocabulary such as to, novel, one, a etc. need to be removed when calculating similitude, therefore, the present embodiment is to P_processedWith P′_processedIn vocabulary do stop words filtration treatment.The present embodiment has used a deactivation vocabulary for including 561 stop words. In whole matching process, in order to eliminate the influence that morphological change is calculated similarity score, matching treatment is after stem extraction Text on carry out.

Step 2.4, P and P ' and corresponding P is given_processedWith P '_processed, similarity score employ one it is simple Computational methods, computing formula is：

Wherein, | * | represent length.It can be seen that, this similarity measurements figureofmerit is asymmetric, that is to say, that sim (P, P ') is not equal to sim (P ', P).If all vocabulary in P can be included semantically by P ', both similarities For 1, if both constitute overlapping relation without any vocabulary or sequence of words, Similarity Measure result is 0.

Step 3, latent structure.The invention is that sequence of words to be sorted construction feature includes：Lexical feature, syntactic feature and TextRank features.

Step 3.1, construction lexical feature, including combination in each vocabulary, the previous vocabulary of current vocabulary sequence, when Latter vocabulary of preceding sequence of words, the first two vocabulary of current vocabulary combination, latter two vocabulary of current vocabulary combination and The previous verb of current vocabulary.Whether particular text is included in sentence where treating ranked object, such as " this paper ", " we ", " our work " etc., to the effect of sequence, there is also considerable influence.Therefore one 01 feature of construction is needed to mark the row for the treatment of Whether ordered pair includes particular text as place sentence.

Step 3.2, syntactic feature is constructed, including：

1.Head vocabulary is recognized；

Vocabulary in word combination is added into directed networkses, built according to the dependence between vocabulary corresponding oriented Side.As " an approach " construct a side from " approach " sensing " an ".Each node in traverses network, directly All it is isolated node to them, finally return to "<MULI_HEAD>”.

2. vocabulary is to ROOT interdependent path；

Path using Head words to ROOT is as feature, and the result in path is output as (word1, Category1: Relation:Category2,word2)+；Wherein word1, word2 are vocabulary texts, and Category1, Category2 are words Property, Relation is word1 to word2 dependence, the multiple * of *+expression repetition；If including multiple Head vocabulary, Interdependent path is not calculated, directly returns to " NOPATH "；

3. only record the interdependent paths of vocabulary-ROOT of verb node；

Method and an output ibid path, but only record verb.

4. the dependence feature of vocabulary direct correlation.The Head vocabulary of given vocabulary or word combination, is designated as word, Word feature generation strategy is：Pair there is each dependence dependence tr for associate with word, because of the vocabulary of tr associations Target is designated as, if word is that (this vocabulary is refer in Standfordparser to governer vocabulary in tr relations Vocabulary), then return " tr:Target ", if target is governer vocabulary, returns to " tr-r:target”.Therefore, such as There is n incidence relation in fruit word, then can form n feature.

Step 3.3, TextRank features are constructed.A construction of strategy moved based on window has been used to have no right undirected word altogether Network, the TextRank values of sequence of words to be sorted in this basic calculation.

Step 4：Model training；

Meet the documentation summary data of AD HOC using 18690 titles collected from CNKI databases, will be from this The problem of being extracted in a little documents and method are used as key problem and the natural annotation results of core methed.Order models training is used SVM-Rank instruments, use SVMs order models training PairWise order models.The text that sequence study is used Granularity is chunk (Chunk).In order to obtain chunk data, the present embodiment does syntax solution to text using Stanford Parser The chunk included in analysis, and then the syntactic structure identification text obtained based on Stanford Parser.The present embodiment is used OpenNLP carries out sentence cutting, and part-of-speech tagging is carried out to text using Stanford Postagger.Model training can be core Problem and the respective independent order models of core methed generation.The sample and feature that the order models of two classifications are used all are one Sample, difference is that sequence of each ordered samples under different classes of is different.Sequence of words and mesh in text is calculated When marking the correlation of sequence of words, the present embodiment has used one to disable vocabulary comprising 561 vocabulary.Stem is extracted and used PorterStemmer stem extracting tools.The document member that synonym vocabulary is included using the method for bilingual Chinese-English alignment from CNKI Extracting data, altogether comprising 438968 synonyms pair.

Step 5：The model obtained using training is ranked up to the sequence of words that literature summary is included, to sequence generation As a result, the result using top1 is used as extraction result.

The present embodiment in test phase, from ACM and ACL include document in randomly selected 200 documents, remove because of mark 44, the document (such as hardware classes Research Literature) that the limitation of personnel's research field can not be read, is obtained 156 test documents.Table 3 To use title the effect assessment result that rule and method is extracted.

Table 3 uses title the effect assessment result that rule and method is extracted

Evaluated and tested using the mode manually evaluated and tested, evaluation and test is primarily upon accuracy rate, recall rate.Some documents are not bright True provides method/problem, and this kind of document is noted as no method/problem in mark；Table 4 is key problem and core methed Recognition effect.

The key problem of table 4 and core methed recognition effect

From the experimental results, this method has certain validity in the key problem and core methed of identification paper.

It should be appreciated that the part that this specification is not elaborated belongs to prior art.

It should be appreciated that the above-mentioned description for preferred embodiment is more detailed, therefore it can not be considered to this The limitation of invention patent protection scope, one of ordinary skill in the art is not departing from power of the present invention under the enlightenment of the present invention Profit is required under protected ambit, can also be made replacement or be deformed, each fall within protection scope of the present invention, this hair It is bright scope is claimed to be determined by the appended claims.

Claims

1. a kind of academic text vocabulary identification of function method sorted based on machine learning, it is characterised in that comprise the following steps：

Step 1：Construct training data；

Step 1.1：Some title forms are collected for " document of the Y " based on X, for every document, its Subject Title is turned Change the representation of part of speech and frequent part of speech into；

Step 1.2：By being counted to the text representation pattern after conversion, " Y " the category title patterns based on X are obtained；

Step 1.3：By being labeled to the pattern obtained in step 1.2, obtain extracting the text of problem and method from title This match pattern；

Step 2：Recognition methods based on sequence；

Step 2.1：Given word combination P={ w₁,w₂,...,w_mAnd annotation results sequence of words P '={ w '₁,w′₂,...,w ′_n}；Terminology extraction is carried out to text first by most long character string matching method, by carrying out cutting, structure on different grain size Make character string cutting tree and carry out synonymous conflation of words；After cutting tree merger, the character string that have matched in text is in the word each belonged to It is removed in bag, thus obtains P and P ' new expression P_processedWith P '_processed；

Step 2.3：Calculate P and P ' similarity score；

Step 3：Latent structure；

Step 4：Model training；

Step 5：The model obtained using training is ranked up to the sequence of words that documentation summary is included, to the knot of sequence generation Really, the result using top1 is used as extraction result.

2. the academic text vocabulary identification of function method according to claim 1 sorted based on machine learning, its feature is existed In：For every document described in step 1.1, its Subject Title is converted into the representation of part of speech and frequent part of speech, first Sentence s is expressed as sequence of words { w₁,w₂,…,w_n, w_iI-th of vocabulary in sentence is represented, n is s length；Frequent vocabulary row Table F have recorded a series of previously given frequent vocabulary；By by all non-frequent vocabulary in s, that is, being not present in the vocabulary in F It is substituted for the corresponding chunk Chunk marks of vocabulary, you can obtain the corresponding text tables based on frequent lexical item and part of speech of sentence s Show.

3. the academic text vocabulary identification of function method according to claim 1 sorted based on machine learning, its feature is existed In：P and P ' similarity score is calculated described in step 2.3, computing formula is：

s i m (P, P^{'}) = \frac{| P | - | P p r o c e s s e d |}{| P |}

Wherein, | * | represent length.

4. the academic text vocabulary identification of function method according to claim 1 sorted based on machine learning, its feature is existed In implementing including following sub-step for, step 3：

Step 3.1：Each vocabulary, the previous vocabulary of current vocabulary sequence in construction lexical feature, including combination, current word Converge latter vocabulary of sequence, the first two vocabulary of current vocabulary combination, latter two vocabulary of current vocabulary combination and current The previous verb of vocabulary；

Step 3.2：Syntactic feature is constructed, including the identification of Head vocabulary, the interdependent path of vocabulary to ROOT, only records verb node The interdependent paths of vocabulary-ROOT, the dependence feature of vocabulary direct correlation；

The Head vocabulary identification, adds directed networkses, according to the dependence structure between vocabulary by the vocabulary in word combination The each node built in corresponding directed edge, traverses network, until they are isolated nodes, finally return to "<MULI_HEAD >”；

The vocabulary is to ROOT interdependent path, and the path using Head words to ROOT is as feature, and the result in path is output as (word1,Category1:Relation:Category2,word2)+；Wherein word1, word2 are vocabulary texts, Category1, Category2 are parts of speech, and Relation is word1 to word2 dependence, the multiple * of *+expression repetition； If including multiple Head vocabulary, interdependent path is not calculated, directly " NOPATH " is returned to；

The interdependent paths of vocabulary-ROOT for only recording verb node, method and output an ibid path, but only record dynamic Word；

The dependence feature of the vocabulary direct correlation, gives the Head vocabulary of vocabulary or word combination, is designated as word, Word feature generation strategy is：Pair there is each dependence tr for associate with word, because the vocabulary of tr associations is designated as Target, if word is governer vocabulary in tr relations, returns to " tr:Target ", if target is Governer vocabulary, returns to " tr-r:target”；Therefore, if word has n incidence relation, n feature can be formed；

Step 3.3：Construct TextRank features；

Have no right undirected co-word network using a construction of strategy moved based on window, the vocabulary sequence to be sorted in this basic calculation The TextRank values of row.

5. the academic text vocabulary identification of function side sorted based on machine learning according to claim 1-4 any one Method, it is characterised in that the process that implements of step 4 is：

Meet the documentation summary data of AD HOC using some titles, the problem of being extracted from these documents and method are made For key problem and the natural annotation results of core methed；Order models training used SVM-Rank instruments, using support to Amount machine order models train PairWise order models；The text granularity that order models are used is chunk, uses Stanford Parser does the group included in syntax parsing, and then the syntactic structure identification text obtained based on StanfordParser to text Block；Sentence cutting is carried out using OpenNLP, part-of-speech tagging is carried out to text using Stanford Postagger；Calculating text In this during the correlation of sequence of words and target vocabulary sequence, vocabulary is disabled using vocabulary；Stem is extracted and used PorterStemmer stem extracting tools；Synonym vocabulary is using the method for bilingual Chinese-English alignment from existing literature metadata Extract.