CN105653706B

CN105653706B - A kind of multilayer quotation based on literature content knowledge mapping recommends method

Info

Publication number: CN105653706B
Application number: CN201511026567.7A
Authority: CN
Inventors: 张春霞; 陈俊鹏; 王森; 王树良; 赵小林
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2018-04-06
Anticipated expiration: 2035-12-31
Also published as: CN105653706A

Abstract

The invention discloses a kind of multilayer quotation based on literature content knowledge mapping to recommend method, belongs to information recommendation and Intelligent Information Processing field.This method obtains the query demand of user first, and query demand is by needing to recommend the keyword for the title and summary for quoting paper or the paper of citation to form.Then, the knowledge mapping expanding query retrieval word based on literature content, knowledge mapping by document research object word and behavior of research word node, and represent it is synonymous, near it is adopted, it is upper it is the next, part is overall, the side of various semantic relations is formed side by side etc..Finally, the inverted index of data set Literature is built, chooses candidate's quotation, the similarity of candidate's quotation and inquiry is calculated, quotation recommendation is carried out using the progressive regression tree of gradient.This method carries out multi-level quotation based on literature content knowledge mapping and recommended, and expands the scope of candidate's quotation, expresses the research object and content of paper exactly, improves the efficiency that user obtains pertinent literature, has broad application prospects.

Description

Multi-layer citation recommendation method based on literature content knowledge graph

Technical Field

The invention relates to the technical field of information recommendation, in particular to a multi-layer citation recommendation method based on a literature content knowledge graph. The method has wide application prospect in the fields of information recommendation, information retrieval, network public opinion monitoring and the like.

Background

Currently, information recommendation methods can be divided into three major categories, content-based recommendation, collaborative filtering-based recommendation, and hybrid methods.

In the recommendation method based on the content, firstly, a content characteristic model and a user interest model of a recommendation object are built, then the similarity between the recommendation object and the user interest is calculated, and finally the recommendation object with larger similarity is recommended to the user. The recommended objects and user models are typically characterized by keywords. The method has the advantage that the user interest model can be constructed according to the historical records of the user, and the requirements and the preferences of the user are reflected. The method is characterized in that firstly, recommendation performance depends on a feature extraction method and a content feature model of a recommendation object, namely, the accuracy and the integrity of the content feature of the recommendation object; secondly, the recommended objects and the user interest model are expressed and similarity calculation based on the keywords and stay at the character string level, so that the cognition of the user on the high-level concept is limited, and the real requirement of the user is difficult to meet.

The collaborative filtering based recommendation method is to make recommendations based on the correlation between recommended objects or the correlation between users. Collaborative filtering-based recommendation methods can be classified into user-based collaborative recommendation, item-based collaborative recommendation, and model-based collaborative recommendation. The advantage of this approach is that complex objects, both structured and unstructured, can be processed. It is characterized by sparsity problems and cold start problems. The sparsity problem is that, for a user with few recommendation objects, it is difficult to find a user with similar interest to the user in a huge user set. The cold start problem refers to that when a new user or a new recommended object appears in the recommendation system for the first time, the system is difficult to know the interest and preference of the new user and recommend the new recommended object.

Citation recommendation is an important research content of information recommendation, and aims to find out papers to be cited in current papers in a massive literature. The existing citation recommendation method mainly utilizes citation relations of documents to recommend, and the content of the paper and the interests of users are represented based on keywords.

Disclosure of Invention

The invention aims to solve the problems that the recommendation method in the prior art is limited by the number of similar users, documents with characters similar to different semantics are difficult to retrieve, documents with different semantic association relations with research objects and research behaviors of papers are difficult to retrieve, and the recommendation result of a cited paper in the prior art cannot well meet the requirements of users, and provides a multi-layer citation recommendation method based on a knowledge graph of document contents.

The purpose of the invention is realized by the following technical scheme.

A multi-layer citation recommendation method based on a literature content knowledge graph comprises the following steps:

step 1, obtaining query requirements

Extracting the title and abstract of the paper needing to recommend citation, performing root extraction (Stemming) and morphological reduction (Lemmatization), and removing punctuation marks and stop words. Stop words refer to words without practical meaning, and mainly include auxiliary words, prepositions, conjunctions, and the like. And further, extracting keywords as search terms required by search engine Lucene query.

Step 2, query expansion is carried out by using knowledge graph of literature content

Firstly, expanding the search words of the query requirement, acquiring synonyms and near-synonyms of the search words by utilizing a synonym dictionary and a near-synonym dictionary, and expanding a search word set;

secondly, according to the title and abstract of the paper, identifying a research object word u and a research behavior word v of the paper;

thirdly, extracting synonyms and synonyms of the words to be researched and the words to be researched of the thesis by utilizing the synonym dictionary and the similar synonym dictionary, constructing retrieval expansion words, and adding the retrieval expansion words into the retrieval word set.

If synonyms and synonyms of the study object word u of the thesis are a₁,a₂,…,a_m(m is a natural number), and the synonyms and synonyms of the study behavior words v are b₁,b₂,…,b_n(n is a natural number), a search expansion word is constructed as follows, wherein "+" means the connection of two words. For example, "u + b₁"means the word u and the word b₁The connection of (2).

u+b₁,u+b₂,…,u+b_n,

a₁+v,a₁+b₁,a₁+b₂,…,a₁+b_n,

a₂+v,a₂+b₁,a₂+b₂,…,a₂+b_n,

…,

a_m+v,a_m+b₁,a_m+b₂,…,a_m+b_n.

Fourthly, extracting the upper concepts and the lower concepts of the research object words u and the research behavior words v of the thesis by utilizing the sub-network of the upper and lower relations in the knowledge graph;

if u is a generic term of c₁,c₂,…,c_p(p is a natural number) and the subordinate concept of u is d₁,d₂,…,d_q(q is a natural number) and v has a higher order meaning of e₁,e₂,…,e_s(s is a natural number) and v has a subordinate meaning of f₁,f₂,…,f_t(t is a natural number), constructing the following search expansion words:

u+e_j(j＝1,2,…,s),u+f_j(j＝1,2,…,t),

a_i+e_j(i＝1,2,…,m,j＝1,2,…,s),a_i+f_j(i＝1,2,…,m,j＝1,2,…,t),

c_i+v(i＝1,2,…,p),d_i+v(i＝1,2,…,q),

c_i+b_j(i＝1,2,…,p,j＝1,2,…,n),d_i+b_j(i＝1,2,…,q,j＝1,2,…,n),

c_i+e_j(i＝1,2,…,p,j＝1,2,…,s),c_i+f_j(i＝1,2,…,p,j＝1,2,…,t),

d_i+e_j(i＝1,2,…,q,j＝1,2,…,s),d_i+f_j(i＝1,2,…,q,j＝1,2,…,t).

and fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the paper by utilizing partial overall relation sub-networks in the knowledge graph. If the overall concept of u is g₁,g₂,…,g_o(o is a natural number) and u is partially defined as h₁,h₂,…,h_r(r is a natural number) and v is given as k as a whole₁,k₂,…,k_w(w is a natural number) and v is partially defined as l₁,l₂,…,l_z(z is a natural number), constructing the following search expansion words:

u+k_j(j＝1,2,…,w),u+l_j(j＝1,2,…,z),

a_i+k_j(i＝1,2,…,m,j＝1,2,…,w),a_i+l_j(i＝1,2,…,m,j＝1,2,…,z),

g_i+v(i＝1,2,…,o),h_i+v(i＝1,2,…,r),

g_i+b_j(i＝1,2,…,o,j＝1,2,…,n),h_i+b_j(i＝1,2,…,r,j＝1,2,…,n),

g_i+k_j(i＝1,2,…,o,j＝1,2,…,w),g_i+l_j(i＝1,2,…,o,j＝1,2,…,z),

h_i+k_j(i＝1,2,…,r,j＝1,2,…,w),h_i+l_j(i＝1,2,…,r,j＝1,2,…,z).

and sixthly, extracting the parallel concept of the research object words u and the research behavior words v of the paper by utilizing a parallel relation sub-network in the knowledge graph. If the juxtaposition of u is x₁,x₂,…,x_k1(k1 is a natural number) and v is a parallel concept of y₁,y₂,…,y_k2(k2 is a natural number), the following search expansion word is constructed.

u+y_j(j＝1,2,…,k2),x_i+v(i＝1,2,…,k1).

Step 3, constructing an inverted index of the literature

And constructing the inverted index according to the titles and the abstracts of the documents in the data set, wherein the inverted index comprises preprocessing, index construction and index storage. The preprocessing comprises root extraction and morphological reduction, and punctuation and stop words are removed. The index construction comprises the steps of constructing a mapping dictionary from words to documents, sequencing the words according to the dictionary sequence, combining document mapping information of the same words, and constructing a document inverted list, namely the document inverted index.

Step 4, selecting candidate citation sets

First, according to the expanded search term set, a paper including any search term in the title and abstract is searched in the data set. Then, the similarity of the query to these papers is calculated. The top N (N is a natural number) papers with the highest similarity are taken as a candidate citation set. Wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in a search engine Lucene. The query and the paper are represented by a query vector and a paper vector, and the similarity of the query and the paper is the cosine similarity of the query vector and the paper vector.

Step 5, extracting similarity characteristics of candidate quotations and queries

The similarity features of candidate quotations and queries are divided into the following two features. The first is based on the similarity characteristics of candidate quotation and query of a search engine Lucene. The second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler dictionary). Firstly, a hidden Dirichlet distribution model is adopted to obtain the topic distribution of the query and the candidate quotation. Then, KL distances of the two topic distributions are calculated.

Step 6, constructing training data recommended by citations

First, for each training paper in the training data set, a search engine Lucene is used to retrieve candidate citations based on its title and abstract.

Second, for each candidate citation p, a training sample is constructed. The training sample features comprise the reference times feature of the candidate quotation p, the candidate quotation p and the similarity feature of the query constructed according to the training paper. If the training paper refers to the candidate quotation p, the classification label of the sample is 1, otherwise, the classification label is 0. If the training paper contains m references, m positive samples and n-m negative samples can be constructed, where n is the space of the candidate quotation.

Step 7, performing citation recommendation based on gradient progressive regression tree

Firstly, a classification model is trained by adopting a gradient progressive Regression tree (GBRT) to realize citation recommendation. The classification features comprise similarity features of candidate quotations and queries and paper quotation frequency features. The output value of the gradient regression tree is generally a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation. The greater the degree of recommendation, the greater the likelihood that the candidate citation is classified as "recommended". Further, taking M (M is a natural number) candidate quotations with the highest recommendation degree as the quotation recommendation result of the current paper;

second, for each citation p recommended, the subject words x and the research behavior words y are identified from their titles and abstracts. For the current paper, a multi-level semantic association of each citation p with it is constructed. If u and v are respectively the words of the research object and the words of the research behavior of the current thesis;

case 1: if x is the overall concept of u, or y is the overall concept of v, the study content of the citation p includes the study content of the current paper. If x is a partial concept of u, or y is a partial concept of v, the research content of the current paper comprises the research content of the citation p;

case 2: if x is a generic concept of u, or y is a generic concept of v, the research method of citation p can be applied to solve the research problem of the current paper. If x is a lower concept of u or y is a lower concept of v, the research method of the current paper can be applied to solve the research problem of the citation p;

case 3: if x is a parallel concept of u, or y is a parallel concept of v, the research method of the current paper can refer to the research method of the citation p.

Thus, the whole process of the method is completed.

Advantageous effects

The method introduces knowledge of content semantic association of different documents aiming at the problems that the existing citation recommendation method is difficult to retrieve documents with different semantemes similar to characters, difficult to retrieve documents with different semantic association relations with research objects and research behaviors of papers, limited by the number of similar users and the like, and adopts a multi-layer citation recommendation method based on document content knowledge maps. The method utilizes various semantic relations of the words of the research objects and the words of the research behaviors in the document content to obtain the search expansion words, and carries out multi-level quotation recommendation based on the gradient asymptotic regression tree, so that the efficiency of obtaining quotation by the user is improved. The concrete aspects are as follows:

(1) according to the method, on one hand, the research content of the paper is represented by extracting keywords of titles and abstracts of the paper, on the other hand, the research content of the paper is represented by extracting words of research objects and words of research behaviors of the paper, so that semantic representation is performed on the research problems and the research content of the paper, the research subjects and the content of the paper are expressed more accurately, and the effect of citation recommendation is improved.

(2) The knowledge graph of the document content is used for obtaining the search expansion words, namely, the synonymy relation, the near relation, the upper and lower position relation, the partial integral relation and the parallel relation of the research object words and the research behavior words of the thesis are used for obtaining the search expansion words, the range of candidate quotations is expanded, and therefore the problem of missed detection of the cited documents and the problem of early cold start of a recommendation system are solved.

(3) According to the method, the gradient progressive regression tree GBRT is adopted for citation recommendation, the citation recommendation is regarded as a classification problem, the class label of each training sample citation is 1 or 0, namely, the citation recommendation represents 'recommendation' or 'non-recommendation', the effect of the citation recommendation result is guaranteed, and the operation efficiency of the citation recommendation method is guaranteed.

(4) In the knowledge graph of the document content, words with different semantic relations with the words of the research object and the words of the research behavior of the thesis can be dynamically added, and the knowledge graph network of the document content is continuously expanded, so that the instantaneity and the flexibility of the citation recommendation method are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The process of the present invention will be described in detail with reference to examples.

Examples

step 1, acquiring a query requirement.

Extracting the title and abstract of the paper needing to recommend citation, performing root extraction (Stemming) and morphological reduction (Lemmatization), and removing punctuation marks and stop words. For example, the word "entities" is converted to "entity" by root extraction. The word "identified" is converted into "identified" through morphological reduction. Stop words refer to words without practical meaning, and mainly include auxiliary words, prepositions, conjunctions, and the like. For example, "is," "with," and "are stop words. And further, extracting keywords as search terms required by search engine Lucene query.

And 2, utilizing the knowledge graph of the document content to perform query expansion.

Firstly, expanding the search words of the query requirement, obtaining synonyms and near-synonyms of the search words by utilizing a synonym dictionary and a near-synonym dictionary, and expanding a search expansion word set.

For example, the keywords "hidden Markov model" and "named entity recognition" are extracted as terms from a paper entitled "a kind of named entity recognition based on hidden Markov model". The retrieval expansion words "HMM (hidden markov model)" and "NER (named entity recognition)" are obtained from the synonym dictionary and the near synonym dictionary.

Second, the subject words u and the study behavior words v of the paper are identified based on the title and abstract of the paper. For example, for a paper titled "a named entity recognition based on hidden markov models", the subject word identifying the paper is "named entity" and the research behavior word is "recognition".

If synonyms and synonyms of the study object word u of the thesis are a₁,a₂,…,a_m(m is a natural number), and the synonyms and synonyms of the study behavior words v are b₁,b₂,…,b_n(n is a natural number), a search expansion word is constructed as follows, wherein "+" means the connection of two words. For example, "u + b₁"means the word u and the word b₁The connection of (2). "entity + detection" refers to the concatenation of the word "entity" and the word "detection", i.e., "entity detection".

u+b₁,u+b₂,…,u+b_n,

a₁+v,a₁+b₁,a₁+b₂,…,a₁+b_n,

a₂+v,a₂+b₁,a₂+b₂,…,a₂+b_n,

…,

a_m+v,a_m+b₁,a_m+b₂,…,a_m+b_n.

For example, for a paper titled "a kind of hidden markov model-based named entity recognition", the synonyms of the research behavior word "recognition" are extracted as "detection" and "extraction", and thus, the search expansion words "named entity detection" and "named entity extraction" are constructed and added to the set of search words.

And fourthly, extracting the upper concepts and the lower concepts of the research object words u and the research behavior words v of the thesis by utilizing the sub-network of the upper and lower relation in the knowledge graph.

If u is a generic term of c₁,c₂,…,c_p(p is a natural number) and the subordinate concept of u is d₁,d₂,…,d_q(q is a natural number) and v has a higher order meaning of e₁,e₂,…,e_s(s is a natural number) and v has a subordinate meaning of f₁,f₂,…,f_t(t is a natural number), the following search expansion word is constructed.

u+e_j(j＝1,2,…,s),u+f_j(j＝1,2,…,t),

a_i+e_j(i＝1,2,…,m,j＝1,2,…,s),a_i+f_j(i＝1,2,…,m,j＝1,2,…,t),

c_i+v(i＝1,2,…,p),d_i+v(i＝1,2,…,q),

c_i+b_j(i＝1,2,…,p,j＝1,2,…,n),d_i+b_j(i＝1,2,…,q,j＝1,2,…,n),

c_i+e_j(i＝1,2,…,p,j＝1,2,…,s),c_i+f_j(i＝1,2,…,p,j＝1,2,…,t),

d_i+e_j(i＝1,2,…,q,j＝1,2,…,s),d_i+f_j(i＝1,2,…,q,j＝1,2,…,t).

For example, for a paper titled "a kind of named entity recognition based on hidden markov model", the superordinate concept "entity" of its research object "named entity" is extracted, and the search expansion words "entity recognition", "entity detection", and "entity extraction" can be constructed and added to the search term set.

And fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the paper by utilizing partial overall relation sub-networks in the knowledge graph. If the overall concept of u is g₁,g₂,…,g_o(o is a natural number) and u is partially defined as h₁,h₂,…,h_r(r is a natural number) and v is given as k as a whole₁,k₂,…,k_w(w is a natural number) and v is partially defined as l₁,l₂,…,l_z(z is a natural number), the following search expansion word is constructed.

u+k_j(j＝1,2,…,w),u+l_j(j＝1,2,…,z),

a_i+k_j(i＝1,2,…,m,j＝1,2,…,w),a_i+l_j(i＝1,2,…,m,j＝1,2,…,z),

g_i+v(i＝1,2,…,o),h_i+v(i＝1,2,…,r),

g_i+b_j(i＝1,2,…,o,j＝1,2,…,n),h_i+b_j(i＝1,2,…,r,j＝1,2,…,n),

g_i+k_j(i＝1,2,…,o,j＝1,2,…,w),g_i+l_j(i＝1,2,…,o,j＝1,2,…,z),

h_i+k_j(i＝1,2,…,r,j＝1,2,…,w),h_i+l_j(i＝1,2,…,r,j＝1,2,…,z).

For example, for a paper titled "a kind of hidden markov model-based named entity recognition", the overall concept "entity information" of "named entity" is extracted, and then the search expansion words "entity information extraction", "entity information recognition", and "entity information detection" may be constructed and added to the set of search words.

u+y_j(j＝1,2,…,k2),x_i+v(i＝1,2,…,k1).

For example, for a paper titled "a kind of hidden markov model-based named entity recognition", parallel concepts "links" and "disambiguation" of the research behavior word "recognition" are extracted, and search expansion words "entity disambiguation" and "entity links" are constructed, which are added to the set of search words.

And 3, constructing an inverted index of the document.

And 4, selecting a candidate citation set.

First, according to the expanded search term set, a paper including any search term in the title and abstract is searched in the data set. Then, the similarity of the query to these papers is calculated. The top N (N is a natural number) papers with the highest similarity are taken as a candidate citation set. Wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in Lucene. The query and the paper are represented by a query vector and a paper vector, and the similarity of the query and the paper is the cosine similarity of the query vector and the paper vector.

And 5, extracting similarity characteristics of the candidate quotation and the query.

The similarity features of candidate quotations and queries are divided into the following two features. The first is based on the similarity characteristics of candidate quotation and query of Lucene. The second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler dictionary). Firstly, a hidden Dirichlet distribution model is adopted to obtain the topic distribution of the query and the candidate quotation. Then, KL distances of the two topic distributions are calculated.

And 6, constructing training data recommended by the citation.

And 7, performing citation recommendation based on the gradient progressive regression tree.

Firstly, a classification model is trained by adopting a gradient progressive Regression tree (GBRT) to realize citation recommendation. The classification features comprise similarity features of candidate quotations and queries and paper quotation frequency features. The output value of the gradient regression tree is generally a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation. The greater the degree of recommendation, the greater the likelihood that the candidate citation is classified as "recommended". Further, M candidate quotations with the highest recommendation degree (M is a natural number) are used as the quotation recommendation results of the current paper.

case 1: if x is the overall concept of u, or y is the overall concept of v, the study content of the citation p includes the study content of the current paper. If x is a partial concept of u, or y is a partial concept of v, the contents of the current paper include the contents of the citation p.

Case 2: if x is a generic concept of u, or y is a generic concept of v, the research method of citation p can be applied to solve the research problem of the current paper. If x is a subset of u, or y is a subset of v, then the current paper's research approach can be applied to solve the research problem of the citation p.

The implementation process of the invention adopts scientific and technological paper in the field of physics to carry out experimental tests. The average accuracy ap (average precision) was used to evaluate the experimental results of the citation recommendations.

For paper q, let x_qIs a reference collection of articles q, y_qIs an ordered binary set and represents the quotation recommendation result of the paper q. y is_q(i) Is an ordered binary set y_qWhere a is a paper ID, B indicates whether the paper is referenced, 1 indicates referenced, and 0 indicates not referenced. y is_qThe quotations are sorted according to the descending mode of the GBRT output values of the gradient progressive regression tree. Calculate y using the following equation_qAccuracy P at the k-th position_k(y_q) And k is a natural number.

Wherein,denotes y_q(i) Whether the article in (1) belongs to the reference set of the article q, specifically calculated as follows: if y_q(i) The article in (a) belongs to the reference collection of article q, thenIf y_q(i) The article in (a) does not belong to the reference set of article q, then

Further, y is calculated using the following equation_qAverage accuracy of AP (y)_q) Where n is a binary set y_qThe number of doublets.

Taking the paper entitled "More defining N ═ 1 SUSY Gauge tools from Non-abelian duplex", as an example, the first 10 citations obtained by querying in the dataset with Lucene are (9811119,1), (9610139,1), (9804038,0), (9807222,0), (9603206,0), (9411149,1), (9607200,0), (9408155,0), (9810014,1), (9605113,0) in that order. The first 10 citations obtained by the process of the invention are in turn (9411149,1), (9407087,0), (9408099,0), (9610139,1), (9811119,1), (9510101,0), (9503179,1), (9510148,1), (9408155,0), (9602031, 0). The average accuracy of the citation recommendation experiment result based on Lucene is about 0.29, and the average accuracy of the citation recommendation experiment result adopting the method of the invention is about 0.33. Experimental results show that the quotation recommending method improves the efficiency of obtaining quotations by users. In addition, the citation recommendation method does not involve similar users, and is therefore not limited to the number of similar users; the method can recommend documents with multilayer semantic association relation with the papers by utilizing the knowledge graph of document contents.

Claims

1. A multi-layer citation recommendation method based on a literature content knowledge graph is characterized by comprising the following steps:

step 1, acquiring a query requirement;

step 2, utilizing a knowledge graph of the document content to perform query expansion;

step 3, constructing an inverted index of the literature;

step 4, selecting a candidate citation set;

step 5, extracting similarity characteristics of the candidate quotation and the query;

step 6, constructing training data recommended by citations;

step 7, performing citation recommendation based on the gradient progressive regression tree;

the step 1 comprises the following steps: acquiring titles and abstracts of papers needing to recommend quotations, extracting roots and restoring shapes of the papers, and removing punctuation marks and stop words; extracting key words as search terms required by search engine Lucene query;

the step 2 comprises the following steps:

thirdly, extracting synonyms and synonyms of the research object words u and the research behavior words v of the thesis by utilizing the synonym dictionary and the near synonym dictionary, constructing retrieval expansion words and adding the retrieval expansion words into a retrieval word set;

if u's synonyms and synonyms are a₁,a₂,…,a_mM is a natural number, and the synonyms and synonyms of v are b₁,b₂,…,b_nAnd n is a natural number, constructing a search expansion word as follows, wherein '+' refers to the connection of two words; "u + b₁"means the word u and the word b₁The connection of (1); "entity + detect" refers to the concatenation of the word "entity" and the word "detect", i.e., "entity detect";

u+b₁,u+b₂,…,u+b_n,

a₁+v,a₁+b₁,a₁+b₂,…,a₁+b_n,

a₂+v,a₂+b₁,a₂+b₂,…,a₂+b_n,

…,

a_m+v,a_m+b₁,a_m+b₂,…,a_m+b_n；

if u is a generic term of c₁,c₂,…,c_pP is a natural number, and u has a subordinate concept of d₁,d₂,…,d_qQ is a natural number, and v is a generic term of e₁,e₂,…,e_sS is a natural number, v is a subordinate term of f₁,f₂…, ft, t is a natural number, the following search expansion word is constructed:

u+e_j,j＝1,2,…,s,u+f_j,j＝1,2,…,t,

a_i+e_j,i＝1,2,…,m,j＝1,2,…,s,a_i+f_j,i＝1,2,…,m,j＝1,2,…,t,

c_i+v,i＝1,2,…,p,d_i+v,i＝1,2,…,q,

c_i+b_j,i＝1,2,…,p,j＝1,2,…,n,d_i+b_j,i＝1,2,…,q,j＝1,2,…,n,

c_i+e_j,i＝1,2,…,p,j＝1,2,…,s,c_i+f_j,i＝1,2,…,p,j＝1,2,…,t,

d_i+e_j,i＝1,2,…,q,j＝1,2,…,s,d_i+f_j,i＝1,2,…,q,j＝1,2,…,t，

fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the thesis by utilizing partial overall relation sub-networks in the knowledge graph; if the overall concept of u is g₁,g₂,…,g_oO is a natural number, and u is partially defined as h₁,h₂,…,h_rR is a natural number, and v is given as k as a whole₁,k₂,…,k_wW is a natural number, and v is partially defined as l₁,l₂,…,l_zIf z is a natural number, constructing the following search expansion words;

u+k_j,j＝1,2,…,w,u+l_j,j＝1,2,…,z,

a_i+k_j,i＝1,2,…,m,j＝1,2,…,w,a_i+l_j,i＝1,2,…,m,j＝1,2,…,z,

g_i+v,i＝1,2,…,o,h_i+v,i＝1,2,…,r,

g_i+b_j,i＝1,2,…,o,j＝1,2,…,n,h_i+b_j,i＝1,2,…,r,j＝1,2,…,n,

g_i+k_j,i＝1,2,…,o,j＝1,2,…,w,g_i+l_j,i＝1,2,…,o,j＝1,2,…,z,

h_i+k_j,i＝1,2,…,r,j＝1,2,…,w,h_i+l_j,i＝1,2,…,r,j＝1,2,…,z，

sixthly, extracting the parallel concept of the research object word u and the research behavior word v of the thesis by utilizing a parallel relation sub-network in the knowledge graph; if the juxtaposition of u is x₁,x₂,…,x_k1K1 is a natural number, and v is a parallel concept of y₁,y₂,…,y_k2And k2 is a natural number, the following search expansion words are constructed:

u+y_j,j＝1,2,…,k2,x_i+v,i＝1,2,…,k1；

the step 3 comprises the following steps:

constructing an inverted index according to titles and abstracts of documents in a data set, wherein the inverted index comprises preprocessing, index construction and index storage; preprocessing comprises root extraction and morphological reduction, and removing punctuation marks and stop words; constructing an index comprises constructing a mapping dictionary from words to documents, sequencing the words according to the dictionary sequence, combining document mapping information of the same words, and constructing a document inverted list, namely a document inverted index;

the step 4 comprises the following steps:

firstly, according to the expanded search term set, searching out a paper including any search term in a title and an abstract in a data set; then, calculating the similarity of the query and the papers; taking the first N papers with the highest similarity as a candidate citation set, wherein N is a natural number; wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in a search engine Lucene; the query and the thesis are represented by a query vector and a thesis vector, and the similarity of the query and the thesis is the cosine similarity of the query vector and the thesis vector;

the step 5 comprises the following steps:

the similarity characteristics of the candidate quotation and the query are divided into the following two characteristics; the first is based on the similarity characteristics of candidate quotation and query of a search engine Lucene; the second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler Divergence); firstly, acquiring topic distribution of query and candidate quotation by adopting an implicit Dirichlet distribution model LDA; then, calculating KL distances of the two theme distributions;

the step 6 comprises the following steps:

6.1, for each training paper in the training data set, utilizing a search engine Lucene to retrieve candidate quotations according to the title and the abstract of the training paper;

6.2, constructing a training sample for each candidate citation p; the training sample characteristics comprise the reference frequency characteristics of the candidate quotation p, the query constructed according to the training paper and the similarity characteristics of the candidate quotation p; if the training paper quotes the candidate quotation p, the classification label of the training sample is 1, otherwise, the classification label is 0; if the training paper contains m₁In this reference, m can be constructed₁A positive sample and n₁-m₁A negative example, wherein n₁The space number of the candidate quotation;

the step 7 comprises the following steps:

7.1, training a classification model by adopting a gradient progressive regression tree GBRT to realize citation recommendation; the classification features comprise similarity features of candidate quotations and queries and thesis quotation frequency features; the output value of the gradient progressive regression tree is a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation; the greater the recommendation degree, the greater the likelihood that the candidate citation is classified as "recommended"; further, taking M candidate quotations with the highest recommendation degree as quotation recommendation results of the current paper, wherein M is a natural number;

7.2, for each citation p recommended, identifying a research object word x and a research behavior word y from the title and abstract of the citation p; for the current paper, constructing a multi-layer semantic association relation between each citation p and the citation p; if u and v are the subject words and the study behavior words of the current paper respectively,

case 1: if x is the overall concept of u, or y is the overall concept of v, the research content of the citation p comprises the research content of the current paper; if x is a partial concept of u, or y is a partial concept of v, the research content of the current paper comprises the research content of the citation p;

case 2: if x is a higher-level concept of u or y is a higher-level concept of v, the research method of the citation p can be applied to solving the research problem of the current paper; if x is a lower concept of u or y is a lower concept of v, the research method of the current paper can be applied to solve the research problem of the citation p;