CN105653706B - A kind of multilayer quotation based on literature content knowledge mapping recommends method - Google Patents

A kind of multilayer quotation based on literature content knowledge mapping recommends method Download PDF

Info

Publication number
CN105653706B
CN105653706B CN201511026567.7A CN201511026567A CN105653706B CN 105653706 B CN105653706 B CN 105653706B CN 201511026567 A CN201511026567 A CN 201511026567A CN 105653706 B CN105653706 B CN 105653706B
Authority
CN
China
Prior art keywords
words
research
citation
candidate
paper
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511026567.7A
Other languages
Chinese (zh)
Other versions
CN105653706A (en
Inventor
张春霞
陈俊鹏
王森
王树良
赵小林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201511026567.7A priority Critical patent/CN105653706B/en
Publication of CN105653706A publication Critical patent/CN105653706A/en
Application granted granted Critical
Publication of CN105653706B publication Critical patent/CN105653706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of multilayer quotation based on literature content knowledge mapping to recommend method, belongs to information recommendation and Intelligent Information Processing field.This method obtains the query demand of user first, and query demand is by needing to recommend the keyword for the title and summary for quoting paper or the paper of citation to form.Then, the knowledge mapping expanding query retrieval word based on literature content, knowledge mapping by document research object word and behavior of research word node, and represent it is synonymous, near it is adopted, it is upper it is the next, part is overall, the side of various semantic relations is formed side by side etc..Finally, the inverted index of data set Literature is built, chooses candidate's quotation, the similarity of candidate's quotation and inquiry is calculated, quotation recommendation is carried out using the progressive regression tree of gradient.This method carries out multi-level quotation based on literature content knowledge mapping and recommended, and expands the scope of candidate's quotation, expresses the research object and content of paper exactly, improves the efficiency that user obtains pertinent literature, has broad application prospects.

Description

Multi-layer citation recommendation method based on literature content knowledge graph
Technical Field
The invention relates to the technical field of information recommendation, in particular to a multi-layer citation recommendation method based on a literature content knowledge graph. The method has wide application prospect in the fields of information recommendation, information retrieval, network public opinion monitoring and the like.
Background
Currently, information recommendation methods can be divided into three major categories, content-based recommendation, collaborative filtering-based recommendation, and hybrid methods.
In the recommendation method based on the content, firstly, a content characteristic model and a user interest model of a recommendation object are built, then the similarity between the recommendation object and the user interest is calculated, and finally the recommendation object with larger similarity is recommended to the user. The recommended objects and user models are typically characterized by keywords. The method has the advantage that the user interest model can be constructed according to the historical records of the user, and the requirements and the preferences of the user are reflected. The method is characterized in that firstly, recommendation performance depends on a feature extraction method and a content feature model of a recommendation object, namely, the accuracy and the integrity of the content feature of the recommendation object; secondly, the recommended objects and the user interest model are expressed and similarity calculation based on the keywords and stay at the character string level, so that the cognition of the user on the high-level concept is limited, and the real requirement of the user is difficult to meet.
The collaborative filtering based recommendation method is to make recommendations based on the correlation between recommended objects or the correlation between users. Collaborative filtering-based recommendation methods can be classified into user-based collaborative recommendation, item-based collaborative recommendation, and model-based collaborative recommendation. The advantage of this approach is that complex objects, both structured and unstructured, can be processed. It is characterized by sparsity problems and cold start problems. The sparsity problem is that, for a user with few recommendation objects, it is difficult to find a user with similar interest to the user in a huge user set. The cold start problem refers to that when a new user or a new recommended object appears in the recommendation system for the first time, the system is difficult to know the interest and preference of the new user and recommend the new recommended object.
Citation recommendation is an important research content of information recommendation, and aims to find out papers to be cited in current papers in a massive literature. The existing citation recommendation method mainly utilizes citation relations of documents to recommend, and the content of the paper and the interests of users are represented based on keywords.
Disclosure of Invention
The invention aims to solve the problems that the recommendation method in the prior art is limited by the number of similar users, documents with characters similar to different semantics are difficult to retrieve, documents with different semantic association relations with research objects and research behaviors of papers are difficult to retrieve, and the recommendation result of a cited paper in the prior art cannot well meet the requirements of users, and provides a multi-layer citation recommendation method based on a knowledge graph of document contents.
The purpose of the invention is realized by the following technical scheme.
A multi-layer citation recommendation method based on a literature content knowledge graph comprises the following steps:
step 1, obtaining query requirements
Extracting the title and abstract of the paper needing to recommend citation, performing root extraction (Stemming) and morphological reduction (Lemmatization), and removing punctuation marks and stop words. Stop words refer to words without practical meaning, and mainly include auxiliary words, prepositions, conjunctions, and the like. And further, extracting keywords as search terms required by search engine Lucene query.
Step 2, query expansion is carried out by using knowledge graph of literature content
Firstly, expanding the search words of the query requirement, acquiring synonyms and near-synonyms of the search words by utilizing a synonym dictionary and a near-synonym dictionary, and expanding a search word set;
secondly, according to the title and abstract of the paper, identifying a research object word u and a research behavior word v of the paper;
thirdly, extracting synonyms and synonyms of the words to be researched and the words to be researched of the thesis by utilizing the synonym dictionary and the similar synonym dictionary, constructing retrieval expansion words, and adding the retrieval expansion words into the retrieval word set.
If synonyms and synonyms of the study object word u of the thesis are a1,a2,…,am(m is a natural number), and the synonyms and synonyms of the study behavior words v are b1,b2,…,bn(n is a natural number), a search expansion word is constructed as follows, wherein "+" means the connection of two words. For example, "u + b1"means the word u and the word b1The connection of (2).
u+b1,u+b2,…,u+bn,
a1+v,a1+b1,a1+b2,…,a1+bn,
a2+v,a2+b1,a2+b2,…,a2+bn,
…,
am+v,am+b1,am+b2,…,am+bn.
Fourthly, extracting the upper concepts and the lower concepts of the research object words u and the research behavior words v of the thesis by utilizing the sub-network of the upper and lower relations in the knowledge graph;
if u is a generic term of c1,c2,…,cp(p is a natural number) and the subordinate concept of u is d1,d2,…,dq(q is a natural number) and v has a higher order meaning of e1,e2,…,es(s is a natural number) and v has a subordinate meaning of f1,f2,…,ft(t is a natural number), constructing the following search expansion words:
u+ej(j=1,2,…,s),u+fj(j=1,2,…,t),
ai+ej(i=1,2,…,m,j=1,2,…,s),ai+fj(i=1,2,…,m,j=1,2,…,t),
ci+v(i=1,2,…,p),di+v(i=1,2,…,q),
ci+bj(i=1,2,…,p,j=1,2,…,n),di+bj(i=1,2,…,q,j=1,2,…,n),
ci+ej(i=1,2,…,p,j=1,2,…,s),ci+fj(i=1,2,…,p,j=1,2,…,t),
di+ej(i=1,2,…,q,j=1,2,…,s),di+fj(i=1,2,…,q,j=1,2,…,t).
and fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the paper by utilizing partial overall relation sub-networks in the knowledge graph. If the overall concept of u is g1,g2,…,go(o is a natural number) and u is partially defined as h1,h2,…,hr(r is a natural number) and v is given as k as a whole1,k2,…,kw(w is a natural number) and v is partially defined as l1,l2,…,lz(z is a natural number), constructing the following search expansion words:
u+kj(j=1,2,…,w),u+lj(j=1,2,…,z),
ai+kj(i=1,2,…,m,j=1,2,…,w),ai+lj(i=1,2,…,m,j=1,2,…,z),
gi+v(i=1,2,…,o),hi+v(i=1,2,…,r),
gi+bj(i=1,2,…,o,j=1,2,…,n),hi+bj(i=1,2,…,r,j=1,2,…,n),
gi+kj(i=1,2,…,o,j=1,2,…,w),gi+lj(i=1,2,…,o,j=1,2,…,z),
hi+kj(i=1,2,…,r,j=1,2,…,w),hi+lj(i=1,2,…,r,j=1,2,…,z).
and sixthly, extracting the parallel concept of the research object words u and the research behavior words v of the paper by utilizing a parallel relation sub-network in the knowledge graph. If the juxtaposition of u is x1,x2,…,xk1(k1 is a natural number) and v is a parallel concept of y1,y2,…,yk2(k2 is a natural number), the following search expansion word is constructed.
u+yj(j=1,2,…,k2),xi+v(i=1,2,…,k1).
Step 3, constructing an inverted index of the literature
And constructing the inverted index according to the titles and the abstracts of the documents in the data set, wherein the inverted index comprises preprocessing, index construction and index storage. The preprocessing comprises root extraction and morphological reduction, and punctuation and stop words are removed. The index construction comprises the steps of constructing a mapping dictionary from words to documents, sequencing the words according to the dictionary sequence, combining document mapping information of the same words, and constructing a document inverted list, namely the document inverted index.
Step 4, selecting candidate citation sets
First, according to the expanded search term set, a paper including any search term in the title and abstract is searched in the data set. Then, the similarity of the query to these papers is calculated. The top N (N is a natural number) papers with the highest similarity are taken as a candidate citation set. Wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in a search engine Lucene. The query and the paper are represented by a query vector and a paper vector, and the similarity of the query and the paper is the cosine similarity of the query vector and the paper vector.
Step 5, extracting similarity characteristics of candidate quotations and queries
The similarity features of candidate quotations and queries are divided into the following two features. The first is based on the similarity characteristics of candidate quotation and query of a search engine Lucene. The second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler dictionary). Firstly, a hidden Dirichlet distribution model is adopted to obtain the topic distribution of the query and the candidate quotation. Then, KL distances of the two topic distributions are calculated.
Step 6, constructing training data recommended by citations
First, for each training paper in the training data set, a search engine Lucene is used to retrieve candidate citations based on its title and abstract.
Second, for each candidate citation p, a training sample is constructed. The training sample features comprise the reference times feature of the candidate quotation p, the candidate quotation p and the similarity feature of the query constructed according to the training paper. If the training paper refers to the candidate quotation p, the classification label of the sample is 1, otherwise, the classification label is 0. If the training paper contains m references, m positive samples and n-m negative samples can be constructed, where n is the space of the candidate quotation.
Step 7, performing citation recommendation based on gradient progressive regression tree
Firstly, a classification model is trained by adopting a gradient progressive Regression tree (GBRT) to realize citation recommendation. The classification features comprise similarity features of candidate quotations and queries and paper quotation frequency features. The output value of the gradient regression tree is generally a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation. The greater the degree of recommendation, the greater the likelihood that the candidate citation is classified as "recommended". Further, taking M (M is a natural number) candidate quotations with the highest recommendation degree as the quotation recommendation result of the current paper;
second, for each citation p recommended, the subject words x and the research behavior words y are identified from their titles and abstracts. For the current paper, a multi-level semantic association of each citation p with it is constructed. If u and v are respectively the words of the research object and the words of the research behavior of the current thesis;
case 1: if x is the overall concept of u, or y is the overall concept of v, the study content of the citation p includes the study content of the current paper. If x is a partial concept of u, or y is a partial concept of v, the research content of the current paper comprises the research content of the citation p;
case 2: if x is a generic concept of u, or y is a generic concept of v, the research method of citation p can be applied to solve the research problem of the current paper. If x is a lower concept of u or y is a lower concept of v, the research method of the current paper can be applied to solve the research problem of the citation p;
case 3: if x is a parallel concept of u, or y is a parallel concept of v, the research method of the current paper can refer to the research method of the citation p.
Thus, the whole process of the method is completed.
Advantageous effects
The method introduces knowledge of content semantic association of different documents aiming at the problems that the existing citation recommendation method is difficult to retrieve documents with different semantemes similar to characters, difficult to retrieve documents with different semantic association relations with research objects and research behaviors of papers, limited by the number of similar users and the like, and adopts a multi-layer citation recommendation method based on document content knowledge maps. The method utilizes various semantic relations of the words of the research objects and the words of the research behaviors in the document content to obtain the search expansion words, and carries out multi-level quotation recommendation based on the gradient asymptotic regression tree, so that the efficiency of obtaining quotation by the user is improved. The concrete aspects are as follows:
(1) according to the method, on one hand, the research content of the paper is represented by extracting keywords of titles and abstracts of the paper, on the other hand, the research content of the paper is represented by extracting words of research objects and words of research behaviors of the paper, so that semantic representation is performed on the research problems and the research content of the paper, the research subjects and the content of the paper are expressed more accurately, and the effect of citation recommendation is improved.
(2) The knowledge graph of the document content is used for obtaining the search expansion words, namely, the synonymy relation, the near relation, the upper and lower position relation, the partial integral relation and the parallel relation of the research object words and the research behavior words of the thesis are used for obtaining the search expansion words, the range of candidate quotations is expanded, and therefore the problem of missed detection of the cited documents and the problem of early cold start of a recommendation system are solved.
(3) According to the method, the gradient progressive regression tree GBRT is adopted for citation recommendation, the citation recommendation is regarded as a classification problem, the class label of each training sample citation is 1 or 0, namely, the citation recommendation represents 'recommendation' or 'non-recommendation', the effect of the citation recommendation result is guaranteed, and the operation efficiency of the citation recommendation method is guaranteed.
(4) In the knowledge graph of the document content, words with different semantic relations with the words of the research object and the words of the research behavior of the thesis can be dynamically added, and the knowledge graph network of the document content is continuously expanded, so that the instantaneity and the flexibility of the citation recommendation method are improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The process of the present invention will be described in detail with reference to examples.
Examples
A multi-layer citation recommendation method based on a literature content knowledge graph comprises the following steps:
step 1, acquiring a query requirement.
Extracting the title and abstract of the paper needing to recommend citation, performing root extraction (Stemming) and morphological reduction (Lemmatization), and removing punctuation marks and stop words. For example, the word "entities" is converted to "entity" by root extraction. The word "identified" is converted into "identified" through morphological reduction. Stop words refer to words without practical meaning, and mainly include auxiliary words, prepositions, conjunctions, and the like. For example, "is," "with," and "are stop words. And further, extracting keywords as search terms required by search engine Lucene query.
And 2, utilizing the knowledge graph of the document content to perform query expansion.
Firstly, expanding the search words of the query requirement, obtaining synonyms and near-synonyms of the search words by utilizing a synonym dictionary and a near-synonym dictionary, and expanding a search expansion word set.
For example, the keywords "hidden Markov model" and "named entity recognition" are extracted as terms from a paper entitled "a kind of named entity recognition based on hidden Markov model". The retrieval expansion words "HMM (hidden markov model)" and "NER (named entity recognition)" are obtained from the synonym dictionary and the near synonym dictionary.
Second, the subject words u and the study behavior words v of the paper are identified based on the title and abstract of the paper. For example, for a paper titled "a named entity recognition based on hidden markov models", the subject word identifying the paper is "named entity" and the research behavior word is "recognition".
Thirdly, extracting synonyms and synonyms of the words to be researched and the words to be researched of the thesis by utilizing the synonym dictionary and the similar synonym dictionary, constructing retrieval expansion words, and adding the retrieval expansion words into the retrieval word set.
If synonyms and synonyms of the study object word u of the thesis are a1,a2,…,am(m is a natural number), and the synonyms and synonyms of the study behavior words v are b1,b2,…,bn(n is a natural number), a search expansion word is constructed as follows, wherein "+" means the connection of two words. For example, "u + b1"means the word u and the word b1The connection of (2). "entity + detection" refers to the concatenation of the word "entity" and the word "detection", i.e., "entity detection".
u+b1,u+b2,…,u+bn,
a1+v,a1+b1,a1+b2,…,a1+bn,
a2+v,a2+b1,a2+b2,…,a2+bn,
…,
am+v,am+b1,am+b2,…,am+bn.
For example, for a paper titled "a kind of hidden markov model-based named entity recognition", the synonyms of the research behavior word "recognition" are extracted as "detection" and "extraction", and thus, the search expansion words "named entity detection" and "named entity extraction" are constructed and added to the set of search words.
And fourthly, extracting the upper concepts and the lower concepts of the research object words u and the research behavior words v of the thesis by utilizing the sub-network of the upper and lower relation in the knowledge graph.
If u is a generic term of c1,c2,…,cp(p is a natural number) and the subordinate concept of u is d1,d2,…,dq(q is a natural number) and v has a higher order meaning of e1,e2,…,es(s is a natural number) and v has a subordinate meaning of f1,f2,…,ft(t is a natural number), the following search expansion word is constructed.
u+ej(j=1,2,…,s),u+fj(j=1,2,…,t),
ai+ej(i=1,2,…,m,j=1,2,…,s),ai+fj(i=1,2,…,m,j=1,2,…,t),
ci+v(i=1,2,…,p),di+v(i=1,2,…,q),
ci+bj(i=1,2,…,p,j=1,2,…,n),di+bj(i=1,2,…,q,j=1,2,…,n),
ci+ej(i=1,2,…,p,j=1,2,…,s),ci+fj(i=1,2,…,p,j=1,2,…,t),
di+ej(i=1,2,…,q,j=1,2,…,s),di+fj(i=1,2,…,q,j=1,2,…,t).
For example, for a paper titled "a kind of named entity recognition based on hidden markov model", the superordinate concept "entity" of its research object "named entity" is extracted, and the search expansion words "entity recognition", "entity detection", and "entity extraction" can be constructed and added to the search term set.
And fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the paper by utilizing partial overall relation sub-networks in the knowledge graph. If the overall concept of u is g1,g2,…,go(o is a natural number) and u is partially defined as h1,h2,…,hr(r is a natural number) and v is given as k as a whole1,k2,…,kw(w is a natural number) and v is partially defined as l1,l2,…,lz(z is a natural number), the following search expansion word is constructed.
u+kj(j=1,2,…,w),u+lj(j=1,2,…,z),
ai+kj(i=1,2,…,m,j=1,2,…,w),ai+lj(i=1,2,…,m,j=1,2,…,z),
gi+v(i=1,2,…,o),hi+v(i=1,2,…,r),
gi+bj(i=1,2,…,o,j=1,2,…,n),hi+bj(i=1,2,…,r,j=1,2,…,n),
gi+kj(i=1,2,…,o,j=1,2,…,w),gi+lj(i=1,2,…,o,j=1,2,…,z),
hi+kj(i=1,2,…,r,j=1,2,…,w),hi+lj(i=1,2,…,r,j=1,2,…,z).
For example, for a paper titled "a kind of hidden markov model-based named entity recognition", the overall concept "entity information" of "named entity" is extracted, and then the search expansion words "entity information extraction", "entity information recognition", and "entity information detection" may be constructed and added to the set of search words.
And sixthly, extracting the parallel concept of the research object words u and the research behavior words v of the paper by utilizing a parallel relation sub-network in the knowledge graph. If the juxtaposition of u is x1,x2,…,xk1(k1 is a natural number) and v is a parallel concept of y1,y2,…,yk2(k2 is a natural number), the following search expansion word is constructed.
u+yj(j=1,2,…,k2),xi+v(i=1,2,…,k1).
For example, for a paper titled "a kind of hidden markov model-based named entity recognition", parallel concepts "links" and "disambiguation" of the research behavior word "recognition" are extracted, and search expansion words "entity disambiguation" and "entity links" are constructed, which are added to the set of search words.
And 3, constructing an inverted index of the document.
And constructing the inverted index according to the titles and the abstracts of the documents in the data set, wherein the inverted index comprises preprocessing, index construction and index storage. The preprocessing comprises root extraction and morphological reduction, and punctuation and stop words are removed. The index construction comprises the steps of constructing a mapping dictionary from words to documents, sequencing the words according to the dictionary sequence, combining document mapping information of the same words, and constructing a document inverted list, namely the document inverted index.
And 4, selecting a candidate citation set.
First, according to the expanded search term set, a paper including any search term in the title and abstract is searched in the data set. Then, the similarity of the query to these papers is calculated. The top N (N is a natural number) papers with the highest similarity are taken as a candidate citation set. Wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in Lucene. The query and the paper are represented by a query vector and a paper vector, and the similarity of the query and the paper is the cosine similarity of the query vector and the paper vector.
And 5, extracting similarity characteristics of the candidate quotation and the query.
The similarity features of candidate quotations and queries are divided into the following two features. The first is based on the similarity characteristics of candidate quotation and query of Lucene. The second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler dictionary). Firstly, a hidden Dirichlet distribution model is adopted to obtain the topic distribution of the query and the candidate quotation. Then, KL distances of the two topic distributions are calculated.
And 6, constructing training data recommended by the citation.
First, for each training paper in the training data set, a search engine Lucene is used to retrieve candidate citations based on its title and abstract.
Second, for each candidate citation p, a training sample is constructed. The training sample features comprise the reference times feature of the candidate quotation p, the candidate quotation p and the similarity feature of the query constructed according to the training paper. If the training paper refers to the candidate quotation p, the classification label of the sample is 1, otherwise, the classification label is 0. If the training paper contains m references, m positive samples and n-m negative samples can be constructed, where n is the space of the candidate quotation.
And 7, performing citation recommendation based on the gradient progressive regression tree.
Firstly, a classification model is trained by adopting a gradient progressive Regression tree (GBRT) to realize citation recommendation. The classification features comprise similarity features of candidate quotations and queries and paper quotation frequency features. The output value of the gradient regression tree is generally a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation. The greater the degree of recommendation, the greater the likelihood that the candidate citation is classified as "recommended". Further, M candidate quotations with the highest recommendation degree (M is a natural number) are used as the quotation recommendation results of the current paper.
Second, for each citation p recommended, the subject words x and the research behavior words y are identified from their titles and abstracts. For the current paper, a multi-level semantic association of each citation p with it is constructed. If u and v are respectively the words of the research object and the words of the research behavior of the current thesis;
case 1: if x is the overall concept of u, or y is the overall concept of v, the study content of the citation p includes the study content of the current paper. If x is a partial concept of u, or y is a partial concept of v, the contents of the current paper include the contents of the citation p.
Case 2: if x is a generic concept of u, or y is a generic concept of v, the research method of citation p can be applied to solve the research problem of the current paper. If x is a subset of u, or y is a subset of v, then the current paper's research approach can be applied to solve the research problem of the citation p.
Case 3: if x is a parallel concept of u, or y is a parallel concept of v, the research method of the current paper can refer to the research method of the citation p.
The implementation process of the invention adopts scientific and technological paper in the field of physics to carry out experimental tests. The average accuracy ap (average precision) was used to evaluate the experimental results of the citation recommendations.
For paper q, let xqIs a reference collection of articles q, yqIs an ordered binary set and represents the quotation recommendation result of the paper q. y isq(i) Is an ordered binary set yqWhere a is a paper ID, B indicates whether the paper is referenced, 1 indicates referenced, and 0 indicates not referenced. y isqThe quotations are sorted according to the descending mode of the GBRT output values of the gradient progressive regression tree. Calculate y using the following equationqAccuracy P at the k-th positionk(yq) And k is a natural number.
Wherein,denotes yq(i) Whether the article in (1) belongs to the reference set of the article q, specifically calculated as follows: if yq(i) The article in (a) belongs to the reference collection of article q, thenIf yq(i) The article in (a) does not belong to the reference set of article q, then
Further, y is calculated using the following equationqAverage accuracy of AP (y)q) Where n is a binary set yqThe number of doublets.
Taking the paper entitled "More defining N ═ 1 SUSY Gauge tools from Non-abelian duplex", as an example, the first 10 citations obtained by querying in the dataset with Lucene are (9811119,1), (9610139,1), (9804038,0), (9807222,0), (9603206,0), (9411149,1), (9607200,0), (9408155,0), (9810014,1), (9605113,0) in that order. The first 10 citations obtained by the process of the invention are in turn (9411149,1), (9407087,0), (9408099,0), (9610139,1), (9811119,1), (9510101,0), (9503179,1), (9510148,1), (9408155,0), (9602031, 0). The average accuracy of the citation recommendation experiment result based on Lucene is about 0.29, and the average accuracy of the citation recommendation experiment result adopting the method of the invention is about 0.33. Experimental results show that the quotation recommending method improves the efficiency of obtaining quotations by users. In addition, the citation recommendation method does not involve similar users, and is therefore not limited to the number of similar users; the method can recommend documents with multilayer semantic association relation with the papers by utilizing the knowledge graph of document contents.

Claims (1)

1. A multi-layer citation recommendation method based on a literature content knowledge graph is characterized by comprising the following steps:
step 1, acquiring a query requirement;
step 2, utilizing a knowledge graph of the document content to perform query expansion;
step 3, constructing an inverted index of the literature;
step 4, selecting a candidate citation set;
step 5, extracting similarity characteristics of the candidate quotation and the query;
step 6, constructing training data recommended by citations;
step 7, performing citation recommendation based on the gradient progressive regression tree;
the step 1 comprises the following steps: acquiring titles and abstracts of papers needing to recommend quotations, extracting roots and restoring shapes of the papers, and removing punctuation marks and stop words; extracting key words as search terms required by search engine Lucene query;
the step 2 comprises the following steps:
firstly, expanding the search words of the query requirement, acquiring synonyms and near-synonyms of the search words by utilizing a synonym dictionary and a near-synonym dictionary, and expanding a search word set;
secondly, according to the title and abstract of the paper, identifying a research object word u and a research behavior word v of the paper;
thirdly, extracting synonyms and synonyms of the research object words u and the research behavior words v of the thesis by utilizing the synonym dictionary and the near synonym dictionary, constructing retrieval expansion words and adding the retrieval expansion words into a retrieval word set;
if u's synonyms and synonyms are a1,a2,…,amM is a natural number, and the synonyms and synonyms of v are b1,b2,…,bnAnd n is a natural number, constructing a search expansion word as follows, wherein '+' refers to the connection of two words; "u + b1"means the word u and the word b1The connection of (1); "entity + detect" refers to the concatenation of the word "entity" and the word "detect", i.e., "entity detect";
u+b1,u+b2,…,u+bn,
a1+v,a1+b1,a1+b2,…,a1+bn,
a2+v,a2+b1,a2+b2,…,a2+bn,
…,
am+v,am+b1,am+b2,…,am+bn
fourthly, extracting the upper concepts and the lower concepts of the research object words u and the research behavior words v of the thesis by utilizing the sub-network of the upper and lower relations in the knowledge graph;
if u is a generic term of c1,c2,…,cpP is a natural number, and u has a subordinate concept of d1,d2,…,dqQ is a natural number, and v is a generic term of e1,e2,…,esS is a natural number, v is a subordinate term of f1,f2…, ft, t is a natural number, the following search expansion word is constructed:
u+ej,j=1,2,…,s,u+fj,j=1,2,…,t,
ai+ej,i=1,2,…,m,j=1,2,…,s,ai+fj,i=1,2,…,m,j=1,2,…,t,
ci+v,i=1,2,…,p,di+v,i=1,2,…,q,
ci+bj,i=1,2,…,p,j=1,2,…,n,di+bj,i=1,2,…,q,j=1,2,…,n,
ci+ej,i=1,2,…,p,j=1,2,…,s,ci+fj,i=1,2,…,p,j=1,2,…,t,
di+ej,i=1,2,…,q,j=1,2,…,s,di+fj,i=1,2,…,q,j=1,2,…,t,
fifthly, extracting partial concepts and overall concepts of the research object words u and the research behavior words v of the thesis by utilizing partial overall relation sub-networks in the knowledge graph; if the overall concept of u is g1,g2,…,goO is a natural number, and u is partially defined as h1,h2,…,hrR is a natural number, and v is given as k as a whole1,k2,…,kwW is a natural number, and v is partially defined as l1,l2,…,lzIf z is a natural number, constructing the following search expansion words;
u+kj,j=1,2,…,w,u+lj,j=1,2,…,z,
ai+kj,i=1,2,…,m,j=1,2,…,w,ai+lj,i=1,2,…,m,j=1,2,…,z,
gi+v,i=1,2,…,o,hi+v,i=1,2,…,r,
gi+bj,i=1,2,…,o,j=1,2,…,n,hi+bj,i=1,2,…,r,j=1,2,…,n,
gi+kj,i=1,2,…,o,j=1,2,…,w,gi+lj,i=1,2,…,o,j=1,2,…,z,
hi+kj,i=1,2,…,r,j=1,2,…,w,hi+lj,i=1,2,…,r,j=1,2,…,z,
sixthly, extracting the parallel concept of the research object word u and the research behavior word v of the thesis by utilizing a parallel relation sub-network in the knowledge graph; if the juxtaposition of u is x1,x2,…,xk1K1 is a natural number, and v is a parallel concept of y1,y2,…,yk2And k2 is a natural number, the following search expansion words are constructed:
u+yj,j=1,2,…,k2,xi+v,i=1,2,…,k1;
the step 3 comprises the following steps:
constructing an inverted index according to titles and abstracts of documents in a data set, wherein the inverted index comprises preprocessing, index construction and index storage; preprocessing comprises root extraction and morphological reduction, and removing punctuation marks and stop words; constructing an index comprises constructing a mapping dictionary from words to documents, sequencing the words according to the dictionary sequence, combining document mapping information of the same words, and constructing a document inverted list, namely a document inverted index;
the step 4 comprises the following steps:
firstly, according to the expanded search term set, searching out a paper including any search term in a title and an abstract in a data set; then, calculating the similarity of the query and the papers; taking the first N papers with the highest similarity as a candidate citation set, wherein N is a natural number; wherein, the similarity between the query and the thesis is calculated by adopting a vector space model in a search engine Lucene; the query and the thesis are represented by a query vector and a thesis vector, and the similarity of the query and the thesis is the cosine similarity of the query vector and the thesis vector;
the step 5 comprises the following steps:
the similarity characteristics of the candidate quotation and the query are divided into the following two characteristics; the first is based on the similarity characteristics of candidate quotation and query of a search engine Lucene; the second is the KL distance of the candidate quote from the topic distribution of the query (Kullback-Leibler Divergence); firstly, acquiring topic distribution of query and candidate quotation by adopting an implicit Dirichlet distribution model LDA; then, calculating KL distances of the two theme distributions;
the step 6 comprises the following steps:
6.1, for each training paper in the training data set, utilizing a search engine Lucene to retrieve candidate quotations according to the title and the abstract of the training paper;
6.2, constructing a training sample for each candidate citation p; the training sample characteristics comprise the reference frequency characteristics of the candidate quotation p, the query constructed according to the training paper and the similarity characteristics of the candidate quotation p; if the training paper quotes the candidate quotation p, the classification label of the training sample is 1, otherwise, the classification label is 0; if the training paper contains m1In this reference, m can be constructed1A positive sample and n1-m1A negative example, wherein n1The space number of the candidate quotation;
the step 7 comprises the following steps:
7.1, training a classification model by adopting a gradient progressive regression tree GBRT to realize citation recommendation; the classification features comprise similarity features of candidate quotations and queries and thesis quotation frequency features; the output value of the gradient progressive regression tree is a real number between 0 and 1, and the output value of the GBRT is used as the recommendation degree of the candidate quotation; the greater the recommendation degree, the greater the likelihood that the candidate citation is classified as "recommended"; further, taking M candidate quotations with the highest recommendation degree as quotation recommendation results of the current paper, wherein M is a natural number;
7.2, for each citation p recommended, identifying a research object word x and a research behavior word y from the title and abstract of the citation p; for the current paper, constructing a multi-layer semantic association relation between each citation p and the citation p; if u and v are the subject words and the study behavior words of the current paper respectively,
case 1: if x is the overall concept of u, or y is the overall concept of v, the research content of the citation p comprises the research content of the current paper; if x is a partial concept of u, or y is a partial concept of v, the research content of the current paper comprises the research content of the citation p;
case 2: if x is a higher-level concept of u or y is a higher-level concept of v, the research method of the citation p can be applied to solving the research problem of the current paper; if x is a lower concept of u or y is a lower concept of v, the research method of the current paper can be applied to solve the research problem of the citation p;
case 3: if x is a parallel concept of u, or y is a parallel concept of v, the research method of the current paper can refer to the research method of the citation p.
CN201511026567.7A 2015-12-31 2015-12-31 A kind of multilayer quotation based on literature content knowledge mapping recommends method Active CN105653706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511026567.7A CN105653706B (en) 2015-12-31 2015-12-31 A kind of multilayer quotation based on literature content knowledge mapping recommends method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511026567.7A CN105653706B (en) 2015-12-31 2015-12-31 A kind of multilayer quotation based on literature content knowledge mapping recommends method

Publications (2)

Publication Number Publication Date
CN105653706A CN105653706A (en) 2016-06-08
CN105653706B true CN105653706B (en) 2018-04-06

Family

ID=56490788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511026567.7A Active CN105653706B (en) 2015-12-31 2015-12-31 A kind of multilayer quotation based on literature content knowledge mapping recommends method

Country Status (1)

Country Link
CN (1) CN105653706B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN111460324A (en) * 2020-06-18 2020-07-28 杭州灿八科技有限公司 Citation recommendation method and system based on link analysis

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10503832B2 (en) * 2016-07-29 2019-12-10 Rovi Guides, Inc. Systems and methods for disambiguating a term based on static and temporal knowledge graphs
CN106407316B (en) * 2016-08-30 2020-05-15 北京航空航天大学 Software question and answer recommendation method and device based on topic model
CN107169010A (en) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 A kind of determination method and device of recommendation search keyword
CN107103100B (en) * 2017-06-10 2019-07-30 海南大学 A kind of fault-tolerant intelligent semantic searching method based on map framework
CN107895056A (en) * 2017-12-29 2018-04-10 百度在线网络技术(北京)有限公司 A kind of information recommendation method, device, electronic equipment and storage medium
CN108304531B (en) * 2018-01-26 2020-11-03 中国信息通信研究院 Visualization method and device for reference relationship of digital object identifiers
CN108763354B (en) * 2018-05-16 2021-04-06 浙江工业大学 Personalized academic literature recommendation method
CN108664661B (en) * 2018-05-22 2021-08-17 武汉理工大学 Academic paper recommendation method based on frequent theme set preference
CN110737774B (en) * 2018-07-03 2024-05-24 百度在线网络技术(北京)有限公司 Book knowledge graph construction method, book recommendation method, device, equipment and medium
CN108897887B (en) * 2018-07-10 2020-10-16 华南师范大学 Teaching resource recommendation method based on knowledge graph and user similarity
CN109033314B (en) * 2018-07-18 2020-10-23 哈尔滨工业大学 Real-time query method and system for large-scale knowledge graph under condition of limited memory
CN109241273B (en) * 2018-08-23 2022-02-18 云南大学 Method for extracting minority subject data in new media environment
CN109063188A (en) * 2018-08-28 2018-12-21 国信优易数据有限公司 A kind of entity recommended method and device
CN109597878B (en) * 2018-11-13 2020-06-05 北京合享智慧科技有限公司 Method for determining text similarity and related device
CN109582933B (en) * 2018-11-13 2021-09-03 北京合享智慧科技有限公司 Method and related device for determining text novelty
CN109542247B (en) * 2018-11-14 2023-03-24 腾讯科技(深圳)有限公司 Sentence recommendation method and device, electronic equipment and storage medium
CN109582803A (en) * 2018-11-30 2019-04-05 广东电网有限责任公司 The construction method and system of competitive intelligence database
CN109597879B (en) * 2018-11-30 2022-03-29 京华信息科技股份有限公司 Service behavior relation extraction method and device based on 'citation relation' data
CN109376309B (en) 2018-12-28 2022-05-17 北京百度网讯科技有限公司 Document recommendation method and device based on semantic tags
CN109815335B (en) * 2019-01-26 2022-03-04 福州大学 Paper field classification method suitable for literature network
CN110033851B (en) * 2019-04-02 2022-07-26 腾讯科技(深圳)有限公司 Information recommendation method and device, storage medium and server
CN110427465B (en) * 2019-08-14 2022-03-04 北京奇艺世纪科技有限公司 Content recommendation method and device based on word knowledge graph
CN110674308A (en) * 2019-08-23 2020-01-10 上海科技发展有限公司 Scientific and technological word list expansion method, device, terminal and medium based on grammar mode
CN110532393B (en) * 2019-09-03 2023-09-26 腾讯科技(深圳)有限公司 Text processing method and device and intelligent electronic equipment thereof
CN110688838B (en) * 2019-10-08 2023-07-18 北京金山数字娱乐科技有限公司 Idiom synonym list generation method and device
CN111091454A (en) * 2019-11-05 2020-05-01 新华智云科技有限公司 Financial public opinion recommendation method based on knowledge graph
CN111090743B (en) * 2019-11-26 2023-05-09 华南师范大学 Thesis recommendation method and device based on word embedding and multi-value form concept analysis
CN111078884B (en) * 2019-12-13 2023-08-15 北京小米智能科技有限公司 Keyword extraction method, device and medium
CN111930891B (en) * 2020-07-31 2024-02-02 中国平安人寿保险股份有限公司 Knowledge graph-based search text expansion method and related device
CN112100405B (en) * 2020-09-23 2024-01-30 中国农业大学 Veterinary drug residue knowledge graph construction method based on weighted LDA
CN112364151B (en) * 2020-10-26 2023-06-27 西北大学 Thesis mixed recommendation method based on graph, quotation and content
CN112287218B (en) * 2020-10-26 2022-11-01 安徽工业大学 Knowledge graph-based non-coal mine literature association recommendation method
CN112667773A (en) * 2020-12-23 2021-04-16 医渡云(北京)技术有限公司 Data acquisition method based on knowledge graph and related equipment
CN112925895A (en) * 2021-03-29 2021-06-08 中国工商银行股份有限公司 Natural language software operation and maintenance method and device
CN115618014B (en) * 2022-10-21 2023-07-18 上海研途标准化技术服务有限公司 Standard document analysis management system and method applying big data technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955849A (en) * 2012-10-29 2013-03-06 新浪技术(中国)有限公司 Method for recommending documents based on tags and document recommending device
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6877001B2 (en) * 2002-04-25 2005-04-05 Mitsubishi Electric Research Laboratories, Inc. Method and system for retrieving documents with spoken queries
US20140297644A1 (en) * 2013-04-01 2014-10-02 Tencent Technology (Shenzhen) Company Limited Knowledge graph mining method and system
CN103729402B (en) * 2013-11-22 2017-01-18 浙江大学 Method for establishing mapping knowledge domain based on book catalogue

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955849A (en) * 2012-10-29 2013-03-06 新浪技术(中国)有限公司 Method for recommending documents based on tags and document recommending device
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Predicting Citation Counts of Papers;Junpeng Chen et al;《2013 IEEE International Conference on Computer Vision》;20150630;第434-440页 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN109241278B (en) * 2018-07-18 2022-04-26 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN111460324A (en) * 2020-06-18 2020-07-28 杭州灿八科技有限公司 Citation recommendation method and system based on link analysis

Also Published As

Publication number Publication date
CN105653706A (en) 2016-06-08

Similar Documents

Publication Publication Date Title
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN109829104A (en) Pseudo-linear filter model information search method and system based on semantic similarity
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN110543564B (en) Domain label acquisition method based on topic model
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN107291895B (en) Quick hierarchical document query method
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN115270738B (en) Research and report generation method, system and computer storage medium
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN108416008A (en) A kind of BIM product database semantic retrieving methods based on natural language processing
CN102915381A (en) Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method
US9652997B2 (en) Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme
CN106372122A (en) Wiki semantic matching-based document classification method and system
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN112685440B (en) Structural query information expression method for marking search semantic role
CN115687773A (en) Cross-environment metadata matching method and system based on knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant