CN106815297B

CN106815297B - Academic resource recommendation service system and method

Info

Publication number: CN106815297B
Application number: CN201611130297.9A
Authority: CN
Inventors: 刘柏嵩; 王洋洋; 尹丽玲; 费晨杰; 高元
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2020-04-10
Anticipated expiration: 2036-12-09
Also published as: CN106815297A

Abstract

Providing an academic resource recommendation service system and method, crawling academic resources on the Internet by using a topic crawler based on LDA, classifying the academic resources according to a preset A categories by using a text classification model based on LDA, storing the classified academic resources in a local academic resource database, further comprising an academic resource model, a resource quality value calculation model and a user interest model, embedding a tracking software module in a terminal of a user, respectively modeling the academic resource model and the user interest model by combining four dimensions including an interest subject and historical browsing behavior data of the user, calculating the similarity between the academic resource model and a user interest preference model, calculating the recommendation degree by combining the resource quality value, and finally recommending the academic resources Top-N for the user according to the recommendation degree; according to the method and the system, the individual and accurate recommendation of academic resources is performed according to the identity, interest and browsing behaviors of the user, and the working efficiency of scientific research personnel is improved.

Description

Academic resource recommendation service system and method

Technical Field

The invention relates to the technical field of computer application, in particular to an academic resource recommendation service system and a method for providing academic resource recommendation service for related users by using the resource recommendation service system.

Background

Now has entered the big data era, especially in the field of academic resources, with hundreds of millions of academic resources being generated each year. Besides academic papers and patents, a large number of academic resources such as academic conferences, academic news and academic community information emerge in real time, and the academic resources of the types have great significance for a user to accurately and efficiently master the current scientific research situation in the field of interest. However, scientific research users are reluctant in scientific research, academic resources have the characteristics of large data heterogeneity, heterogeneity and rapid growth, the traditional search engine mode in the academic resources is difficult to complete and check, the search process is complicated, and the users often spend a lot of time and energy in inquiring interested academic resources, so that the work efficiency is affected.

The current academic resource personalized recommendation research object mainly focuses on academic papers, and the recommended academic resource type is single; different user groups, namely users with different identities have different attention degrees to different types of academic resources, the current personalized recommendation research of the academic resources does not consider the factors, and a multi-strategy recommendation scheme cannot be formulated based on the user identities. In addition, the current academic resource recommendation research is only limited to a recommendation module, the system provides systematic services for the academic resource recommendation, and an integrated service system taking resource integration and recommendation as a core is formed from the dynamic acquisition, integration and classification of academic resources to the individual recommendation of the academic resources based on the user identity, behavior and interest disciplines.

Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". The theme refers to a defined professional field or an interest field, such as aerospace, biomedicine, information technology and the like, and specifically refers to a set formed by a series of related words. Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution. LDA is an unsupervised machine learning technique that can be used to identify underlying subject information in a document. It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words. The LDA topic model is a typical model for topic mining in natural language processing, can extract potential topics from text corpus, provides a method for quantifying research topics, and is widely applied to topic discovery of academic resources, such as research hotspot mining, research topic evolution, research trend prediction and the like.

In addition, with the discovery of the internet, the internet is filled with a large amount of information texts in various modes such as various news, blogs, meeting memos and the like, the information texts more or less include academic related information contents and often include latest academic research information, and the information texts are concerned by various related subjects, are disordered and often overlap in subject, and generally have no classification information.

The present invention is directed to solving the above-mentioned problems.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an academic resource recommendation service system and a method for providing academic resource recommendation service for related users by using the resource recommendation service system aiming at the technical current situation.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an academic resource recommendation service system is characterized in that the academic resource recommendation service system is stored in a local academic resource database after being classified according to preset A categories by using a text classification model, provides an open API of the academic resource database for display and calling of a resource recommendation module, and further comprises an academic resource model, a resource quality value calculation model and a user interest model, and a tracking software module is implanted in a terminal of a user and is used for tracking and recording the online browsing behavior of the user; the method comprises the steps of calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling the interest preference of the users by combining the interest subjects of the users and the historical browsing behavior data, calculating the similarity between an academic resource model and a user interest preference model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree.

The web crawler is a topic crawler and further comprises an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, and a word clustering function during training of the LDA topic model is utilized to obtain K topic associated word sets which are respectively aggregated according to the set topic number K after the training corpuses are trained by the LDA topic model so as to obtain K topic documents of the topic crawler for crawling at the time; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.

The corpus also comprises verification corpora with definite categories, which are used for classifying and verifying the text classification model according to a preset category number A by using the verification corpora in advance so as to obtain the classification accuracy of the text classification model to each category in the A categories, and the classification accuracy serves as a classification credibility index of the text classification model to each category in the A categories; the accuracy rate is the ratio of correctly classified corpora in all the verification corpora classified into a certain category by the text classification model, and a classification accuracy rate threshold is preset.

All the disciplines are divided into 75 discipline categories, namely the category number A is 75 categories, the theme number K is set to be 100 during training by utilizing an LDA theme model, and the preset classification accuracy threshold is 80% during classification verification of the text classification model.

A method for providing academic resource recommendation service for related users by a resource recommendation service system is characterized in that the academic resources are classified and stored according to predetermined A categories by using a text classification model to form an academic resource database, an open API of the academic resource database is provided for display and calling of a resource recommendation module, and a tracking software module is installed at a user terminal and used for tracking and recording the online browsing behavior of the user; the process of recommending the corresponding academic resources to the user comprises a cold start recommending stage and a secondary recommending stage, wherein the cold start recommending stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation by a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of the resource authority, the resource community heat and the resource time-freshness; and in the secondary recommendation stage, modeling is respectively carried out on the user interest model and the resource model, the similarity between the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree.

The resource Quality value Quality calculation includes that the calculation formula of the resource Authority is as follows:

wherein, Level is the score of the quantified publication Level of the resource, and the publication Level is divided into 5 grades, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 in sequence. The tip magazine or conference such as Nature, Science, score 1, the second grade such as ACM Transaction score 0.8, the lowest grade score 0.2; the calculation formula for Cite is as follows:

Cite＝Cites/maxCite (2)

cite is the quantized result of the resource quoted quantity, Cites is the quoted quantity of the resource, maxCite is the largest quoted quantity in the resource database;

the calculation formula of the resource community heat degree Popularity is as follows:

Popularity＝readTimes/maxReadTimes (3)

readTimes is the reading times of the thesis, maxReadTimes is the maximum reading times in the source database of the resource;

the time-new Recentness calculation method of the resources is the same, and the formula is as follows:

year and month are the year and month of publication of the resource, respectively; minYear, minMonth, maxYear, and maxMonth are the earliest and latest publication years and months of all resources in the source database for that type of resource;

the resource Quality value Quality calculation method is as follows:

the academic resource model is represented as follows:

M_r＝{T_r,K_r,C_t,L_r} (6)

wherein, T_rThe subject distribution vector of the academic resources is the probability value of the academic resources distributed in A subject categories and is obtained by a Bayesian polynomial model;

K_r＝{(k_r1,ω_r1),(k_r2,ω_r2),…,(k_rm,ω_rm) M is the number of keywords, k_ri(i is more than or equal to 1 and less than or equal to m) represents the ith keyword, omega, of a single academic resource_riAs a keyword k_riThe weight of (c) is obtained by an improved tf-idf algorithm, and the calculation formula is as follows:

w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i; lr is the underlying topic distribution vector, L_r＝{l_r1,l_r2,l_r3…,l_rN1N1 is the number of potential topics; ct is a resource type, and t can take values of 1,2,3,4,5, namely five major academic resources: papers, patents, news, meetings, and books;

according to the behavior characteristics of a user using mobile software, the operation behavior of the user on an academic resource is divided into opening, reading, star-level evaluation, sharing and collection, a user interest model is built on the basis of the user background and the browsed academic resource and in combination with the academic resource model according to different browsing behaviors of the user, and the user interest model is expressed as follows:

M_u＝{T_u，K_u，C_t，L_u} (8)

wherein, T_uIs that the user browses in a period of timeDisciplinary distribution vector T of a certain class of academic resources_rAfter user action, a user subject preference distribution vector is formed, i.e.

Wherein sum is the total academic resource number, s, of the behavior generated by the user_jA "behavior coefficient" after the behavior is generated for the user for the academic resource j, and the larger the value is, the more the user likes the resource. T is_jrThe discipline distribution vector representing the jth resource. s_jThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.

K_u＝{(k_u1，ω_u1)，(k_u2，ω_u2)，...，(k_uN2，ω_uN2) Is the user's keyword preference distribution vector, N₂Is the number of keywords, k_ui(1≤i≤N₂) Denotes the ith user preference keyword, ω_uiAs a keyword k_uiBy the "keyword distribution vector" K of some kind of academic resources that the user u has produced behavior for a period of time_rAnd (4) calculating.

K_jr′＝s_j*K_jr(10)

Calculating new keyword distribution vector of each resource according to formula 10, and selecting TOP-N2 of new keyword distribution vector of all resources as user keyword preference distribution vector K_u；

L_uFor the user's LDA latent topic preference distribution vector, the LDA latent topic distribution vector L from academic resources_r＝{l_r1，l_r2，l_r3...，l_rN1Obtained by calculation in the same way as T_u：

The similarity between the user interests and the resource model is calculated as follows:

academic resource model representation:

M_r＝{T_r，K_r，C_t，L_r} (12)

user interest model representation:

M_u＝{T_u，K_u，C_t，L_u} (13)

user discipline preference distribution vector T_uSubject distribution vector T of academic resources_rThe similarity of (2) is calculated by cosine similarity, namely:

user LDA latent topic preference distribution vector L_uAnd academic resources LDA potential topic distribution vector L_rThe similarity of (2) is calculated by cosine similarity, namely:

user keyword preference distribution vector K_uAnd academic resource keyword distribution vector K_rSimilarity calculation of (c) was calculated by the Jaccard Similarity entry:

then the similarity between the user interest model and the academic resource model is as follows:

where σ + ρ + τ is 1, and the specific weight assignment is obtained by experimental training.

Introducing a Recommendation _ degree concept, wherein the higher the Recommendation degree of a certain academic resource is, the more the resource meets the interest preference of a user, and the better the resource is, the Recommendation degree calculation formula is as follows:

Recommendation_degree＝λ₁Sim(M_u,M_n)+λ₂Quality(λ₁+λ₂＝1) (18)

the secondary recommendation stage is to perform Top-N recommendation according to the recommendation degree of academic resources.

The web crawler comprises an addressing crawler, a topic crawler and an LDA topic model, wherein the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, and a word aggregation function during training of the LDA topic model is utilized to obtain K topic associated word sets which are respectively aggregated according to the set topic number K after the training corpuses are trained by the LDA topic model so as to obtain K topic documents of the topic crawler crawling at the time; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.

The corpus also comprises verification corpora with definite categories, which are used for classifying and verifying the text classification model according to a preset category number A by using the verification corpora in advance so as to obtain the classification accuracy of the text classification model to each category in the A categories, and the classification accuracy serves as a classification credibility index of the text classification model to each category in the A categories; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; the text classification method for each text to be classified by using the text classification model specifically comprises the following steps:

step one, preprocessing each text to be classified, wherein the preprocessing comprises word segmentation and word retention removal, proper nouns are reserved, the characteristic weights of all preprocessed words of the text are respectively calculated, the characteristic weight numerical values of the words are in direct proportion to the occurrence frequency of the words in the text and in inverse proportion to the occurrence frequency of the words in the training corpus, the word sets obtained through calculation are arranged in a descending order according to the characteristic weight numerical values, and the front part of the original word set of each text to be classified is extracted as the characteristic word set;

selecting an original feature word set of each text to be classified by using a text classification model to respectively calculate the probability value of each category of the predetermined A categories to which the text may belong, and selecting the category with the maximum probability value as the classification category of the text;

step three, judging the text classification result of the step two, and directly outputting the result if the classification accuracy value of the text classification model to the classification reaches a set threshold value; if the classification accuracy rate value of the text classification model to the classification does not reach the set threshold value, entering the step four;

inputting each preprocessed text into the LDA topic model, calculating a weight value of each topic in K set topics corresponding to the text by using the LDA topic model, selecting the topic with the largest weight value, adding the first Y words in topic associated words under the topic obtained after being trained by the LDA topic model into an original feature word set of the text to be used as an expanded feature word set together, respectively calculating probability values of each category in A preset categories possibly attributed to the text by using the text classification model again, and selecting the category with the largest probability value as a final classification category of the text.

The main calculation formula of the text classification model is as follows:

wherein P (c)_j|x₁，x₂，...，x_n) Representation feature word (c)_j|x₁，x₂，...，x_n) Probability that the text belongs to category cj when occurring at the same time; wherein P (c)_j) Representing a set of training texts, belonging to class c_jThe ratio of the total number of texts of (a), P (x)₁，x₂，...，x_n|c_j) Indicating if the text to be classified belongs to class c_jThen the feature word set of this text is (x)₁，x₂，...，x_n) Probability of (A), P (c)₁，c₂，...，c_n) Representing the joint probability of all classes given.

The resource recommendation service system for the multi-type academic resources has the following characteristics:

(1) the invention realizes the dynamic acquisition of various types of academic resources, such as academic papers, patents, academic conferences, academic news and the like, and efficiently acquires the target academic resources based on the topic crawler module.

(2) The invention realizes the theme classification work of various academic resources based on subject attributes.

(3) The attention degrees of different user groups to different types of academic resources are different, the multi-strategy academic resource recommendation method based on different user groups is realized, and various types of academic resources are recommended for users with different identities according to different proportions.

(4) Based on the browsing habits of the users, the invention realizes the personalized recommendation work of various academic resources based on different behaviors of the users.

According to the invention, the individual recommendation of academic resources is carried out according to the identity, interest and browsing behavior of the user, the academic resources can be recommended to the user more accurately, the working efficiency of scientific research personnel is greatly improved, a convenient and rapid information acquisition environment is created for scientific research workers to carry out scientific research better, and the contradiction between the information overload of the academic resources and the acquisition of the user resources is effectively solved.

In addition, the invention adopts an academic resource acquisition method and classification method based on LDA, deeply excavates subject semantic information through an LDA subject model, constructs a good guidance basis for subject crawlers of academic resources, integrates machine learning into the academic resource acquisition method, and improves the quality and efficiency of academic resource acquisition; the academic resources obtained by the topic crawler are used for updating the LDA topic, so that the topic model can be updated at any time, the development trend of the academic is followed, and the forward resources of related fields are provided for scientific researchers; the text classification method based on selective feature expansion is suitable for complex application scenes, selectively adds theme information to data with small information quantity, avoids adding noise to data with sufficient information quantity, provides an idea for optimizing a text classification model, and has the characteristics of strong scene adaptability, high result usability and easiness in updating and maintaining the classification model.

Drawings

FIG. 1 is a block diagram of an overall academic resource recommendation service system according to the present invention;

FIG. 2 is a schematic view of an LDA model;

FIG. 3 is a schematic diagram of a text before preprocessing a certain text;

FIG. 4 is a schematic diagram of a pre-processed text;

FIG. 5 is a schematic diagram of a topic and a topic document after a corpus is trained by an LDA topic model;

FIG. 6 is a flow chart illustrating an LDA-based academic resource acquisition method according to the present invention;

FIG. 7 is a flow chart illustrating a text classification method according to the present invention using LDA;

FIG. 8 is a graph showing recall ratios of three experiments in a part of disciplines;

FIG. 9 is a graph showing the precision ratio of three experiments in part of the subject

FIG. 10 is a schematic diagram of a preferred process of the present invention.

Detailed Description

The following describes the embodiments of the present invention in detail.

The academic resource recommendation service system comprises a web crawler, a text classification model and an academic resource database, wherein the web crawler crawls academic resources on the Internet, the text classification model is used for classifying the academic resources according to preset A categories and then storing the classified academic resources in the local academic resource database, and an open API of the academic resource database is provided for display and calling of a resource recommendation module; the academic resource recommendation service system also comprises an academic resource model, a resource quality value calculation model and a user interest model, wherein a tracking software module is implanted into a terminal of a user and used for tracking and recording the online browsing behavior of the user; the method comprises the steps of calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling the interest preference of the users by combining the interest subjects of the users and the historical browsing behavior data, calculating the similarity between an academic resource model and a user interest model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree. All the first-level disciplines are arranged into 75 discipline categories according to the discipline categories in the professional catalog of the research biology discipline of the education department, namely the category number A is 75 categories.

Acquisition of academic resources

The web crawler of the invention is mainly a topic crawler and also comprises a corresponding LDA topic model, wherein the LDA topic model is a three-layer Bayesian generation model of 'document-topic-word', as shown in figure 2; training an LDA topic model by using training linguistic data according to a set topic number K in advance, wherein each training linguistic data needs to be preprocessed before training, and preprocessing comprises word segmentation and word stay removal; the word clustering function during LDA topic model training is utilized to obtain K topic associated word sets which are respectively clustered according to a set topic number K after a training corpus is trained by an LDA topic model, and the topic associated word sets are also called topic documents; when the LDA topic model is used for training, the number K of topics can be set to be 50-200, and the number K of preferred topics is 100; the documents in various forms of various disciplines can be randomly crawled from the internet, documents such as a long but normative abstract paper can be only abstracted, and an existing database can be used as a training corpus, and the document length can reach a considerable scale, at least tens of thousands of documents and up to millions of documents. If the number K of the selected topics is 100, all words of the training corpus are respectively gathered into 100 topic associated word sets in the LDA topic model operation training process, namely 100 topic documents; we can name each topic name artificially according to the meaning of each collective word, or can not name each topic name, but only number numbers or code numbers are used to indicate each topic name, wherein 3 topic documents are shown in fig. 5.

The topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of the anchor texts combined with the topic similarity of the page smaller than a set threshold are removed, URLs of the anchor texts combined with the topic similarity of the page larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the crawled academic resources into a database, and aiming at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.

In order to facilitate operation, the abstract of academic resources can be used as a training corpus, topics and topic documents are obtained through calculation of an LDA topic model, the topic documents guide calculation of topic similarity in the crawling process of a topic crawler, then crawled contents are stored in a database and serve as new linguistic data of the LDA training model, and an open API (application program interface) of the academic resources database is provided for display and calling; the method comprises the following specific steps:

step one, downloading and preprocessing abstracts of academic resources in a plurality of existing fields, manually classifying the abstracts into different categories according to the academic fields, and respectively using the categories as training corpora of a plurality of subjects of LDA;

inputting LDA topic model parameters, wherein the LDA topic model parameters comprise K and α, the value of K represents the topic number, the value of α represents the weight distribution of each topic before sampling, the value of β represents the prior distribution of each topic to words, a plurality of topics and topic documents with more subdivided topics are obtained through training, and each topic document is used for guiding a crawler;

step three, each crawler maintains a crawling URL queue from the selected high-quality seed URL, and updates the crawling URL queue according to similarity sequencing by continuously calculating the similarity between texts in the webpage and texts and topics pointed by anchor text links in the webpage, and captures webpage contents most relevant to the topics;

step four, after the academic resources acquired by the topic crawler are marked with corresponding topic labels, the academic resources are stored in a database and used as new language materials for training LDA (latent dirichlet allocation) for updating topic documents;

and step five, providing an open API of the academic resource database for display and calling.

The first step comprises the following specific sub-steps:

(a) corpus collection: downloading abstracts of academic resources in a plurality of existing fields as training corpora;

(b) text preprocessing: extracting abstract, Chinese word segmentation and removing stop words;

(c) classification into corpus: artificially classifying into different categories according to the academic field, and respectively using the categories as training corpora of a plurality of subjects of LDA.

The third step comprises the following specific sub-steps:

(a) the initial seed URL selects a better seed site facing a specific theme;

(b) extracting webpage content: downloading a page pointed by the URL with high priority, and extracting required content and URL information according to the HTML tag;

(c) analyzing and judging the relevance of the theme, and determining the acceptance or rejection of the page; the invention mainly adopts the combination of the existing VSM technology and the SSRM technology to calculate the subject correlation;

(d) sequencing the importance degree of the URL of the unvisited webpage;

(e) and (d) repeating the processes from (b) to (d) until the URL of the unvisited queue is empty.

In the substep (c), when the topic crawler crawls through each electronic document to analyze and judge the topic relevance, the generalized vector space model GVSSM combining two topic similarity calculation algorithms of VSM and SSRM is adopted to calculate the topic relevance of the crawled page and determine the choice of the page.

A topic is represented by a set of semantically related words and a weight that indicates that the word is related to the topic, i.e., topic Z { (w)₁，p₁)，(w₂，p₂)，…，(w_n，p_n) W, where the ith word w_iIs a word related to the subject Z, p₁Expressed in LDA as Z { (w) as a measure of the degree of correlation of the word with Z₁，p(w₁|z_j))，(w₂，p(w₂|z_j))，…，(w_n，p(w_n|z_j) In which w is present)_i∈W，p(w_i|z_j) Subject matter is Z_jWhen the selected word is w_iProbability of (a), z_jIs the jth topic.

The topic document generation process is a probability sampling process of a model and comprises the following specific sub-steps:

(a) generating the length N, N-Poisson (epsilon) of a document for any document d in the corpus, and obeying Poisson distribution;

(b) generating a theta-Dirichlet (α) for any document d in the corpus, and obeying Dirichlet distribution;

(c) generation of the ith word wi in the document d: first, a theme z is generated_jMultinominal (θ), subject to a polynomial distribution; then, for the subject z_jGenerating a discrete variable

Obeying a dirichlet distribution; finally generate so that

The word with the highest probability. The LDA model is shown in fig. 3.

Where the value of α represents the weight distribution of the respective topic prior to sampling and the value of β represents the prior distribution of the respective topic to the word.

The distribution of all variables and their obeys in the LDA model is as follows:

the entire model can actually become a joint distribution of P (w | Z) by integrating the variables that may exist.W refers to the word, and can be observed.Z is a variable of the topic that is the target product of the model. α can be seen to be the initial parameters of the model.then by integrating the variables that exist therein:

where N is the vocabulary length, w is the word, for θ -Dirichlet (α),

the middle θ is integrated to obtain:

wherein the content of the first and second substances,

representing the number of times the feature word w is assigned to the topic j,

indicating the number of feature words assigned to topic j,

representing the number of feature words in text d assigned to topic j,

and the number of all the characteristic words with the assigned subjects in the text d is shown.

From the above, it can be seen that the three variables that influence LDA modeling are mainly α and topic number k to select a better topic number, the value of α is first fixed, and then the change in the value of the equation after integrating other variables is calculated.

When the LDA model is adopted to carry out theme modeling on the text set, the theme number K has great influence on the performance of the LDA model for fitting the text set, so the theme number needs to be preset. According to the method, the optimal theme number is determined by measuring the classification effect under different theme numbers and is compared with the classification effect when the Perplexity value is used for determining the model to be optimally fit, on one hand, the method can obtain the more visual and accurate optimal theme number, and on the other hand, the difference between the corresponding classification effect and the actual result can be found out through the optimal theme number determined by the Perplexity value. The Perplexity value formula is:

wherein M is the number of texts in the text set, N_mIs the length of the m-th text, P (d)_m) The probability of generating the mth text for the LDA model is given by:

the subject crawler of the invention is additionally provided with three modules on the basis of the general crawler: the system comprises a theme determining module, a similarity calculating module and a URL priority ordering module, so that filtering and theme matching of a crawled page are completed, and finally, contents highly related to the theme are obtained.

1. A topic determination module: before the topic crawler works, a related topic word set of the topic crawler needs to be determined, namely a topic document is established. The topic word set is usually determined in two ways, one is manually determined, and the other is extracted from the initial page set. The subject word set is determined manually, the training and selection of the keywords have subjectivity, and the keywords extracted from the initial page have high noise and low coverage rate. The number of the subject words is used as the dimension of the subject vector, and the corresponding weight is the component value of the subject vector. The topic notation word set vector is: k is { K1, K2, …, kn }, and n is the number of topic words.

2. A similarity calculation module: in order to ensure that the web pages acquired by the crawler can be close to the topic as much as possible, the web pages must be filtered, and the web pages with low topic relevance (smaller than a set threshold value) are removed, so that the links in the web pages cannot be processed in the next crawling. Because the topic relevance of a page is very low, the webpage is likely to have some keywords only occasionally, the topic of the page may have little relation with the specified topic, the link meaning in processing the topic is very small, and the fundamental difference between the topic crawler and the common crawler is provided. The common crawler processes all links according to the set search depth, returns a large number of useless webpages as a result, and further increases the workload. It is obviously not feasible to use the whole text for the similarity comparison, and the text is generally required to be refined and extracted and converted into a data structure suitable for comparison and calculation, and simultaneously, the theme of the text is ensured to be embodied as much as possible. The feature selection adopted by a typical subject crawler is VSM, and also relates to TF-IDF algorithm. The method is based on semantic similarity calculation of the 'Hopkinson Web', and obtains the similarity value of the whole article and the theme by calculating the similarity between words of the document and the theme word document.

3. URL prioritization module: the URL priority ranking module is mainly used for screening potential pages with high similarity to the subject from the unvisited URLs, ranking according to the similarity, wherein the higher the similarity is, the higher the priority is, the higher the similarity is, the higher the priority is visited as far as possible, and therefore the high subject correlation of the visited pages is guaranteed. When ranking the unvisited URLs, the similarity between the page where the URL is located and the anchor text (text describing the URL) of the URL may be combined as an influencing factor of the priority ranking.

The invention utilizes the definition of the semantic information of each word in the 'Hopkins' to calculate the similarity between the words. In the knowns web, for two words W₁And W₂Let W be₁There is a concept:

w2 has m concepts:

W₁and W₂Is W₁Each concept of

And W₂Each concept of

Is expressed by the formula

Therefore, the similarity between two words can be converted into the similarity calculation between concepts, and the similarity calculation between the concepts can also be attributed to the similarity calculation between the corresponding sememes because all the concepts in the network are finally attributed to the representation of the sememes. Suppose concept c₁And concept c₂Respectively has p and q sememes which are respectively marked as

Concept c₁And concept c₂Is c₁Each of the sense of (1)

And c₂Each of the sense of (1)

The formula is as follows:

all concepts in the book "Zhi Wang" are finally ascribed to the representation of the sememes, so the calculation of the similarity between concepts can also be ascribed to the calculation of the similarity between the corresponding sememes. Because all the sememes form a tree-like sememe hierarchy based on the upper and lower relations, sememe similarity can be calculated by using the semantic distance of the sememes in the sememe hierarchy to obtain the concept similarity [27 ]]. Assume that two sememes and the path distance in the sememe hierarchy is Dis(s)₁，s₂) Then, the similarity calculation formula of the sememe is:

wherein Dis(s)₁，s₂) Is s₁And s₂Path length in the semantic hierarchy, here using the semantic context, is a positive integer.

The design of the crawler subject of the invention is based on the common crawler and further expands the functions. In the whole process of processing the webpage, the steps of: determining an initial seed URL, extracting webpage content, analyzing topic relevance and sequencing URLs.

(a) And selecting a better seed site facing a specific theme from the initial seed URL, so that the theme crawler can smoothly perform crawling work.

(b) Extracting webpage content: and downloading the page pointed by the URL with high priority, and extracting the required content and the URL information according to the HTML label.

(c) Topic relevance analysis is the core module of a topic crawler, which determines the choice of pages. The invention mainly adopts a generalized vector space model GVSSM combining the existing VSM technology and the SSRM technology to calculate the topic relevance.

And (4) topic relevance analysis, namely extracting text keywords by using TF-IDF, calculating the weight of the words, and carrying out relevance analysis on the webpage.

TF-IDF correlation calculation:

wherein w_diIs the weight of the word i in the document d, tf_iWord frequency, idf, of the word i_iInverse document frequency, f, for word i_iNumber of times word i appears in document d, f_maxThe number of times of occurrence of all words in the document d is the highest, N is the number of all documents, N is_iIs the number of documents containing the word i. TF-IDF is still the most effective method for extracting keywords and calculating the weight of words at present.

VSM topic relevance calculation:

wherein

Is a word vector for the document d,

word vectors for topic t, w_di，w_tiThe TF-IDF value of the word i in the document d and the subject t, and n is the number of common words appearing in the document d and the subject t. The algorithm only considers the frequency vector of the same word appearing in the document as the document similarity judgment, and does not consider the semantically existing relationship between the words, such as a near word, a synonym and the like, thereby influencing the accuracy of the similarity.

SSRM topic relevance calculation:

wherein w_di，w_tiThe TF-IDF value of a word i in a document d and a subject t, n and m are the word numbers of the document d and the subject t respectively, and Sem_ijIs the semantic similarity of word i and word j.

Wherein C is₁，C₂Are two concepts, equivalent to the word w₁And the word w₁，Sem(C₁，C₂) Is a concept C₁And concept C₂Semantic similarity of (C)₃Is C₁And C₂The lowest common concept shared, Path (C)₁，C₃) Is C₁To C₃Number of nodes on the Path, Path (C)₂，C₃) Is C₂To C₃Number of nodes on the path, Depth (C)₃) In some different entities, C₃Number of nodes on the path to the root node. The algorithm of the SSRM only considers the semantic relation, and if the words in two articles are similar words or synonyms, the similarity of the documents is ensured1 would be calculated, i.e. exactly the same, which is clearly less accurate.

The invention adopts a method for calculating the similarity by combining VSM and SSRM, also called generalized vector space model, GVSSM for short, and the calculation formula is as follows:

wherein Sim (d)_kT) is a document d_kThe topic similarity calculation method gives consideration to the word frequency factor of the document and the semantic relation between words, and effectively improves the topic similarity calculation accuracy by adopting a method of combining VSM and SSRM.

(d) The importance of the URLs of the unvisited web pages is ranked. The following formula is used to rank the URLs:

wherein priority (h) is the priority value of the un-accessed hyperlink h, N is the number of the search pages including h, Sim (f)_pT) topic similarity of the full text of the web page p (containing hyperlink h), Sim (a)_hT) is the topic similarity of the anchor text of the hyperlink h, and lambda is the weight value of the adjustment full text and the anchor text. The similarity calculation in the formula also adopts a method of combining VSM and SSRM, optimizes the priority sequence of the link queue of the un-crawled URL, and also effectively improves the accuracy of obtaining the subject academic resources.

Compared with the common web crawler, the topic crawler aims at grabbing the web page information related to the specific topic content, whether the web page is grabbed or not is judged by calculating the correlation degree of the web page and the topic, a URL queue to be crawled is maintained, and the web page is accessed according to the priority of the URL so as to ensure that the web page with high correlation degree is preferentially accessed.

The current subject crawler has several drawbacks: (1) the topic crawler determines a set of related topic words for the topic crawler before working. The topic word set is usually determined in two ways, one is manually determined, and the other is obtained by initial page analysis. The manual determination method has certain subjectivity; the method of extracting keywords from the initial page generally has a deficiency in topic coverage. Both traditional methods cause a small deviation when the topic crawler performs webpage topic similarity calculation. (2) At present, the core of a text heuristic-based topic crawler is page similarity calculation, whether a current crawled webpage is close to a topic is judged, except for the accuracy of a topic determination module, the most important is a similarity calculation algorithm, a VSM (vector space model) is usually adopted, a text is represented by word vectors on the basis of the assumption that different words are irrelevant, the similarity between documents is calculated through common word frequency, the semantic relation between the words of the words is often ignored by the algorithm, and the similarity value of semantically related articles is reduced.

The design of the subject crawler of the invention is based on a general crawler, and three core modules are added: the system comprises a theme determining module, a theme similarity calculating module and a URL sequencing module to be crawled. In order to overcome the defects, the invention provides the topic crawler based on the topic model LDA, improves the topic similarity algorithm and the URL priority ranking algorithm, and improves the content quality and accuracy of the topic crawler from the initial crawling and crawling processes. The main contribution points are as follows: (1) according to the LDA topic model, corpus topic semantic information is deeply excavated, a good guidance foundation is constructed for a topic crawler, machine learning is integrated into a resource acquisition method, and the accuracy and quality of resource acquisition are improved. (2) In the topic crawler topic similarity calculation module, a semantic similarity calculation method based on the 'Hotan' is adopted to balance cosine similarity and semantic similarity, so that a better topic matching effect is achieved.

Classification of academic resources

The invention adopts a text classification method based on LDA, as shown in figure 7, a Bayesian probability calculation model is used as a text classification model, a group of characteristic words which can most embody the characteristics of the text to be classified is extracted as a characteristic word set for inputting the text classification model, the original characteristic word set is the front part of the original word set after being sorted according to characteristic weight, the text classification model is used for calculating the probability of each category of the predetermined A categories to which the characteristic word combination belongs, and the category with the maximum probability value is taken as the category to which the characteristic word combination belongs; all the disciplines are classified into 75 discipline categories according to the discipline categories in the research biology discipline professional catalog of the department of education, namely, the category number A is 75 categories. The LDA topic model described above and the 100 topic documents trained by it are used to assist the text classification model in text classification. The text classification model is classified and verified according to the preset class number A by using a verification corpus with definite classes in advance to obtain the classification accuracy of the text classification model to each class in the A classes, and the classification accuracy serves as a classification credibility index of the text classification model to each class in the A classes; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; when the text classification model is used for classification verification, the preset classification accuracy threshold is more suitable to be 80%. The text classification method for each text to be classified by using the text classification model specifically comprises the following steps:

respectively calculating the characteristic weights of all preprocessed words of each text to be classified, wherein the characteristic weight numerical values of the words are in direct proportion to the occurrence frequency of the words in the text and in inverse proportion to the occurrence frequency of the words in the training corpus, arranging the calculated word sets in a descending order according to the characteristic weight numerical values of the word sets, and extracting the front part of the original word set of each text to be classified as the characteristic word set of the text.

inputting each preprocessed text into the LDA topic model, calculating a weight value of each topic in K set topics corresponding to the text by using the LDA topic model, selecting the topic with the largest weight value, adding the first Y words in topic associated words under the topic obtained after being trained by the LDA topic model into an original feature word set of the text to be used as an expanded feature word set together, respectively calculating probability values of each category in A preset categories possibly attributed to the text by using the text classification model again, and selecting the category with the largest probability value as a final classification category of the text. Specifically, 10 to 20 words can be selected, for example, the first 15 words in the subject related words are added into the original feature word set of the text to be used as the expanded feature word set; even if the newly added word has a repetition with the original feature word.

The main calculation formula of the text classification model is as follows:

wherein P (c)_j|x₁，x₂，...，x_n) Representing the probability that the text belongs to the category cj when the feature words (x1, x2, …, xn) appear simultaneously; wherein P (c)_j) Representing a set of training texts, belonging to class c_jThe ratio of the total number of texts of (a), P (x)₁，x₂，...，x_n|c_j) Indicating if the text to be classified belongs to class c_jThen the feature word set of this text is (x)₁，x₂，...，x_n) Probability of (A), P (c)₁，c₂，...，c_n) Representing the joint probability of all classes given.

Obviously, for all classes given, the denominator P (c)₁，c₂，...，c_n) Is a constant, the classification result of the model is the class with the highest probability in the formula (1), and solving the maximum value of the formula (6) can be converted into solving the maximum value of the following formula

And according to Bayesian assumption, the attribute x of the text feature vector₁，x₂，...，x_nIndependent equal distribution, the joint probability distribution of which is equal to the product of the probability distributions of the individual attribute features, namely:

P(x₁，x₂，...，x_n|c_j)＝Π_iP(x_i|c_j) (36)

therefore, equation (7) becomes:

i.e. the classification function sought for classification.

Probability value P (c) in classification function_j) And P (x)_i|c_j) Is not known, therefore, in order to calculate the maximum value of the classification function, the prior probability values in (9) are estimated as follows:

wherein N (C ═ C)_j) Representing belongings c in training text_jThe number of samples of a category; n represents the total number of training samples.

Wherein, N (X)_i＝x_i，C＝c_j) Represents a category c_jIn which contains an attribute x_iThe number of training samples of (a); n (C ═ C)_j) Represents a category c_jThe number of training samples in (1); m represents the number of the keywords after the useless words are removed in the training sample set.

LDA is a statistical topic model for modeling a discrete data set proposed by Blei et al in 2003, and is a three-layer Bayesian generation model of 'document-topic-word'. The initial model is only rightThe "document-topic" probability distribution introduces a hyper-parameter that makes it subject to the Dirichlet distribution, and Griffiths et al subsequently introduce a hyper-parameter to the "topic-word" probability distribution that makes it subject to the Dirichlet distribution. The LDA model is shown in fig. 2. Wherein: n is the number of words of the document, M is the number of documents in the document set, K is the number of topics,

is the probability distribution of topic-word, theta is the probability distribution of document-topic, Z is the hidden variable representing topic, W is word, α is hyper-parameter of theta, β is

The super ginseng.

The LDA topic model regards a document as a set of words, no precedence order exists between words, the document can contain a plurality of topics, each word in the document is generated by a certain topic, and the same word can belong to different topics, so that the LDA topic model is a typical bag-of-words model.

The key for training the LDA model is the inference of hidden variable distribution, namely, the hidden text-subject distribution theta and the subject-word distribution of the target text are obtained

Given the model parameters α, the joint distribution of the random variables θ, z, and w for the text d is:

because a plurality of implicit variables exist in the above formula at the same time, the theta is directly calculated,

it is impossible, so the estimation and inference of parameters are needed, and the currently common parameter estimation algorithm has Expectation Maximization (EM), variational bayes inference and Gibbs sampling. The Gibbs sampling is used herein for model parameter inference,griffiths points out that Gibbs samples are superior to variational Bayesian inference and EM algorithms in terms of Perplexity values, training speed, and the like. The local maximization problem of the likelihood function of the EM algorithm often leads the model to find a local optimal solution, the model obtained by variational Bayesian inference deviates from the real situation, and Gibbs sampling can quickly and effectively extract subject information from a large-scale data set, so that the EM algorithm becomes the most popular LDA model extraction algorithm at present.

MCMC is a set of approximate iterative methods that extract sample values from complex probability distributions, and Gibbs sampling is a simple implementation of MCMC, with the goal of constructing a Markov chain that converges to a particular distribution and extracting samples from the chain that are close to the target probability distribution value. In the training process, the algorithm only works on the subject variable z_iSampling is carried out, and the conditional probability calculation formula is as follows:

wherein, the left meaning of the equation is: current word w_iThe probability that the word belongs to the topic k under the condition that the respective topics of other words are known; equation right n_i-1 is the number of ith words minus 1 for the kth topic; n is_k-1 is the number of kth topics of the document minus 1; the first multiplier being w_iThe probability of this word under topic number k; the second multiplier is the probability of the kth topic in the document.

The Gibbs sampling comprises the following specific steps:

1) initialisation, for each word w_iRandomly assigning a theme, z_iIs the subject of a word, will z_iInitializing to a random integer between 1 and K, wherein i is from 1 to N, and N is a characteristic word mark of a text set, which is an initial state of a Markov chain;

2) i loops from 1 to N, calculating the current word w according to equation (2)_iProbabilities of belonging to respective topics, and aligning words w according to the probabilities_iResampling the theme to obtain the next state of the Markov chain;

iterating step 2) for a sufficient number of times, the Markov chain is considered to have reached a steady state, so far as the document is concernedEach word has a specific subject; for each document, a text-topic distribution θ and a topic-word distribution

The value of (d) can be estimated according to the following formula:

wherein the content of the first and second substances,

representing the number of times the feature word w is assigned to the topic k,

indicates the number of feature words assigned to the topic k,

representing the number of feature words in text d assigned to subject k,

The classification accuracy as a text classification model reliability index is calculated by probability, and the specific formula is as follows:

wherein i represents a class, N_iRepresenting the number of times the classifier correctly predicted the i class, M_iRepresenting the total number of times the classifier predicts the i class.

The precision ratio P, the recall ratio R and the comprehensive evaluation index F of the precision ratio P and the recall ratio R can be adopted₁As a final evaluation index, the precision ratio P measures the proportion of the test sample of the type to the test sample of the type, and the recall ratio R measures the proportion of the test sample of the type to all the test samples of the type. In a certain class C_iFor example, n⁺⁺Indicating that the correct decision sample belongs to class C_iNumber of (2), n^+-Indicates that it does not belong to but is determined to be of class C_iNumber of samples of (1), n^-+Indicates belonging but is determined not to belong to class C_iThe number of samples of (1). For class C_iIn terms of recall ratio R, precision ratio P and comprehensive index F₁The values are:

the inventors have performed three sets of experiments: performing a first experiment, namely performing a classifier performance test based on an original feature set; performing a classifier performance test based on the expanded feature set; and thirdly, performing a classifier performance test based on the feature set after the selective feature expansion, wherein the reliability threshold is set to be 0.8. Table 2 shows recall and precision for three experiments in a part of disciplines:

TABLE 2 recall and precision of partial disciplines

As can be seen from table 2, when an experiment is performed based on the original feature set, the history recall ratio is higher and the precision ratio is lower, which indicates that more data not belonging to the history subject is classified as the history by the classifier, and it is found that the science and technology history subject has a lower recall ratio, which indicates that more data belonging to the subject is classified as other subjects, because the subjects of the two subjects are very similar, it is very likely that the classifier classifies more data belonging to the science and technology history as the history. Similar situations also occur in geological resources and in geological engineering and geological disciplines. The problem is improved based on the expanded feature set, but the problem affects the previous disciplines with high recognition degree. And selective feature expansion is carried out, on one hand, influence on the subject with high recognition degree is avoided, and on the other hand, the subject with low recognition degree caused by insufficient information amount per se is improved to a certain extent.

According to the above experiment results, the average recall ratio, the average precision ratio and the average F of each of the three experiments can be calculated₁The value is obtained. The results are as follows:

TABLE 3 comparison of the experiments

As can be seen from Table 3, in the case of complex classification scenarios, the method based on selective feature expansion of the present invention has better adaptability, average recall ratio, average precision ratio and average F than the method based on the original feature set or the expanded feature set₁The value is obviously higher than other schemes, and a better practical effect can be achieved.

FIG. 6 is a graph showing recall ratios of three experiments in a part of disciplines; FIG. 7 is a graph showing the precision of three experiments in a part of the disciplines.

Due to the advent of the big data era, the resource classification has more and more challenges, different application scenes need to adopt different classification technologies, and no technology is suitable for all classification tasks. The method based on selective characteristic expansion is suitable for complex application scenes, selectively adds theme information to data with small information quantity, and simultaneously avoids adding noise to data with sufficient information quantity, and has universal adaptability.

Recommendation of academic resources

The process of recommending the corresponding academic resources to the user comprises a cold start recommendation stage and a secondary recommendation stage, wherein the cold start recommendation stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation of a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of resource authority, resource community heat and resource time-of-arrival degree; and in the secondary recommendation stage, modeling is respectively carried out on the user interest model and the resource model, the similarity between the user interest and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree.

1. Cold start phase recommendation algorithm:

TABLE 4 Properties and metrics for five broad classes of resources

High-quality academic resources can attract and retain new users. During the cold start phase, the text is intended to recommend to the user good resources that meet his or her disciplines of interest. The measurement standard of the quality value mainly comprises the attributes of authority, community heat, time freshness and the like. The attributes and metrics for the five major classes of resources are shown in table 4.

The formula for calculating the Authority of the paper is as follows:

level is the score of the quantified publication Level of the paper. The journal grades are divided into 5 grades with scores of 1, 0.8, 0.6, 0.4 and 0.2 in order. The journal or conference of the centre is rated 1 for Nature and Science, the second level is rated 0.8 for ACM Transaction and the lowest level is rated 0.2. The calculation formula for Cite is as follows:

Cite＝Cites/maxCite. (2)

cite is the quantized result of the quoted quantity of the paper, Cites is the quoted quantity of the paper, and maxCite is the largest quoted quantity in the source database of the paper.

The authority calculation of the other four types of resources is similar to the thesis, except that the quantification method is different.

The formula for calculating the community Popularity of the paper is as follows:

Popularity＝readTimes/maxReadTimes. (3)

readTimes is the number of reads of the paper, maxReadTimes is the maximum number of reads in the source database of the paper.

The calculation methods of the time-new Recentness of all resources are the same, and the formula is as follows:

year and month are the year and month of publication of the resource, respectively. minYear, minMonth, maxYear, and maxMonth are the earliest and latest publication years and months for all resources in the source database for that type of resource.

The paper Quality value Quality calculation method is as follows:

2. and (3) an algorithm of a secondary recommendation stage:

in the stage, a recommendation method integrating user behaviors and resource contents is adopted, a user interest model and a resource model are modeled respectively, the similarity of the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally recommendation is carried out according to the recommendation degree.

The academic resource model is represented as follows:

M_r＝{T_r，K_r，C_t，L_r} (6)

wherein, T_rThe discipline distribution vector of the academic resources is the probability value of 75 disciplines distributed by the academic resources and is obtained by a Bayesian polynomial model.

K_r＝{(k_r1，ω_r1)，(k_r2，ω_r2)，...，(k_rm，ω_rm) M is the number of keywords, k_ri(i is more than or equal to 1 and less than or equal to m) represents the ith keyword, omega, of a single academic resource_riAs a keyword k_riThe weight of (c) is obtained by an improved tf-idf algorithm, and the calculation formula is as follows:

w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword appearing in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i.

L_rDistributing vectors, L, for LDALDA potential topics_r＝{l_r1，l_r2，l_r3...，l_rN1N1 is the number of potential topics.

C_tFor the resource type, t can take on values of 1,2,3,4,5, i.e. five major academic resources: academic papers, academic patent academic news, academic conferences and academic books.

According to the behavior characteristics of a user using mobile software, the operation behaviors of the user on an academic resource are divided into opening, reading, star-level evaluation, sharing and collection, wherein the star-level evaluation belongs to an explicit behavior, and the other behaviors belong to an implicit behavior. The explicit behavior can definitely reflect the interest preference degree of the user, such as star-level evaluation, and the higher the score is, the more the user likes the resource; implicit behavior, although not capable of clearly reflecting user interest preferences, tends to imply a greater amount and value of information than explicit feedback.

The user interest model is based primarily on the user's background and the academic resources that have been browsed. According to different browsing behaviors of the user, a user interest model can be constructed by combining an academic resource model, and the model is dynamically adjusted along with the change of the user interest. The user interest model is represented as follows:

M_u＝{T_u，K_u，C_t，L_u} (8)

wherein, T_uIs a disciplinary distribution vector T of a certain class of academic resources browsed by a user within a period of time_rAfter user action, a user subject preference distribution vector is formed, i.e.

Wherein sum is the total academic resource number, s, of the behavior generated by the user_j'behavior system' after generating behaviors for user on academic resources jNumber ", a larger value indicates that the user prefers the resource. T is_jrThe discipline distribution vector representing the jth resource. s_jThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.

K_u＝{(k_u1，ω_u1)，(k_u2，ω_u2)，...，(k_uN2，ω_uN2) Is the user keyword preference distribution vector, N₂Is the number of keywords, k_ui(1≤i≤N₂) Denotes the ith user preference keyword, ω_uiAs a keyword k_uiBy the "keyword distribution vector" K of all academic resources that the user u has produced behavior over a period of time_rAnd (4) calculating.

K_jr′＝s_j*K_jr(10)

Calculating new keyword distribution vector of each academic resource according to the formula 10, and selecting TOP-N2 of the new keyword distribution vector of all the resources as the keyword preference distribution vector K of the user_u。

L_uFor the user's LDA latent topic preference distribution vector, the LDA latent topic distribution vector L from academic resources_r＝{l_r1，l_r2，l_r3...，l_rN1Obtained by calculation in the same way as T_u.

Calculation of the behavior coefficients: s represents a behavior coefficient, T is a reading time threshold, δ is an adjustment parameter, and the reading time threshold is added to prevent a click error, so the value is small. And if the time for the user to read the resource j is less than the threshold value T, the user is considered to be a false click, and s is equal to 0. Under the condition that the user is willing to spend a long time for reading, namely the reading time is more than or equal to T, if the user makes an evaluation and the evaluation value is more than the mean value of all previous evaluations, the user is considered to like j, and s is increased by delta. If the user collects or shares j, indicating that the user likes j very well, s is increased by δ. The invention considers that reading, evaluating, collecting and sharing reflect the interest preference of the user from shallow to deep. The value of s depends mainly on the initial value and the tuning parameter δ, and we want to map all the behaviors of the user to a value from 0 to 2, so the initial value is 1 and the tuning parameter δ is 0.333333.

Similarity calculation between the academic resource model and the user interest model:

academic resource model representation:

M_r＝{T_r，K_r，C_t，L_r} (12)

user interest model representation:

M_u＝{T_u，K_u，C_t，L_u} (13)

LDA latent topic distribution vector L for a user_uLDA latent topic distribution vector L with academic resources_rThe similarity of (2) is calculated by cosine similarity, namely:

user keyword preference distribution vector K_uAnd academic resource keyword distribution vector K_rSimilarity calculation of (d) was calculated by Jaccard Similarity:

In order to recommend high-quality resources which are interested by a user, a Recommendation _ degree concept is introduced, and the higher the Recommendation degree of a certain academic resource is, the more the resource meets the interest preference of the user, and the better the resource is. The recommendation calculation formula is as follows:

Recommendation_degree＝λ₁Sim(M_u，M_n)+λ₂Quality(λ₁+λ₂＝1) (18)

The whole recommendation process is shown in fig. 10, and as can be seen from fig. 2, the recommendation flow of the whole system includes three parts: the method comprises the following steps of resource model construction, cold start stage recommendation and secondary recommendation, wherein the steps are as follows:

the construction process of the resource model comprises the following steps:

1) acquiring five types of academic resource data through a web crawler and a data interface technology;

2) analyzing and extracting the relevant information of each academic resource, and inserting the information into a resource library;

3) preprocessing each piece of data in the resource library, including word segmentation and word stay removal;

4) calculating subject distribution, keyword distribution and LDA potential subject distribution of each resource through the trained three models, wherein the three models are a Bayesian polynomial model, a VSM model and an LDA model respectively;

5) obtaining subject categories of the resources according to the subject distribution vectors, wherein the subject categories of the resources are the first 3 subjects with higher probability in the subject distribution vectors;

6) calculating a quality value for each resource;

7) the discipline distribution vector, the keyword distribution vector, the LDA latent topic distribution vector, the discipline category, and the quality value are inserted into the repository.

Recommendation process of cold start phase:

1) selecting academic resources conforming to user interest disciplines

2) And recommending high-quality resources according to the quality value of the academic resources.

The recommendation process of the secondary recommendation stage is as follows:

1) obtaining a browsing record of a user, and calculating a 'behavior coefficient';

2) constructing a user interest model;

3) calculating the similarity between the resource model and the user interest model;

4) calculating recommendation degree according to the similarity and the quality value;

5) and performing Top-N recommendation according to the recommendation degree of the resources.

In order to facilitate subsequent calculation, a resource model is built in advance, and when a user uses the system for the first time, academic resources are recommended to the user by adopting a recommendation strategy in a cold start stage; and after the behavior data of the user reaches a certain amount, recommending academic resources for the user by adopting a secondary recommendation strategy.

The invention provides a corresponding recommendation strategy mainly according to the continuous accumulation and change of academic resources and user data. Recommending high-quality resources which accord with the interest disciplines for the user in the cold starting stage; in the secondary recommendation stage, modeling is carried out on various academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling is carried out on user interest preference according to user behaviors, and finally Top-N recommendation is carried out according to resource recommendation degree.

Experimental results show that the academic resource recommendation strategy adopted by the invention can fully meet the interest disciplines of users and has obvious effect on improving the CTR of resources; in the secondary recommendation stage, the experimental results show that the recommendation strategy under the modeling method adopted by the invention is obviously higher than the recommendation strategies under the two current common resource modeling modes in Precision.

Claims

1. An academic resource recommendation service system is characterized in that the academic resource recommendation service system comprises a web crawler, a text classification model and a local academic resource database to be recommended, and the academic resource recommendation service system is used for crawling academic resources on the internet by the web crawler; calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling a user interest model by combining interest subjects of the users and the historical browsing behavior data, calculating the similarity between the academic resource model and the user interest model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree; the network crawler is a topic crawler and is provided with an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, a word clustering function during training of the LDA topic model is utilized, after the training corpuses are trained by the LDA topic model, K topic associated word sets respectively aggregated according to the set topic number K are obtained, and then K topic documents of the topic crawler crawling at the time are obtained; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.

2. The academic resource recommendation service system according to claim 1, wherein the corpus further comprises a verification corpus with definite categories, which is used for classifying and verifying the text classification model according to a predetermined category number a by using the verification corpus in advance, so as to obtain classification accuracy of the text classification model to each of the a categories, which is used as an index of classification credibility of the text classification model to each of the a categories; the accuracy rate is the ratio of correctly classified corpora in all the verification corpora classified into a certain category by the text classification model, and a classification accuracy rate threshold is preset.

3. The academic resource recommendation service system of claim 2, wherein all the disciplines are divided into 75 discipline categories, that is, the category number a is 75 categories, the number K of topics is set to 100 during training using the LDA topic model, and the preset classification accuracy threshold is 80% during the classification verification of the text classification model.

4. A method for providing academic resource recommendation service for related users by a resource recommendation service system is characterized in that the academic resources are classified according to predetermined A categories and then stored by using a text classification model to form an academic resource database, an open API of the academic resource database is provided for display and resource recommendation module calling, and a tracking software module is cloned at a user terminal by using the academic resource model, a resource quality value calculation model and a user interest model and used for tracking and recording the online browsing behavior of the user; the process of recommending the corresponding academic resources to the user comprises a cold start recommending stage and a secondary recommending stage, wherein the cold start recommending stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation by a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of the resource authority, the resource community heat and the resource time-freshness; in the secondary recommendation stage, modeling is respectively carried out on a user interest model and a resource model, the similarity between the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree; the network crawler is a topic crawler and is provided with an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, a word clustering function during training of the LDA topic model is utilized, after the training corpuses are trained by the LDA topic model, K topic associated word sets respectively aggregated according to the set topic number K are obtained, and then K topic documents of the topic crawler crawling at the time are obtained; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.

5. The method of claim 4, wherein the resource Quality value Quality calculation includes a formula for Authority of the resource as follows:

wherein, Level is the score of the quantified publication Level of the resource, and the publication Level is divided into 5 grades, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 in sequence. The journal or conference of the centre such as Nature and Science can be divided into 1 point, the second level such as ACMTransaction can be divided into 0.8 point, and the lowest level can be divided into 0.2 point; the calculation formula for Cite is as follows:

Cite＝Cites/maxCite (2)

Popularity ＝readTimes/maxReadTimes (3)

the resource Quality value Quality calculation method is as follows:

6. the method of claim 4, wherein the academic resource model is represented as follows:

M_r＝{T_r，K_r，C_t，L_r} (6)

w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i; lr is the underlying topic distribution vector, L_r＝{l_r1，l_r2，l_r3...，l_rN1N1 is the number of potential topics; ct is a resource type, and t can take values of 1,2,3,4,5, namely five major academic resources: papers, patents, news, meetings, and books;

M_u＝{T_u，K_u，C_t，L_u} (8)

wherein, T_uIs a disciplinary distribution vector, T, of a certain class of academic resources viewed by a user over a period of time_rIs a user subject preference distribution vector formed after user behavior, namely

Wherein sum is the total academic resource number, s, of the behavior generated by the user_jThe 'behavior coefficient' after the behavior is generated for the user on the academic resource j, and the larger the value is, the more the user likes the resource; t is_jrA discipline distribution vector representing the jth resource; s_jThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.

k_u＝{(k_u1，ω_u1)，(k_u2，ω_u2)，...，(k_uN2，ω_uN2)]Is the user preference keyword distribution, N₂Is the number of keywords, k_ui(1≤i≤N₂) Denotes the ith user preference keyword, ω_uiAs a keyword k_uiBy the "keyword distribution vector" K of all academic resources that the user u has produced behavior over a period of time_rCalculating to obtain;

K′_jr＝s_j*K_jr(10)

calculating new keyword distribution vector of each academic resource according to the formula 10, and selecting TOP-N2 of the new keyword distribution vector of all the resources as the user keyword preference distribution vector K_u；

academic resource model representation:

M_r＝{T_r，K_r，C_t，L_r} (12)

user interest model representation:

M_u＝{T_u，K_u，C_t，L_u} (13)

user keyword preference distribution vector K_uAnd academic resource keyword distribution vector K_rThe similarity calculation of (2) is calculated by the JacchardSimiarity entry:

wherein σ + ρ + τ is 1, and the specific weight distribution is obtained by experimental training;

Recommendation_degree＝λ₁Sim(M_u，M_n)+λ₂Quality(λ₁+λ₂＝1) (18)

7. The method according to claim 4, wherein the corpus further includes a class-specific verification corpus, which is used to make the text classification model perform classification verification in advance according to a predetermined class number A by using the verification corpus, so as to obtain the classification accuracy of the text classification model for each class in the A classes, which is used as the classification credibility index of the text classification model for each class in the A classes; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; the text classification method for each text to be classified by using the text classification model specifically comprises the following steps:

8. The method of claim 7, wherein the main calculation formula of the text classification model is:

wherein P (c)_j|x₁，x₂，...，x_n) Indicating that the text belongs to category C when the feature words (x1, x2, …, xn) occur simultaneously_jThe probability of (d);

wherein P (c)_j) Representing a set of training texts, belonging to class c_jThe ratio of the total number of texts of (a), P (x)₁，x₂，…，x_n|c_j) Indicating if the text to be classified belongs to class c_jThen the feature word set of this text is (x)₁，x₂，...，x_n) Probability of (A), P (c)₁，c₂，...，c_n) Representing the joint probability of all classes given.