CN106815297B - Academic resource recommendation service system and method - Google Patents

Academic resource recommendation service system and method Download PDF

Info

Publication number
CN106815297B
CN106815297B CN201611130297.9A CN201611130297A CN106815297B CN 106815297 B CN106815297 B CN 106815297B CN 201611130297 A CN201611130297 A CN 201611130297A CN 106815297 B CN106815297 B CN 106815297B
Authority
CN
China
Prior art keywords
topic
academic
resource
model
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611130297.9A
Other languages
Chinese (zh)
Other versions
CN106815297A (en
Inventor
刘柏嵩
王洋洋
尹丽玲
费晨杰
高元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201611130297.9A priority Critical patent/CN106815297B/en
Publication of CN106815297A publication Critical patent/CN106815297A/en
Application granted granted Critical
Publication of CN106815297B publication Critical patent/CN106815297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

Providing an academic resource recommendation service system and method, crawling academic resources on the Internet by using a topic crawler based on LDA, classifying the academic resources according to a preset A categories by using a text classification model based on LDA, storing the classified academic resources in a local academic resource database, further comprising an academic resource model, a resource quality value calculation model and a user interest model, embedding a tracking software module in a terminal of a user, respectively modeling the academic resource model and the user interest model by combining four dimensions including an interest subject and historical browsing behavior data of the user, calculating the similarity between the academic resource model and a user interest preference model, calculating the recommendation degree by combining the resource quality value, and finally recommending the academic resources Top-N for the user according to the recommendation degree; according to the method and the system, the individual and accurate recommendation of academic resources is performed according to the identity, interest and browsing behaviors of the user, and the working efficiency of scientific research personnel is improved.

Description

Academic resource recommendation service system and method
Technical Field
The invention relates to the technical field of computer application, in particular to an academic resource recommendation service system and a method for providing academic resource recommendation service for related users by using the resource recommendation service system.
Background
Now has entered the big data era, especially in the field of academic resources, with hundreds of millions of academic resources being generated each year. Besides academic papers and patents, a large number of academic resources such as academic conferences, academic news and academic community information emerge in real time, and the academic resources of the types have great significance for a user to accurately and efficiently master the current scientific research situation in the field of interest. However, scientific research users are reluctant in scientific research, academic resources have the characteristics of large data heterogeneity, heterogeneity and rapid growth, the traditional search engine mode in the academic resources is difficult to complete and check, the search process is complicated, and the users often spend a lot of time and energy in inquiring interested academic resources, so that the work efficiency is affected.
The current academic resource personalized recommendation research object mainly focuses on academic papers, and the recommended academic resource type is single; different user groups, namely users with different identities have different attention degrees to different types of academic resources, the current personalized recommendation research of the academic resources does not consider the factors, and a multi-strategy recommendation scheme cannot be formulated based on the user identities. In addition, the current academic resource recommendation research is only limited to a recommendation module, the system provides systematic services for the academic resource recommendation, and an integrated service system taking resource integration and recommendation as a core is formed from the dynamic acquisition, integration and classification of academic resources to the individual recommendation of the academic resources based on the user identity, behavior and interest disciplines.
Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". The theme refers to a defined professional field or an interest field, such as aerospace, biomedicine, information technology and the like, and specifically refers to a set formed by a series of related words. Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution. LDA is an unsupervised machine learning technique that can be used to identify underlying subject information in a document. It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words. The LDA topic model is a typical model for topic mining in natural language processing, can extract potential topics from text corpus, provides a method for quantifying research topics, and is widely applied to topic discovery of academic resources, such as research hotspot mining, research topic evolution, research trend prediction and the like.
In addition, with the discovery of the internet, the internet is filled with a large amount of information texts in various modes such as various news, blogs, meeting memos and the like, the information texts more or less include academic related information contents and often include latest academic research information, and the information texts are concerned by various related subjects, are disordered and often overlap in subject, and generally have no classification information.
The present invention is directed to solving the above-mentioned problems.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an academic resource recommendation service system and a method for providing academic resource recommendation service for related users by using the resource recommendation service system aiming at the technical current situation.
The technical scheme adopted by the invention for solving the technical problems is as follows:
an academic resource recommendation service system is characterized in that the academic resource recommendation service system is stored in a local academic resource database after being classified according to preset A categories by using a text classification model, provides an open API of the academic resource database for display and calling of a resource recommendation module, and further comprises an academic resource model, a resource quality value calculation model and a user interest model, and a tracking software module is implanted in a terminal of a user and is used for tracking and recording the online browsing behavior of the user; the method comprises the steps of calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling the interest preference of the users by combining the interest subjects of the users and the historical browsing behavior data, calculating the similarity between an academic resource model and a user interest preference model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree.
The web crawler is a topic crawler and further comprises an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, and a word clustering function during training of the LDA topic model is utilized to obtain K topic associated word sets which are respectively aggregated according to the set topic number K after the training corpuses are trained by the LDA topic model so as to obtain K topic documents of the topic crawler for crawling at the time; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.
The corpus also comprises verification corpora with definite categories, which are used for classifying and verifying the text classification model according to a preset category number A by using the verification corpora in advance so as to obtain the classification accuracy of the text classification model to each category in the A categories, and the classification accuracy serves as a classification credibility index of the text classification model to each category in the A categories; the accuracy rate is the ratio of correctly classified corpora in all the verification corpora classified into a certain category by the text classification model, and a classification accuracy rate threshold is preset.
All the disciplines are divided into 75 discipline categories, namely the category number A is 75 categories, the theme number K is set to be 100 during training by utilizing an LDA theme model, and the preset classification accuracy threshold is 80% during classification verification of the text classification model.
A method for providing academic resource recommendation service for related users by a resource recommendation service system is characterized in that the academic resources are classified and stored according to predetermined A categories by using a text classification model to form an academic resource database, an open API of the academic resource database is provided for display and calling of a resource recommendation module, and a tracking software module is installed at a user terminal and used for tracking and recording the online browsing behavior of the user; the process of recommending the corresponding academic resources to the user comprises a cold start recommending stage and a secondary recommending stage, wherein the cold start recommending stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation by a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of the resource authority, the resource community heat and the resource time-freshness; and in the secondary recommendation stage, modeling is respectively carried out on the user interest model and the resource model, the similarity between the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree.
The resource Quality value Quality calculation includes that the calculation formula of the resource Authority is as follows:
Figure BDA0001176027590000041
wherein, Level is the score of the quantified publication Level of the resource, and the publication Level is divided into 5 grades, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 in sequence. The tip magazine or conference such as Nature, Science, score 1, the second grade such as ACM Transaction score 0.8, the lowest grade score 0.2; the calculation formula for Cite is as follows:
Cite=Cites/maxCite (2)
cite is the quantized result of the resource quoted quantity, Cites is the quoted quantity of the resource, maxCite is the largest quoted quantity in the resource database;
the calculation formula of the resource community heat degree Popularity is as follows:
Popularity=readTimes/maxReadTimes (3)
readTimes is the reading times of the thesis, maxReadTimes is the maximum reading times in the source database of the resource;
the time-new Recentness calculation method of the resources is the same, and the formula is as follows:
Figure BDA0001176027590000042
year and month are the year and month of publication of the resource, respectively; minYear, minMonth, maxYear, and maxMonth are the earliest and latest publication years and months of all resources in the source database for that type of resource;
the resource Quality value Quality calculation method is as follows:
Figure BDA0001176027590000043
the academic resource model is represented as follows:
Mr={Tr,Kr,Ct,Lr} (6)
wherein, TrThe subject distribution vector of the academic resources is the probability value of the academic resources distributed in A subject categories and is obtained by a Bayesian polynomial model;
Kr={(kr1r1),(kr2r2),…,(krmrm) M is the number of keywords, kri(i is more than or equal to 1 and less than or equal to m) represents the ith keyword, omega, of a single academic resourceriAs a keyword kriThe weight of (c) is obtained by an improved tf-idf algorithm, and the calculation formula is as follows:
Figure BDA0001176027590000044
w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i; lr is the underlying topic distribution vector, Lr={lr1,lr2,lr3…,lrN1N1 is the number of potential topics; ct is a resource type, and t can take values of 1,2,3,4,5, namely five major academic resources: papers, patents, news, meetings, and books;
according to the behavior characteristics of a user using mobile software, the operation behavior of the user on an academic resource is divided into opening, reading, star-level evaluation, sharing and collection, a user interest model is built on the basis of the user background and the browsed academic resource and in combination with the academic resource model according to different browsing behaviors of the user, and the user interest model is expressed as follows:
Mu={Tu,Ku,Ct,Lu} (8)
wherein, TuIs that the user browses in a period of timeDisciplinary distribution vector T of a certain class of academic resourcesrAfter user action, a user subject preference distribution vector is formed, i.e.
Figure BDA0001176027590000051
Wherein sum is the total academic resource number, s, of the behavior generated by the userjA "behavior coefficient" after the behavior is generated for the user for the academic resource j, and the larger the value is, the more the user likes the resource. T isjrThe discipline distribution vector representing the jth resource. sjThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.
Ku={(ku1,ωu1),(ku2,ωu2),...,(kuN2,ωuN2) Is the user's keyword preference distribution vector, N2Is the number of keywords, kui(1≤i≤N2) Denotes the ith user preference keyword, ωuiAs a keyword kuiBy the "keyword distribution vector" K of some kind of academic resources that the user u has produced behavior for a period of timerAnd (4) calculating.
Kjr′=sj*Kjr(10)
Calculating new keyword distribution vector of each resource according to formula 10, and selecting TOP-N2 of new keyword distribution vector of all resources as user keyword preference distribution vector Ku
LuFor the user's LDA latent topic preference distribution vector, the LDA latent topic distribution vector L from academic resourcesr={lr1,lr2,lr3...,lrN1Obtained by calculation in the same way as Tu
Figure BDA0001176027590000052
The similarity between the user interests and the resource model is calculated as follows:
academic resource model representation:
Mr={Tr,Kr,Ct,Lr} (12)
user interest model representation:
Mu={Tu,Ku,Ct,Lu} (13)
user discipline preference distribution vector TuSubject distribution vector T of academic resourcesrThe similarity of (2) is calculated by cosine similarity, namely:
Figure BDA0001176027590000053
user LDA latent topic preference distribution vector LuAnd academic resources LDA potential topic distribution vector LrThe similarity of (2) is calculated by cosine similarity, namely:
Figure BDA0001176027590000061
user keyword preference distribution vector KuAnd academic resource keyword distribution vector KrSimilarity calculation of (c) was calculated by the Jaccard Similarity entry:
Figure BDA0001176027590000062
then the similarity between the user interest model and the academic resource model is as follows:
Figure BDA0001176027590000063
where σ + ρ + τ is 1, and the specific weight assignment is obtained by experimental training.
Introducing a Recommendation _ degree concept, wherein the higher the Recommendation degree of a certain academic resource is, the more the resource meets the interest preference of a user, and the better the resource is, the Recommendation degree calculation formula is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ12=1) (18)
the secondary recommendation stage is to perform Top-N recommendation according to the recommendation degree of academic resources.
The web crawler comprises an addressing crawler, a topic crawler and an LDA topic model, wherein the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, and a word aggregation function during training of the LDA topic model is utilized to obtain K topic associated word sets which are respectively aggregated according to the set topic number K after the training corpuses are trained by the LDA topic model so as to obtain K topic documents of the topic crawler crawling at the time; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.
The corpus also comprises verification corpora with definite categories, which are used for classifying and verifying the text classification model according to a preset category number A by using the verification corpora in advance so as to obtain the classification accuracy of the text classification model to each category in the A categories, and the classification accuracy serves as a classification credibility index of the text classification model to each category in the A categories; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; the text classification method for each text to be classified by using the text classification model specifically comprises the following steps:
step one, preprocessing each text to be classified, wherein the preprocessing comprises word segmentation and word retention removal, proper nouns are reserved, the characteristic weights of all preprocessed words of the text are respectively calculated, the characteristic weight numerical values of the words are in direct proportion to the occurrence frequency of the words in the text and in inverse proportion to the occurrence frequency of the words in the training corpus, the word sets obtained through calculation are arranged in a descending order according to the characteristic weight numerical values, and the front part of the original word set of each text to be classified is extracted as the characteristic word set;
selecting an original feature word set of each text to be classified by using a text classification model to respectively calculate the probability value of each category of the predetermined A categories to which the text may belong, and selecting the category with the maximum probability value as the classification category of the text;
step three, judging the text classification result of the step two, and directly outputting the result if the classification accuracy value of the text classification model to the classification reaches a set threshold value; if the classification accuracy rate value of the text classification model to the classification does not reach the set threshold value, entering the step four;
inputting each preprocessed text into the LDA topic model, calculating a weight value of each topic in K set topics corresponding to the text by using the LDA topic model, selecting the topic with the largest weight value, adding the first Y words in topic associated words under the topic obtained after being trained by the LDA topic model into an original feature word set of the text to be used as an expanded feature word set together, respectively calculating probability values of each category in A preset categories possibly attributed to the text by using the text classification model again, and selecting the category with the largest probability value as a final classification category of the text.
The main calculation formula of the text classification model is as follows:
Figure BDA0001176027590000071
wherein P (c)j|x1,x2,...,xn) Representation feature word (c)j|x1,x2,...,xn) Probability that the text belongs to category cj when occurring at the same time; wherein P (c)j) Representing a set of training texts, belonging to class cjThe ratio of the total number of texts of (a), P (x)1,x2,...,xn|cj) Indicating if the text to be classified belongs to class cjThen the feature word set of this text is (x)1,x2,...,xn) Probability of (A), P (c)1,c2,...,cn) Representing the joint probability of all classes given.
The resource recommendation service system for the multi-type academic resources has the following characteristics:
(1) the invention realizes the dynamic acquisition of various types of academic resources, such as academic papers, patents, academic conferences, academic news and the like, and efficiently acquires the target academic resources based on the topic crawler module.
(2) The invention realizes the theme classification work of various academic resources based on subject attributes.
(3) The attention degrees of different user groups to different types of academic resources are different, the multi-strategy academic resource recommendation method based on different user groups is realized, and various types of academic resources are recommended for users with different identities according to different proportions.
(4) Based on the browsing habits of the users, the invention realizes the personalized recommendation work of various academic resources based on different behaviors of the users.
According to the invention, the individual recommendation of academic resources is carried out according to the identity, interest and browsing behavior of the user, the academic resources can be recommended to the user more accurately, the working efficiency of scientific research personnel is greatly improved, a convenient and rapid information acquisition environment is created for scientific research workers to carry out scientific research better, and the contradiction between the information overload of the academic resources and the acquisition of the user resources is effectively solved.
In addition, the invention adopts an academic resource acquisition method and classification method based on LDA, deeply excavates subject semantic information through an LDA subject model, constructs a good guidance basis for subject crawlers of academic resources, integrates machine learning into the academic resource acquisition method, and improves the quality and efficiency of academic resource acquisition; the academic resources obtained by the topic crawler are used for updating the LDA topic, so that the topic model can be updated at any time, the development trend of the academic is followed, and the forward resources of related fields are provided for scientific researchers; the text classification method based on selective feature expansion is suitable for complex application scenes, selectively adds theme information to data with small information quantity, avoids adding noise to data with sufficient information quantity, provides an idea for optimizing a text classification model, and has the characteristics of strong scene adaptability, high result usability and easiness in updating and maintaining the classification model.
Drawings
FIG. 1 is a block diagram of an overall academic resource recommendation service system according to the present invention;
FIG. 2 is a schematic view of an LDA model;
FIG. 3 is a schematic diagram of a text before preprocessing a certain text;
FIG. 4 is a schematic diagram of a pre-processed text;
FIG. 5 is a schematic diagram of a topic and a topic document after a corpus is trained by an LDA topic model;
FIG. 6 is a flow chart illustrating an LDA-based academic resource acquisition method according to the present invention;
FIG. 7 is a flow chart illustrating a text classification method according to the present invention using LDA;
FIG. 8 is a graph showing recall ratios of three experiments in a part of disciplines;
FIG. 9 is a graph showing the precision ratio of three experiments in part of the subject
FIG. 10 is a schematic diagram of a preferred process of the present invention.
Detailed Description
The following describes the embodiments of the present invention in detail.
The academic resource recommendation service system comprises a web crawler, a text classification model and an academic resource database, wherein the web crawler crawls academic resources on the Internet, the text classification model is used for classifying the academic resources according to preset A categories and then storing the classified academic resources in the local academic resource database, and an open API of the academic resource database is provided for display and calling of a resource recommendation module; the academic resource recommendation service system also comprises an academic resource model, a resource quality value calculation model and a user interest model, wherein a tracking software module is implanted into a terminal of a user and used for tracking and recording the online browsing behavior of the user; the method comprises the steps of calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling the interest preference of the users by combining the interest subjects of the users and the historical browsing behavior data, calculating the similarity between an academic resource model and a user interest model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree. All the first-level disciplines are arranged into 75 discipline categories according to the discipline categories in the professional catalog of the research biology discipline of the education department, namely the category number A is 75 categories.
Acquisition of academic resources
The web crawler of the invention is mainly a topic crawler and also comprises a corresponding LDA topic model, wherein the LDA topic model is a three-layer Bayesian generation model of 'document-topic-word', as shown in figure 2; training an LDA topic model by using training linguistic data according to a set topic number K in advance, wherein each training linguistic data needs to be preprocessed before training, and preprocessing comprises word segmentation and word stay removal; the word clustering function during LDA topic model training is utilized to obtain K topic associated word sets which are respectively clustered according to a set topic number K after a training corpus is trained by an LDA topic model, and the topic associated word sets are also called topic documents; when the LDA topic model is used for training, the number K of topics can be set to be 50-200, and the number K of preferred topics is 100; the documents in various forms of various disciplines can be randomly crawled from the internet, documents such as a long but normative abstract paper can be only abstracted, and an existing database can be used as a training corpus, and the document length can reach a considerable scale, at least tens of thousands of documents and up to millions of documents. If the number K of the selected topics is 100, all words of the training corpus are respectively gathered into 100 topic associated word sets in the LDA topic model operation training process, namely 100 topic documents; we can name each topic name artificially according to the meaning of each collective word, or can not name each topic name, but only number numbers or code numbers are used to indicate each topic name, wherein 3 topic documents are shown in fig. 5.
The topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of the anchor texts combined with the topic similarity of the page smaller than a set threshold are removed, URLs of the anchor texts combined with the topic similarity of the page larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the crawled academic resources into a database, and aiming at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.
In order to facilitate operation, the abstract of academic resources can be used as a training corpus, topics and topic documents are obtained through calculation of an LDA topic model, the topic documents guide calculation of topic similarity in the crawling process of a topic crawler, then crawled contents are stored in a database and serve as new linguistic data of the LDA training model, and an open API (application program interface) of the academic resources database is provided for display and calling; the method comprises the following specific steps:
step one, downloading and preprocessing abstracts of academic resources in a plurality of existing fields, manually classifying the abstracts into different categories according to the academic fields, and respectively using the categories as training corpora of a plurality of subjects of LDA;
inputting LDA topic model parameters, wherein the LDA topic model parameters comprise K and α, the value of K represents the topic number, the value of α represents the weight distribution of each topic before sampling, the value of β represents the prior distribution of each topic to words, a plurality of topics and topic documents with more subdivided topics are obtained through training, and each topic document is used for guiding a crawler;
step three, each crawler maintains a crawling URL queue from the selected high-quality seed URL, and updates the crawling URL queue according to similarity sequencing by continuously calculating the similarity between texts in the webpage and texts and topics pointed by anchor text links in the webpage, and captures webpage contents most relevant to the topics;
step four, after the academic resources acquired by the topic crawler are marked with corresponding topic labels, the academic resources are stored in a database and used as new language materials for training LDA (latent dirichlet allocation) for updating topic documents;
and step five, providing an open API of the academic resource database for display and calling.
The first step comprises the following specific sub-steps:
(a) corpus collection: downloading abstracts of academic resources in a plurality of existing fields as training corpora;
(b) text preprocessing: extracting abstract, Chinese word segmentation and removing stop words;
(c) classification into corpus: artificially classifying into different categories according to the academic field, and respectively using the categories as training corpora of a plurality of subjects of LDA.
The third step comprises the following specific sub-steps:
(a) the initial seed URL selects a better seed site facing a specific theme;
(b) extracting webpage content: downloading a page pointed by the URL with high priority, and extracting required content and URL information according to the HTML tag;
(c) analyzing and judging the relevance of the theme, and determining the acceptance or rejection of the page; the invention mainly adopts the combination of the existing VSM technology and the SSRM technology to calculate the subject correlation;
(d) sequencing the importance degree of the URL of the unvisited webpage;
(e) and (d) repeating the processes from (b) to (d) until the URL of the unvisited queue is empty.
In the substep (c), when the topic crawler crawls through each electronic document to analyze and judge the topic relevance, the generalized vector space model GVSSM combining two topic similarity calculation algorithms of VSM and SSRM is adopted to calculate the topic relevance of the crawled page and determine the choice of the page.
A topic is represented by a set of semantically related words and a weight that indicates that the word is related to the topic, i.e., topic Z { (w)1,p1),(w2,p2),…,(wn,pn) W, where the ith word wiIs a word related to the subject Z, p1Expressed in LDA as Z { (w) as a measure of the degree of correlation of the word with Z1,p(w1|zj)),(w2,p(w2|zj)),…,(wn,p(wn|zj) In which w is present)i∈W,p(wi|zj) Subject matter is ZjWhen the selected word is wiProbability of (a), zjIs the jth topic.
The topic document generation process is a probability sampling process of a model and comprises the following specific sub-steps:
(a) generating the length N, N-Poisson (epsilon) of a document for any document d in the corpus, and obeying Poisson distribution;
(b) generating a theta-Dirichlet (α) for any document d in the corpus, and obeying Dirichlet distribution;
(c) generation of the ith word wi in the document d: first, a theme z is generatedjMultinominal (θ), subject to a polynomial distribution; then, for the subject zjGenerating a discrete variable
Figure BDA0001176027590000115
Obeying a dirichlet distribution; finally generate so that
Figure BDA0001176027590000116
The word with the highest probability. The LDA model is shown in fig. 3.
Where the value of α represents the weight distribution of the respective topic prior to sampling and the value of β represents the prior distribution of the respective topic to the word.
The distribution of all variables and their obeys in the LDA model is as follows:
Figure BDA0001176027590000114
Figure BDA0001176027590000111
the entire model can actually become a joint distribution of P (w | Z) by integrating the variables that may exist.W refers to the word, and can be observed.Z is a variable of the topic that is the target product of the model. α can be seen to be the initial parameters of the model.then by integrating the variables that exist therein:
Figure BDA0001176027590000112
where N is the vocabulary length, w is the word, for θ -Dirichlet (α),
Figure BDA0001176027590000117
the middle θ is integrated to obtain:
Figure BDA0001176027590000113
wherein the content of the first and second substances,
Figure BDA0001176027590000121
representing the number of times the feature word w is assigned to the topic j,
Figure BDA0001176027590000122
indicating the number of feature words assigned to topic j,
Figure BDA0001176027590000123
representing the number of feature words in text d assigned to topic j,
Figure BDA0001176027590000124
and the number of all the characteristic words with the assigned subjects in the text d is shown.
From the above, it can be seen that the three variables that influence LDA modeling are mainly α and topic number k to select a better topic number, the value of α is first fixed, and then the change in the value of the equation after integrating other variables is calculated.
When the LDA model is adopted to carry out theme modeling on the text set, the theme number K has great influence on the performance of the LDA model for fitting the text set, so the theme number needs to be preset. According to the method, the optimal theme number is determined by measuring the classification effect under different theme numbers and is compared with the classification effect when the Perplexity value is used for determining the model to be optimally fit, on one hand, the method can obtain the more visual and accurate optimal theme number, and on the other hand, the difference between the corresponding classification effect and the actual result can be found out through the optimal theme number determined by the Perplexity value. The Perplexity value formula is:
Figure BDA0001176027590000125
wherein M is the number of texts in the text set, NmIs the length of the m-th text, P (d)m) The probability of generating the mth text for the LDA model is given by:
Figure BDA0001176027590000126
the subject crawler of the invention is additionally provided with three modules on the basis of the general crawler: the system comprises a theme determining module, a similarity calculating module and a URL priority ordering module, so that filtering and theme matching of a crawled page are completed, and finally, contents highly related to the theme are obtained.
1. A topic determination module: before the topic crawler works, a related topic word set of the topic crawler needs to be determined, namely a topic document is established. The topic word set is usually determined in two ways, one is manually determined, and the other is extracted from the initial page set. The subject word set is determined manually, the training and selection of the keywords have subjectivity, and the keywords extracted from the initial page have high noise and low coverage rate. The number of the subject words is used as the dimension of the subject vector, and the corresponding weight is the component value of the subject vector. The topic notation word set vector is: k is { K1, K2, …, kn }, and n is the number of topic words.
2. A similarity calculation module: in order to ensure that the web pages acquired by the crawler can be close to the topic as much as possible, the web pages must be filtered, and the web pages with low topic relevance (smaller than a set threshold value) are removed, so that the links in the web pages cannot be processed in the next crawling. Because the topic relevance of a page is very low, the webpage is likely to have some keywords only occasionally, the topic of the page may have little relation with the specified topic, the link meaning in processing the topic is very small, and the fundamental difference between the topic crawler and the common crawler is provided. The common crawler processes all links according to the set search depth, returns a large number of useless webpages as a result, and further increases the workload. It is obviously not feasible to use the whole text for the similarity comparison, and the text is generally required to be refined and extracted and converted into a data structure suitable for comparison and calculation, and simultaneously, the theme of the text is ensured to be embodied as much as possible. The feature selection adopted by a typical subject crawler is VSM, and also relates to TF-IDF algorithm. The method is based on semantic similarity calculation of the 'Hopkinson Web', and obtains the similarity value of the whole article and the theme by calculating the similarity between words of the document and the theme word document.
3. URL prioritization module: the URL priority ranking module is mainly used for screening potential pages with high similarity to the subject from the unvisited URLs, ranking according to the similarity, wherein the higher the similarity is, the higher the priority is, the higher the similarity is, the higher the priority is visited as far as possible, and therefore the high subject correlation of the visited pages is guaranteed. When ranking the unvisited URLs, the similarity between the page where the URL is located and the anchor text (text describing the URL) of the URL may be combined as an influencing factor of the priority ranking.
The invention utilizes the definition of the semantic information of each word in the 'Hopkins' to calculate the similarity between the words. In the knowns web, for two words W1And W2Let W be1There is a concept:
Figure BDA0001176027590000131
w2 has m concepts:
Figure BDA0001176027590000132
W1and W2Is W1Each concept of
Figure BDA0001176027590000133
And W2Each concept of
Figure BDA0001176027590000134
Is expressed by the formula
Figure BDA0001176027590000135
Therefore, the similarity between two words can be converted into the similarity calculation between concepts, and the similarity calculation between the concepts can also be attributed to the similarity calculation between the corresponding sememes because all the concepts in the network are finally attributed to the representation of the sememes. Suppose concept c1And concept c2Respectively has p and q sememes which are respectively marked as
Figure BDA0001176027590000136
Figure BDA0001176027590000137
Concept c1And concept c2Is c1Each of the sense of (1)
Figure BDA0001176027590000138
And c2Each of the sense of (1)
Figure BDA0001176027590000139
The formula is as follows:
Figure BDA00011760275900001310
all concepts in the book "Zhi Wang" are finally ascribed to the representation of the sememes, so the calculation of the similarity between concepts can also be ascribed to the calculation of the similarity between the corresponding sememes. Because all the sememes form a tree-like sememe hierarchy based on the upper and lower relations, sememe similarity can be calculated by using the semantic distance of the sememes in the sememe hierarchy to obtain the concept similarity [27 ]]. Assume that two sememes and the path distance in the sememe hierarchy is Dis(s)1,s2) Then, the similarity calculation formula of the sememe is:
Figure BDA00011760275900001311
wherein Dis(s)1,s2) Is s1And s2Path length in the semantic hierarchy, here using the semantic context, is a positive integer.
The design of the crawler subject of the invention is based on the common crawler and further expands the functions. In the whole process of processing the webpage, the steps of: determining an initial seed URL, extracting webpage content, analyzing topic relevance and sequencing URLs.
(a) And selecting a better seed site facing a specific theme from the initial seed URL, so that the theme crawler can smoothly perform crawling work.
(b) Extracting webpage content: and downloading the page pointed by the URL with high priority, and extracting the required content and the URL information according to the HTML label.
(c) Topic relevance analysis is the core module of a topic crawler, which determines the choice of pages. The invention mainly adopts a generalized vector space model GVSSM combining the existing VSM technology and the SSRM technology to calculate the topic relevance.
And (4) topic relevance analysis, namely extracting text keywords by using TF-IDF, calculating the weight of the words, and carrying out relevance analysis on the webpage.
TF-IDF correlation calculation:
Figure BDA0001176027590000141
wherein wdiIs the weight of the word i in the document d, tfiWord frequency, idf, of the word iiInverse document frequency, f, for word iiNumber of times word i appears in document d, fmaxThe number of times of occurrence of all words in the document d is the highest, N is the number of all documents, N isiIs the number of documents containing the word i. TF-IDF is still the most effective method for extracting keywords and calculating the weight of words at present.
VSM topic relevance calculation:
Figure BDA0001176027590000142
wherein
Figure BDA0001176027590000143
Is a word vector for the document d,
Figure BDA0001176027590000144
word vectors for topic t, wdi,wtiThe TF-IDF value of the word i in the document d and the subject t, and n is the number of common words appearing in the document d and the subject t. The algorithm only considers the frequency vector of the same word appearing in the document as the document similarity judgment, and does not consider the semantically existing relationship between the words, such as a near word, a synonym and the like, thereby influencing the accuracy of the similarity.
SSRM topic relevance calculation:
Figure BDA0001176027590000145
wherein wdi,wtiThe TF-IDF value of a word i in a document d and a subject t, n and m are the word numbers of the document d and the subject t respectively, and SemijIs the semantic similarity of word i and word j.
Figure BDA0001176027590000146
Wherein C is1,C2Are two concepts, equivalent to the word w1And the word w1,Sem(C1,C2) Is a concept C1And concept C2Semantic similarity of (C)3Is C1And C2The lowest common concept shared, Path (C)1,C3) Is C1To C3Number of nodes on the Path, Path (C)2,C3) Is C2To C3Number of nodes on the path, Depth (C)3) In some different entities, C3Number of nodes on the path to the root node. The algorithm of the SSRM only considers the semantic relation, and if the words in two articles are similar words or synonyms, the similarity of the documents is ensured1 would be calculated, i.e. exactly the same, which is clearly less accurate.
The invention adopts a method for calculating the similarity by combining VSM and SSRM, also called generalized vector space model, GVSSM for short, and the calculation formula is as follows:
Figure BDA0001176027590000151
wherein Sim (d)kT) is a document dkThe topic similarity calculation method gives consideration to the word frequency factor of the document and the semantic relation between words, and effectively improves the topic similarity calculation accuracy by adopting a method of combining VSM and SSRM.
(d) The importance of the URLs of the unvisited web pages is ranked. The following formula is used to rank the URLs:
Figure BDA0001176027590000152
wherein priority (h) is the priority value of the un-accessed hyperlink h, N is the number of the search pages including h, Sim (f)pT) topic similarity of the full text of the web page p (containing hyperlink h), Sim (a)hT) is the topic similarity of the anchor text of the hyperlink h, and lambda is the weight value of the adjustment full text and the anchor text. The similarity calculation in the formula also adopts a method of combining VSM and SSRM, optimizes the priority sequence of the link queue of the un-crawled URL, and also effectively improves the accuracy of obtaining the subject academic resources.
Compared with the common web crawler, the topic crawler aims at grabbing the web page information related to the specific topic content, whether the web page is grabbed or not is judged by calculating the correlation degree of the web page and the topic, a URL queue to be crawled is maintained, and the web page is accessed according to the priority of the URL so as to ensure that the web page with high correlation degree is preferentially accessed.
The current subject crawler has several drawbacks: (1) the topic crawler determines a set of related topic words for the topic crawler before working. The topic word set is usually determined in two ways, one is manually determined, and the other is obtained by initial page analysis. The manual determination method has certain subjectivity; the method of extracting keywords from the initial page generally has a deficiency in topic coverage. Both traditional methods cause a small deviation when the topic crawler performs webpage topic similarity calculation. (2) At present, the core of a text heuristic-based topic crawler is page similarity calculation, whether a current crawled webpage is close to a topic is judged, except for the accuracy of a topic determination module, the most important is a similarity calculation algorithm, a VSM (vector space model) is usually adopted, a text is represented by word vectors on the basis of the assumption that different words are irrelevant, the similarity between documents is calculated through common word frequency, the semantic relation between the words of the words is often ignored by the algorithm, and the similarity value of semantically related articles is reduced.
The design of the subject crawler of the invention is based on a general crawler, and three core modules are added: the system comprises a theme determining module, a theme similarity calculating module and a URL sequencing module to be crawled. In order to overcome the defects, the invention provides the topic crawler based on the topic model LDA, improves the topic similarity algorithm and the URL priority ranking algorithm, and improves the content quality and accuracy of the topic crawler from the initial crawling and crawling processes. The main contribution points are as follows: (1) according to the LDA topic model, corpus topic semantic information is deeply excavated, a good guidance foundation is constructed for a topic crawler, machine learning is integrated into a resource acquisition method, and the accuracy and quality of resource acquisition are improved. (2) In the topic crawler topic similarity calculation module, a semantic similarity calculation method based on the 'Hotan' is adopted to balance cosine similarity and semantic similarity, so that a better topic matching effect is achieved.
Classification of academic resources
The invention adopts a text classification method based on LDA, as shown in figure 7, a Bayesian probability calculation model is used as a text classification model, a group of characteristic words which can most embody the characteristics of the text to be classified is extracted as a characteristic word set for inputting the text classification model, the original characteristic word set is the front part of the original word set after being sorted according to characteristic weight, the text classification model is used for calculating the probability of each category of the predetermined A categories to which the characteristic word combination belongs, and the category with the maximum probability value is taken as the category to which the characteristic word combination belongs; all the disciplines are classified into 75 discipline categories according to the discipline categories in the research biology discipline professional catalog of the department of education, namely, the category number A is 75 categories. The LDA topic model described above and the 100 topic documents trained by it are used to assist the text classification model in text classification. The text classification model is classified and verified according to the preset class number A by using a verification corpus with definite classes in advance to obtain the classification accuracy of the text classification model to each class in the A classes, and the classification accuracy serves as a classification credibility index of the text classification model to each class in the A classes; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; when the text classification model is used for classification verification, the preset classification accuracy threshold is more suitable to be 80%. The text classification method for each text to be classified by using the text classification model specifically comprises the following steps:
respectively calculating the characteristic weights of all preprocessed words of each text to be classified, wherein the characteristic weight numerical values of the words are in direct proportion to the occurrence frequency of the words in the text and in inverse proportion to the occurrence frequency of the words in the training corpus, arranging the calculated word sets in a descending order according to the characteristic weight numerical values of the word sets, and extracting the front part of the original word set of each text to be classified as the characteristic word set of the text.
Selecting an original feature word set of each text to be classified by using a text classification model to respectively calculate the probability value of each category of the predetermined A categories to which the text may belong, and selecting the category with the maximum probability value as the classification category of the text;
step three, judging the text classification result of the step two, and directly outputting the result if the classification accuracy value of the text classification model to the classification reaches a set threshold value; if the classification accuracy rate value of the text classification model to the classification does not reach the set threshold value, entering the step four;
inputting each preprocessed text into the LDA topic model, calculating a weight value of each topic in K set topics corresponding to the text by using the LDA topic model, selecting the topic with the largest weight value, adding the first Y words in topic associated words under the topic obtained after being trained by the LDA topic model into an original feature word set of the text to be used as an expanded feature word set together, respectively calculating probability values of each category in A preset categories possibly attributed to the text by using the text classification model again, and selecting the category with the largest probability value as a final classification category of the text. Specifically, 10 to 20 words can be selected, for example, the first 15 words in the subject related words are added into the original feature word set of the text to be used as the expanded feature word set; even if the newly added word has a repetition with the original feature word.
The main calculation formula of the text classification model is as follows:
Figure BDA0001176027590000171
wherein P (c)j|x1,x2,...,xn) Representing the probability that the text belongs to the category cj when the feature words (x1, x2, …, xn) appear simultaneously; wherein P (c)j) Representing a set of training texts, belonging to class cjThe ratio of the total number of texts of (a), P (x)1,x2,...,xn|cj) Indicating if the text to be classified belongs to class cjThen the feature word set of this text is (x)1,x2,...,xn) Probability of (A), P (c)1,c2,...,cn) Representing the joint probability of all classes given.
Obviously, for all classes given, the denominator P (c)1,c2,...,cn) Is a constant, the classification result of the model is the class with the highest probability in the formula (1), and solving the maximum value of the formula (6) can be converted into solving the maximum value of the following formula
Figure BDA0001176027590000174
And according to Bayesian assumption, the attribute x of the text feature vector1,x2,...,xnIndependent equal distribution, the joint probability distribution of which is equal to the product of the probability distributions of the individual attribute features, namely:
P(x1,x2,...,xn|cj)=ΠiP(xi|cj) (36)
therefore, equation (7) becomes:
Figure BDA0001176027590000175
i.e. the classification function sought for classification.
Probability value P (c) in classification functionj) And P (x)i|cj) Is not known, therefore, in order to calculate the maximum value of the classification function, the prior probability values in (9) are estimated as follows:
Figure BDA0001176027590000172
wherein N (C ═ C)j) Representing belongings c in training textjThe number of samples of a category; n represents the total number of training samples.
Figure BDA0001176027590000173
Wherein, N (X)i=xi,C=cj) Represents a category cjIn which contains an attribute xiThe number of training samples of (a); n (C ═ C)j) Represents a category cjThe number of training samples in (1); m represents the number of the keywords after the useless words are removed in the training sample set.
LDA is a statistical topic model for modeling a discrete data set proposed by Blei et al in 2003, and is a three-layer Bayesian generation model of 'document-topic-word'. The initial model is only rightThe "document-topic" probability distribution introduces a hyper-parameter that makes it subject to the Dirichlet distribution, and Griffiths et al subsequently introduce a hyper-parameter to the "topic-word" probability distribution that makes it subject to the Dirichlet distribution. The LDA model is shown in fig. 2. Wherein: n is the number of words of the document, M is the number of documents in the document set, K is the number of topics,
Figure BDA0001176027590000181
is the probability distribution of topic-word, theta is the probability distribution of document-topic, Z is the hidden variable representing topic, W is word, α is hyper-parameter of theta, β is
Figure BDA0001176027590000182
The super ginseng.
The LDA topic model regards a document as a set of words, no precedence order exists between words, the document can contain a plurality of topics, each word in the document is generated by a certain topic, and the same word can belong to different topics, so that the LDA topic model is a typical bag-of-words model.
The key for training the LDA model is the inference of hidden variable distribution, namely, the hidden text-subject distribution theta and the subject-word distribution of the target text are obtained
Figure BDA0001176027590000183
Given the model parameters α, the joint distribution of the random variables θ, z, and w for the text d is:
Figure BDA0001176027590000184
because a plurality of implicit variables exist in the above formula at the same time, the theta is directly calculated,
Figure BDA0001176027590000185
it is impossible, so the estimation and inference of parameters are needed, and the currently common parameter estimation algorithm has Expectation Maximization (EM), variational bayes inference and Gibbs sampling. The Gibbs sampling is used herein for model parameter inference,griffiths points out that Gibbs samples are superior to variational Bayesian inference and EM algorithms in terms of Perplexity values, training speed, and the like. The local maximization problem of the likelihood function of the EM algorithm often leads the model to find a local optimal solution, the model obtained by variational Bayesian inference deviates from the real situation, and Gibbs sampling can quickly and effectively extract subject information from a large-scale data set, so that the EM algorithm becomes the most popular LDA model extraction algorithm at present.
MCMC is a set of approximate iterative methods that extract sample values from complex probability distributions, and Gibbs sampling is a simple implementation of MCMC, with the goal of constructing a Markov chain that converges to a particular distribution and extracting samples from the chain that are close to the target probability distribution value. In the training process, the algorithm only works on the subject variable ziSampling is carried out, and the conditional probability calculation formula is as follows:
Figure BDA0001176027590000186
wherein, the left meaning of the equation is: current word wiThe probability that the word belongs to the topic k under the condition that the respective topics of other words are known; equation right ni-1 is the number of ith words minus 1 for the kth topic; n isk-1 is the number of kth topics of the document minus 1; the first multiplier being wiThe probability of this word under topic number k; the second multiplier is the probability of the kth topic in the document.
The Gibbs sampling comprises the following specific steps:
1) initialisation, for each word wiRandomly assigning a theme, ziIs the subject of a word, will ziInitializing to a random integer between 1 and K, wherein i is from 1 to N, and N is a characteristic word mark of a text set, which is an initial state of a Markov chain;
2) i loops from 1 to N, calculating the current word w according to equation (2)iProbabilities of belonging to respective topics, and aligning words w according to the probabilitiesiResampling the theme to obtain the next state of the Markov chain;
iterating step 2) for a sufficient number of times, the Markov chain is considered to have reached a steady state, so far as the document is concernedEach word has a specific subject; for each document, a text-topic distribution θ and a topic-word distribution
Figure BDA0001176027590000191
The value of (d) can be estimated according to the following formula:
Figure BDA0001176027590000192
wherein the content of the first and second substances,
Figure BDA0001176027590000193
representing the number of times the feature word w is assigned to the topic k,
Figure BDA0001176027590000194
indicates the number of feature words assigned to the topic k,
Figure BDA0001176027590000195
representing the number of feature words in text d assigned to subject k,
Figure BDA0001176027590000196
and the number of all the characteristic words with the assigned subjects in the text d is shown.
The classification accuracy as a text classification model reliability index is calculated by probability, and the specific formula is as follows:
Figure BDA0001176027590000197
wherein i represents a class, NiRepresenting the number of times the classifier correctly predicted the i class, MiRepresenting the total number of times the classifier predicts the i class.
The precision ratio P, the recall ratio R and the comprehensive evaluation index F of the precision ratio P and the recall ratio R can be adopted1As a final evaluation index, the precision ratio P measures the proportion of the test sample of the type to the test sample of the type, and the recall ratio R measures the proportion of the test sample of the type to all the test samples of the type. In a certain class CiFor example, n++Indicating that the correct decision sample belongs to class CiNumber of (2), n+-Indicates that it does not belong to but is determined to be of class CiNumber of samples of (1), n-+Indicates belonging but is determined not to belong to class CiThe number of samples of (1). For class CiIn terms of recall ratio R, precision ratio P and comprehensive index F1The values are:
Figure BDA0001176027590000198
the inventors have performed three sets of experiments: performing a first experiment, namely performing a classifier performance test based on an original feature set; performing a classifier performance test based on the expanded feature set; and thirdly, performing a classifier performance test based on the feature set after the selective feature expansion, wherein the reliability threshold is set to be 0.8. Table 2 shows recall and precision for three experiments in a part of disciplines:
TABLE 2 recall and precision of partial disciplines
Figure BDA0001176027590000199
Figure BDA0001176027590000201
As can be seen from table 2, when an experiment is performed based on the original feature set, the history recall ratio is higher and the precision ratio is lower, which indicates that more data not belonging to the history subject is classified as the history by the classifier, and it is found that the science and technology history subject has a lower recall ratio, which indicates that more data belonging to the subject is classified as other subjects, because the subjects of the two subjects are very similar, it is very likely that the classifier classifies more data belonging to the science and technology history as the history. Similar situations also occur in geological resources and in geological engineering and geological disciplines. The problem is improved based on the expanded feature set, but the problem affects the previous disciplines with high recognition degree. And selective feature expansion is carried out, on one hand, influence on the subject with high recognition degree is avoided, and on the other hand, the subject with low recognition degree caused by insufficient information amount per se is improved to a certain extent.
According to the above experiment results, the average recall ratio, the average precision ratio and the average F of each of the three experiments can be calculated1The value is obtained. The results are as follows:
TABLE 3 comparison of the experiments
Figure BDA0001176027590000202
As can be seen from Table 3, in the case of complex classification scenarios, the method based on selective feature expansion of the present invention has better adaptability, average recall ratio, average precision ratio and average F than the method based on the original feature set or the expanded feature set1The value is obviously higher than other schemes, and a better practical effect can be achieved.
FIG. 6 is a graph showing recall ratios of three experiments in a part of disciplines; FIG. 7 is a graph showing the precision of three experiments in a part of the disciplines.
Due to the advent of the big data era, the resource classification has more and more challenges, different application scenes need to adopt different classification technologies, and no technology is suitable for all classification tasks. The method based on selective characteristic expansion is suitable for complex application scenes, selectively adds theme information to data with small information quantity, and simultaneously avoids adding noise to data with sufficient information quantity, and has universal adaptability.
Figure BDA0001176027590000211
Recommendation of academic resources
The process of recommending the corresponding academic resources to the user comprises a cold start recommendation stage and a secondary recommendation stage, wherein the cold start recommendation stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation of a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of resource authority, resource community heat and resource time-of-arrival degree; and in the secondary recommendation stage, modeling is respectively carried out on the user interest model and the resource model, the similarity between the user interest and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree.
1. Cold start phase recommendation algorithm:
TABLE 4 Properties and metrics for five broad classes of resources
High-quality academic resources can attract and retain new users. During the cold start phase, the text is intended to recommend to the user good resources that meet his or her disciplines of interest. The measurement standard of the quality value mainly comprises the attributes of authority, community heat, time freshness and the like. The attributes and metrics for the five major classes of resources are shown in table 4.
The formula for calculating the Authority of the paper is as follows:
Figure BDA0001176027590000212
level is the score of the quantified publication Level of the paper. The journal grades are divided into 5 grades with scores of 1, 0.8, 0.6, 0.4 and 0.2 in order. The journal or conference of the centre is rated 1 for Nature and Science, the second level is rated 0.8 for ACM Transaction and the lowest level is rated 0.2. The calculation formula for Cite is as follows:
Cite=Cites/maxCite. (2)
cite is the quantized result of the quoted quantity of the paper, Cites is the quoted quantity of the paper, and maxCite is the largest quoted quantity in the source database of the paper.
The authority calculation of the other four types of resources is similar to the thesis, except that the quantification method is different.
The formula for calculating the community Popularity of the paper is as follows:
Popularity=readTimes/maxReadTimes. (3)
readTimes is the number of reads of the paper, maxReadTimes is the maximum number of reads in the source database of the paper.
The calculation methods of the time-new Recentness of all resources are the same, and the formula is as follows:
Figure BDA0001176027590000221
year and month are the year and month of publication of the resource, respectively. minYear, minMonth, maxYear, and maxMonth are the earliest and latest publication years and months for all resources in the source database for that type of resource.
The paper Quality value Quality calculation method is as follows:
Figure BDA0001176027590000222
2. and (3) an algorithm of a secondary recommendation stage:
in the stage, a recommendation method integrating user behaviors and resource contents is adopted, a user interest model and a resource model are modeled respectively, the similarity of the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally recommendation is carried out according to the recommendation degree.
The academic resource model is represented as follows:
Mr={Tr,Kr,Ct,Lr} (6)
wherein, TrThe discipline distribution vector of the academic resources is the probability value of 75 disciplines distributed by the academic resources and is obtained by a Bayesian polynomial model.
Kr={(kr1,ωr1),(kr2,ωr2),...,(krm,ωrm) M is the number of keywords, kri(i is more than or equal to 1 and less than or equal to m) represents the ith keyword, omega, of a single academic resourceriAs a keyword kriThe weight of (c) is obtained by an improved tf-idf algorithm, and the calculation formula is as follows:
Figure BDA0001176027590000223
w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword appearing in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i.
LrDistributing vectors, L, for LDALDA potential topicsr={lr1,lr2,lr3...,lrN1N1 is the number of potential topics.
CtFor the resource type, t can take on values of 1,2,3,4,5, i.e. five major academic resources: academic papers, academic patent academic news, academic conferences and academic books.
According to the behavior characteristics of a user using mobile software, the operation behaviors of the user on an academic resource are divided into opening, reading, star-level evaluation, sharing and collection, wherein the star-level evaluation belongs to an explicit behavior, and the other behaviors belong to an implicit behavior. The explicit behavior can definitely reflect the interest preference degree of the user, such as star-level evaluation, and the higher the score is, the more the user likes the resource; implicit behavior, although not capable of clearly reflecting user interest preferences, tends to imply a greater amount and value of information than explicit feedback.
The user interest model is based primarily on the user's background and the academic resources that have been browsed. According to different browsing behaviors of the user, a user interest model can be constructed by combining an academic resource model, and the model is dynamically adjusted along with the change of the user interest. The user interest model is represented as follows:
Mu={Tu,Ku,Ct,Lu} (8)
wherein, TuIs a disciplinary distribution vector T of a certain class of academic resources browsed by a user within a period of timerAfter user action, a user subject preference distribution vector is formed, i.e.
Figure BDA0001176027590000231
Wherein sum is the total academic resource number, s, of the behavior generated by the userj'behavior system' after generating behaviors for user on academic resources jNumber ", a larger value indicates that the user prefers the resource. T isjrThe discipline distribution vector representing the jth resource. sjThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.
Ku={(ku1,ωu1),(ku2,ωu2),...,(kuN2,ωuN2) Is the user keyword preference distribution vector, N2Is the number of keywords, kui(1≤i≤N2) Denotes the ith user preference keyword, ωuiAs a keyword kuiBy the "keyword distribution vector" K of all academic resources that the user u has produced behavior over a period of timerAnd (4) calculating.
Kjr′=sj*Kjr(10)
Calculating new keyword distribution vector of each academic resource according to the formula 10, and selecting TOP-N2 of the new keyword distribution vector of all the resources as the keyword preference distribution vector K of the useru
LuFor the user's LDA latent topic preference distribution vector, the LDA latent topic distribution vector L from academic resourcesr={lr1,lr2,lr3...,lrN1Obtained by calculation in the same way as Tu.
Figure BDA0001176027590000232
Calculation of the behavior coefficients: s represents a behavior coefficient, T is a reading time threshold, δ is an adjustment parameter, and the reading time threshold is added to prevent a click error, so the value is small. And if the time for the user to read the resource j is less than the threshold value T, the user is considered to be a false click, and s is equal to 0. Under the condition that the user is willing to spend a long time for reading, namely the reading time is more than or equal to T, if the user makes an evaluation and the evaluation value is more than the mean value of all previous evaluations, the user is considered to like j, and s is increased by delta. If the user collects or shares j, indicating that the user likes j very well, s is increased by δ. The invention considers that reading, evaluating, collecting and sharing reflect the interest preference of the user from shallow to deep. The value of s depends mainly on the initial value and the tuning parameter δ, and we want to map all the behaviors of the user to a value from 0 to 2, so the initial value is 1 and the tuning parameter δ is 0.333333.
Similarity calculation between the academic resource model and the user interest model:
academic resource model representation:
Mr={Tr,Kr,Ct,Lr} (12)
user interest model representation:
Mu={Tu,Ku,Ct,Lu} (13)
user discipline preference distribution vector TuSubject distribution vector T of academic resourcesrThe similarity of (2) is calculated by cosine similarity, namely:
Figure BDA0001176027590000241
LDA latent topic distribution vector L for a useruLDA latent topic distribution vector L with academic resourcesrThe similarity of (2) is calculated by cosine similarity, namely:
Figure BDA0001176027590000242
user keyword preference distribution vector KuAnd academic resource keyword distribution vector KrSimilarity calculation of (d) was calculated by Jaccard Similarity:
Figure BDA0001176027590000243
then the similarity between the user interest model and the academic resource model is as follows:
Figure BDA0001176027590000244
where σ + ρ + τ is 1, and the specific weight assignment is obtained by experimental training.
In order to recommend high-quality resources which are interested by a user, a Recommendation _ degree concept is introduced, and the higher the Recommendation degree of a certain academic resource is, the more the resource meets the interest preference of the user, and the better the resource is. The recommendation calculation formula is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ12=1) (18)
the secondary recommendation stage is to perform Top-N recommendation according to the recommendation degree of academic resources.
The whole recommendation process is shown in fig. 10, and as can be seen from fig. 2, the recommendation flow of the whole system includes three parts: the method comprises the following steps of resource model construction, cold start stage recommendation and secondary recommendation, wherein the steps are as follows:
the construction process of the resource model comprises the following steps:
1) acquiring five types of academic resource data through a web crawler and a data interface technology;
2) analyzing and extracting the relevant information of each academic resource, and inserting the information into a resource library;
3) preprocessing each piece of data in the resource library, including word segmentation and word stay removal;
4) calculating subject distribution, keyword distribution and LDA potential subject distribution of each resource through the trained three models, wherein the three models are a Bayesian polynomial model, a VSM model and an LDA model respectively;
5) obtaining subject categories of the resources according to the subject distribution vectors, wherein the subject categories of the resources are the first 3 subjects with higher probability in the subject distribution vectors;
6) calculating a quality value for each resource;
7) the discipline distribution vector, the keyword distribution vector, the LDA latent topic distribution vector, the discipline category, and the quality value are inserted into the repository.
Recommendation process of cold start phase:
1) selecting academic resources conforming to user interest disciplines
2) And recommending high-quality resources according to the quality value of the academic resources.
The recommendation process of the secondary recommendation stage is as follows:
1) obtaining a browsing record of a user, and calculating a 'behavior coefficient';
2) constructing a user interest model;
3) calculating the similarity between the resource model and the user interest model;
4) calculating recommendation degree according to the similarity and the quality value;
5) and performing Top-N recommendation according to the recommendation degree of the resources.
In order to facilitate subsequent calculation, a resource model is built in advance, and when a user uses the system for the first time, academic resources are recommended to the user by adopting a recommendation strategy in a cold start stage; and after the behavior data of the user reaches a certain amount, recommending academic resources for the user by adopting a secondary recommendation strategy.
The invention provides a corresponding recommendation strategy mainly according to the continuous accumulation and change of academic resources and user data. Recommending high-quality resources which accord with the interest disciplines for the user in the cold starting stage; in the secondary recommendation stage, modeling is carried out on various academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling is carried out on user interest preference according to user behaviors, and finally Top-N recommendation is carried out according to resource recommendation degree.
Experimental results show that the academic resource recommendation strategy adopted by the invention can fully meet the interest disciplines of users and has obvious effect on improving the CTR of resources; in the secondary recommendation stage, the experimental results show that the recommendation strategy under the modeling method adopted by the invention is obviously higher than the recommendation strategies under the two current common resource modeling modes in Precision.

Claims (8)

1. An academic resource recommendation service system is characterized in that the academic resource recommendation service system comprises a web crawler, a text classification model and a local academic resource database to be recommended, and the academic resource recommendation service system is used for crawling academic resources on the internet by the web crawler; calculating the attention degree of users with different identities to various types of academic resources based on historical browsing behavior data of users of different groups, modeling the academic resources from four dimensions including resource types, subject distribution, keyword distribution and LDA potential subject distribution, modeling a user interest model by combining interest subjects of the users and the historical browsing behavior data, calculating the similarity between the academic resource model and the user interest model, calculating the recommendation degree by combining a resource quality value, and finally recommending the academic resources Top-N for the users according to the recommendation degree; the network crawler is a topic crawler and is provided with an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, a word clustering function during training of the LDA topic model is utilized, after the training corpuses are trained by the LDA topic model, K topic associated word sets respectively aggregated according to the set topic number K are obtained, and then K topic documents of the topic crawler crawling at the time are obtained; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.
2. The academic resource recommendation service system according to claim 1, wherein the corpus further comprises a verification corpus with definite categories, which is used for classifying and verifying the text classification model according to a predetermined category number a by using the verification corpus in advance, so as to obtain classification accuracy of the text classification model to each of the a categories, which is used as an index of classification credibility of the text classification model to each of the a categories; the accuracy rate is the ratio of correctly classified corpora in all the verification corpora classified into a certain category by the text classification model, and a classification accuracy rate threshold is preset.
3. The academic resource recommendation service system of claim 2, wherein all the disciplines are divided into 75 discipline categories, that is, the category number a is 75 categories, the number K of topics is set to 100 during training using the LDA topic model, and the preset classification accuracy threshold is 80% during the classification verification of the text classification model.
4. A method for providing academic resource recommendation service for related users by a resource recommendation service system is characterized in that the academic resources are classified according to predetermined A categories and then stored by using a text classification model to form an academic resource database, an open API of the academic resource database is provided for display and resource recommendation module calling, and a tracking software module is cloned at a user terminal by using the academic resource model, a resource quality value calculation model and a user interest model and used for tracking and recording the online browsing behavior of the user; the process of recommending the corresponding academic resources to the user comprises a cold start recommending stage and a secondary recommending stage, wherein the cold start recommending stage recommends high-quality resources which accord with the interest disciplines for the user based on the interest disciplines, the high-quality resources are the academic resources with high resource quality values obtained by comparison after calculation by a resource quality value calculation model, and the resource quality values are the arithmetic mean or weighted mean of the resource authority, the resource community heat and the resource time-freshness; in the secondary recommendation stage, modeling is respectively carried out on a user interest model and a resource model, the similarity between the user interest model and the resource model is calculated, the recommendation degree is calculated by combining the resource quality value, and finally, academic resource Top-N recommendation is carried out on the user according to the recommendation degree; the network crawler is a topic crawler and is provided with an LDA topic model, the LDA topic model is a three-layer Bayes generation model of 'document-topic-word', a corpus is configured for the LDA topic model in advance, the corpus comprises training corpuses, the LDA topic model is trained by the training corpuses according to a set topic number K, a word clustering function during training of the LDA topic model is utilized, after the training corpuses are trained by the LDA topic model, K topic associated word sets respectively aggregated according to the set topic number K are obtained, and then K topic documents of the topic crawler crawling at the time are obtained; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics; in each crawling process of the topic crawler, a topic determining module of the topic crawler determines a target topic and a topic document thereof, the topic document is used for guiding the calculation of topic similarity, a similarity calculating module calculates and judges the topic similarity of each anchor text on a crawled page and combines the content of the page, hyperlinks of which the topic similarity of the anchor text combined with the page is smaller than a set threshold are removed, URLs of which the topic similarity of the anchor text combined with the page is larger than the set threshold are selected, the topic crawler maintains a URL queue of unvisited webpages pointed by the hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the topic crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the tags into a database, and aims at the crawled topic document, until the URL of the non-access queue is empty; the academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; and continuously repeating the crawling process of the theme crawler, so that the theme associated words collected by each theme document are continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to a degree of human approval.
5. The method of claim 4, wherein the resource Quality value Quality calculation includes a formula for Authority of the resource as follows:
Figure FDA0002274819940000031
wherein, Level is the score of the quantified publication Level of the resource, and the publication Level is divided into 5 grades, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 in sequence. The journal or conference of the centre such as Nature and Science can be divided into 1 point, the second level such as ACMTransaction can be divided into 0.8 point, and the lowest level can be divided into 0.2 point; the calculation formula for Cite is as follows:
Cite=Cites/maxCite (2)
cite is the quantized result of the resource quoted quantity, Cites is the quoted quantity of the resource, maxCite is the largest quoted quantity in the resource database;
the calculation formula of the resource community heat degree Popularity is as follows:
Popularity =readTimes/maxReadTimes (3)
readTimes is the reading times of the thesis, maxReadTimes is the maximum reading times in the source database of the resource;
the time-new Recentness calculation method of the resources is the same, and the formula is as follows:
Figure FDA0002274819940000032
year and month are the year and month of publication of the resource, respectively; minYear, minMonth, maxYear, and maxMonth are the earliest and latest publication years and months of all resources in the source database for that type of resource;
the resource Quality value Quality calculation method is as follows:
Figure FDA0002274819940000033
6. the method of claim 4, wherein the academic resource model is represented as follows:
Mr={Tr,Kr,Ct,Lr} (6)
wherein, TrThe subject distribution vector of the academic resources is the probability value of the academic resources distributed in A subject categories and is obtained by a Bayesian polynomial model;
Kr={(kr1,ωr1),(kr2,ωr2),...,(krm,ωrm) M is the number of keywords, kri(i is more than or equal to 1 and less than or equal to m) represents the ith keyword, omega, of a single academic resourceriAs a keyword kriThe weight of (c) is obtained by an improved tf-idf algorithm, and the calculation formula is as follows:
Figure FDA0002274819940000041
w (i, r) represents the weight of the ith keyword in the document r, tf (i, r) represents the frequency of the ith keyword in the document r, Z represents the total length of the document set, and L represents the number of documents containing the keyword i; lr is the underlying topic distribution vector, Lr={lr1,lr2,lr3...,lrN1N1 is the number of potential topics; ct is a resource type, and t can take values of 1,2,3,4,5, namely five major academic resources: papers, patents, news, meetings, and books;
according to the behavior characteristics of a user using mobile software, the operation behavior of the user on an academic resource is divided into opening, reading, star-level evaluation, sharing and collection, a user interest model is built on the basis of the user background and the browsed academic resource and in combination with the academic resource model according to different browsing behaviors of the user, and the user interest model is expressed as follows:
Mu={Tu,Ku,Ct,Lu} (8)
wherein, TuIs a disciplinary distribution vector, T, of a certain class of academic resources viewed by a user over a period of timerIs a user subject preference distribution vector formed after user behavior, namely
Figure FDA0002274819940000042
Wherein sum is the total academic resource number, s, of the behavior generated by the userjThe 'behavior coefficient' after the behavior is generated for the user on the academic resource j, and the larger the value is, the more the user likes the resource; t isjrA discipline distribution vector representing the jth resource; sjThe calculation comprehensively considers the behaviors of opening, reading, evaluating, collecting, sharing and the like, and can accurately reflect the preference degree of the user to the resources.
ku={(ku1,ωu1),(ku2,ωu2),...,(kuN2,ωuN2)]Is the user preference keyword distribution, N2Is the number of keywords, kui(1≤i≤N2) Denotes the ith user preference keyword, ωuiAs a keyword kuiBy the "keyword distribution vector" K of all academic resources that the user u has produced behavior over a period of timerCalculating to obtain;
K′jr=sj*Kjr(10)
calculating new keyword distribution vector of each academic resource according to the formula 10, and selecting TOP-N2 of the new keyword distribution vector of all the resources as the user keyword preference distribution vector Ku
LuFor the user's LDA latent topic preference distribution vector, the LDA latent topic distribution vector L from academic resourcesr={lr1,lr2,lr3...,lrN1Obtained by calculation in the same way as Tu
Figure FDA0002274819940000043
The similarity between the user interests and the resource model is calculated as follows:
academic resource model representation:
Mr={Tr,Kr,Ct,Lr} (12)
user interest model representation:
Mu={Tu,Ku,Ct,Lu} (13)
user discipline preference distribution vector TuSubject distribution vector T of academic resourcesrThe similarity of (2) is calculated by cosine similarity, namely:
Figure FDA0002274819940000051
user LDA latent topic preference distribution vector LuAnd academic resources LDA potential topic distribution vector LrThe similarity of (2) is calculated by cosine similarity, namely:
Figure FDA0002274819940000052
user keyword preference distribution vector KuAnd academic resource keyword distribution vector KrThe similarity calculation of (2) is calculated by the JacchardSimiarity entry:
Figure FDA0002274819940000053
then the similarity between the user interest model and the academic resource model is as follows:
Figure FDA0002274819940000054
wherein σ + ρ + τ is 1, and the specific weight distribution is obtained by experimental training;
introducing a Recommendation _ degree concept, wherein the higher the Recommendation degree of a certain academic resource is, the more the resource meets the interest preference of a user, and the better the resource is, the Recommendation degree calculation formula is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ12=1) (18)
the secondary recommendation stage is to perform Top-N recommendation according to the recommendation degree of academic resources.
7. The method according to claim 4, wherein the corpus further includes a class-specific verification corpus, which is used to make the text classification model perform classification verification in advance according to a predetermined class number A by using the verification corpus, so as to obtain the classification accuracy of the text classification model for each class in the A classes, which is used as the classification credibility index of the text classification model for each class in the A classes; the accuracy rate is the ratio of correctly classified corpora in all verified corpora classified by the text classification model, and a classification accuracy rate threshold is preset; the text classification method for each text to be classified by using the text classification model specifically comprises the following steps:
step one, preprocessing each text to be classified, wherein the preprocessing comprises word segmentation and word retention removal, proper nouns are reserved, the characteristic weights of all preprocessed words of the text are respectively calculated, the characteristic weight numerical values of the words are in direct proportion to the occurrence frequency of the words in the text and in inverse proportion to the occurrence frequency of the words in the training corpus, the word sets obtained through calculation are arranged in a descending order according to the characteristic weight numerical values, and the front part of the original word set of each text to be classified is extracted as the characteristic word set;
selecting an original feature word set of each text to be classified by using a text classification model to respectively calculate the probability value of each category of the predetermined A categories to which the text may belong, and selecting the category with the maximum probability value as the classification category of the text;
step three, judging the text classification result of the step two, and directly outputting the result if the classification accuracy value of the text classification model to the classification reaches a set threshold value; if the classification accuracy rate value of the text classification model to the classification does not reach the set threshold value, entering the step four;
inputting each preprocessed text into the LDA topic model, calculating a weight value of each topic in K set topics corresponding to the text by using the LDA topic model, selecting the topic with the largest weight value, adding the first Y words in topic associated words under the topic obtained after being trained by the LDA topic model into an original feature word set of the text to be used as an expanded feature word set together, respectively calculating probability values of each category in A preset categories possibly attributed to the text by using the text classification model again, and selecting the category with the largest probability value as a final classification category of the text.
8. The method of claim 7, wherein the main calculation formula of the text classification model is:
Figure FDA0002274819940000061
wherein P (c)j|x1,x2,...,xn) Indicating that the text belongs to category C when the feature words (x1, x2, …, xn) occur simultaneouslyjThe probability of (d);
wherein P (c)j) Representing a set of training texts, belonging to class cjThe ratio of the total number of texts of (a), P (x)1,x2,…,xn|cj) Indicating if the text to be classified belongs to class cjThen the feature word set of this text is (x)1,x2,...,xn) Probability of (A), P (c)1,c2,...,cn) Representing the joint probability of all classes given.
CN201611130297.9A 2016-12-09 2016-12-09 Academic resource recommendation service system and method Active CN106815297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611130297.9A CN106815297B (en) 2016-12-09 2016-12-09 Academic resource recommendation service system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611130297.9A CN106815297B (en) 2016-12-09 2016-12-09 Academic resource recommendation service system and method

Publications (2)

Publication Number Publication Date
CN106815297A CN106815297A (en) 2017-06-09
CN106815297B true CN106815297B (en) 2020-04-10

Family

ID=59107077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611130297.9A Active CN106815297B (en) 2016-12-09 2016-12-09 Academic resource recommendation service system and method

Country Status (1)

Country Link
CN (1) CN106815297B (en)

Families Citing this family (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2632131C2 (en) 2015-08-28 2017-10-02 Общество С Ограниченной Ответственностью "Яндекс" Method and device for creating recommended list of content
RU2632100C2 (en) 2015-09-28 2017-10-02 Общество С Ограниченной Ответственностью "Яндекс" Method and server of recommended set of elements creation
RU2629638C2 (en) 2015-09-28 2017-08-30 Общество С Ограниченной Ответственностью "Яндекс" Method and server of creating recommended set of elements for user
RU2632144C1 (en) 2016-05-12 2017-10-02 Общество С Ограниченной Ответственностью "Яндекс" Computer method for creating content recommendation interface
RU2636702C1 (en) 2016-07-07 2017-11-27 Общество С Ограниченной Ответственностью "Яндекс" Method and device for selecting network resource as source of content in recommendations system
RU2632132C1 (en) 2016-07-07 2017-10-02 Общество С Ограниченной Ответственностью "Яндекс" Method and device for creating contents recommendations in recommendations system
USD882600S1 (en) 2017-01-13 2020-04-28 Yandex Europe Ag Display screen with graphical user interface
CN107247751B (en) * 2017-05-26 2020-01-14 武汉大学 LDA topic model-based content recommendation method
CN108280114B (en) * 2017-07-28 2022-01-28 淮阴工学院 Deep learning-based user literature reading interest analysis method
CN110008334B (en) * 2017-08-04 2023-03-14 腾讯科技(北京)有限公司 Information processing method, device and storage medium
CN107590232B (en) * 2017-09-07 2019-12-06 北京师范大学 Resource recommendation system and method based on network learning environment
CN110020110B (en) * 2017-09-15 2023-04-07 腾讯科技(北京)有限公司 Media content recommendation method, device and storage medium
CN109672706B (en) * 2017-10-16 2022-06-14 百度在线网络技术(北京)有限公司 Information recommendation method and device, server and storage medium
CN107908669A (en) * 2017-10-17 2018-04-13 广东广业开元科技有限公司 A kind of big data news based on parallel LDA recommends method, system and device
CN107818145A (en) * 2017-10-18 2018-03-20 南京邮数通信息科技有限公司 A kind of user behavior tag along sort extracting method based on dynamic reptile
CN107833061A (en) * 2017-11-17 2018-03-23 中农网购(江苏)电子商务有限公司 One kind is for retail Intelligent agricultural product allocator
CN108090131A (en) * 2017-11-23 2018-05-29 北京洪泰同创信息技术有限公司 It teaches the method for pushing of auxiliary resource data and teaches the pusher of auxiliary resource data
CN108038765B (en) * 2017-12-23 2022-01-25 身轻如燕信息(上海)有限公司 Catering management ordering system based on video capture
CN108255992A (en) * 2017-12-29 2018-07-06 广州贝睿信息科技有限公司 It is a kind of paint originally can be readability assessment recommend method
CN110309411A (en) * 2018-03-15 2019-10-08 中国移动通信集团有限公司 A kind of resource recommendation method and device
CN108446273B (en) * 2018-03-15 2021-07-20 哈工大机器人(合肥)国际创新研究院 Kalman filtering word vector learning method based on Dield process
CN108600306A (en) * 2018-03-20 2018-09-28 成都星环科技有限公司 A kind of intelligent content supplying system
CN108337569A (en) * 2018-04-03 2018-07-27 优视科技有限公司 A kind of interactive discussion method, apparatus and terminal device based on video
CN108595593B (en) * 2018-04-19 2021-11-23 南京大学 Topic model-based conference research hotspot and development trend information analysis method
CN108717445A (en) * 2018-05-17 2018-10-30 南京大学 A kind of online social platform user interest recommendation method based on historical data
CN108897860B (en) * 2018-06-29 2022-05-27 中国科学技术信息研究所 Information pushing method and device, electronic equipment and computer readable storage medium
CN110020189A (en) * 2018-06-29 2019-07-16 武汉掌游科技有限公司 A kind of article recommended method based on Chinese Similarity measures
CN109213908A (en) * 2018-08-01 2019-01-15 浙江工业大学 A kind of academic meeting paper supplying system based on data mining
RU2714594C1 (en) 2018-09-14 2020-02-18 Общество С Ограниченной Ответственностью "Яндекс" Method and system for determining parameter relevance for content items
RU2720952C2 (en) 2018-09-14 2020-05-15 Общество С Ограниченной Ответственностью "Яндекс" Method and system for generating digital content recommendation
RU2720899C2 (en) 2018-09-14 2020-05-14 Общество С Ограниченной Ответственностью "Яндекс" Method and system for determining user-specific content proportions for recommendation
CN109189892B (en) * 2018-09-17 2021-04-27 北京一点网聚科技有限公司 Recommendation method and device based on article comments
CN109325179B (en) * 2018-09-17 2020-12-04 青岛海信网络科技股份有限公司 Content promotion method and device
RU2725659C2 (en) 2018-10-08 2020-07-03 Общество С Ограниченной Ответственностью "Яндекс" Method and system for evaluating data on user-element interactions
RU2731335C2 (en) 2018-10-09 2020-09-01 Общество С Ограниченной Ответственностью "Яндекс" Method and system for generating recommendations of digital content
CN109492157B (en) * 2018-10-24 2021-08-31 华侨大学 News recommendation method and theme characterization method based on RNN and attention mechanism
CN109344319B (en) * 2018-11-01 2021-08-24 中国搜索信息科技股份有限公司 Online content popularity prediction method based on ensemble learning
CN109801146B (en) * 2019-01-18 2020-12-29 北京工业大学 Resource service recommendation method and system based on demand preference
CN110297882A (en) * 2019-03-01 2019-10-01 阿里巴巴集团控股有限公司 Training corpus determines method and device
CN110245080B (en) * 2019-05-28 2022-08-16 厦门美柚股份有限公司 Method and device for generating scene test case
CN112052330B (en) * 2019-06-05 2021-11-26 上海游昆信息技术有限公司 Application keyword distribution method and device
CN110209822B (en) * 2019-06-11 2021-12-21 中译语通科技股份有限公司 Academic field data correlation prediction method based on deep learning and computer
CN110490547A (en) * 2019-08-13 2019-11-22 北京航空航天大学 Office system intellectualized technology
CN110598151B (en) * 2019-09-09 2023-07-14 河南牧业经济学院 Method and system for judging news spreading effect
RU2757406C1 (en) 2019-09-09 2021-10-15 Общество С Ограниченной Ответственностью «Яндекс» Method and system for providing a level of service when advertising content element
CN110688476A (en) * 2019-09-23 2020-01-14 腾讯科技(北京)有限公司 Text recommendation method and device based on artificial intelligence
CN110866106A (en) * 2019-10-10 2020-03-06 重庆金融资产交易所有限责任公司 Text recommendation method and related equipment
CN110866181B (en) * 2019-10-12 2022-04-22 平安国际智慧城市科技股份有限公司 Resource recommendation method, device and storage medium
CN111177372A (en) * 2019-12-06 2020-05-19 绍兴市上虞区理工高等研究院 Scientific and technological achievement classification method, device, equipment and medium
CN111241318B (en) * 2020-01-03 2021-04-13 北京字节跳动网络技术有限公司 Method, device, equipment and storage medium for selecting object to push cover picture
CN111241403B (en) * 2020-01-15 2023-04-18 华南师范大学 Deep learning-based team recommendation method, system and storage medium
CN111325006B (en) * 2020-03-17 2023-05-05 北京百度网讯科技有限公司 Information interaction method and device, electronic equipment and storage medium
CN111563177B (en) * 2020-05-15 2023-05-23 深圳掌酷软件有限公司 Theme wallpaper recommendation method and system based on cosine algorithm
CN111625439B (en) * 2020-06-01 2023-07-04 杭州弧途科技有限公司 Method for analyzing app user viscosity based on log data of user behaviors
CN111651675B (en) * 2020-06-09 2023-07-04 杨鹏 UCL-based user interest topic mining method and device
CN112287199A (en) * 2020-10-29 2021-01-29 黑龙江稻榛通网络技术服务有限公司 Big data center processing system based on cloud server
CN112559901B (en) * 2020-12-11 2022-02-08 百度在线网络技术(北京)有限公司 Resource recommendation method and device, electronic equipment, storage medium and computer program product
CN113268683B (en) * 2021-04-15 2023-05-16 南京邮电大学 Academic literature recommendation method based on multiple dimensions
CN113536085B (en) * 2021-06-23 2023-05-19 西华大学 Method and system for scheduling subject term search crawlers based on combined prediction method
CN113420058B (en) * 2021-07-01 2022-07-01 宁波大学 Conversational academic conference recommendation method based on combination of user historical behaviors
CN113360776B (en) * 2021-07-19 2023-07-21 西南大学 Cross-table data mining-based technological resource recommendation method
CN113568882A (en) * 2021-08-03 2021-10-29 重庆仓舟网络科技有限公司 OSS-based resource sharing method and system
CN113921016A (en) * 2021-10-15 2022-01-11 阿波罗智联(北京)科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN114519097B (en) * 2022-04-21 2022-07-19 宁波大学 Academic paper recommendation method for heterogeneous information network enhancement
CN117575745B (en) * 2024-01-17 2024-04-30 山东正禾大教育科技有限公司 Course teaching resource individual recommendation method based on AI big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN104680453A (en) * 2015-02-28 2015-06-03 北京大学 Course recommendation method and system based on students' attributes
CN103336793B (en) * 2013-06-09 2015-08-12 中国科学院计算技术研究所 A kind of personalized article recommends method and system thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336793B (en) * 2013-06-09 2015-08-12 中国科学院计算技术研究所 A kind of personalized article recommends method and system thereof
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN104680453A (en) * 2015-02-28 2015-06-03 北京大学 Course recommendation method and system based on students' attributes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"高质量学术资源推荐方法的研究与实现";高洁;《中国优秀硕士学位论文全文数据库信息科技辑》;20150430;第1-56页 *

Also Published As

Publication number Publication date
CN106815297A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN106815297B (en) Academic resource recommendation service system and method
US7844592B2 (en) Ontology-content-based filtering method for personalized newspapers
US8027977B2 (en) Recommending content using discriminatively trained document similarity
KR101712988B1 (en) Method and apparatus providing for internet service in mobile communication terminal
CN106682152B (en) Personalized message recommendation method
US20150213361A1 (en) Predicting interesting things and concepts in content
CN111061962A (en) Recommendation method based on user score analysis
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
CN111177538A (en) Unsupervised weight calculation-based user interest tag construction method
Godoy et al. Interface agents personalizing Web-based tasks
CN102156747B (en) Method and device for forecasting collaborative filtering mark by introduction of social tag
Chang et al. Lda-based personalized document recommendation
Kacem et al. Time-sensitive user profile for optimizing search personlization
CN111753167B (en) Search processing method, device, computer equipment and medium
Velásquez Web site keywords: A methodology for improving gradually the web site text content
JP2022035314A (en) Information processing unit and program
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
KR20100023630A (en) Method and system of classifying web page using categogory tag information and recording medium using by the same
Zhang et al. An interpretable and scalable recommendation method based on network embedding
Hoang et al. Academic event recommendation based on research similarity and exploring interaction between authors
Ma et al. Book recommendation model based on wide and deep model
KR101827338B1 (en) Method and apparatus providing for internet service in mobile communication terminal
Ahamed et al. Deduce user search progression with feedback session
Vázquez et al. Validation of scientific topic models using graph analysis and corpus metadata
CN115510326A (en) Internet forum user interest recommendation algorithm based on text features and emotional tendency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant