CN106815297A - A kind of academic resources recommendation service system and method - Google Patents

A kind of academic resources recommendation service system and method Download PDF

Info

Publication number
CN106815297A
CN106815297A CN201611130297.9A CN201611130297A CN106815297A CN 106815297 A CN106815297 A CN 106815297A CN 201611130297 A CN201611130297 A CN 201611130297A CN 106815297 A CN106815297 A CN 106815297A
Authority
CN
China
Prior art keywords
classification
theme
user
model
resource
Prior art date
Application number
CN201611130297.9A
Other languages
Chinese (zh)
Inventor
刘柏嵩
王洋洋
尹丽玲
费晨杰
高元
Original Assignee
宁波大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 宁波大学 filed Critical 宁波大学
Priority to CN201611130297.9A priority Critical patent/CN106815297A/en
Publication of CN106815297A publication Critical patent/CN106815297A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

A kind of academic resources recommendation service system and method are provided, academic resources are crawled on the internet with the Theme Crawler of Content based on LDA, with the textual classification model based on LDA by being stored in local academic resources database after predetermined A category classification, also include academic resources model, resource quality value computation model, user interest model, grown into Tracking Software Module in the terminal of user, with reference to the interest subject and historical viewings behavioral data of user, academic resources type, subject is distributed, keyword is distributed and the common four dimensions of the potential theme distributions of LDA are modeled to academic resources model and user interest model respectively, calculate the similarity between academic resource model and user interest preference model, recommendation degree is calculated in conjunction with resource quality value, recommend for user carries out academic resources Top N finally according to recommendation degree;The present invention carries out the personalized accurate recommendation of academic resources according to user identity, interest and navigation patterns, improves the operating efficiency of scientific research personnel.

Description

A kind of academic resources recommendation service system and method

Technical field

The present invention relates to Computer Applied Technology field, more particularly to a kind of academic resources recommendation service system with resource The method that recommendation service system provides academic resources recommendation service for associated user.

Background technology

The big data epoch are come at present, it is especially true in academic resources field, have hundreds of millions of all kinds of every year Academic resources are produced.In addition to scientific paper, patent, also substantial amounts of academic conference, academic news and academic community information Emerged in large numbers in real time etc. all kinds of academic resources, the academic resources of these types are accurate for user, efficiently grasp domain of interest Present Situation of Scientific Research is significant.However, scientific research clients usually have heavy research work, this kind of academic resources have big data different The characteristic of matter, isomery and rapid growth, in academic resources the mode based on traditional search engines be difficult to look into it is complete, look into standard, search for Process also seems cumbersome, and user often needs to expend considerable time and effort in inquiry academic resources interested, influences its work Make efficiency.

The object of current academic resources personalized recommendation research is principally dedicated to scientific paper, recommends academic resources type list One;The user of different user groups, i.e. different identity is different to the degree of concern of different type academic resources, current academic money The personalized recommendation research in source does not consider these factors, it is impossible to formulate shifty suggested design based on user identity.And And, current academic resources recommend research to be limited only to recommending module, clothes of the invention then that systematization is provided for academic resources recommendation Business, dynamic access, integration and classification from academic resources, academic resources are carried out to based on user identity, behavior and interest subject Personalized recommendation, form with resource consolidation and be recommended as the Integrated service system of core.

LDA (Latent Dirichlet Allocation) is a kind of document subject matter generation model, also referred to as one three layers Bayesian probability model, comprising word, theme and document three-decker.So-called generation model, that is, it is believed that an article Each word be by " with certain probability selection certain theme, and from this theme with certain word of certain probability selection The such a process of language " is obtained.Theme refers to the professional domain or interest worlds, such as space flight and aviation that certain is defined, and is given birth to Thing medical science, information technology etc., refer specifically to the set that a series of word of correlations is constituted.Document obeys multinomial distribution to theme, Theme obeys multinomial distribution to word.LDA is a kind of non-supervisory machine learning techniques, can be used to recognize the master hidden in document Topic information.Each document is considered as a word frequency vector by the method that it employs bag of words (bag of words), this method, So as to text message to be converted the digital information for ease of modeling.Constituted one of each documents representative some themes Probability distribution, and each theme represents the probability distribution that many words are constituted.LDA topic models are nature languages The typical model of Topics Crawling, can extract potential theme, there is provided a quantitative research master from corpus of text in speech treatment The method of topic, has been widely used in the motif discovery of academic resources, such as study hotspot is excavated, research theme develops, Research tendency prediction etc..

In addition, with the discovery of internet, it is various that internet fills substantial amounts of various news, blog, the summary of meeting etc. The information text of mode, this kind of information text more or less includes the academic related information content, and frequently includes newest Academic research information, be interested in all kinds of related discipline personages, and this kind of information text is unordered in a jumble, often subject is overlapped, from The typically no classification information of body, prior art is often difficult to make correct automatic classification, all kinds of correlations to this kind of information text Subject personage pick up from row traditional search engines mode be difficult to look into it is complete, look into standard, search procedure also seems cumbersome, and user is in inquiry Academic resources interested often need to expend considerable time and effort, and influence its operating efficiency.

The present invention is precisely in order to solve above-mentioned technical problem.

The content of the invention

The technical problems to be solved by the invention are directed to the above-mentioned state of the art, there is provided a kind of academic resources recommendation service system Method of the system with academic resources recommendation service is provided as associated user with resource recommendation service system.

The present invention solve the technical scheme that is used of above-mentioned technical problem for:

A kind of academic resources recommendation service system, the academic resources are to announce various e-texts on the internet, The academic resources recommendation service system includes web crawlers, textual classification model, academic resources database, is existed by web crawlers Academic resources are crawled on internet, it is characterised in that local by being stored in after predetermined A category classification with textual classification model Academic resources database, there is provided the API that academic resources database is opened is called for displaying and resource recommendation module, the academic money Source recommendation service system also includes academic resources model, resource quality value computation model, user interest model, in the terminal of user Grow into Tracking Software Module, the online browsing behavior for keeping track of user;Historical viewings row based on different groups user It is data, the degree of concern of the user to each type academic resources of different identity is calculated, from resource type, subjects distribution, pass Keyword is distributed and the common four dimensions of the potential theme distributions of LDA are modeled to academic resources, clear with reference to the interest subject and history of user Look at behavioral data, the interest preference modeling to user calculates similar between academic resource model and user interest preference model Degree, calculates recommendation degree, finally according to recommendation degree for user carries out academic resources Top-N recommendations in conjunction with resource quality value.

The web crawlers is the theme reptile, also including LDA topic models, the LDA topic models be one " document- Three layers of Bayes's generation model of theme-word ", in advance for the LDA topic models configure a corpus, corpus includes Training corpus, allows LDA topic models to train, poly- word when being trained using LDA topic models with training corpus by setting number of topics K Function is obtained after training corpus is trained through LDA topic models and is gathered into K theme association word set respectively by setting number of topics K Close, that is, obtain Theme Crawler of Content this K subject document creeping;The Theme Crawler of Content enters one on the basis of general network reptile Step includes theme determining module, similarity calculation module, URL prioritization modules;The Theme Crawler of Content is by academic subjects Multiple distributed reptiles of number distribution, each distributed reptile one academic subjects of correspondence, each distributed reptile obtains many simultaneously The academic resources of individual academic subjects;In each crawling process of Theme Crawler of Content, the theme determining module of Theme Crawler of Content determines target master Topic and its subject document, instruct the calculating of Topic Similarity with the subject document, and similarity calculation module is to the page that is crawled Each Anchor Text and Topic Similarity calculating and judgement are carried out on face with reference to the content of pages, reject Anchor Text and combine the page Topic Similarity chooses Anchor Text and combines the Topic Similarity of the page more than given threshold less than the hyperlink of given threshold URL, the signified URL queues for not accessing webpage of a hyperlink by having accessed webpage, the URL queues are safeguarded by Theme Crawler of Content Arranged according to similarity height descending, Theme Crawler of Content successively constantly accesses the webpage of each URL by putting in order for URL queues, Corresponding academic resources are crawled, and database is stored in after the academic resources tag along sort that will constantly be crawled, creeped for this Subject document, until non-access queue URL for sky;The academic resources that the Theme Crawler of Content is crawled every time are used as LDA master Inscribe the new language material of model training;And constantly repeat Theme Crawler of Content crawling process so that the master for being gathered of each subject document Topic conjunctive word is constantly able to supplement and updates, and the academic resources for being crawled constantly are able to supplement the journey for being updated to an artificial accreditation Degree.

Language material is clearly also verified including classification in the corpus, for pressing predetermined classification number A with checking language material in advance The textual classification model is allowed to carry out classification checking, to obtain classification of the textual classification model to each classification in A classification Accuracy rate, as textual classification model to the classification confidence level target of each classification in A classification;The accuracy rate is by text Disaggregated model assigns to the ratio for belonging to the language material correctly classified in all checking language materials of certain classification, and default classification accuracy Threshold value.

All subjects are divided into 75 subject categories, i.e., described classification number A is 75 classifications, is instructed using LDA topic models Number of topics K as 100 is set when practicing, the textual classification model carries out presetting classification accuracy threshold value during classification checking and is 80%.

A kind of method that academic resources recommendation service is provided with resource recommendation service system as associated user, the academic money Source is to announce various e-texts on the internet, including crawls academic resources on the internet using web crawlers, and it is special Levy and be, the academic resources that will be crawled using textual classification model are stored after being classified by predetermined A classification, form academic Resource database, there is provided the API that academic resources database is opened is called for displaying and resource recommendation module, uses resource quality value Computation model, user interest model, grow into Tracking Software Module in the terminal of user, the online browsing for keeping track of user Behavior;The process for recommending its corresponding academic resources to user recommends stage and secondary recommendation stage, cold start-up including cold start-up It is that user recommends to meet the high-quality resource of its interest subject that the recommendation stage is based on interest subject, and the high-quality resource is through resource Mass value computation model compares the resource quality value academic resources high of gained after calculating, resource quality value be resource technorati authority, The arithmetic mean of instantaneous value or weighted average of resource community temperature and the stylish degree of resource;The secondary recommendation stage, respectively to user interest Model and resource model are modeled, and the similitude of user interest model and both resource models are calculated, in conjunction with resource quality value meter Recommendation degree is calculated, finally according to recommendation degree for user carries out academic resources Top-N recommendations.

The resource quality value Quality is calculated to be included, the computing formula of resource technorati authority Authority is as follows:

Wherein Level is that resource delivers the score after publication rank is quantized, and publication rank is divided into 5 grades, fraction It is followed successively by 1,0.8,0.6,0.4 and 0.2 point.Top magazine or meeting such as Nature, Science obtain 1 point, second level as ACM Transaction obtain 0.8 point, lowest level 0.2 point;The computing formula of Cite is as follows:

Cite=Cites/maxCite (2)

Cite be resource by the quantized result of the amount of drawing, Cites be resource by the amount of drawing, maxCite is source resource data It is maximum by the amount of drawing in storehouse;

The computing formula of resource community temperature Popularity is as follows:

Popularity=readTimes/maxReadTimes (3)

ReadTimes is the frequency of reading of paper, and maxReadTimes is the reading time of maximum in source resource database Number;

The stylish degree Recentness computational methods of resource are identical, and formula is as follows:

Year and month are respectively that resource delivers time and month;MinYear, minMonth, maxYear and MaxMonth be such resource source database in all resources earliest and deliver the latest time and month;

Resource quality value Quality computational methods are as follows:

The academic resources model is expressed as follows:

Mr={ Tr,Kr,Ct,Lr} (6)

Wherein, TrIt is the subjects distribution vector of academic resources, is that the academic resources are distributed in the A probability of subject category Value, is obtained by Bayes's multinomial model;

Kr={ (kr1r1),(kr2r2),…,(krmrm), m is keyword number, kri(1≤i≤m) represents single I-th keyword of bar academic resources, ωriIt is keyword kriWeight, obtained by the tf-idf algorithms after improvement, calculate public Formula is as follows:

W (i, r) represents i-th weight of keyword in document r, and tf (i, r) represents that i-th keyword goes out in document r Existing frequency, Z represents total record of document sets, and L represents the number of files comprising keyword i;Lr is potential theme distribution vector, Lr ={ lr1,lr2,lr3…,lrN1, N1 is potential theme quantity;Ct is resource type, and the value of t can be 1,2,3,4,5 i.e. five Major class academic resources:Paper, patent, news, meeting and books;

According to user using the behavioral characteristic of mobile software, user is divided into the operation behavior of an academic resources and is beaten Open, read, star evaluation, sharing and collect, user interest model is based on user context and browsed academic resources, according to The different viewing behavior at family, with reference to academic resources model, builds user interest model, and user interest model is expressed as follows:

Mu={ Tu, Ku, Ct, Lu} (8)

Wherein, TuIt is the subjects distribution vector T of user's certain browsed class academic resources interior for a period of timerBy user's row It is rear, user's subject preference distribution vector of formation, i.e.,

Wherein, sum produced the academic resources sum of behavior, s for userjFor user to academic resources j generation behaviors after " behavior coefficient ", the value it is bigger explanation user more like the resource.TjrRepresent the subjects distribution vector of jth resource.sj's Calculating such as has considered opening, has read, evaluates, collects and share at the behavior, can accurately reflect preference journey of the user to resource Degree.

Ku={ (ku1, ωu1), (ku2, ωu2) ..., (kuN2, ωuN2) be user keyword preference distribution vector, N2 It is keyword number, kui(1≤i≤N2) represent i-th user preference keyword, ωuiIt is keyword kuiWeight, by with Family u is interior for a period of time produced behavior certain class academic resources " keyword distribution vector " KrIt is calculated.

Kjr'=sj*Kjr (10)

Every new keyword distribution vector of resource can be calculated according to formula 10, then choose the new key of all resources The TOP-N2 of word distribution vector is used as user's keyword preference distribution vector Ku

LuIt is the potential subject matter preferences distribution vectors of the LDA of user, by the potential theme distribution vector L of the LDA of academic resourcesr= {lr1, lr2, lr3..., lrN1Be calculated, the same T of methodu

User interest is as follows with the Similarity measures of both resource models:

Academic resources model is represented:

Mr={ Tr, Kr, Ct, Lr} (12)

User interest model is represented:

Mu={ Tu, Ku, Ct, Lu} (13)

User's subject preference distribution vector TuWith academic resources subjects distribution vector TrSimilarity pass through cosine similarity meter Calculate, i.e.,:

The potential subject matter preferences distribution vector L of user LDAuTheme distribution vector L potential with academic resources LDArSimilarity lead to Cosine Similarity Measure is crossed, i.e.,:

User's keyword preference distribution vector KuWith academic resources keyword distribution vector KrSimilarity Measure pass through Jaccard Similarity enter calculating:

Then user interest model is with the similarity of academic resources model:

Wherein, σ+ρ+τ=1, specific weight distribution is obtained by Experiment Training.

Recommendation degree Recommendation_degree concepts are introduced, the recommendation degree of a certain academic resources is bigger to illustrate the money The interest preference for meeting user is got in source, and resource gets over high-quality, and recommendation degree computing formula is as follows:

Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ12=1) (18)

The secondary recommendation stage is to carry out Top-N recommendations according to the recommendation degree of academic resources.

The web crawlers includes addressing reptile and Theme Crawler of Content, also including LDA topic models, the LDA topic models It is three layers of Bayes's generation model of " document-theme-word ", in advance for the LDA topic models configure a corpus, Corpus includes training corpus, allows LDA topic models to train by setting number of topics K with training corpus, using LDA topic models Poly- word function during training is obtained after training corpus is trained through LDA topic models and is gathered into K master respectively by setting number of topics K Topic association set of words, that is, obtain Theme Crawler of Content this K subject document creeping;The Theme Crawler of Content is in general network reptile On the basis of further include theme determining module, similarity calculation module, URL prioritization modules;The Theme Crawler of Content is By multiple distributed reptiles that academic subjects number is distributed, each distributed reptile one academic subjects of correspondence, each distributed reptile The academic resources of multiple academic subjects are obtained simultaneously;In each crawling process of Theme Crawler of Content, the theme determining module of Theme Crawler of Content Determine target topic and its subject document, the calculating of Topic Similarity, similarity calculation module pair are instructed with the subject document Each Anchor Text and Topic Similarity calculating and judgement are carried out on the page for being crawled with reference to the content of pages, reject Anchor Text knot The hyperlink of the Topic Similarity less than given threshold of the page is closed, the Topic Similarity for choosing Anchor Text with reference to the page is more than The URL of given threshold, the signified URL teams for not accessing webpage of a hyperlink by having accessed webpage are safeguarded by Theme Crawler of Content Row, the URL queues are arranged according to similarity height descending, and Theme Crawler of Content successively constantly access by putting in order for URL queues The webpage of each URL, crawls corresponding academic resources, and is stored in database after the academic resources tag along sort that will constantly be crawled, For the subject document that this is creeped, until non-access queue URL is sky;The academic money that the Theme Crawler of Content is crawled every time Source as LDA topic model training new language material;And constantly repeat Theme Crawler of Content crawling process so that each subject document The theme conjunctive word gathered constantly is able to supplement and updates, and the academic resources for being crawled constantly are able to supplement and are updated to one artificially The degree of accreditation.

Language material is clearly also verified including classification in the corpus, for pressing predetermined classification number A with checking language material in advance The textual classification model is allowed to carry out classification checking, to obtain classification of the textual classification model to each classification in A classification Accuracy rate, as textual classification model to the classification confidence level target of each classification in A classification;The accuracy rate is by text Disaggregated model assigns to the ratio for belonging to the language material correctly classified in all checking language materials of certain classification, and default classification accuracy Threshold value;Following steps are specifically included when carrying out text classification to each piece text to be sorted with the textual classification model:

Step one, each piece text to be sorted is pre-processed, pretreatment includes participle, goes to stop word, and retains specially Have noun, respectively calculate the text after pretreatment all words characteristic weight, the characteristic weighted value of word with this article The number of times occurred in this is directly proportional, and is inversely proportional with the number of times occurred in the training corpus, and the word set obtained by calculating is pressed into it Characteristic weighted value size descending is arranged, and extracts the previous section of the original word set of each piece text to be sorted as its Feature Words Collection;

Step 2, using textual classification model, choose each piece text primitive character word set to be sorted and be used for calculating respectively This text may belong to the probable value of each classification in predetermined A classification, choose the maximum classification of probable value as this text This class categories;

Step 3, the text classification result to step 2 judge, if classification of the textual classification model to the category Accurate rate score reaches given threshold with regard to direct output result;If textual classification model is to the classification accuracy numerical value of the category Not up to given threshold, is put into step 4;

Step 4, by LDA topic models described in the pretreated text input of each piece, calculated with LDA topic models The weighted value of each theme in K set theme of this text correspondence, the maximum theme of weight selection value, and will be advance The preceding Y word in theme conjunctive word after being trained through LDA topic models under the resulting theme is added to the original of this text Collectively as the feature word set after expansion among beginning feature word set, textual classification model is reused, this text is calculated respectively The probable value of each classification in predetermined A classification may be belonged to, the maximum classification of probable value is chosen and finally divided as this text Class classification.

The main formulas for calculating of the textual classification model is:

Wherein P (cj|x1, x2..., xn) represent Feature Words (cj|x1, x2..., xn) while the text belongs to class when occurring The probability of other cj;Wherein P (cj) represent that training text is concentrated, belong to classification cjText account for sum ratio, P (x1, x2..., xn|cj) represent if text to be sorted belongs to classification cj, then the feature word set of this text is (x1, x2..., xn) probability, P (c1, c2..., cn) represent the joint probability of given all categories.

Resource recommendation service system towards polymorphic type academic resources of the present invention has following features:

(1) present invention realizes polytype, such as the typology such as scientific paper, patent, academic conference and academic news The dynamic access of art resource, and target academic resources are obtained based on Theme Crawler of Content modular high-performance.

(2) present invention realizes the work that polytype academic resources are carried out with subject classification based on Subject Character.

(3) different user colony to the degree of concern of different type academic resources difference, the present invention is realized and is based on The how tactful academic resources suggested design of different user colony, is that the user of different identity recommends each typology by different proportion Art resource.

(4) custom is browsed based on user, the present invention is realized carries out polytype academic resources based on user's difference behavior Personalized recommendation work.

The present invention carries out the personalized recommendation of academic resources according to user identity, interest and navigation patterns, can be more accurate Academic resources are recommended in ground to user, greatly improve the operating efficiency of scientific research personnel, are that researcher preferably carries out science and grinds Study carefully creation easily and efficiently acquisition of information environment, effectively dissolve between academic resources information overload and user resources acquisition Contradiction.

In addition, the present invention is using academic resources acquisition methods and sorting technique based on LDA, it is deep by LDA topic models Degree excavates theme semantic information, is that the Theme Crawler of Content of academic resources constructs good guidance basis, and machine learning is dissolved into In the acquisition methods of art resource, quality and efficiency that academic resources are obtained are improved;Academic resources obtained by Theme Crawler of Content are used for again LDA themes update, and can at any time update topic model, the trend of the academic development that follows up, before providing association area for researcher Along resource;File classification method based on selectional feature extension proposed by the present invention is adapted to complicated application scenarios, there is selection Data few to information content increase subject information, be text classification while avoiding increasing noise to the sufficient data of information content The optimization of model provides a kind of thinking, and with scene strong adaptability, as a result availability is high, and disaggregated model is easily updated and ties up The characteristics of shield.

Brief description of the drawings

Fig. 1 is the block schematic illustration of whole academic resources recommendation service system of the invention;

Fig. 2 is LDA model schematics;

Fig. 3 is the text schematic diagram before a certain Text Pretreatment;

Fig. 4 is the text schematic diagram after a certain Text Pretreatment;

Fig. 5 is theme and subject document schematic diagram after training corpus is trained through LDA topic models;

Fig. 6 is schematic flow sheet of the present invention using the academic resources acquisition methods based on LDA;

Fig. 7 is schematic flow sheet of the present invention using the file classification method based on LDA;

Fig. 8 is recall ratio schematic diagram of three experiments on the subject of part;

Fig. 9 is precision ratio schematic diagram of three experiments on the subject of part

Figure 10 is recommended flowsheet schematic diagram of the present invention.

Specific embodiment

Specific embodiment of the invention described further below.

Academic resources recommendation service system of the present invention, as shown in figure 1, including web crawlers, textual classification model, academic money Source database, academic resources are crawled by web crawlers on the internet, with textual classification model by being deposited after predetermined A category classification It is stored in local academic resources database, there is provided the API that academic resources database is opened is called for displaying and resource recommendation module; Academic resources recommendation service system of the present invention also includes academic resources model, resource quality value computation model, user interest model, Grown into Tracking Software Module in the terminal of user, the online browsing behavior for keeping track of user;Based on different groups user Historical viewings behavioral data, calculate the user of different identity to the degree of concern of each type academic resources, from resource type, Four dimensions are modeled to academic resources altogether for subjects distribution, keyword distribution and the potential theme distributions of LDA, with reference to the interest of user Section and historical viewings behavioral data, model to user interest preference, calculate between academic resource model and user interest model Similarity, calculates recommendation degree, finally according to recommendation degree for user carries out academic resources Top-N recommendations in conjunction with resource quality value. According to the Ministry of Education《Postgraduate's disciplines catalogue》In the branches of learning and subjects, by all first level disciplines arrange be 75 subject categories, I.e. described classification number A is 75 classifications.

First, the acquisition of academic resources

Inventive network reptile is mainly Theme Crawler of Content, and also including corresponding LDA topic models, LDA topic models are one Three layers of Bayes's generation model of individual " document-theme-word ", as shown in Figure 2;Allowed by setting number of topics K with training corpus in advance LDA topic models are trained, and need to pre-process each training corpus before training certainly, and pretreatment includes participle, goes to stop Word;Poly- word function when being trained using LDA topic models is obtained by setting theme after training corpus is trained through LDA topic models Number K is gathered into K theme association set of words respectively, and theme association set of words is also referred to as subject document;Instructed using LDA topic models It is 50 to 200 that can set number of topics K when practicing, and preferably number of topics K is 100;It is various every subjects can at random to be crawled from network The document of form, length is very long but document of the paper etc that has specification to make a summary can only take its and make a summary, it is possible to use ready-made number According to storehouse, used as training corpus, document record should reach a great deal of scale, an at least tens of thousands of pieces, a up to millions of pieces.As chosen Number of topics K be 100, LDA topic model computing training process in all words of training corpus will be respectively gathered into 100 Theme associates the subject document of set of words, i.e., 100;We can be according to each theme of the artificial name of the implication of each mass-word Title, it is also possible to do not name each subject name, and only with numeral numbering or code name to show difference, wherein 3 subject documents are such as Shown in Fig. 5.

Theme Crawler of Content further included on the basis of general network reptile theme determining module, similarity calculation module, URL prioritization modules;The Theme Crawler of Content is the multiple distributed reptiles being distributed by academic subjects number, and each distribution is climbed Worm one academic subjects of correspondence, each distributed reptile obtains the academic resources of multiple academic subjects simultaneously;Theme Crawler of Content is climbed every time During row, the theme determining module of Theme Crawler of Content determines target topic and its subject document, and theme phase is instructed with subject document Like the calculating of degree, similarity calculation module carries out theme phase to each Anchor Text on the page that is crawled and with reference to the content of pages Calculated like degree and judged, reject hyperlink of the Topic Similarity less than given threshold that Anchor Text combines the page, choose anchor text This combines the URL of the Topic Similarity more than given threshold of the page, and one is safeguarded by having accessed the super of webpage by Theme Crawler of Content The signified URL queues for not accessing webpage are linked, the URL queues are arranged according to similarity height descending, and Theme Crawler of Content presses URL teams Putting in order for row successively constantly accesses the webpage of each URL, crawls corresponding academic resources, and the science that will constantly be crawled Database is stored in after resource classification label, for the subject document that this is creeped, until non-access queue URL is sky;By theme The academic resources that reptile is crawled every time as LDA topic model training new language material;And constantly repeatedly Theme Crawler of Content is climbed Row process so that the theme conjunctive word gathered of each subject document is constantly able to supplement and updates, and the academic resources for being crawled are not It is disconnected to be able to supplement the degree for being updated to an artificial accreditation.

For the ease of operation, the summary of academic resources as training corpus can be calculated by LDA topic models To theme and subject document, subject document instructs the calculating of Topic Similarity in Theme Crawler of Content crawling process, after will crawl in Hold and store in database, as the new language material of LDA training patterns, there is provided the API that academic resources database is opened is adjusted for displaying With;Comprise the following steps that:

The summary of step one, the academic resources downloaded and pre-process existing multiple fields, artificially divides according to sphere of learning Into different classes of, respectively as many training corpus of theme of LDA;

Step 2, input LDA topic model parameters, LDA topic models parameter include K, α, β, and the wherein value of K represents theme Number, the value of α represents weight distribution of each theme before sampling, and the value of β represents prior distribution of each theme to word, training Theme and subject document that multiple themes are more segmented are obtained, each subject document is used to instruct a reptile;

Step 3, each reptile are safeguarded one and crawl URL queues, by continuous since the seed URL of the high-quality chosen Calculate the text in webpage and the text of meaning and the similarity of theme are linked with Anchor Text in webpage, updated according to sequencing of similarity URL queues are crawled, and is captured and the maximally related web page contents of theme;

The academic resources that step 4, Theme Crawler of Content are obtained, after stamping correspondence theme label, in storage to database, and make To train the new language material of LDA, updated for subject document;

Step 5, the API for providing the opening of academic resources database, call for displaying.

Wherein step one includes following specific sub-step:

A () language material is collected:The summary of the academic resources of existing multiple fields is downloaded, as training corpus;

(b) Text Pretreatment:Summary is extracted, Chinese word segmentation removes stop words;

C () is classified into corpus:Artificially it is divided into different classes of according to sphere of learning, respectively as many instructions of theme of LDA Practice language material.

Wherein step 3 includes following specific sub-step:

A () initial seed URL chooses the preferable seed website towards particular topic;

B () extracts web page contents:Download the page pointed by priority URL high, according to needed for html tag is extracted in Hold and URL information;

C the analysis judgement of () degree of subject relativity, determines the choice of the page;It is of the invention main using by existing VSM technologies and SSRM technologies are combined to calculate degree of subject relativity;

D () is ranked up to the significance level for not accessing webpage URL;

E () repeats (b)~(d) processes, until non-access queue URL is sky.

Wherein in sub-step (c), Theme Crawler of Content climb through every electronic literature carry out degree of subject relativity analysis judge when, adopt Calculated through climbing page with the Generalized Vector Space Model GVSM for being combined two kinds of Topic Similarity computational algorithms of VSM and SSRM The degree of subject relativity in face, determines the choice of the page.

Theme is by one group of semantically related word and represents that the word weight related to theme is represented, i.e. theme Z= {(w1, p1), (w2, p2) ..., (wn, pn), wherein i-th word wiIt is the word related to theme Z, p1It is the word and the degree of correlation of Z Measurement, Z={ (w are expressed as in LDA1, p (w1|zj)), (w2, p (w2|zj)) ..., (wn, p (wn|zj)), wherein wi∈ W, p(wi|zj) based on entitled ZjWhen selection word be wiProbability, zjIt is j-th theme.

Subject document generating process is a kind of process of probability sampling of model, including specific sub-step as follows:

(a) to collected works in any document d, generation Document Length N, N~Poisson (ε), obey Poisson distribution;

(b) to collected works in any document d, generate a θ~Dirichlet (α), obey the distribution of Di Li Crays;

I-th generation of word wi in (c) document d:First, a theme z is generatedj~Multinomial (θ), obeys Multinomial distribution;Then, to theme zj, generate a discrete variableObey the distribution of Di Li Crays;Finally Generation is causedOne word of maximum probability.LDA models are as shown in Figure 3.

Wherein, the value of α represents weight distribution of each theme before sampling, and the value of β represents elder generation of each theme to word Test distribution.

The distribution of all of variable and its obedience is as follows in LDA models:

Whole model can essentially be changed into the Joint Distribution of P (w | Z) by integrating variable that may be present.Wherein w refers to Word, and Observable.Z is the variable of topic, is the target product of model.It can be seen that α, β are the initial parameters of model.So Integrate and can obtain by variable present in it:

Wherein, N is vocabulary length, and w is word, to θ~Dirichlet (α),Middle θ is integrated:

Wherein,Represent that Feature Words w distributes to the number of times of theme j,The Feature Words number of theme j is distributed in expression, The Feature Words number of theme j is distributed in expression text d,Represent all Feature Words numbers for being assigned with theme in text d.

From the aforegoing it can be seen that three variables of influence LDA modelings are mainly α, β and topic number K.It is relatively good in order to select Topic number, secure α first, then the value of β calculates the change of the value of the formula after being integrated to its dependent variable.

When theme modeling is carried out to text set using LDA models, performances of the theme number K to LDA models fitting text sets Influence is very big, therefore need to preset number of topics.Determine optimal master herein by the classifying quality under different themes number is weighed Topic number, and be compared with classifying quality when determining model best fit using Perplexity values, context of methods is on the one hand Accurately optimal number of topics more directly perceived can be obtained, the optimal number of topics on the other hand determining by Perplexity values can be found out The gap of corresponding classifying quality and actual result.Perplexity value formula are:

Wherein, M is the textual data in text set, NmIt is the m length of text, P (dm) it is m text of LDA models generation This probability, formula is:

Present subject matter reptile increased three modules on the basis of general reptile:Theme determining module, similarity meter Module, URL prioritization modules are calculated, it is final to obtain and theme so as to complete the filtering and theme matching to crawling the page The content of height correlation.

1st, theme determining module:Theme Crawler of Content will determine the related subject word set of the Theme Crawler of Content before work, that is, set up Subject document.The determination of theme word set generally has two kinds, and one kind is artificial determination, and another kind is to extract institute by initial page collection .Artificial to determine theme word set, the instruction of keyword is chosen has subjectivity, and keyword high noisy that initial page is extracted and low Coverage rate.The number of descriptor as theme vector dimension, and corresponding weights are then the theme each component value of vector.Note Theme word set vector is:The number write inscription based on K={ k1, k2 ..., kn }, n.

2nd, similarity calculation module:Drawn close to theme in order to the webpage for ensureing reptile acquisition can try one's best, it is necessary to webpage Filtered, the relatively low webpage of degree of subject relativity (less than the threshold value of setting) is rejected, thus will not be in next step be creeped Process the link in the page.If because the degree of subject relativity of a page is very low, illustrating that the webpage is likely to simply once in a while There are some keywords, and the theme of the page may process link meaning therein with designated key almost without what relation Very little, this is the fundamental difference of Theme Crawler of Content and common reptile.Common reptile is the search depth according to setting, to all-links Processed, as a result returned a large amount of useless webpages, and further increase workload.Entire chapter text is used for similarity Contrast is clearly an infeasible method, it usually needs the carrying out of text is refined and extracted, and is converted into and is adapted to compare and counts The data structure of calculation, while to ensure to embody the theme of text as far as possible.The Feature Selection that common Theme Crawler of Content is used is VSM, is directed to TF-IDF algorithms.What is used herein is to be based on《Hownet》Semantic Similarity Measurement, by document and theme Similarity Measure between the word of word document, obtains the Similarity value of entire article and theme.

3rd, URL prioritization modules:Filtered out in the URL that URL prioritization modules are mainly never accessed and master The topic similarity potential page high, the height according to similarity is ranked up, and similarity priority higher is higher, as excellent as possible First access similarity high, it is related with the page for ensureing access theme high.When being ranked up to not accessing URL, URL can be combined The similarity of the place page and URL Anchor Texts (text of description URL) as prioritization influence factor.

The present invention is utilized《Hownet》Similarity between word is calculated the definition of the semantic information of each word.In Hownet In, for two word W1And W2, if W1There is individual concept:W2 has m concept:W1And W2 Similarity be W1Each conceptWith W2Each conceptSimilarity maximum, formula is such as

So, the similarity between two words can be converted into the Similarity Measure between concept, all concepts in Hownet All finally be attributed to the expression of adopted original, thus the calculating of Concept Similarity can also be attributed to it is similar between corresponding justice original The calculating of degree.Assuming that concept c1With concept c2There are p and q justice former respectively, be designated as respectively Concept c1With concept c2Similarity be c1Each justice it is formerAnd c2Each justice it is formerSimilarity maximum, formula For:

《Hownet》In all concepts be all finally attributed to the expression of adopted original, so the calculating of similarity can also between concept Be attributed to it is corresponding justice original between similarity calculating.Due to all of adopted primitive root according to hyponymy constitute one it is tree-shaped Justice original hierarchical system, therefore adopted original similarity can be calculated using the former semantic distance in adopted original hierarchical system of justice, and then obtain Go out concept similarity [27].Assuming that former and in adopted original hierarchical system the path distance of two justice is Dis (s1, s2), then justice is former Calculating formula of similarity is:

Wherein Dis (s1, s2) it is s1And s2Path length in adopted original hierarchical system, here be it is adopted it is former up and down Position relation, it is a positive integer.

The design of present subject matter reptile is that based on common reptile, further function expands.To the whole of webpage Step in processing procedure:Initial seed URL determines, extracts web page contents, degree of subject relativity analysis, URL sequences.

A () initial seed URL chooses the preferable seed website towards particular topic, Theme Crawler of Content is smoothly launched Creep work.

B () extracts web page contents:Download the page pointed by priority URL high, according to needed for html tag is extracted in Hold and URL information.

C the analysis of () degree of subject relativity is the nucleus module of Theme Crawler of Content, it determines the choice of the page.Main use of the invention Generalized Vector Space Model GVSM that existing VSM technologies and SSRM technologies are combined calculates degree of subject relativity.

Degree of subject relativity is analyzed, and text key word is extracted with TF-IDF, and calculates the weight of word, and the degree of correlation is carried out to webpage Analysis.

TF-IDF correlation computations:

Wherein wdiIt is weights of the word i in document d, tfiIt is the word frequency of word i, idfiIt is the inverse document frequency of word i, fiIt is word The number of times that i occurs in document d, fmaxIt is the frequency of occurrences highest number of times in all words of document d, N is all number of files, Ni It is the number of files comprising word i.TF-IDF is still the current maximally effective method extracted keyword and calculate the weights of word.

VSM degree of subject relativity is calculated:

WhereinIt is the term vector of document d,Be the theme the term vector of t, wdi, wtiIt is word i in the document d and TF- of theme t IDF values, n is the number of the common word of appearance in document d and theme t.The algorithm only considers the frequency of same words occur in document Vector, judges in this, as Documents Similarity, not in view of relation, such as near synonym present on semanteme between word and word, Synonym etc., so as to have impact on the degree of accuracy of similarity.

SSRM degree of subject relativity is calculated:

Wherein wdi, wtiIt is word i in the TF-IDF values of document d and theme t, n, m is respectively the word number of document d and theme t, SemijIt is word i and the semantic similarity of word j.

Wherein C1, C2It is two concepts, equivalent to word w1With word w1, Sem (C1, C2) it is concept C1With concept C2Semantic phase Like degree, C3It is C1And C2The minimum common notion enjoyed, Path (C1, C3) it is C1To C3Nodes on path, Path (C2, C3) It is C2To C3Nodes on path, Depth (C3) it is the C in some different bodies3Nodes onto root node path. Using the algorithm of SSRM, relation semantically is only considered, is all near synonym or synonym if there is the word in two articles, So this Documents Similarity can calculate 1, i.e., identical, this is clearly shortcoming accurate.

Method of the present invention using VSM and SSRM calculating similarities are combined, also referred to as Generalized Vector Space Model, referred to as GVSM, its calculating formula is:

Wherein Sim (dk, t) it is document dkTopic Similarity, the present invention take into account between document word frequency factor and word and word Semantic relation, the method using VSM is combined with SSRM effectively improves the precision of Topic Similarity calculating.

D () is ranked up to the significance level for not accessing webpage URL.Below equation is used to be ranked up URL:

Wherein priority (h) is the preferred value of the hyperlink h not accessed, and N is the searching web pages number comprising h, Sim (fp, T) it is webpage p (Topic Similarity comprising hyperlink h) full text, Sim (ah, t) for the theme of the Anchor Text of hyperlink h is similar Degree, λ is the weighted value of regulation full text and Anchor Text.The side that Similarity Measure in formula is equally combined using VSM and SSRM Method, optimizes the prioritization for not crawling URL link queue, equally effectively increases the accurate of theme academic resources acquisition Property.

Present subject matter reptile is to aim at the network information gripping tool for capturing certain subject resource and occurring, compared to logical Common web crawlers, Theme Crawler of Content purpose is to capture the info web related to particular topic content, it is necessary to pass through to calculate Webpage judges whether to capture the webpage with the degree of correlation of theme, and safeguards a URL queue to be crawled, according to URL's Priority conducts interviews to the page, to ensure that the degree of correlation page high is preferentially accessed.

Current Theme Crawler of Content has some defects:(1) Theme Crawler of Content will determine the phase of the Theme Crawler of Content before work Close theme word set.The determination of theme word set generally has two kinds, and one kind is artificial determination, and another kind is to analyze institute by initial page .There is certain subjectivity in artificial determination method;And the method that keyword is extracted by initial page, typically covered in theme It is not enough in rate.Two kinds of traditional methods can all cause no small inclined when Theme Crawler of Content carries out Web page subject Similarity Measure Difference.(2) core for being currently based on the heuristic Theme Crawler of Content of text is that Page resemblance is calculated, judge it is current crawl webpage whether with Theme is close, and except the accuracy with theme determining module has outside the Pass, topmost is exactly Similarity Measure algorithm, generally uses Be VSM (vector space model), based on being incoherent between different words it is assumed that representing text with term vector, by altogether Similarity between having word frequency to calculate document, this algorithm often have ignored the semantic relation between word word, reduce in semanteme The similar value of upper height correlation article.

The design of present subject matter reptile is based on general reptile, to increase by three nucleus modules:Theme determining module, Topic Similarity computing module and URL order modules to be crawled.Not enough for more than, the present invention proposes to be based on topic model LDA Theme Crawler of Content, and improve Topic Similarity algorithm and URL prioritizing algorithms, carried from the initial and process that is crawling for crawling The content quality of Theme Crawler of Content high and the degree of accuracy.Main contributions point:(1) by LDA topic models, depth excavates language material theme language Adopted information, the reptile that is the theme constructs good guidance basis, and machine learning is dissolved into the acquisition methods of resource, improves resource The accuracy and quality of acquisition.(2) in Theme Crawler of Content Topic Similarity computing module, use will be based on《Hownet》Semantic similarity The method of calculating, balances cosine similarity and semantic similarity, reaches more preferable theme matching effect.

2nd, the classification of academic resources

The present invention using based on LDA file classification method, as shown in fig. 7, using Bayesian probability computation model as Textual classification model, extraction best embodies one group of Feature Words of this text characteristics to be sorted as being input into text classification mould The feature word set of type, primitive character word set is exactly, by the previous section after characteristic weight sequencing, to use text classification by original word set Model calculates the probability that the feature word combination belongs to each classification in predetermined A classification, takes the maximum classification of probable value and makees It is its generic;According to the Ministry of Education《Postgraduate's disciplines catalogue》In the branches of learning and subjects, all subjects are divided into 75 Section's classification, i.e., described classification number A is 75 classifications.Using above-described LDA topic models and through 100 obtained by its training Subject document carries out text classification aiding in textual classification model.Also clearly verify language material by predetermined classification number with classification in advance A allows the textual classification model to carry out classification checking, accurate to obtain classification of the textual classification model to each classification in A classification Rate, as textual classification model to the classification confidence level target of each classification in A classification;The accuracy rate is by text classification Model assigns to the ratio for belonging to the language material correctly classified in all checking language materials of certain classification, and default classification accuracy threshold Value;It is 80% more suitable that textual classification model carries out presetting classification accuracy threshold value during classification checking.Use textual classification model Following steps are specifically included when carrying out text classification to each piece text to be sorted:

Step one, the characteristic of the after pretreatment all words for calculating each piece text to be sorted the text respectively are weighed Weight, the characteristic weighted value of word is directly proportional to the number of times occurred in the text, with the number of times occurred in the training corpus It is inversely proportional, the word set obtained by calculating is arranged by its characteristic weighted value size descending, extracts each piece text to be sorted original The previous section of word set is used as its feature word set.

Step 2, using textual classification model, choose each piece text primitive character word set to be sorted and be used for calculating respectively This text may belong to the probable value of each classification in predetermined A classification, choose the maximum classification of probable value as this text This class categories;

Step 3, the text classification result to step 2 judge, if classification of the textual classification model to the category Accurate rate score reaches given threshold with regard to direct output result;If textual classification model is to the classification accuracy numerical value of the category Not up to given threshold, is put into step 4;

Step 4, by LDA topic models described in the pretreated text input of each piece, calculated with LDA topic models The weighted value of each theme in K set theme of this text correspondence, the maximum theme of weight selection value, and will be advance The preceding Y word in theme conjunctive word after being trained through LDA topic models under the resulting theme is added to the original of this text Collectively as the feature word set after expansion among beginning feature word set, textual classification model is reused, this text is calculated respectively The probable value of each classification in predetermined A classification may be belonged to, the maximum classification of probable value is chosen and finally divided as this text Class classification.10 to 20 words are specifically can use, preceding 15 words such as taken in theme conjunctive word are added to the primitive character of this text Collectively as the feature word set after expansion among word set;Even if the new word for adding has repetition also to have no relations with primitive character word.

The main formulas for calculating of textual classification model is:

Wherein P (cj|x1, x2..., xn) represent Feature Words (x1, x2 ..., xn) while the text belongs to classification when occurring The probability of cj;Wherein P (cj) represent that training text is concentrated, belong to classification cjText account for sum ratio, P (x1, x2..., xn| cj) represent if text to be sorted belongs to classification cj, then the feature word set of this text is (x1, x2..., xn) probability, P (c1, c2..., cn) represent the joint probability of given all categories.

Clearly for given all categories, denominator P (c1, c2..., cn) it is a constant, category of model result is (1) in formula maximum probability classification, solve (6) formula maximum can be converted into solve following formula maximum

Again according to Bayesian assumption, Text eigenvector attribute x1, x2..., xnIndependent same distribution, its joint probability distribution Equal to the product of each attributive character probability distribution, i.e.,:

P(x1, x2..., xn|cj)=ΠiP(xi|cj) (36)

So (7) formula is changed into:

It is as required for classifying Classification function.

Probable value P (c in classification functionj) and P (xi|cj) or it is unknown, therefore, in order to calculate classification function most Big value, the Prior Probability in (9) formula is estimated as follows respectively:

Wherein, N (C=cj) represent training text in belong to cjThe sample size of classification;N represents training sample total quantity.

Wherein, N (Xi=xi, C=cj) represent classification cjIn include attribute xiTraining samples number;N (C=cj) represent class Other cjIn training samples number;M represents the quantity of the keyword after removing stop word in training sample set.

LDA is a kind of statistics topic model to discrete data set modeling that Blei et al. was proposed in 2003, is one Three layers of Bayes's generation model of " document-theme-word ".Initial model only introduces one to " document-theme " probability distribution Hyper parameter is distributed its obedience Dirichlet, and subsequent Griffiths et al. have also been introduced one to " theme-word " probability distribution Hyper parameter makes it obey Dirichlet distributions.LDA models are as shown in Figure 2.Wherein:N is the word quantity of this document, and M is text The number of documents that shelves are concentrated, K is the theme number,Be the theme-the probability distribution of word, and θ is the probability distribution of document-theme, and Z is Implicit variable represents theme, and W is word, and α is the super ginseng of θ, and β isSuper ginseng.

One document is regarded as one group of set of word by LDA topic models, does not have sequencing, Er Qieyi between word and word Piece document can be comprising multiple themes, and each word is generated by certain theme in document, and same word can also belong to different Theme, therefore LDA topic models are a kind of typical bag of words.

The key for training LDA models is the deduction of implicit variable distribution, that is, obtain the implicit text-theme point of target text Cloth θ and theme-word distributionIf setting models parameter alpha, β, stochastic variable θ, z of text d and the Joint Distribution of w are:

There are multiple implicit variables simultaneously due to above formula, directly calculate θ,It is impossible, so needing to carry out parameter Estimate to infer, parameter estimation algorithm common at present has expectation maximization (Expectation Maximization, EM), variation Bayesian inference and Gibbs sample.The deduction of model parameter is carried out using Gibbs sampling herein, Griffiths points out Gibbs Sampling is superior to variation Bayesian inference and EM algorithms at the aspect such as Perplexity values and training speed.EM algorithms due to it seemingly What right function local maxima problem often led to that model finds is locally optimal solution, and the model that variation Bayesian inference is obtained With truth deviation, Gibbs sampling can fast and effectively be concentrated from large-scale data and extract subject information, as current Most popular LDA model extractions algorithm.

MCMC is the Iterative approximation of a set of sample drawn value from complicated probability distribution, and Gibbs samples as MCMC A kind of simple realization form, it is therefore an objective to construction converges on the Markov chain of specific distribution, and is extracted from chain general close to target The sample of rate Distribution Value.In the training process, algorithm is only to theme variable ziIt is sampled, its conditional probability computing formula is such as Under:

Wherein, equation left side implication is:Current word wiUnder conditions of known other words each affiliated theme, the word belongs to The probability of theme k;Equation the right ni- 1 be k-th theme under i-th word number subtract 1;nk- 1 is k-th theme of the document Number subtract 1;First multiplier is wiProbability of this word under k themes;Second multiplier is k-th theme in the piece Probability in document.

Gibbs sampling is concretely comprised the following steps:

1) initialize, be each word wiIt is randomly assigned theme, ziIt is the theme of word, by ziIt is initialized as 1 between K Individual random integers, from 1 to N, N is the Feature Words mark of text set to i, and this is the initial state of Markov chain;

2) i is recycled to N from 1, and current word w is calculated according to formula (2)iBelong to the probability of each theme, and probability pair according to this Word wiAgain sample theme, obtain the NextState of Markov chain;

Iterative step 2) enough after number of times, it is believed that Markov chain has reached stable state, and so far each word of this document has one Individual specific affiliated theme;For every document, text-theme distribution θ and theme-word are distributedValue can estimate by following equation Calculate:

Wherein,Represent that Feature Words w distributes to the number of times of theme k,The Feature Words number of theme k is distributed in expression, The Feature Words number of theme k is distributed in expression text d,Represent all Feature Words numbers for being assigned with theme in text d.

As the classification accuracy of textual classification model confidence level target, calculated by probability, specific formula is such as Under:

Wherein, i represents classification, NiThe number of times of the correctly predicted i classifications of presentation class device, MiPresentation class device prediction i classifications Total degree.

Precision ratio P, recall ratio R and both comprehensive evaluation index F can be used1As final evaluation index, precision ratio P What is weighed is that the test sample for being appropriately determined the category accounts for the ratio of the test sample for being judged to the category, what recall ratio R was weighed It is to be appropriately determined the ratio that category test sample accounts for all test samples of the category.With certain classification CiAs a example by, n++Represent correct Judge that sample belongs to classification CiQuantity, n+-Expression is not belonging to but is judged as classification CiSample number, n-+Expression belong to but It is judged as being not belonging to classification CiSample number.For classification CiFor, recall ratio R, precision ratio P and overall target F1It is worth and is:

Inventor had once carried out three groups of experiments:Experiment one, classifier performance test is carried out based on primitive character collection;Experiment Two, classifier performance test is carried out based on the feature set after expansion;Experiment three, is entered based on the feature set after selectional feature extension Row classifier performance is tested, and wherein believability threshold is set to 0.8.Table 2 be three recall ratios of the experiment on the subject of part and Precision ratio:

The recall ratio and precision ratio of the part subject of table 2

As shown in Table 2, when being tested based on primitive character collection, history recall ratio is higher, and precision ratio is relatively low, explanation There are the more data for being not belonging to history subject to be classified device and be classified as history, while finding that History of Science and Technology subject is looked into entirely Rate is relatively low, illustrates that the data for much originally belonging to this subject have been classified as Other subjects, due to the two subjects very Similar, this is likely to be that the more data for belonging to History of Science and Technology are classified as history by grader.Similar situation is same Appear on Geological Resources and Geological Engineering subject and geology subject.Based on feature set after extension to problem above Improve, but the subject high on resolution before generates influence.And carry out selectional feature extension and on the one hand avoid to identification Degree subject high produces influence, on the other hand to itself because the subject that information content deficiency causes resolution low has to a certain extent Improvement.

Can be calculated according to experimental result above and test respective recall level average, average precision and average for three times F1Value.Result is as follows:

The Experimental comparison of table 3

As can be seen from Table 3, in face of complicated classification scene, method of the present invention based on selectional feature extension compared to Method based on primitive character collection or based on the feature set after extension has more preferable adaptability, recall level average, averagely looks into standard Rate and average F1Value can reach preferable practical function apparently higher than other schemes.

Fig. 6 is recall ratio schematic diagram of three experiments on the subject of part;Fig. 7 is three experiment looking on the subject of part Quasi- rate schematic diagram.

Due to the arrival in big data epoch, resource classification facing challenges are increasing, and different application scenarios need to adopt With different sorting techniques, all of classification task is adapted in the absence of a technology.It is proposed by the present invention based on selectional feature The method of extension is adapted to complicated application scenarios, and data selectively few to information content increase subject information, at the same avoid it is right The sufficient data of information content increase noise, and the inventive method has universal adaptability.

3rd, the recommendation of academic resources

The present invention recommends the process of its corresponding academic resources to recommend stage and secondary recommendation rank including cold start-up to user Section, it is that user recommends to meet the high-quality resource of its interest subject, the high-quality resource that cold start-up recommends the stage to be based on interest subject Compare the resource quality value of gained academic resources high after as being calculated through resource quality value computation model, resource quality value is money The arithmetic mean of instantaneous value or weighted average of source technorati authority, resource community temperature and the stylish degree of resource;It is the secondary recommendation stage, right respectively User interest model and resource model are modeled, and the similitude of user interest and both resource models are calculated, in conjunction with resource quality Value calculates recommendation degree, finally according to recommendation degree for user carries out academic resources Top-N recommendations.

1st, cold-start phase proposed algorithm:

The attribute and criterion of the major class resource of table 4 five

The academic resources of high-quality can attract and keep here new user.In cold-start phase, intend recommending to meet to user herein The high-quality resource of its interest subject.High-quality resource is mass value academic resources high, and the criterion of mass value mainly includes power The attributes such as prestige degree, community's temperature and stylish degree.The attribute and criterion of five major class resources are as shown in table 4.

The computing formula of paper technorati authority Authority is as follows:

Level is the score after paper publishing publication rank is quantized.Publication rank is divided into 5 grades, fraction herein It is followed successively by 1,0.8,0.6,0.4 and 0.2 point.Top magazine or meeting such as Nature, Science obtain 1 point, second level as ACM Transaction obtain 0.8 point, lowest level 0.2 point.The computing formula of Cite is as follows:

Cite=Cites/maxCite. (2)

Cite be paper by the quantized result of the amount of drawing, Cites be paper by the amount of drawing, maxCite is Source of Articles data It is maximum by the amount of drawing in storehouse.

The technorati authority of other four classes resources calculates similar with paper, and simply quantization method is different.

The computing formula of paper community temperature Popularity is as follows:

Popularity=readTimes/maxReadTimes. (3)

ReadTimes is the frequency of reading of paper, and maxReadTimes is the reading time of maximum in Source of Articles database Number.

The stylish degree Recentness computational methods of all resources are identical, and formula is as follows:

Year and month are respectively that resource delivers time and month.MinYear, minMonth, maxYear and MaxMonth be such resource source database in all resources earliest and deliver the latest time and month.

Quality of Papers value Quality computational methods are as follows:

2nd, the algorithm in secondary recommendation stage:

This stage uses the recommendation method of fusion user behavior and resource content, respectively to user interest model and resource mould Type is modeled, and calculates the similitude of the two, and recommendation degree is calculated in conjunction with resource quality value, is recommended finally according to recommendation degree.

Academic resources model is expressed as follows:

Mr={ Tr, Kr, Ct, Lr} (6)

Wherein, TrIt is the subjects distribution vector of academic resources, is that the academic resources are distributed in 75 probable values of subject, by Bayes's multinomial model is obtained.

Kr={ (kr1, ωr1), (kr2, ωr2) ..., (krm, ωrm), m is keyword number, kri(1≤i≤m) is represented I-th keyword of wall scroll academic resources, ωriIt is keyword kriWeight, obtained by the tf-idf algorithms after improvement, calculate Formula is as follows:

W (i, r) represents i-th weight of keyword in document r, and tf (i, r) represents that i-th keyword goes out in document r Existing frequency, Z represents total record of document sets, and L represents the number of files comprising keyword i.

LrIt is the potential theme distribution vectors of LDALDA, Lr={ lr1, lr2, lr3..., lrN1, N1 is potential theme quantity.

CtIt is resource type, the value of t can be 1,2,3,4,5 i.e. five major class academic resources:Scientific paper, academic patent Academic news, academic conference and academic books.

According to user using the behavioral characteristic of mobile software, user is divided into the operation behavior of an academic resources and is beaten Open, read, star evaluation, sharing and collect, star evaluation belongs to explicit behavior, other to belong to implicit behavior.Explicit behavior User interest preference degree can clearly be reflected, such as star evaluation, the explanation user higher that scores more likes the resource;Implicit row Though can not clearly to reflect user interest preference, its information content for containing and information value often it is more more than explicit feedback more It is high.

User interest model is based primarily upon user context and browsed academic resources.Different viewing row according to user For, with reference to academic resources model, user interest model can be built, this model will dynamic be adjusted with user interest change.User Interest model is expressed as follows:

Mu={ Tu, Ku, Ct, Lu} (8)

Wherein, TuIt is the subjects distribution vector T of user's certain browsed class academic resources interior for a period of timerBy user's row It is rear, user's subject preference distribution vector of formation, i.e.,

Wherein, sum produced the academic resources sum of behavior, s for userjFor user to academic resources j generation behaviors after " behavior coefficient ", the value it is bigger explanation user more like the resource.TjrRepresent the subjects distribution vector of jth resource.sj's Calculating such as has considered opening, has read, evaluates, collects and share at the behavior, can accurately reflect preference journey of the user to resource Degree.

Ku={ (ku1, ωu1), (ku2, ωu2) ..., (kuN2, ωuN2) it is user's keyword preference distribution vector, N2For Keyword number, kui(1≤i≤N2) represent i-th user preference keyword, ωuiIt is keyword kuiWeight, by user u " keyword distribution vector " K of all academic resources of behavior was produced in a period of timerIt is calculated.

Kjr'=sj*Kjr (10)

Every new keyword distribution vector of academic resources can be calculated according to formula 10, then to choose all resources new The TOP-N2 of keyword distribution vector as user keyword preference distribution vector Ku

LuIt is the potential subject matter preferences distribution vectors of the LDA of user, by the potential theme distribution vector L of the LDA of academic resourcesr= {lr1, lr2, lr3..., lrN1Be calculated, the same T of methodu.

The calculating of behavior coefficient:S represents behavior coefficient, and T is reading time threshold value, and δ is a regulation parameter, adds and reads Time threshold, it is intended to prevent it is overdue hit, so this value very little.If the time of user's read resource j is less than threshold value T, then it is assumed that User be it is overdue hit, s=0.Under conditions of user is ready to spend long period reading i.e. reading time more than or equal to T, if with Evaluation is made at family and evaluation of estimate is more than the average mean of its all evaluation previous, then it is assumed that it likes j, and s is increased into δ.If with Family has been carried out collecting or sharing to j, illustrates that user is delithted with j, and s is increased into δ.It is considered herein that reading, evaluating, collect, sharing It is the interest preference for from the superficial to the deep reflecting user.The value of s depends primarily on initial value and regulation parameter δ, we want by with All behaviors at family are mapped as the value of 0 to 2, so initial value is 1, regulation parameter δ=0.333333.

Academic resources model and user interest model Similarity Measure:

Academic resources model is represented:

Mr={ Tr, Kr, Ct, Lr} (12)

User interest model is represented:

Mu={ Tu, Ku, Ct, Lu} (13)

User's subject preference distribution vector TuWith academic resources subjects distribution vector TrSimilarity pass through cosine similarity meter Calculate, i.e.,:

The potential theme distribution vector L of LDA of useruTheme distribution vector L potential with the LDA of academic resourcesrSimilarity lead to Cosine Similarity Measure is crossed, i.e.,:

User's keyword preference distribution vector KuWith academic resources keyword distribution vector KrSimilarity Measure pass through Jaccard Similarity are calculated:

Then user interest model is with the similarity of academic resources model:

Wherein, σ+ρ+τ=1, specific weight distribution is obtained by Experiment Training.

In order to recommend its high-quality resource interested to user, recommendation degree Recommendation_degree concepts are introduced, The recommendation degree of a certain academic resources is bigger to illustrate the interest preference that the resource more meets user, and resource gets over high-quality.Recommendation degree meter Calculate formula as follows:

Recommendation_degree=λ1Sim(Mu, Mn)+λ2Quality(λ12=1) (18)

The secondary recommendation stage is to carry out Top-N recommendations according to the recommendation degree of academic resources.

As shown in Figure 10, as can be seen from Figure 2, the overall recommended flowsheet of system includes three parts to whole recommendation process:Resource mould The recommendation of the structure, cold-start phase of type and secondary recommendation process, they are comprised the following steps that:

The building process of resource model:

1) five class academic resources data are obtained by web crawlers and data interface techniques;

2) parse and extract every relevant information of academic resources, insert resources bank;

3) to resources bank in every data pre-process, including participle and go stop word;

4) three class models by having trained calculate every subjects distribution of resource, keyword distribution and LDA it is potential Theme distribution, three class models are respectively Bayes's multinomial model, VSM and LDA models;

5) subject category of resource is obtained according to subjects distribution vector, the subject category of resource is general in subjects distribution vector Larger preceding 3 subjects of rate;

6) every mass value of resource is calculated;

7) by subjects distribution vector, keyword distribution vector, the potential theme distribution vectors of LDA, subject category and mass value Insertion resources bank.

The recommendation process of cold-start phase:

1) selector closes the academic resources of user interest subject

2) mass value according to academic resources carries out high-quality resource recommendation.

The recommendation process in secondary recommendation stage:

1) record that browses of user is obtained, is calculated " behavior coefficient ";

2) user interest model is built;

3) similarity of computing resource model and user interest model;

4) recommendation degree is calculated according to similarity and mass value;

5) the recommendation degree according to resource carries out Top-N recommendations.

For convenience of follow-up calculating, we construct resource model in advance, and when user uses the system first, we use The Generalization bounds of cold-start phase are its recommendation academic resources;After the behavioral data of user reaches certain amount, just using two Secondary Generalization bounds are its recommendation academic resources.

The present invention mainly proposes corresponding Generalization bounds according to the continuous cumulative change of academic resources and user data.It is cold Startup stage is that user recommends to meet the high-quality resource of its interest subject;The secondary recommendation stage is from resource type, subjects distribution, pass Keyword is distributed and the common four dimensions of the potential theme distributions of LDA are modeled to all kinds of academic resources, according to user behavior to user interest Preference is modeled, and Top-N recommendations are carried out finally according to resource recommendation degree.

Test result indicate that, academic resources Generalization bounds of the present invention can fully cater to the interest subject of user, Obvious effect is obtained in terms of the CTR of lifting resource;It is the secondary recommendation stage, of the present invention knowable to experimental result Generalization bounds under modeling method are in terms of Precision apparently higher than mesh first two often with the recommendation plan under Resource Modeling mode Slightly.

Claims (10)

1. a kind of academic resources recommendation service system, the academic resources are to announce various e-texts on the internet, institute Stating academic resources recommendation service system includes web crawlers, textual classification model, the academic resources data to be recommended being located locally Storehouse, academic resources are crawled by web crawlers on the internet, it is characterised in that with textual classification model by predetermined A classification point Local academic resources database to be recommended is stored in after class, there is provided the API that academic resources database is opened is for displaying and resource Recommending module is called, and the academic resources recommendation service system also includes academic resources model, resource quality value computation model, uses Family interest model, grows into Tracking Software Module in the terminal of user, the online browsing behavior for keeping track of user;Based on not With the historical viewings behavioral data of group of subscribers, the degree of concern of the user to each type academic resources of different identity is calculated, From resource type, subjects distribution, keyword distribution and the potential theme distributions of LDA, four dimensions are modeled to academic resources altogether, with reference to The interest subject and historical viewings behavioral data of user, model to user interest model, calculate academic resource model emerging with user Similarity between interesting model, calculates recommendation degree, finally according to recommendation degree for user carries out academic money in conjunction with resource quality value Source Top-N recommends.
2. academic resources recommendation service system as claimed in claim 1, it is characterised in that the web crawlers is the theme and climbs Worm, and LDA topic models are configured, the LDA topic models are three layers of Bayes's generation models of " document-theme-word ", In advance for the LDA topic models configure a corpus, corpus includes training corpus, with training corpus by setting theme Number K allows LDA topic models to train, and poly- word function when being trained using LDA topic models is instructed in training corpus through LDA topic models Obtained after white silk and be gathered into K theme association set of words respectively by setting number of topics K, that is, obtain Theme Crawler of Content this K for creeping individual Subject document;The Theme Crawler of Content further includes theme determining module, Similarity Measure on the basis of general network reptile Module, URL prioritization modules;The Theme Crawler of Content is the multiple distributed reptiles being distributed by academic subjects number, each point Cloth reptile one academic subjects of correspondence, each distributed reptile obtains the academic resources of multiple academic subjects simultaneously;Theme Crawler of Content In each crawling process, the theme determining module of Theme Crawler of Content determines target topic and its subject document, uses the subject document Instruct the calculating of Topic Similarity, similarity calculation module is to each Anchor Text on the page that is crawled and combines the content of pages Topic Similarity calculating and judgement are carried out, hyperlink of the Topic Similarity less than given threshold that Anchor Text combines the page is rejected Connect, choose URL of the Topic Similarity more than given threshold that Anchor Text combines the page, one is safeguarded by having visited by Theme Crawler of Content Ask the hyperlink of webpage the signified URL queues for not accessing webpage, the URL queues are arranged according to similarity height descending, theme Reptile successively constantly accesses the webpage of each URL by putting in order for URL queues, crawls corresponding academic resources, and constantly will Database is stored in after the academic resources tag along sort for being crawled, for the subject document that this is creeped, until non-access queue URL It is sky;New language material of the academic resources that the Theme Crawler of Content is crawled every time as LDA topic model training;And constantly Repeat Theme Crawler of Content crawling process so that the theme conjunctive word gathered of each subject document is constantly able to supplement and updates, and is climbed The academic resources for taking constantly are able to supplement the degree for being updated to an artificial accreditation.
3. academic resources recommendation service system as claimed in claim 2, it is characterised in that also include classification in the corpus Clear and definite checking language material, for allowing the textual classification model to carry out classification checking by predetermined classification number A with verifying language material in advance, To obtain classification accuracy of the textual classification model to each classification in A classification, as textual classification model to A classification In each classification classification confidence level target;The accuracy rate is that all checking languages of certain classification are assigned to by textual classification model Belong to the ratio of the language material correctly classified, and default classification accuracy threshold value in material.
4. academic resources recommendation service system as claimed in claim 3, it is characterised in that all subjects are divided into 75 subjects Classification, i.e., described classification number A is 75 classifications, and number of topics K as 100, the text are set when being trained using LDA topic models It is 80% that disaggregated model carries out presetting classification accuracy threshold value during classification checking.
5. a kind of method that academic resources recommendation service is provided with resource recommendation service system as associated user, the academic resources To announce various e-texts on the internet, including academic resources are crawled on the internet using web crawlers, its feature It is that the academic resources that will be crawled using textual classification model are stored after being classified by predetermined A classification, forms academic money Source database, there is provided the API that academic resources database is opened is called for displaying and resource recommendation module, uses academic resources mould Type, resource quality value computation model, user interest model, grow into Tracking Software Module, in the terminal of user for keeping track of The online browsing behavior of user;The process for recommending its corresponding academic resources to user recommends the stage to be pushed away with secondary including cold start-up Recommend the stage, it is that user recommends to meet the high-quality resource of its interest subject, the high-quality that cold start-up recommends the stage to be based on interest subject Resource is after being calculated through resource quality value computation model the academic resources high of the resource quality value obtained by comparing, resource quality value It is the arithmetic mean of instantaneous value or weighted average of resource technorati authority, resource community temperature and the stylish degree of resource;The secondary recommendation stage, point It is other that user interest model and resource model are modeled, the similitude of user interest model and both resource models is calculated, in conjunction with Resource quality value calculates recommendation degree, finally according to recommendation degree for user carries out academic resources Top-N recommendations.
6. method as claimed in claim 5, it is characterised in that the resource quality value Quality is calculated to be included, the power of resource The computing formula of prestige degree Authority is as follows:
A u t h o r i t y = 1 2 L e v e l + 1 2 C i t e - - - ( 1 )
Wherein Level is that resource delivers the score after publication rank is quantized, and publication rank is divided into 5 grades, and fraction is successively It is 1,0.8,0.6,0.4 and 0.2 point.Top magazine or meeting such as Nature, Science obtain 1 point, second level such as ACM Transaction obtains 0.8 point, lowest level 0.2 point;The computing formula of Cite is as follows:
Cite=Cites/maxCite (2)
Cite be resource by the quantized result of the amount of drawing, Cites be resource by the amount of drawing, during maxCite is source resource database It is maximum by the amount of drawing;
The computing formula of resource community temperature Popularity is as follows:
Popularity=readTimes/maxReadTimes (3)
ReadTimes is the frequency of reading of paper, and maxReadTimes is the frequency of reading of maximum in source resource database;
The stylish degree Recentness computational methods of resource are identical, and formula is as follows:
Re c e n t n e s s = 12 * ( y e a r - min Y e a r ) + ( m o n t h - min M o n t h ) 12 * ( max Y e a r - min Y e a r ) + ( max M o n t h - min M o n t h ) - - - ( 4 )
Year and month are respectively that resource delivers time and month;MinYear, minMonth, maxYear and maxMonth All resources earliest and delivered the latest time and month in the source database for being such resource;
Resource quality value Quality computational methods are as follows:
Q u a l i t y = 1 3 A u t h o r i t y + 1 3 P o p u l a r i t y + 1 3 Re c e n t n e s s - - - ( 5 ) .
7. method as claimed in claim 5, it is characterised in that the academic resources model is expressed as follows:
Mr={ Tr,Kr,Ct,Lr} (6)
Wherein, TrIt is the subjects distribution vector of academic resources, is that the academic resources are distributed in the A probable value of subject category, by shellfish Leaf this multinomial model is obtained;
Kr={ (kr1r1),(kr2r2),…,(krmrm), m is keyword number, kri(1≤i≤m) represents wall scroll I-th keyword of art resource, ωriIt is keyword kriWeight, obtained by the tf-idf algorithms after improvement, computing formula is such as Under:
W (i, r) represents i-th weight of keyword in document r, and tf (i, r) represents what i-th keyword occurred in document r Frequency, Z represents total record of document sets, and L represents the number of files comprising keyword i;Lr is potential theme distribution vector, Lr= {lr1,lr2,lr3…,lrN1, N1 is potential theme quantity;Ct is resource type, and the value of t can be big for 1,2,3,4,5 i.e. five Class academic resources:Paper, patent, news, meeting and books;
According to user using the behavioral characteristic of mobile software, user is divided into opening to the operation behavior of an academic resources, is read Reading, star evaluation, share and collect, user interest model is based on user context and browsed academic resources, according to user's Different viewing behavior, with reference to academic resources model, builds user interest model, and user interest model is expressed as follows:
Mu={ Tu,Ku,Ct,Lu} (8)
Wherein, TuIt is the subjects distribution vector of user's certain browsed class academic resources interior for a period of time, TrIt is by user behavior Afterwards, user's subject preference distribution vector of formation, i.e.,
T u = 1 s u m Σ j = 1 s u m s j × T j r - - - ( 9 )
Wherein, sum produced the academic resources sum of behavior, s for userjFor user to academic resources j generation behaviors after " OK It is coefficient ", the bigger explanation user of the value more likes the resource.TjrRepresent the subjects distribution vector of jth resource;sjCalculating it is comprehensive Conjunction such as considers opening, reads, evaluates, collects and share at the behavior, can accurately reflect preference of the user to resource.
Ku={ (ku1u1),(ku2u2),…,(kuN2uN2) it is the distribution of user preference keyword, N2It is keyword number, kui(1≤i≤N2) represent i-th user preference keyword, ωuiIt is keyword kuiWeight, it is interior for a period of time by user u Produced " keyword distribution vector " K of all academic resources of behaviorrIt is calculated;
K′jr=sj*Kjr (10)
Every new keyword distribution vector of academic resources can be calculated according to formula 10, then choose the new key of all resources The TOP-N2 of word distribution vector is used as user's keyword preference distribution vector Ku
LuIt is the potential subject matter preferences distribution vectors of the LDA of user, by the potential theme distribution vector L of the LDA of academic resourcesr={ lr1, lr2,lr3…,lrN1Be calculated, the same T of methodu
L u = 1 s u m Σ j = 1 s u m s j × L j r - - - ( 11 )
User interest is as follows with the Similarity measures of both resource models:
Academic resources model is represented:
Mr={ Tr,Kr,Ct,Lr} (12)
User interest model is represented:
Mu={ Tu,Ku,Ct,Lu} (13)
User's subject preference distribution vector TuWith academic resources subjects distribution vector TrSimilarity by cosine similarity calculate, I.e.:
S i m ( T u , T r ) = T u × T r | | T u | | | | T r | | - - - ( 14 )
The potential subject matter preferences distribution vector L of user LDAuTheme distribution vector L potential with academic resources LDArSimilarity by remaining String Similarity Measure, i.e.,:
S i m ( L u , L r ) = L u × L r | | L u | | | | L r | | - - - ( 15 )
User's keyword preference distribution vector KuWith academic resources keyword distribution vector KrSimilarity Measure pass through Jaccard Similarity enters calculating:
S i m ( K u , K r ) = | K u ∩ K r K u ∪ K r | - - - ( 16 )
Then user interest model is with the similarity of academic resources model:
S i m ( M u , M n ) = σ * S i m ( T u , T n ) + ρ * S i m ( K u , K n ) + τ * S i m ( L u , L n ) σ 2 + ρ 2 + τ 2 - - - ( 17 )
Wherein, σ+ρ+τ=1, specific weight distribution is obtained by Experiment Training.
Recommendation degree Recommendation_degree concepts are introduced, the recommendation degree of a certain academic resources is bigger to illustrate the resource more Meet the interest preference of user, and resource gets over high-quality, recommendation degree computing formula is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ12=1) (18)
The secondary recommendation stage is to carry out Top-N recommendations according to the recommendation degree of academic resources.
8. method as claimed in claim 5, it is characterised in that the web crawlers is the theme reptile, and configures LDA theme moulds Type, the LDA topic models are three layers of Bayes's generation models of " document-theme-word ", are in advance the LDA themes Model configures a corpus, and corpus includes training corpus, and LDA topic models are allowed by setting number of topics K with training corpus Training, poly- word function when being trained using LDA topic models is obtained after training corpus is trained through LDA topic models to be led by setting Topic number K is gathered into K theme and associates set of words respectively, that is, obtain Theme Crawler of Content this K subject document creeping;The theme Reptile further includes theme determining module, similarity calculation module, URL prioritizations on the basis of general network reptile Module;The Theme Crawler of Content is the multiple distributed reptiles being distributed by academic subjects number, one of each distributed reptile correspondence Art theme, each distributed reptile obtains the academic resources of multiple academic subjects simultaneously;In each crawling process of Theme Crawler of Content, theme The theme determining module of reptile determines target topic and its subject document, and the meter of Topic Similarity is instructed with the subject document Calculate, similarity calculation module carries out Topic Similarity calculating to each Anchor Text on the page that is crawled and with reference to the content of pages And judge, reject hyperlink of the Topic Similarity less than given threshold that Anchor Text combines the page, selection Anchor Text is combined should The Topic Similarity of the page safeguards that a hyperlink by having accessed webpage is signified more than the URL of given threshold by Theme Crawler of Content The URL queues for not accessing webpage, the URL queues arrange according to similarity height descending, and Theme Crawler of Content presses the arrangement of URL queues Order successively constantly accesses the webpage of each URL, crawls corresponding academic resources, and the academic resources classification that will constantly be crawled Database is stored in after label, for the subject document that this is creeped, until non-access queue URL is sky;By the Theme Crawler of Content The academic resources for being crawled every time as LDA topic model training new language material;And constantly repeatedly Theme Crawler of Content was creeped Journey so that the theme conjunctive word gathered of each subject document is constantly able to supplement and updates, the academic resources for being crawled are continuous to be obtained The degree of an artificial accreditation is updated to supplement.
9. method as claimed in claim 5, it is characterised in that also clearly verify language material including classification in the corpus, For allowing the textual classification model to carry out classification checking by predetermined classification number A with verifying language material in advance, to obtain text classification Model to the classification accuracy of each classification in A classification, as textual classification model to each classification in A classification Sort out confidence level target;The accuracy rate is to belong in all checking language materials for assigned to certain classification by textual classification model correctly being divided The ratio of the language material of class, and default classification accuracy threshold value;Each piece text to be sorted is carried out with the textual classification model Following steps are specifically included during text classification:
Step one, each piece text to be sorted is pre-processed, pretreatment includes participle, removes stop word, and retains proprietary name Word, respectively calculate the text after pretreatment all words characteristic weight, the characteristic weighted value of word with the text The number of times of appearance is directly proportional, and is inversely proportional with the number of times occurred in the training corpus, and the word set obtained by calculating is pressed into its characteristic Weighted value size descending is arranged, and extracts the previous section of the original word set of each piece text to be sorted as its feature word set;
Step 2, using textual classification model, choose each piece text primitive character word set to be sorted and be used for calculating the piece respectively Text may belong to the probable value of each classification in predetermined A classification, choose the maximum classification of probable value as this text point Class classification;
Step 3, the text classification result to step 2 judge, if classification of the textual classification model to the category is accurate Rate score reaches given threshold with regard to direct output result;If textual classification model does not reach to the classification accuracy numerical value of the category To given threshold, step 4 is put into;
Step 4, by LDA topic models described in the pretreated text input of each piece, calculate the piece with LDA topic models The weighted value of each theme in K set theme of text correspondence, the maximum theme of weight selection value, and will pass through in advance The preceding Y word in theme conjunctive word after the training of LDA topic models under the resulting theme adds original to this text Collectively as the feature word set after expansion among feature word set, textual classification model is reused, this text is calculated respectively can The probable value of each classification in predetermined A classification can be belonged to, the maximum classification of probable value is chosen as this text final classification Classification.
10. method as claimed in claim 9, it is characterised in that the main formulas for calculating of the textual classification model is:
P ( c j | x 1 , x 2 , ... , x n ) = P ( x 1 , x 2 , ... , x n | c j ) P ( c j ) P ( c 1 , c 2 , ... , c n ) - - - ( 19 )
Wherein P (cj|x1,x2,…,xn) expression Feature Words (x1, x2 ..., xn) while the text belongs to classification C when occurringjIt is general Rate;Wherein P (cj) represent that training text is concentrated, belong to classification cjText account for sum ratio, P (x1,x2,…,xn|cj) represent If text to be sorted belongs to classification cj, then the feature word set of this text is (x1,x2,…,xn) probability, P (c1,c2,…, cn) represent the joint probability of given all categories.
CN201611130297.9A 2016-12-09 2016-12-09 A kind of academic resources recommendation service system and method CN106815297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611130297.9A CN106815297A (en) 2016-12-09 2016-12-09 A kind of academic resources recommendation service system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611130297.9A CN106815297A (en) 2016-12-09 2016-12-09 A kind of academic resources recommendation service system and method

Publications (1)

Publication Number Publication Date
CN106815297A true CN106815297A (en) 2017-06-09

Family

ID=59107077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611130297.9A CN106815297A (en) 2016-12-09 2016-12-09 A kind of academic resources recommendation service system and method

Country Status (1)

Country Link
CN (1) CN106815297A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247751A (en) * 2017-05-26 2017-10-13 武汉大学 Content recommendation method based on LDA topic models
CN107590232A (en) * 2017-09-07 2018-01-16 北京师范大学 A kind of resource recommendation system and method based on Network Study Environment
US10387115B2 (en) 2015-09-28 2019-08-20 Yandex Europe Ag Method and apparatus for generating a recommended set of items
US10387513B2 (en) 2015-08-28 2019-08-20 Yandex Europe Ag Method and apparatus for generating a recommended content list
US10394420B2 (en) 2016-05-12 2019-08-27 Yandex Europe Ag Computer-implemented method of generating a content recommendation interface
US10430481B2 (en) 2016-07-07 2019-10-01 Yandex Europe Ag Method and apparatus for generating a content recommendation in a recommendation system
WO2019192352A1 (en) * 2018-04-03 2019-10-10 阿里巴巴集团控股有限公司 Video-based interactive discussion method and apparatus, and terminal device
US10452731B2 (en) 2015-09-28 2019-10-22 Yandex Europe Ag Method and apparatus for generating a recommended set of items for a user

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN104680453A (en) * 2015-02-28 2015-06-03 北京大学 Course recommendation method and system based on students' attributes
CN103336793B (en) * 2013-06-09 2015-08-12 中国科学院计算技术研究所 A personalized recommendation method and system papers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336793B (en) * 2013-06-09 2015-08-12 中国科学院计算技术研究所 A personalized recommendation method and system papers
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN104680453A (en) * 2015-02-28 2015-06-03 北京大学 Course recommendation method and system based on students' attributes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高洁: ""高质量学术资源推荐方法的研究与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10387513B2 (en) 2015-08-28 2019-08-20 Yandex Europe Ag Method and apparatus for generating a recommended content list
US10452731B2 (en) 2015-09-28 2019-10-22 Yandex Europe Ag Method and apparatus for generating a recommended set of items for a user
US10387115B2 (en) 2015-09-28 2019-08-20 Yandex Europe Ag Method and apparatus for generating a recommended set of items
US10394420B2 (en) 2016-05-12 2019-08-27 Yandex Europe Ag Computer-implemented method of generating a content recommendation interface
US10430481B2 (en) 2016-07-07 2019-10-01 Yandex Europe Ag Method and apparatus for generating a content recommendation in a recommendation system
CN107247751A (en) * 2017-05-26 2017-10-13 武汉大学 Content recommendation method based on LDA topic models
CN107590232A (en) * 2017-09-07 2018-01-16 北京师范大学 A kind of resource recommendation system and method based on Network Study Environment
WO2019192352A1 (en) * 2018-04-03 2019-10-10 阿里巴巴集团控股有限公司 Video-based interactive discussion method and apparatus, and terminal device

Similar Documents

Publication Publication Date Title
Nie et al. Web object retrieval.
RU2419858C2 (en) System, method and interface for providing personalised search and information access
US9009134B2 (en) Named entity recognition in query
US10157233B2 (en) Search engine that applies feedback from users to improve search results
Wu et al. Harvesting social knowledge from folksonomies
Sieg et al. Web search personalization with ontological user profiles
US20090094233A1 (en) Modeling Topics Using Statistical Distributions
US20090307213A1 (en) Suffix Tree Similarity Measure for Document Clustering
Zhao et al. Topical keyphrase extraction from twitter
JP4365074B2 (en) Document expansion system with user-definable personality
JP5391634B2 (en) Selecting tags for a document through paragraph analysis
Sieg et al. Learning ontology-based user profiles: A semantic approach to personalized web search.
Wu et al. Flame: A probabilistic model combining aspect based opinion mining and collaborative filtering
Porteous et al. Fast collapsed gibbs sampling for latent dirichlet allocation
US9081852B2 (en) Recommending terms to specify ontology space
US8463786B2 (en) Extracting topically related keywords from related documents
US8027977B2 (en) Recommending content using discriminatively trained document similarity
US7844592B2 (en) Ontology-content-based filtering method for personalized newspapers
IJntema et al. Ontology-based news recommendation
Zanasi Text mining and its applications to intelligence, CRM and knowledge management
US20080270384A1 (en) System and method for intelligent ontology based knowledge search engine
US7519588B2 (en) Keyword characterization and application
CN102831234B (en) Personalized news recommendation device and method based on news content and theme feature
US7516397B2 (en) Methods, apparatus and computer programs for characterizing web resources
TW200900973A (en) Personalized shopping recommendation based on search units

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination