CN111061939B

CN111061939B - Scientific research academic news keyword matching recommendation method based on deep learning

Info

Publication number: CN111061939B
Application number: CN201911408925.9A
Authority: CN
Inventors: 孟海宁; 冯锴; 朱磊; 白涛; 王�锋; 石月开; 童新宇; 姚燕妮; 董林靖; 陈毅
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-03-24
Anticipated expiration: 2039-12-31
Also published as: CN111061939A

Abstract

The invention discloses a scientific research academic news keyword matching recommendation method based on deep learning. The adopted recommendation method is based on two continuously enriched news libraries and thesis libraries, and a keyword library is formed by combining picture recognition, a word segmentation combined probability statistical method and a centence2vec method, so that a word2vec model is constructed, the similarity of the contents of the word2vec model is judged by language processing, and recommendation is performed according to the similarity. When a user browses scientific research news, academic papers with high relevance can be recommended to the user, and the expectation of common paper support of news content view-point questions is achieved; conversely, when a user browses an academic paper, research news similar to the paper can be recommended to the user so as to allow the reader to know the latest development or information about the viewpoint or technology described in the current paper.

Description

Scientific research academic news keyword matching recommendation method based on deep learning

Technical Field

The invention belongs to an important direction in the field of machine learning, and particularly relates to a keyword matching recommendation method for scientific research news and academic papers.

Background

With the development of computer science and technology, the machine learning field has also gained the progress with realistic meaning and application prospect in the deep learning direction. This also provides a possible way to solve the problem of how to acquire the desired data or information in a huge information sea. The rapid development of the internet level simultaneously promotes the continuous progress of the scientific research level of all walks of life, and meanwhile, the quantity of scientific research academic documents and news related to scientific research academic increases day by day. When a researcher consults a scientific news, the obtained opinion is often not clear, and it is necessary to recommend papers similar or similar to the academic opinion of the news. The method strengthens the connection between scientific research news and academic papers, has the background support of the papers, and also enhances the credibility of the news. The process is also reversible, and when browsing academic papers, some news similar or similar to the academic viewpoint of the browsed papers is recommended to the browsing user, so that the reader can understand the latest development of the academic viewpoint or technology of the papers.

At present, recommendation algorithms are mostly content-based personalized recommendations, and recommended objects are mostly news, commodities, music and the like which are interested by common users. Examples of mutual recommendations based purely on scientific news and academic paper libraries are missing. In the recommendation method, the good panning of the recommendation method is taken as an example, some behaviors of the user are utilized, and some mathematical algorithms are used to infer the goods that the user may like or may purchase. There are also content-based, collaboration-based, utility-based, knowledge-based, and combination recommendation methods. The recommendation method adopted by the invention is based on two continuously enriched news libraries and thesis libraries, judges the similarity of the contents by language processing and carries out recommendation according to the similarity. In summary, a purely problem related to mutual recommendation of scientific research news and academic papers is very valuable and meaningful, and the invention aims to provide a solution to the situation.

Disclosure of Invention

The invention aims to provide a method for recommending academic papers according to scientific research news, and the process is reversible. The method has the characteristics of helping researchers or users to save paper search time, helping scientific research news to provide academic paper view support, and recommending academic papers by keyword matching according to the existing academic papers in the current paper library and searched scientific research news.

The key of the method is how to define the similarity between scientific research news and academic papers, and the adopted method is based on the similarity detection of two keyword libraries so as to achieve the effect of keyword matching. The technical solution adopted is that when each administrator logs in with its own ID, its ID and the operations he performs are recorded. Such as uploading news or papers, modifying news or papers, etc., and then performing data integration on the changed data. The method aims to generate scientific research news database and academic paper database nodes and relationship creating sentences thereof, and push related news or papers according to the relationship keywords. Taking the recommendation of academic papers according to browsed scientific research news as an example, the method comprises the following steps:

step 1, a website administrator inputs news data and academic and scientific research thesis data;

step 2, integrating news data into a news database, and integrating academic and scientific research paper data into a paper database;

and 3, dividing news into picture news and character news, identifying pictures by utilizing the picture news through a built BOW model, and extracting characters and character information in the pictures. For the character news, processing the character news by combining a news word segmentation processing means and a vocabulary probability statistical means, and finally collecting the data of the character news and the vocabulary probability statistical means to form a news keyword library;

and 4, for the theoretical library, firstly extracting the keyword data carried in the paper to form a keyword set. Secondly, semantic recognition is carried out on the titles, texts and the like of the papers by adopting a C-bow method of sensor 2 vec. Finally, collecting the data of the two groups to form a theses keyword library;

and 5, integrating the news keyword library and the thesis keyword library, constructing a word2vec model, and training the model to mine the relation between the keywords. The final effect is as follows: a keyword is input, and a keyword library can be traversed according to the input keyword, so that a data set with the correlation with the keyword from large to small is obtained. This data set will be the benchmark for our recommendation work;

and 6, when the user browses a piece of news, the keyword of the piece of news is used as input data. Traversing a thesaurus keyword library aiming at the keyword so as to obtain a data set with descending relevance with the keyword;

step 7, inquiring a treaty library according to the obtained keyword data set, and searching a plurality of treaties relevant to the input news to form a recommendation list for treaty recommendation;

and 8, repeating the steps 3, 4 and 5 regularly, wherein the two databases are continuously increased, and the model needs to be trained repeatedly so as to improve the accuracy of the recommended data set.

The invention is also characterized in that:

step 3.1, two methods are mainly used for identifying the picture news, wherein the first method is to extract characters in the picture, perform word segmentation processing to form a keyword word bank and compare the keyword word bank with a thesis keyword bank; the second method is to perform face recognition on the person in the picture and compare it with the entered library of scientists. For a character appearing in the photo scientific research news, if a paper written by an author coincident with the character exists in the paper library, the paper of the author is recommended for the news. And the latest priority of publication time is taken as a recommendation order standard, and the three targets are mainly completed in the whole process by constructing a BOW model. Firstly, SIFT features are respectively extracted from a plurality of images, and then k-means clustering is carried out on the extracted whole SIFT features to obtain k clustering centers as a visual word list. Finally, for each image, taking a word list as a specification, calculating the distance between each SIFT feature point of the image and each word in the word list. A codebook of the image can be obtained by adding an operation recently, so that the image is identified according to the codebook;

and 3.2, processing the character news in a way of combining a word segmentation method and a statistical means. The traditional word segmentation method cannot accurately identify news contents, so that a statistical means is combined in the method. Counting the times of the words appearing in the news title and the text, removing some words shared by all news, and taking the word with the highest frequency of appearance as the keyword of the news. In addition, considering that the words appearing in the news headlines can summarize the news opinions, the weight of the keywords in the news headlines should be higher than the words appearing in the text;

in step 4, semantic recognition is carried out on the titles and the abstracts of the thesis by adopting a C-bow method of the center 2vec, and keywords are generalized. And jointly extracting keywords in the paper to form a paper keyword word library. The C-bow method is a model for predicting the occurrence probability of the current word according to the context word, and the words with high occurrence probability are regarded as keywords of the paper;

step 5.1, integrating a news keyword word library and a thesis keyword word library to form a word2vec model;

and 5.2, training the integrated word2vec model. The purpose is to generate vectors to capture word meanings and start arithmetic operation associated with words, thereby realizing the incidence relation and incidence degree sequencing between keywords;

step 5.3, for the trained word2vec model, the following effects are achieved:

namely, any keyword is input, and all keywords related to the current keyword can be output according to the current keyword after traversing the word stock. Data can be artificially added to control the number of output words and phrases, and the output words are sorted according to the relevance between the words and phrases, and the sorting is the basis and the core for subsequent recommendation;

the invention has the beneficial effects that:

the invention provides a method for recommending scientific research news and academic papers based on deep learning, aiming at the condition that the association degree between the scientific research news and the academic papers is weak and the mutual recommendation is lacked. The adopted recommendation method is based on two continuously enriched news libraries and thesis libraries, and a keyword library is formed by combining picture recognition, a word segmentation combined probability statistical method and a centence2vec method. And thus, a word2vec model is constructed, the similarity of the contents is judged by language processing, and recommendation is carried out according to the similarity. The reader can know the thesis view support of scientific research news through the thesis recommendation of the news; the news recommendation of the paper can enable readers to know about the recent technological development of the current paper.

Drawings

FIG. 1 is a general flowchart of a scientific research academic news keyword matching recommendation method based on deep learning according to the present invention;

FIG. 2 is a model building and picture recognition process for recognizing picture news in accordance with the present invention;

FIG. 3 is a flow chart of the steps of the invention for performing word segmentation and statistical means combination processing on the text scientific research news;

FIG. 4 is a flowchart of the steps of building and training a word2vec model according to the deep learning method of the present invention;

FIG. 5 is a process diagram of theoretical library data using the C-bow method of centence2 vec;

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, the scientific research academic news keyword matching recommendation method based on deep learning specifically includes the following steps:

step 1.1, scientific research news is input by an administrator or partial users with management authority, and then all news is integrated into a news library;

step 1.2, integrating all papers into a paper library except scientific research news, wherein the steps are the same as the news;

and 2, mainly sorting scientific research news and academic papers to form a news keyword word bank and a paper keyword word bank. The two word banks are also the basis of subsequent recommendation work, and the specific steps are as follows:

step 2.1, constructing a picture identification BOW model as shown in FIG. 2, and constructing a BOW codebook, wherein the steps are as follows:

1. and preprocessing the training image set. The method comprises the steps of image enhancement, segmentation, unified format of images and unified specification;

2. extracting SIFT features, wherein each SIFT feature is represented by a 128-dimensional descriptor vector;

3. performing K-means clustering on the N SIFT features extracted in the step 2;

4. calculating the distance from each SIFT feature of each image to the k visual words;

5. and executing an adding operation on the corresponding word frequency of the visual word.

After the steps are completed, each image is changed into a word frequency vector corresponding to the visual word sequence, so that the aim of identifying the image is fulfilled;

and 2.2, processing the character news in the scientific research news in a mode of combining a word segmentation method and a statistical means. The traditional word segmentation method is not accurate enough for identifying news content, so that statistics means are combined to count the times of words appearing in news titles and texts. Except some words common to all news, the word with the highest frequency of occurrence is used as the key word of the news. In addition, considering that the words appearing in the news headlines can summarize the news idea, the weight of the keywords in the news headlines is higher than that of the words appearing in the text, and the specific flow is shown in fig. 3;

step 2.3, regarding the keywords carried by the abstracted academic papers, the general outline of the main body contents of the papers can be highly summarized, so the operation of extracting and directly storing the keywords in the keyword lexicon of the papers is adopted;

step 2.4, the title and abstract parts of the paper are processed by the C-bow method of centence2 vec. C-bow (Continuous Bag-of-Words Model), a Model that predicts the probability of occurrence of a current word based on the context of the word. Here, w is a keyword of the paper itself extracted in the previous step, and the set of the keywords is used as the corpus C.

C-bow is a known context, estimating the language model of the current word. By giving some context words, then by evaluating the probabilities, the most suitable word for that context is found. Its learning goal is to maximize the log-likelihood function:

where w represents any word in corpus C.

INPUT layer (INPUT): is a word vector of a word of the context.

PROJECTION layer (PROJECTION) it is summed, so-called summation, which is a simple vector addition.

OUTPUT layer (OUTPUT) OUTPUTs the most likely w. Since the vocabulary in the corpus is fixed | C |, the above process can be actually seen as a multi-classification problem. Given a feature, one is picked from | C | classifications. For neural network model multi-classification, the most naive approach is softmax regression.

For example, a set of words is now given and exchanged for one-hot encoding:

displays [0, 1,0, \8230; \ 8230;, 0,0

In the corpus, each of the display, the host, the mouse and the keyboard corresponds to a vector, only one value in the vector is 1, and the rest are 0.

The training input of the C-bow model is a word vector corresponding to a word related to the context of a certain characteristic word, and if the word vector is input to a display, a keyboard and a host, the trained language model can find a word 'mouse' most suitable for the context by evaluating the probability.

An embodiment is shown in FIG. 4;

and 3, integrating the news keyword word stock and the thesis keyword word stock to form a word2vec model and train the word2vec model so as to mine the relation between words. When a keyword is input externally, some words with similar semantemes are matched out and sorted according to the relevance. This results in a recommended sequence. The model was constructed as shown in figure 5.

For example, two sentences of corpora exist: the keywords extracted after a word2vec model is constructed and trained by 'hydropower unit fault diagnosis research based on wavelet packet transformation and associated data mining', 'XX hydropower station 4 units are put into full production and installed capacity of 99 ten thousand kilowatts' are as follows:

the first sentence is the academic paper topic: "wavelet packet transformation", "associated data mining", "hydroelectric generating set", "fault diagnosis";

the second sentence is a scientific research news topic: "hydropower station", "unit", "installed capacity", "kilowatt".

Then, the words extracted after training are sorted and input into a corresponding database, and the result is obtained after matching: the correlation degree of the two words of the hydroelectric generating set and the generating set is high, and the two words have an incidence relation. Therefore, when a website browser browses scientific research news of 'XX hydropower station 4 units complete production and installed capacity 99 ten thousand kilowatts', a recommended academic paper matched by a keyword serving as an input query paper library is a paper of 'hydropower unit fault diagnosis research based on wavelet packet transformation and associated data mining'. When the number of papers is enough, a list is formed according to the association degree for recommendation.

Claims

1. The scientific research academic news keyword matching recommendation method based on deep learning is characterized by specifically comprising the following steps of:

step 1, a website administrator inputs scientific research news data and academic and scientific research paper data;

step 2, integrating news data into a news database, and integrating academic and scientific research thesis data into a thesis database;

step 3, dividing news into picture news and character news, and identifying pictures by constructing a BOW model in the picture news so as to extract characters and character information in the pictures; for the character news, the character news is processed by combining the news word segmentation processing and the vocabulary probability statistical means, and finally, the data of the character news and the vocabulary probability statistical means are collected to form a news keyword library;

step 4, extracting keyword data carried in the thesis of the thesis library to form a keyword set, then performing semantic recognition on the title and the text of the thesis by adopting a content 2vec C-bow method, and finally collecting the data of the title and the text to form a thesis keyword library, wherein the steps specifically comprise:

processing titles and text parts of the papers by adopting a C-bow method of centence2vec, wherein in the C-bow method, a training target is context of a given word, and the probability of the word is predicted;

in the training process, a paramgraph id is added, namely each sentence in the training corpus has a unique id, and the paramgraph id is mapped into a vector, namely a paramgraph vector, as same as a common word; the paramph vector has the same dimension as the word vector, but comes from two different vector spaces; in the subsequent calculation, the paramgraph vector and the word vector are accumulated or connected together to be used as the input of the output layer softmax; in the training process of a sentence or a document, the paragraph id is kept unchanged and shares the same paragraph vector;

in the prediction stage, a paragraph id is newly distributed to the sentence to be predicted, the parameters of the word vector and the output layer softmax are kept unchanged, and the sentence to be predicted is trained by reusing gradient descent; after convergence, obtaining a paragraph vector of a sentence to be predicted, firstly training a sentence or a word vector by using Chinese sensor corpus, and then obtaining the most similar sentence or word by calculating a cosine value between the sentence vectors;

step 5, integrating the news keyword library and the thesis keyword library, constructing a word2vec model and training the model, aiming at mining the relation between keywords, and finally achieving the following effects: inputting any keyword, and traversing a keyword library according to the input keyword so as to obtain a data set with big or small correlation with the keyword, wherein the data set is a reference for recommending work;

step 6, when a user browses a certain news, the keywords of the news are used as input data, and a thesis keyword library is traversed aiming at the keywords, so that a data set with gradually reduced relevance with the keywords is obtained;

step 7, inquiring a paper library according to the obtained keywords and the data set, and searching a plurality of papers relevant to the input news to form a recommendation list for recommending the papers;

and 8, repeating the steps 3, 4 and 5 regularly, wherein the two databases are continuously increased, and the model needs to be trained repeatedly to improve the accuracy of the recommended data set.

2. The deep learning-based scientific research academic news keyword matching recommendation method according to claim 1; the method is characterized in that in the step 3, the steps of processing scientific research news and extracting keywords are divided into two main categories: the method comprises the following steps of:

step 3.1, for picture news in scientific research news, decomposing images and identifying by constructing a BOW model and a codebook: the first method is to extract characters in the picture, perform word segmentation processing to form a keyword word bank and compare the keyword word bank with a thesis keyword bank; secondly, performing face recognition on the characters in the pictures and comparing the face recognition with the recorded scientific research worker library, and recommending papers of the authors for the pictures in scientific research news if the papers are written by the authors which coincide with the characters in the paper library, wherein the latest priority of publication time is taken as a recommendation order standard;

step 3.2, processing the character news in the scientific research news by combining a word segmentation method with a statistical means; the frequency of the words appearing in the news headlines and the texts is counted, some words common to all news are removed, and therefore the word with the highest frequency of appearance is used as the key word of the news, and the key word in the news can be more highly weighted than the words appearing in the texts considering that the words appearing in the news headlines can summarize the news.

3. The scientific research academic news keyword matching recommendation method based on deep learning as claimed in claim 1, wherein the training method for the news and thesis keyword lexicon in step 5 specifically comprises the following steps:

step 5.1, importing a word2vec deep learning model;

step 5.2, initializing various parameters of the input gate, the output gate and the forgetting gate so as to keep the accuracy of the model;

step 5.3, an output standard of the output layer is formulated, and the output gate words only output words with the association degree larger than 0.8;

step 5.4, inputting the preprocessed high-quality keyword corpus from the outside, wherein the preprocessing refers to the corpus processed in the step 2;

and 5.5, generating a recommendation priority, sequencing according to the recommendation priority, and inputting a training result.

4. The deep learning-based scientific research academic news keyword matching recommendation method according to claim 1, wherein the recommendation strategy in the step 7 specifically adopts a strategy of recommending scientific research news and academic papers to each other, and a correlation matching recommendation method aiming at scientific research news and academic papers is found by establishing two keyword word banks and finding the correlation between keywords through a deep learning method.