CN114461778A

CN114461778A - Comprehensive scientific research result recommendation method and device for mass scientific research data

Info

Publication number: CN114461778A
Application number: CN202111614370.0A
Authority: CN
Inventors: 刘玮; 李超; 纪玉春; 王益静; 张永铮; 李书豪; 庹宇鹏; 常鹏; 王晗; 祁睿
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-05-10

Abstract

The invention discloses a comprehensive recommendation method and device for scientific research achievements oriented to mass scientific research data, which comprises the following steps: respectively extracting personal information characteristics and attribute characteristics based on the personal information of the user and the attributes of the scientific research data; inputting the personal information characteristics and the attribute characteristics into a BP neural network to obtain a first recommendation result; obtaining a second recommendation result by calculating the similarity between the personal information characteristics and the attribute characteristics; and obtaining a comprehensive recommendation result according to the first recommendation result and/or the second recommendation result. The invention quickly and accurately extracts dozens of related attribute characteristics of the topics from a large amount of scientific research topic documents, performs comparison analysis on key scientific research factors, establishes a refined scientific research achievement recommendation model based on a BP neural network through dozens of attribute characteristics, and recommends scientific research topics to clients more accurately.

Description

Comprehensive scientific research result recommendation method and device for mass scientific research data

Technical Field

The invention relates to the field of machine learning, in particular to a comprehensive recommendation method and device for scientific research achievements oriented to mass scientific research data.

Background

The recommendation system is a measure taken by information overload, and quickly recommends scientific research subject data which accord with the characteristics of a user in the face of massive scientific research data information; mainly for people without clear needs. The method mainly solves the problems that how to find the information which is interesting to the user from a large amount of information and how to utilize the existing scientific research result data to the maximum extent, and assists in serving other scientific research works.

The application requirement of the method is that a scientific research result and a scientific research element comprehensive recommendation method based on mass scientific research data enables a user to acquire needed scientific research information more quickly and better and enables the existing scientific research result to be pushed to the hands of a user more quickly and better; the main research methods include: and a comprehensive recommendation method combining a recommendation algorithm based on user statistics, a recommendation algorithm based on scientific research result content and a recommendation algorithm based on collaborative filtering (including collaborative filtering based on scientific research topic results and collaborative filtering based on users) further supports and assists scientific research project decision-making work.

Disclosure of Invention

Aiming at the requirements, in order to realize comprehensive recommendation of scientific research achievements and scientific research elements, the invention provides a comprehensive recommendation method and device of scientific research achievements for mass scientific research data, a basic recommendation method is formed, results and characteristics of various methods are finally fused, the processing performance of an algorithm under mass data is greatly improved, and the comprehensive recommendation method of the scientific research achievements and the scientific research elements with good real-time performance and high accuracy is finally formed.

The technical scheme of the invention comprises the following steps:

a comprehensive recommendation method for scientific research achievements oriented to mass scientific research data comprises the following steps:

respectively extracting personal information characteristics and attribute characteristics based on the personal information of the user and the attributes of scientific research data;

inputting the personal information characteristics and the attribute characteristics into a BP neural network to obtain a first recommendation result;

obtaining a second recommendation result by calculating the similarity between the personal information characteristics and the attribute characteristics;

and obtaining a comprehensive recommendation result according to the first recommendation result and/or the second recommendation result.

Further, the personal information includes: research direction, field of engagement, browsing records, and collecting information.

Further, the attributes include: performance information, demand information, guideline information, project library information, and achievement benefit information.

Further, the performance information includes: business problems, annual goals and long-term tasks; the requirement information includes: study direction and core objectives; the guide information includes: a guideline number and demand proposing department; the project library information includes: topic number, period, security level, expense and audit information; the achievement benefit information comprises: key technologies, articles, patents, soft works and application benefits.

Further, the personal information features include: interest of users in each scientific research result.

Further, the similarity is calculated by:

1) based on personal information characteristics, obtaining a plurality of < users, scientific research achievements, tags > triplets;

2) based on<User, scientific research result and label>Triple, counting the number n of any label b_u,bThe number n of times that the scientific research result i is marked with the label b_b,iWherein u represents a user;

3) according to the number of times n_u,bAnd the number n of times_b,iAnd calculating the interestingness.

Further, the interestingness is calculated by:

1) acquiring historical browsing records of scientific research data of the user to obtain a plurality of < users, scientific research achievements and tags >;

2) taking all the labels of the scientific research result i as documents, and calculating the word frequency and the reverse file frequency of each label;

3) and calculating the interest degree of the user in each scientific research result based on the word frequency and the reverse file frequency.

And further, adding hot labels and punishment items of the hot scientific research achievements when calculating the interest degree of the user in each scientific research achievement.

Further, the attribute feature is obtained by:

1) classifying the attributes of the scientific research data, wherein the classified categories comprise: a numerical attribute, a temporal attribute, and a statistical attribute;

2) carrying out normalization processing on the numerical attribute;

for the time type attribute, carrying out normalization processing on the time type attribute of continuous time, and carrying out discretization processing on the time type attribute of discrete continuous time;

for statistical attributes, including:

for the statistical attribute of the project income, carrying out discretization processing on the scientific research expenses by an equal step length method and an equal frequency method;

for the statistical attributes of the type, a One-Hot coding/dummy variable mode is adopted to parallelly expand the statistical attributes of the type;

for statistical attributes of orderliness, calculating an order based on the orderliness data;

for statistical data of the scale class, the scale calculation is based on the scale class data.

3) And obtaining the attribute characteristics according to the numerical attribute characteristics, the time attribute characteristics and the statistic attribute characteristics.

Further, the method for calculating the similarity includes: cosine similarity method.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

Compared with the prior art, the invention has the following advantages:

the invention quickly and accurately extracts dozens of related attribute characteristics of the topics from a large amount of scientific research topic documents, performs comparison analysis on key scientific research factors, establishes a refined scientific research achievement recommendation model based on a BP neural network through dozens of attribute characteristics, and recommends scientific research topics to clients more accurately.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a block diagram of a BP neural network-based enterprise research result recommendation method.

FIG. 3 is a method for recommending the content of the scientific achievements.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The process of the invention is shown in fig. 1, and mainly comprises an enterprise scientific research result recommendation method based on a BP neural network and a scientific research result content recommendation method based on enterprise scientific research result data.

The invention accurately extracts dozens of related attribute characteristics of topics from massive scientific research topic documents, including five types of dozens of refined attribute information, namely performance information (service problems, annual targets, long-term tasks and the like), demand information (research directions, core targets and the like), guideline information (guideline numbers, demand proposing departments and the like), project library information (topic numbers, periods, confidentiality levels, expenses, review information and the like), result benefit information (key technologies, treatises, patents, soft works, application benefits and the like), performs comparison and analysis on key scientific research elements, and combines with personal information of users including research directions, engagement fields, browsing records and collection information to provide an enterprise scientific research result recommendation method based on a BP neural network and an enterprise scientific research result content recommendation method based on enterprise scientific research result data.

BP neural network-based enterprise scientific research achievement recommendation method

Each layer of the neural network is composed of a plurality of neurons, the connection mode between the layers is full connection, and the neurons of the same layer are not connected. In the process of training the model by the neural network, a sample firstly reaches a hidden layer from an input layer through initial weight and an activation value, then reaches an output layer in the same mode, if the error between a prediction result and an actual result of the output layer exceeds a threshold value, the error returns along an original path, the weight of neuron connection between layers is gradually adjusted, and the operation is repeated in a circulating mode until the threshold value or the maximum iteration number is reached.

The scientific research result training data of the enterprise is set as D, and the data mainly comprise six types of data characteristics: performance information (business problem, annual objective, etc.), demand information (research direction, core objective, etc.), guide information (guide number, demand proposing department, etc.), project library information (period, confidentiality, expense, etc.), achievement benefit information (key technology, etc.) then D:

D＝{(x₁,y₁),(x₂,y₂),…(x_n,y_n)},x_i∈R^d,y_i∈R^li.e. the input samples of the scientific research results have d attributes and l labels. Corresponding neural networkThe input layer comprises d nodes, the output layer comprises l nodes, the hidden layer comprises q nodes, and the network structure is shown in fig. 2.

Recording the connection weight value between the ith node of the input layer and the h node of the hidden layer as v_ihThe threshold of the hidden layer node is gamma_hThen the input alpha received by the node_hIs composed of

Note that the output of the above hidden layer node is b_hThe connection weight between the node and the jth node of the output layer is omega_hjThe threshold value of the node of the output layer is theta_jReceived input beta_jIs composed of

For one sample (x) in the training data_k,y_k) Suppose the output of the neural network is

Namely, it is

Then is at (x)_k,y_k) Mean square error of the upper neural network is

The parameter v is then iteratively updated by v ← v + Δ v.

In the test, inputting the personal information characteristics to obtain the corresponding scientific research achievements of the enterprises.

Scientific research result content recommendation method based on enterprise scientific research result data

The method for recommending the scientific research result content is to find the similarity of data according to the metadata of the recommended scientific research result or content, and recommend similar information to the user based on the past browsing and collection records of the user, as shown in fig. 3. And the similarity calculation is realized by extracting the intrinsic or extrinsic characteristic values of the scientific research results. Typically a topic time, key technology label, user size label, project expense, etc. are all characteristics of the topic.

Matching personal information characteristics (based on user behavior browsing and collection records or preset interest tags) of the user with characteristics of the topic, and obtaining the interest degree of the user in scientific research achievements;

the similarity is judged mainly by distance, and the method adopts a cosine similarity method to calculate, wherein the formula is as follows:

the key link of the scientific research topic modeling method lies in the feature extraction (labeling) of scientific research achievements, and the practice project of the invention adopts expert labels (PGC) and user-defined labels (UGC) to label historical scientific research achievement data. And providing a label marking module for a user in the newly-built system, and acquiring corresponding characteristic data through the module.

The flow structure of the content-based recommendation system mainly comprises: classifying and sorting various types of data to generate effective information sources, analyzing content, learning features, generating a recommendation list after a filtering component, and further perfecting the feature learning by using feedback of a user through the recommendation list.

In the scientific research result characteristic engineering, the characteristics are used as a group of input variables of judgment conditions, are the basis for judgment and are the targets of judgment and prediction, and the output variables of the model are the results generated by the characteristics. The scientific research subject is characterized in that information useful for result prediction is extracted from scientific research result data, the number of characteristics reflects the observed dimensionality of the data, and the characteristic engineering comprises characteristic cleaning (sampling and cleaning abnormal samples), characteristic processing and characteristic selection. The characteristics of the scientific research results are classified according to different data types, and different characteristic processing methods are adopted: numerical type (such as project expenditure, score, rating, user scale and application range), time type (such as item establishment time), statistical type (such as system application benefit statistics, achievement quantity statistics, rating, field and direction) and the like;

the scientific research result characteristic data of the invention is derived from the basic data of thousands of subjects born by tens of branch companies within ten years.

Amplitude adjustment and normalization processing are carried out on numerical features, for example, the features of project expenses are usually between one hundred thousand and one million, the scores of projects are usually between 50 and 100, the features should be equal, and differences should be reflected in the features. The normalization operation is performed using the following formula:

wherein Feature_maxIs the maximum value in the numerical features, Feature_minFeature, being the minimum of the numerical features_oldAs an original value, Feature_newIs a normalized value.

Discretizing some specific scientific research result characteristics, cutting off an original continuous value, converting the original continuous value into a discrete value, and adopting a formula:

for example, the time characteristic processing is divided into continuous and discrete cases, and the characteristics such as the web page browsing duration of a user and the time interval of the user accessing the same task result are continuous values to perform normalization processing; in the user behavior data, which time period of the day, day of the week, weekday, weekend, or the like is a discrete value is discretized.

Statistical processing includes processing of project benefits, categorical data, sequential data, and proportional data:

in general, project expenses are in direct proportion to benefits of projects, specific numerical values of the expenses do not have actual use values, and the expenses can be discretized and calculated conveniently. The discretization method mainly comprises two modes: equal step length and equal frequency; the constant-frequency discretization method is more accurate, but the data distribution needs to be calculated once and again every time, and because the distribution of the task results previewed in the system by the user in yesterday is not always the same at present, the constant-frequency segmentation points made in yesterday may not be suitable, while the constant-frequency segmentation points which need to be avoided on the line are not fixed and need to be calculated in real time, so the model trained in yesterday is not always suitable at present. The equal frequency is not fixed but is accurate, the equal step length is fixed, the use is simple, and the invention simultaneously adopts two modes to carry out discretization operation on the data.

For the processing of the class-type data in the scientific research result characteristics, the class-type data does not have a size relation per se, the class-type data needs to be coded into numbers, but the class-type data cannot have a preset size relation, so that a plurality of spaces are directly opened up by not only achieving fairness but also distinguishing the class-type data. In the invention, a One-Hot coding/dummy variable mode is adopted to parallelly expand the class type data, that is, after the One-Hot coding/dummy variable is carried out, the space of the characteristic expands, thereby obtaining the statistical characteristic.

The order is as follows: the topic achievement is the second place in the retrieved popular achievement; proportion classification: the performance of the achievement of the subject belongs to the ratio of good/poor, the ratio of whether to be collected and quoted, and the like, and statistical characteristics are obtained.

In addition, the statistical type feature processing mode also comprises modes of adding and subtracting average, dividing bit line and the like.

The feedback data of the recommendation system in the invention comprises: the frequency of retrieval, collection, approval and reference of the task results, and user behavior data such as expert tags, user-defined tags and the time of stay of the user on the page.

The recommendation method based on the user tags is used for describing the opinions of users on topic achievements, is a link for connecting the users and scientific research achievements, and is an important data source for reflecting the interests of the users. The user tag behavior data set is generally represented by a combination of three groups (user, task achievement and tag), wherein one record (u, i, b) represents that the user u tags the task i with a tag b; the direct recommendation method based on the labels is to count the most frequently used labels of each user; for each label, counting the topic achievement with the most label hitting times; for a user, firstly finding common tags of the user, then finding hot topic achievement with the tags, and recommending the hot topic achievement to the user, wherein an interest formula of the user u in the topic achievement i is as follows:

wherein n is_u,bIs the number of times user u has marked tag b, n_b,iThe number of times the subject i has been marked with the label b. On the basis, the invention adopts a TF-IDF word frequency-inverse document frequency statistical method, wherein TF-IDF is a common weighting technology for consulting search and text mining and is used for evaluating the importance degree of a word to one document set or one document in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency of its appearance in the corpus, and if a word appears frequently in one article, TF is high, and rarely appears in other articles, the word or phrase is considered to have good class discrimination ability and is suitable for classification:

TFIDF＝TF*IDF

the above interest calculation formula tends to give a great weight to the hot topic corresponding to the hot tag, and thus may cause the hot topic to be recommended to the user, thereby reducing the novelty of the recommendation result. And the hot label is given too large weight, so that the personalized interest of the user cannot be reflected, and the interest formula is improved into the following formula by taking the thought of TF-IDF as reference:

here, the first and second liquid crystal display panels are,

it is recorded how many different users tag b was used. Similarly, punishment is carried out on the hot articles by using the idea of TF-IDF for reference, and the following formula is obtained:

wherein the content of the first and second substances,

it is recorded how many different users have tagged the topic i.

The following is a specific embodiment of the present invention:

data acquisition, wherein the data sources adopted by the method are complex, and the data sources comprise system docking acquisition, manually arranged Excel files, various data related to scientific research subjects, user information and log data generated in the process of using the system; firstly, manually summarizing and classifying various data;

data cleaning, namely cleaning data as scientific research data are various, reserving effective data for the method, and reserving user description information and operation information related to scientific research subjects in user behavior data; and for scientific research data, the description information of the topic, the topic result data labeled by the system and the topic benefit information are reserved.

The data storage is used for storing the description information and the behavior information related to the user into a mysql _ user database, and storing the description information and the result benefit information related to the subject into a separate mysql _ subject database; meanwhile, the various types of cleaned data are stored in a hive data warehouse, and statistical analysis is carried out through spark SQL;

performing data analysis, wherein in the implementation process, an enterprise scientific research result recommendation method based on a BP neural network and a scientific research result content recommendation method based on enterprise scientific research result data are used and run in a spark MLlib machine learning system to obtain a recommendation list of scientific research subjects;

and data display, namely constructing a web page by using an SSM (simple sequence modeling) frame, and performing comparative display on scientific research topic data and a recommendation list of elements.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A comprehensive recommendation method for scientific research achievements oriented to mass scientific research data comprises the following steps:

2. The method of claim 1, wherein the attributes comprise: performance information, demand information, guideline information, project library information, and achievement benefit information.

3. The method of claim 2, in which the performance information comprises: business problems, annual goals and long-term tasks; the requirement information includes: study direction and core objectives; the guide information includes: a guideline number and demand proposing department; the project library information includes: topic number, period, security level, expense and audit information; the achievement benefit information comprises: key technologies, articles, patents, soft works and application benefits.

4. The method of claim 1, wherein the personal information comprises: research direction, field of engagement, browsing records and collecting information; the personal information features include: interest of users in each scientific research result.

5. The method of claim 4, wherein the similarity is calculated by:

6. The method of claim 4, wherein the interestingness is calculated by:

7. The method of claim 6, wherein the trending labels and the penalties for trending research results are added when calculating the interest level of the user in each research result.

8. The method of claim 1, wherein the attribute characteristics are obtained by:

2) carrying out normalization processing on the numerical attribute;

for statistical attributes, including:

for the statistical attributes of the type, adopting an One-Hot coding/dummy variable mode to parallelly expand the statistical attributes of the type;

for statistical-type attributes of orderliness, calculating an order based on the orderliness data;

9. The method of claim 1, wherein the method of calculating the similarity comprises: cosine similarity method.

10. An electronic apparatus comprising a memory having a computer program stored therein and a processor arranged to execute the computer program to perform the method according to any of claims 1-9.