CN114154498A

CN114154498A - Innovative evaluation method based on scientific and technological big data text content

Info

Publication number: CN114154498A
Application number: CN202111489894.1A
Authority: CN
Inventors: 刘业政; 陈航; 姜元春; 钱洋; 孙见山; 柴一栋; 王继成; 袁昆
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-08
Anticipated expiration: 2041-12-08
Also published as: CN114154498B

Abstract

The invention discloses an innovation evaluation method based on scientific and technological big data text contents, which comprises the following steps: 1. acquiring, preprocessing and word segmentation of scientific and technological big data text content; 2. processing the segmented scientific and technological big data text data by using a TF-IDF model, and constructing a scientific and technological big data document word vector to be evaluated; 3. performing dimensionality reduction on the document word vector by using a Principal Component Analysis (PCA); 4. calculating the similarity between all the documents in the time window M, and expressing the similarity by using the cosine between the word vectors of each document; 5. and sorting the similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the minimum similarity value can represent the innovative size of the text, and obtaining a standardized innovative score. The method can effectively evaluate the innovation of the scientific and technological big data and improve the evaluation accuracy, thereby laying a foundation for evaluation and screening of valuable scientific and technological big data.

Description

Innovative evaluation method based on scientific and technological big data text content

Technical Field

The invention relates to the field of scientific and technological big data value evaluation, in particular to a text content-based scientific and technological big data innovation evaluation method.

Background

In recent years, with the vigorous development of network and communication technology, data related to people in life and production is increased explosively, and modern society has advanced into the big data era, and scientific and technical big data is a kind of information resource capable of reflecting the state and process of human scientific and technical activities. The method can support human to understand new ideas, discover new rules, invent new technologies and develop new products. In other words, on the one hand, scientific and technical big data have value as other common data; on the other hand, based on the characteristics of the system, the value of the scientific and technological big data is mainly guided by scientific and technological innovation; therefore, the innovation is the indispensable characteristic of the scientific and technical big data and the fundamental characteristic of the innovation to be distinguished from other data.

The scientific and technological big data comprise scientific and technological articles, patents, soft works, standard specifications, policy suggestions and the like, and comprise a large amount of unstructured data represented by text content data, wherein the value and innovation of the scientific and technological articles and the invention patents are more researched, on one hand, researchers describe the data value by establishing a value evaluation index system and applying a traditional metering model, but the quality of the unstructured data such as text content and the like is difficult to measure; on the other hand, the learner uses traditional text analysis methods such as word frequency analysis and co-occurrence word analysis and a topic model method represented by LDA to measure the quality of the text content, and the innovation evaluation of the text content is less.

Disclosure of Invention

The invention provides an innovative assessment method based on the text content of the scientific and technological big data for overcoming the defects in the prior art, so that the innovative height of the scientific and technological big data can be effectively assessed, and the assessment accuracy is improved, thereby laying a foundation for assessment and screening of valuable scientific and technological big data.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to an innovative assessment method based on scientific and technological big data text content, which is characterized by comprising the following steps:

step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp₁,d₂,...,d_m,...,d_M}；d_mRepresenting the mth text content, wherein M represents the text number in the text content set;

step 2, for the mth text content d_mPerforming word segmentation, word retention removal and repeated word combination to obtain the mth text content after word segmentation

Representing the mth text content d 'after word segmentation'_mN is the nth word in_mRepresenting the mth text content d 'after word segmentation'_mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;

step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;

step 3.1, calculating the mth text content D 'after word segmentation in the corpus D by using the formula (1)'_mN' th word

TF-IDF value of T_nm：

In the formula (1), the reaction mixture is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

The frequency of the words of (a) is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

Inverse document frequency in corpus D;

3.2, constructing a document word vector of the scientific and technological big data to be evaluated;

combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t₁,t₂,...,t_p,...,t_P}，t_pRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)_pMth text content d 'after word segmentation'_mDegree of importance X in_pmTo obtain the mth text content d 'after word segmentation'_mWord vector X_m＝(X_1m,X_2m,…,X_pm,…,X_Pm)^TAnd further obtaining a word vector X (X) of all the documents under the time window M₁,X₂,...,X_m,...,X_M) And as a sparse matrix;

step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;

step 4.1, performing zero equalization processing on each row of elements in the sparse matrix X to obtain a matrix H;

step 4.2, calculating covariance matrix

4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row;

and 4.4, taking a matrix formed by the first k rows of elements in the matrix P and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction₁,Y₂,...,Y_m,...,Y_M) Wherein Y is_mRepresents the m-th document word vector after dimensionality reduction, and

Y_kmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;

step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window M_mAnd the z-th document word vector Y after dimension reduction_zCosine similarity value cos between<Y_m,Y_z>For representing the mth text content d_mAnd the z text content d_zSimilarity sim between<d_m,d_z>To obtain the mth text content d_mSet of textual similarities { sim ] with other textual content<d_m,d_z>1,2, …, M, and z ≠ M };

step 6, carrying out similarity set { sim) on the texts<d_m,d_z>Sorting the similarity of 1,2, … and M in descending order, selecting the top L values with the maximum text similarity, and using the L-th similarity value as the M-th text content d_mInnovation value of (1), will be the m text content d_mThe innovation value of the text is standardized to obtain the content d of the mth text_mThe innovation score of (a);

and 7, calculating the innovation scores of all the M text contents in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the science and technology big data to be evaluated.

Compared with the prior art, the invention has the beneficial effects that:

1. the method comprises the steps of acquiring scientific and technological big data text content data for preprocessing, eliminating data with missing values, reserving only the latest item of production time for repeated data, and classifying the evaluated data under each time window according to the year; combining the existing mature staying word list, performing word segmentation on the scientific and technological big data text content through jieba word segmentation, deleting the staying words in the scientific and technological big data text content, and removing the virtual words without real meaning; the quality of the scientific and technological big data text data is greatly improved, and the efficiency of text analysis of actual data is favorably improved;

2. the method uses a TF-IDF model to process text data after word segmentation, extracts key words of scientific and technological big data, calculates TF-IDF values of data after word segmentation of each evaluated scientific and technological big data, realizes the embodiment of key subjects and information of each scientific and technological big data under the time window, constructs document word vectors of the scientific and technological big data through word segmentation of the text and the TF-IDF values, and makes preparation for the measurement of similarity between subsequent texts;

3. the invention reduces the dimension of the document word vectors by constructing a sparse matrix consisting of all the document word vectors under each time window and applying a Principal Component Analysis (PCA) method to compress the sparse matrix; the problem of sample point distance calculation in a high-dimensional space is solved, preparation is made for calculation of subsequent document word vector similarity, and accuracy of subsequent similarity measurement is improved; the innovativeness of all scientific and technological big data in the field under the time window can be accurately described;

4. the similarity between the documents is expressed by calculating cosine similarity between the document word vectors after dimension reduction; finally, by combining a nearest neighbor (KNN) idea, the minimum similarity value of the plurality of values with the highest similarity between each scientific and technological big data and all other scientific and technological big data can represent the innovative size of the scientific and technological big data, which is beneficial to evaluating the innovative size and value of the scientific and technological big data, so that the scientific and technological big data with higher innovation and value can be screened more effectively.

Drawings

FIG. 1 is a flow chart of the innovation evaluation based on scientific big data text content of the present invention.

Detailed Description

In the embodiment, an innovative evaluation method based on scientific and technological big data text content extracts key words of the evaluated scientific and technological big data text content through a TF-IDF model; constructing a document word vector through the key words and TF-IDF values of the key words to form a sparse matrix; performing dimensionality reduction on the document word vector by a method of compressing the sparse matrix by using a Principal Component Analysis (PCA); then, expressing the similarity between the documents by calculating cosine similarity between the document word vectors after the dimension reduction; finally, combining the nearest neighbor (KNN) idea, the innovative size of each text can be represented by the minimum similarity value among the L values with the highest similarity between each text and all other texts, specifically, as shown in fig. 1, the method is performed according to the following steps:

step 2, for the mth text content d_mPerforming word segmentation, word despinning and word mergingRepeating the word processing to obtain the mth text content after word segmentation

TF-IDF (Term Frequency-Inverse Document Frequency) is used for evaluating the importance degree of words to texts in a Document set or a corpus, and consists of two parts: TF and IDF.

Step 3.1: the word frequency (TF) is the frequency of occurrence of the word in a text sample, assuming d_mFor a particular sample of text, the text sample,

arranging the nth word according to position after word segmentation (if repeated words exist, the position of the first appearance of the word is selected as the standard, and the position of the subsequent repeated words is not counted), and then the word frequency of the word in the text sample

The ratio of the frequency of occurrence of the word to the total frequency of occurrence of all words in the text can be expressed by equation (1):

step 3.2: inverse Document Frequency (IDF) is used to evaluate the prevalence of words for a corpus. In this embodiment, the corpus D is all the segmented scientific big data text content data in each time window,

IDF value of

Can be represented by the formula (2) and is contained in D

Number of texts

And the total number of samples N is expressed as:

step 3.3: calculating m text content D 'after word segmentation in corpus D by using formula (3)'_mN' th word

TF-IDF value of T_nm：

In the formula (3), the reaction mixture is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

The frequency of the words of (a) is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

Inverse document frequency in corpus D;

3.4, constructing a document word vector of the scientific and technological big data to be evaluated;

combining repeated words among all text contents in the corpus D to obtain a combined corpus D' ═ { t₁,t₂,...,t_p,...,t_P}，t_pRepresents the p-th word; p represents the total number of words in the corpus D' after merging, and the P-th word t is calculated by using the formula (4)_pMth text content d 'after word segmentation'_mDegree of importance X in_pmTo obtain the mth text content d 'after word segmentation'_mWord vector X_m＝(X_1m,X_2m,…,X_pm,…,X_Pm)^TAnd further obtaining a word vector X (X) of all the documents under the time window M₁,X₂,...,X_m,...,X_M) And as a sparse matrix;

step 4.1, respectively carrying out zero-averaging processing on each row of elements in the sparse matrix X by using a formula (5) to obtain a matrix H:

in the formula (5), the reaction mixture is,

and satisfy

Step 4.2, calculating the covariance matrix by using the formula (6)

In equation (6), the covariance matrix C is a real symmetric matrix with p rows and p columns, and diagonal elements thereof respectively correspond to the variances of each row of the matrix H, and the ith row and jth column have the same element, which represents the covariance between the ith row and jth column of the matrix H:

4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector by using the formula (7), and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row by using the formula (8);

the characteristic value of C is obtained and is arranged in the order of size as lambda₁,λ₂,λ₃,...,λ_pSince C is a real symmetric matrix with p rows and p columns, the sum of the characteristic value lambda and the characteristic value lambda can be easily obtained₁,λ₂,λ₃,...,λ_pP unit orthogonal characteristic vectors e corresponding in sequence₁,e₂,e₃,...,e_pThe matrix is formed by column (E) ═ E₁,e₂,e₃,...,e_p) Then, the following conclusion is made for the covariance matrix C:

and 4.4, taking a matrix formed by the first k rows of elements in the matrix P by using the formula (9), and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction₁,Y₂,...,Y_m,...,Y_M) Wherein Y is_mRepresents the m-th document word vector after dimensionality reduction, and

step 5.1, calculating the similarity sim between different document word vectors in the k-dimensional vector space by using the formula (10):

in the formula (10), Y_jmRepresenting the j dimension value, Y, in the m document word vector after dimension reduction_jzRepresenting the j dimension value in the z document word vector after dimensionality reduction, wherein M, z is 1,2,3, the.

Step 5.2, calculating each document word vector

And cosine similarity between the document word vectors and all other document word vectors forms M sets, wherein each set has M-1 elements, and the sets are as follows in sequence:

{sim<d₁,d_m>|1<m≤M}，{sim<d₂,d_m>|1≤m≤M,q≠2}，……，{sim<d_M,d_m>|1≤m＜M}

step 6, carrying out similarity set { sim) on the texts<d_m,d_z>|z＝1,2, …, M, and z ≠ M } carries out descending sorting, selects the top L values with the maximum text similarity, and takes the Lth similarity value as the mth text content d_mInnovation value of (1), will be the m text content d_mThe innovation value of the text is standardized to obtain the content d of the mth text_mThe innovation score of (a);

6.1, sorting the cosine similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the smallest similarity value can represent the innovative size of the text, and a document d_jThe inventive size of (j ═ 1,2,3.., M.) can be expressed by formula (11):

sim^l(d_j)(j＝1,2,3...,m,...,M；l＝1,2,3,...,k) (11)

6.2, standardizing the innovative calculation result of the scientific and technical big data, and giving scores in percentage, wherein the big data text document d under the time window M is shown as a formula (12)_jThe innovativeness of (j ═ 1,2,3.., M.) can be expressed as:

in the formula (12), sim_maxIs sim^l(d_j) Is 1,2,3, M, l 1,2,3, k;

and 7, calculating the innovation scores of all the M document word vectors in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the scientific and technological big data to be evaluated.

Claims

1. An innovation evaluation method based on scientific and technological big data text content is characterized by comprising the following steps:

TF-IDF value of T_nm：

In the formula (1), the reaction mixture is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

The frequency of the words of (a) is,

representing the mth text content d 'after word segmentation'_mMiddle nth word

Inverse document frequency in corpus D;

combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t₁,t₂,...,t_p,...,t_P}，t_pRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)_pMth text content d 'after word segmentation'_mDegree of importance X in_pmTo obtain the mth text content d 'after word segmentation'_mWord vector of

And obtaining a word vector X (X) of all the documents under the time window M₁,X₂,...,X_m,...,X_M) And as a sparse matrix;

step 4.2, calculating covariance matrix

step 4.4, get the front of matrix PMultiplying the matrix formed by k rows of elements by the matrix H to obtain a reduced-dimension matrix Y ═ Y₁,Y₂,...,Y_m,...,Y_M) Wherein Y is_mRepresents the m-th document word vector after dimensionality reduction, and

step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window M_mAnd the z-th document word vector Y after dimension reduction_zCosine similarity value cos between<Y_m,Y_z>For representing the mth text content d_mAnd the z text content d_zSimilarity sim between<d_m,d_z>To obtain the mth text content d_mSet of textual similarities { sim ] with other textual content<d_m,d_z>]z is 1,2, …, M, and z ≠ M };