CN114154498B

CN114154498B - Innovative evaluation method based on science and technology big data text content

Info

Publication number: CN114154498B
Application number: CN202111489894.1A
Authority: CN
Inventors: 刘业政; 陈航; 姜元春; 钱洋; 孙见山; 柴一栋; 王继成; 袁昆
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2024-02-20
Anticipated expiration: 2041-12-08
Also published as: CN114154498A

Abstract

The invention discloses an innovative evaluation method based on technological big data text content, which comprises the following steps: 1. acquiring, preprocessing and word segmentation of the text content of the science and technology big data; 2. using a TF-IDF model to process the segmented technical big data text data and constructing a technical big data document word vector to be evaluated; 3. performing dimension reduction on the document word vector by using a Principal Component Analysis (PCA); 4. calculating the similarity among all documents under the time window M, and representing the similarity by cosine between word vectors of each document; 5. and sorting the similarity values in each set in a descending order, and selecting L values with the highest similarity with the text, wherein the smallest similarity value can represent the innovation size of the text, and obtaining the normalized innovation score. The invention can effectively evaluate the innovation of the technical big data and improve the evaluation accuracy, thereby laying a foundation for evaluating and screening the valuable technical big data.

Description

Innovative evaluation method based on science and technology big data text content

Technical Field

The invention relates to the field of scientific and technological big data value evaluation, in particular to a scientific and technological big data innovation evaluation method based on text content.

Background

In recent years, with the vigorous development of network and communication technologies, data related to life production of people is in explosive growth, and modern society has also advanced into a big data era, and technological big data is a kind of information resource capable of reflecting the state and process of human technological activity. It can support human beings to get insight into new ideas, discover new laws, invent new technology and develop new products. In other words, on one hand, scientific and technological big data are as valuable as other common data; on the other hand, based on the characteristics of the system, the value of the technology big data is mainly guided by technology innovation; therefore, the innovation is an indispensable characteristic of the technological big data and is also a fundamental characteristic of distinguishing the technological big data from other data.

The scientific and technological big data comprise scientific and technological papers, patents, soft books, standard specifications, policy suggestions and the like, and comprise a large amount of unstructured data represented by text content data, wherein the value and innovation of the scientific and technological papers and the invention patents are more studied, on one hand, researchers describe the data value by establishing a value evaluation index system and applying a traditional metering model, but the quality of the unstructured data such as the text content and the like is harder to measure; on the other hand, students use traditional text analysis methods such as word frequency analysis, co-occurrence word analysis and the like and a topic model method represented by LDA to measure the quality of text content, and the innovation evaluation of the text content is less depicted.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an innovative evaluation method based on the text content of the technological big data, so that the innovation of the technological big data can be effectively evaluated, the evaluation accuracy is improved, and a foundation is laid for the evaluation and screening of valuable technological big data.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the innovative evaluation method based on the text content of the technical big data is characterized by comprising the following steps:

step 1, after acquiring a text content set of technical big data to be evaluated and performing preprocessing of removing duplication and deletion, dividing the preprocessed text content according to the generation time of the text content to obtain text content { d } with a time stamp ₁ ,d ₂ ,...,d _m ,...,d _M }；d _m Representing the mth text content, M representing the text space in the text content collection;

step 2, regarding the mth text content d _m Performing word segmentation, word stopping removal and repeated word merging to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation' _m N of the N-th word, N _m Representing the mth text content d 'after word segmentation' _m The total number of different words in (a); thus, forming a corpus D by all the segmented words in the M text contents;

step 3, using a TF-IDF model to process the text content after word segmentation, extracting keywords of technological big data to be evaluated and constructing a document word vector;

step 3.1, calculating the m-th text content D 'after word segmentation in the corpus D by using the formula (1)' _m N-th word of (a)TF-IDF value T of (v) _nm ：

In the formula (1), the components are as follows,representing the mth text content d 'after word segmentation' _m N-th word->Is used for the word frequency of (a),representing the mth text content d 'after word segmentation' _m N-th word->Inverse document frequency in corpus D;

step 3.2, constructing a document word vector of the technical big data to be evaluated;

combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t ₁ ,t ₂ ,...,t _p ,...,t _P }，t _p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (2) _p The mth text content d 'after word segmentation' _m Degree of importance X in (2) _pm Thereby obtaining the mth text content d 'after word segmentation' _m Word vector X of (a) _m ＝(X _1m ,X _2m ,…,X _pm ,…,X _Pm ) ^T And further obtaining word vectors X= (X) of all documents under the time window M ₁ ,X ₂ ,...,X _m ,...,X _M ) And as a sparse matrix;

step 4, performing dimension reduction on the sparse matrix X by using a principal component analysis method;

step 4.1, performing zero-mean treatment on each row of elements in the sparse matrix X to obtain a matrix H;

step 4.2, calculating covariance matrix

Step 4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming the eigenvector into a matrix P according to the descending order of the corresponding eigenvalue and the row;

step 4.4, taking a matrix formed by the elements of the first k rows in the matrix P and multiplying the matrix by the matrix H, thereby obtaining a matrix Y= (Y) after dimension reduction ₁ ,Y ₂ ,...,Y _m ,...,Y _M ) Wherein Y is _m Represents the mth document word vector after dimension reduction, andY _km representing the kth dimension value in the mth document word vector after dimension reduction; k is less than P;

step 5, calculating an mth document word vector Y after the dimension reduction of the time window M _m And the z-th document word vector Y after dimension reduction _z Cosine similarity value cos between<Y _m ,Y _z >For representing the mth text content d _m And the z-th text content d _z Similarity sim between<d _m ,d _z >Thereby obtaining the mth text content d _m Text similarity set { sim between other text content<d _m ,d _z >Z=1, 2, …, M, and z+.m };

step 6, for the text similarity set { sim }<d _m ,d _z >The similarity in z=1, 2, …, M, and z is not equal to M is ordered in descending order, the first L values with the largest text similarity are selected, and the L similarity value is used as the M text content d _m Is to be added to the mth text content d _m Normalized to obtain the mth text content d _m Is a novel score of (2);

and 7, calculating innovation scores of all M text contents under the time window M according to the processes of the steps 5 and 6, and arranging the text contents in a descending order, so that the innovation evaluation of the text content set of the technological big data to be evaluated is completed.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention obtains text content data of science and technology big data for preprocessing, eliminates data with missing values, keeps only the latest item of production time for repeated data, and classifies the evaluated data under each time window according to years; combining the existing mature stay vocabulary, segmenting the scientific and technological big data text content through jieba segmentation, deleting the stay words in the text content, and removing nonsensical works; the quality of the scientific and technological big data text data is greatly improved, and the efficiency of the text analysis of the actual data is improved;

2. according to the invention, text data after word segmentation is processed by using a TF-IDF model, keywords of the technical big data are extracted, a TF-IDF value is calculated for each piece of evaluated technical big data word segmentation data, the implementation of key subjects and information of each piece of technical big data under the time window is realized, a document word vector of the technical big data is constructed through word segmentation of the text and the TF-IDF value, and preparation work is carried out for measuring the similarity between subsequent texts;

3. according to the invention, a sparse matrix formed by all the document word vectors under each time window is constructed, and the dimension of the document word vectors is reduced by a method of compressing the sparse matrix through a Principal Component Analysis (PCA); the problem of calculating the distance between sample points in a high-dimensional space is solved, preparation is made for calculating the similarity of the word vectors of the subsequent documents, and the accuracy of the subsequent similarity measurement is improved; the innovation of accurately describing all technological big data in the field under the time window is facilitated;

4. the invention expresses the similarity between documents by calculating cosine similarity between the word vectors of the documents after dimension reduction; and finally, combining a nearest neighbor (KNN) idea, and enabling the smallest similarity value in a plurality of values with the highest similarity between each technological big data and all other technological big data to represent the innovative size of the technological big data, so that the innovative size and the value of the technological big data can be evaluated, and the innovative and valuable technological big data can be screened more effectively.

Drawings

FIG. 1 is a flow chart of the inventive evaluation of scientific and technological big data based text content of the present invention.

Detailed Description

In the embodiment, an innovative evaluation method based on technological big data text content is that keywords of the evaluated technological big data text content are extracted through a TF-IDF model; constructing a document word vector through the keywords and TF-IDF values thereof to form a sparse matrix; performing dimension reduction on the document word vectors by using a Principal Component Analysis (PCA) method through a method of performing row compression on the sparse matrix; then, the cosine similarity between the word vectors of the documents after dimension reduction is calculated to represent the similarity between the documents; finally, combining the nearest neighbor (KNN) idea, the minimum similarity value of the L values with the highest similarity between each text and all other texts can represent the innovative size of the text, specifically, as shown in fig. 1, the method comprises the following steps:

TF-IDF (Term Frequency-Inverse Document Frequency) is used to evaluate the importance of words to text in a document set or corpus, and consists of two parts: TF and IDF.

Step 3.1: word frequency (TF) is the frequency of occurrence of the word in a text sample, assuming d _m For a particular text sample to be used,for the n-th word (if there is a repeated word, the first appearance position of the word is selected as the reference, and the subsequent repeated word is not counted), the word frequency of the word in the text sample is +.>The ratio of the frequency of occurrence of the word to the total frequency of occurrence of all words in the text can be expressed by the formula (1):

step 3.2: inverse Document Frequency (IDF) is used to evaluate the popularity of terms with a corpus. In this embodiment, the corpus D is the text content data of all the large-tech data after word segmentation under each time window,IDF value +.>All of D may be used to include +.>Text number->The sum total number of samples N is expressed as:

step 3.3: calculating the m-th text content D 'after word segmentation in the corpus D by using the method (3)' _m N-th word of (a)TF-IDF value T of (v) _nm ：

In the formula (3), the amino acid sequence of the compound,representing the mth text content d 'after word segmentation' _m N-th word->Is used for the word frequency of (a),representing the mth text content d 'after word segmentation' _m N-th word->Inverse document frequency in corpus D;

step 3.4, constructing a document word vector of the technical big data to be evaluated;

combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t ₁ ,t ₂ ,...,t _p ,...,t _P }，t _p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (4) _p The mth text content d 'after word segmentation' _m Degree of importance X in (2) _pm Thereby obtaining the mth text content d 'after word segmentation' _m Word vector X of (a) _m ＝(X _1m ,X _2m ,…,X _pm ,…,X _Pm ) ^T And further obtaining word vectors X= (X) of all documents under the time window M ₁ ,X ₂ ,...,X _m ,...,X _M ) And as a sparse matrix;

step 4.1, performing zero-mean treatment on each row of elements in the sparse matrix X by using a formula (5) to obtain a matrix H:

in the formula (5), the amino acid sequence of the compound,and satisfy->

Step 4.2, calculating a covariance matrix by using the method (6)

In the formula (6), the covariance matrix C is a real symmetric matrix of p rows and p columns, diagonal elements of the covariance matrix C respectively correspond to variances of data of each row of the matrix H, and j-th row are identical in elements, which represents covariance between j-th row and j-th row of the matrix H:

step 4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector by using a formula (7), and forming the eigenvector into a matrix P by rows according to the descending order of the corresponding eigenvalue by using a formula (8);

the characteristic value of C is obtained and is arranged according to the order of the size to be lambda ₁ ,λ ₂ ,λ ₃ ,...,λ _p Due to CFor the real symmetric matrix of p rows and p columns, the sum eigenvalue lambda is not difficult to be found ₁ ,λ ₂ ,λ ₃ ,...,λ _p Sequentially corresponding p unit orthogonal eigenvectors e ₁ ,e ₂ ,e ₃ ,...,e _p It is formed into a matrix e= (E) in columns ₁ ,e ₂ ,e ₃ ,...,e _p ) Then the covariance matrix C is concluded as follows:

step 4.4, taking a matrix composed of the first k rows of elements in the matrix P by using the formula (9) and multiplying the matrix by the matrix H, thereby obtaining a matrix Y= (Y) after dimension reduction ₁ ,Y ₂ ,...,Y _m ,...,Y _M ) Wherein Y is _m Represents the mth document word vector after dimension reduction, andY _km representing the kth dimension value in the mth document word vector after dimension reduction; k is less than P;

step 5.1, calculating similarity sim between different document word vectors in the k-dimensional vector space by using a formula (10):

in the formula (10), Y _jm Represents the j dimension value, Y in the m-th document word vector after dimension reduction _jz Representing a j-th dimension value in the z-th document word vector after dimension reduction, M, z=1, 2,3,..m, m+.z, j=1, 2,..k;

step 5.2, calculating word vectors of each documentCosine similarity with all other document word vectors forms M sets, wherein each set has M-1 elements, and the sets are sequentially:

{sim<d ₁ ,d _m >|1<m≤M}，{sim<d ₂ ,d _m >|1≤m≤M,q≠2}，……，{sim<d _M ,d _m >|1≤m＜M}

step 6.1, sorting cosine similarity values in each set in a descending order, and selecting L values with highest similarity to the text, wherein the smallest similarity value can represent the innovative size of the text, and the document d _j (j=1, 2,3., M., (v.), the innovative size of M) can be expressed as formula (11):

sim ^l (d _j )(j＝1,2,3...,m,...,M；l＝1,2,3,...,k) (11)

step 6.2, standardizing the innovative calculation result of the technical big data, and assigning percentages as shown in (12), wherein the big data text is displayed in a time window MDocument d _j (j=1, 2,3., M) can be expressed as:

in the formula (12), sim _max For sim ^l (d _j ) Is set at the maximum value of (c), j=1, 2,3., M, M, l=1, 2,3,;

and 7, calculating innovation scores of all M document word vectors under the time window M according to the processes of the steps 5 and 6, and arranging the innovation scores in a descending order, so that the innovation evaluation of the text content set of the technical big data to be evaluated is completed.

Claims

1. An innovative evaluation method based on technological big data text content is characterized by comprising the following steps:

combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t ₁ ,t ₂ ,...,t _p ,...,t _P }，t _p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (2) _p The mth text content d 'after word segmentation' _m Degree of importance X in (2) _pm Thereby obtaining the mth text content d 'after word segmentation' _m Word vectors of (a)And then obtaining word vectors X= (X) of all documents under the time window M ₁ ,X ₂ ,...,X _m ,...,X _M ) And as a sparse matrix;

step 4.2, calculating covariance matrix

step 5, calculating an mth document word vector Y after the dimension reduction of the time window M _m And the z-th document word vector Y after dimension reduction _z Cosine similarity value cos between<Y _m ,Y _z >For representing the mth text content d _m And the z-th text content d _z Similarity sim between<d _m ,d _z >Thereby obtaining the mth text content d _m Text similarity to other text contentSex set { sim<d _m ,d _z >]z=1, 2, …, M, and z+.m };