CN114154498A - Innovative evaluation method based on scientific and technological big data text content - Google Patents

Innovative evaluation method based on scientific and technological big data text content Download PDF

Info

Publication number
CN114154498A
CN114154498A CN202111489894.1A CN202111489894A CN114154498A CN 114154498 A CN114154498 A CN 114154498A CN 202111489894 A CN202111489894 A CN 202111489894A CN 114154498 A CN114154498 A CN 114154498A
Authority
CN
China
Prior art keywords
text content
word
text
scientific
mth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111489894.1A
Other languages
Chinese (zh)
Other versions
CN114154498B (en
Inventor
刘业政
陈航
姜元春
钱洋
孙见山
柴一栋
王继成
袁昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202111489894.1A priority Critical patent/CN114154498B/en
Publication of CN114154498A publication Critical patent/CN114154498A/en
Application granted granted Critical
Publication of CN114154498B publication Critical patent/CN114154498B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an innovation evaluation method based on scientific and technological big data text contents, which comprises the following steps: 1. acquiring, preprocessing and word segmentation of scientific and technological big data text content; 2. processing the segmented scientific and technological big data text data by using a TF-IDF model, and constructing a scientific and technological big data document word vector to be evaluated; 3. performing dimensionality reduction on the document word vector by using a Principal Component Analysis (PCA); 4. calculating the similarity between all the documents in the time window M, and expressing the similarity by using the cosine between the word vectors of each document; 5. and sorting the similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the minimum similarity value can represent the innovative size of the text, and obtaining a standardized innovative score. The method can effectively evaluate the innovation of the scientific and technological big data and improve the evaluation accuracy, thereby laying a foundation for evaluation and screening of valuable scientific and technological big data.

Description

Innovative evaluation method based on scientific and technological big data text content
Technical Field
The invention relates to the field of scientific and technological big data value evaluation, in particular to a text content-based scientific and technological big data innovation evaluation method.
Background
In recent years, with the vigorous development of network and communication technology, data related to people in life and production is increased explosively, and modern society has advanced into the big data era, and scientific and technical big data is a kind of information resource capable of reflecting the state and process of human scientific and technical activities. The method can support human to understand new ideas, discover new rules, invent new technologies and develop new products. In other words, on the one hand, scientific and technical big data have value as other common data; on the other hand, based on the characteristics of the system, the value of the scientific and technological big data is mainly guided by scientific and technological innovation; therefore, the innovation is the indispensable characteristic of the scientific and technical big data and the fundamental characteristic of the innovation to be distinguished from other data.
The scientific and technological big data comprise scientific and technological articles, patents, soft works, standard specifications, policy suggestions and the like, and comprise a large amount of unstructured data represented by text content data, wherein the value and innovation of the scientific and technological articles and the invention patents are more researched, on one hand, researchers describe the data value by establishing a value evaluation index system and applying a traditional metering model, but the quality of the unstructured data such as text content and the like is difficult to measure; on the other hand, the learner uses traditional text analysis methods such as word frequency analysis and co-occurrence word analysis and a topic model method represented by LDA to measure the quality of the text content, and the innovation evaluation of the text content is less.
Disclosure of Invention
The invention provides an innovative assessment method based on the text content of the scientific and technological big data for overcoming the defects in the prior art, so that the innovative height of the scientific and technological big data can be effectively assessed, and the assessment accuracy is improved, thereby laying a foundation for assessment and screening of valuable scientific and technological big data.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to an innovative assessment method based on scientific and technological big data text content, which is characterized by comprising the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word retention removal and repeated word combination to obtain the mth text content after word segmentation
Figure BDA0003398816240000011
Figure BDA0003398816240000012
Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the mth text content D 'after word segmentation in the corpus D by using the formula (1)'mN' th word
Figure BDA0003398816240000021
TF-IDF value of Tnm
Figure BDA0003398816240000022
In the formula (1), the reaction mixture is,
Figure BDA0003398816240000023
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure BDA0003398816240000024
The frequency of the words of (a) is,
Figure BDA0003398816240000025
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure BDA0003398816240000026
Inverse document frequency in corpus D;
3.2, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector Xm=(X1m,X2m,…,Xpm,…,XPm)TAnd further obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
Figure BDA0003398816240000027
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero equalization processing on each row of elements in the sparse matrix X to obtain a matrix H;
step 4.2, calculating covariance matrix
Figure BDA0003398816240000028
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row;
and 4.4, taking a matrix formed by the first k rows of elements in the matrix P and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, and
Figure BDA0003398816240000029
Ykmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>1,2, …, M, and z ≠ M };
step 6, carrying out similarity set { sim) on the texts<dm,dz>Sorting the similarity of 1,2, … and M in descending order, selecting the top L values with the maximum text similarity, and using the L-th similarity value as the M-th text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
and 7, calculating the innovation scores of all the M text contents in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the science and technology big data to be evaluated.
Compared with the prior art, the invention has the beneficial effects that:
1. the method comprises the steps of acquiring scientific and technological big data text content data for preprocessing, eliminating data with missing values, reserving only the latest item of production time for repeated data, and classifying the evaluated data under each time window according to the year; combining the existing mature staying word list, performing word segmentation on the scientific and technological big data text content through jieba word segmentation, deleting the staying words in the scientific and technological big data text content, and removing the virtual words without real meaning; the quality of the scientific and technological big data text data is greatly improved, and the efficiency of text analysis of actual data is favorably improved;
2. the method uses a TF-IDF model to process text data after word segmentation, extracts key words of scientific and technological big data, calculates TF-IDF values of data after word segmentation of each evaluated scientific and technological big data, realizes the embodiment of key subjects and information of each scientific and technological big data under the time window, constructs document word vectors of the scientific and technological big data through word segmentation of the text and the TF-IDF values, and makes preparation for the measurement of similarity between subsequent texts;
3. the invention reduces the dimension of the document word vectors by constructing a sparse matrix consisting of all the document word vectors under each time window and applying a Principal Component Analysis (PCA) method to compress the sparse matrix; the problem of sample point distance calculation in a high-dimensional space is solved, preparation is made for calculation of subsequent document word vector similarity, and accuracy of subsequent similarity measurement is improved; the innovativeness of all scientific and technological big data in the field under the time window can be accurately described;
4. the similarity between the documents is expressed by calculating cosine similarity between the document word vectors after dimension reduction; finally, by combining a nearest neighbor (KNN) idea, the minimum similarity value of the plurality of values with the highest similarity between each scientific and technological big data and all other scientific and technological big data can represent the innovative size of the scientific and technological big data, which is beneficial to evaluating the innovative size and value of the scientific and technological big data, so that the scientific and technological big data with higher innovation and value can be screened more effectively.
Drawings
FIG. 1 is a flow chart of the innovation evaluation based on scientific big data text content of the present invention.
Detailed Description
In the embodiment, an innovative evaluation method based on scientific and technological big data text content extracts key words of the evaluated scientific and technological big data text content through a TF-IDF model; constructing a document word vector through the key words and TF-IDF values of the key words to form a sparse matrix; performing dimensionality reduction on the document word vector by a method of compressing the sparse matrix by using a Principal Component Analysis (PCA); then, expressing the similarity between the documents by calculating cosine similarity between the document word vectors after the dimension reduction; finally, combining the nearest neighbor (KNN) idea, the innovative size of each text can be represented by the minimum similarity value among the L values with the highest similarity between each text and all other texts, specifically, as shown in fig. 1, the method is performed according to the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word despinning and word mergingRepeating the word processing to obtain the mth text content after word segmentation
Figure BDA0003398816240000041
Figure BDA0003398816240000042
Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
TF-IDF (Term Frequency-Inverse Document Frequency) is used for evaluating the importance degree of words to texts in a Document set or a corpus, and consists of two parts: TF and IDF.
Step 3.1: the word frequency (TF) is the frequency of occurrence of the word in a text sample, assuming dmFor a particular sample of text, the text sample,
Figure BDA0003398816240000043
arranging the nth word according to position after word segmentation (if repeated words exist, the position of the first appearance of the word is selected as the standard, and the position of the subsequent repeated words is not counted), and then the word frequency of the word in the text sample
Figure BDA0003398816240000044
The ratio of the frequency of occurrence of the word to the total frequency of occurrence of all words in the text can be expressed by equation (1):
Figure BDA0003398816240000045
step 3.2: inverse Document Frequency (IDF) is used to evaluate the prevalence of words for a corpus. In this embodiment, the corpus D is all the segmented scientific big data text content data in each time window,
Figure BDA0003398816240000046
IDF value of
Figure BDA0003398816240000047
Can be represented by the formula (2) and is contained in D
Figure BDA0003398816240000048
Number of texts
Figure BDA0003398816240000049
And the total number of samples N is expressed as:
Figure BDA0003398816240000051
step 3.3: calculating m text content D 'after word segmentation in corpus D by using formula (3)'mN' th word
Figure BDA0003398816240000052
TF-IDF value of Tnm
Figure BDA0003398816240000053
In the formula (3), the reaction mixture is,
Figure BDA0003398816240000054
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure BDA0003398816240000055
The frequency of the words of (a) is,
Figure BDA0003398816240000056
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure BDA0003398816240000057
Inverse document frequency in corpus D;
3.4, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among all text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the corpus D' after merging, and the P-th word t is calculated by using the formula (4)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector Xm=(X1m,X2m,…,Xpm,…,XPm)TAnd further obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
Figure BDA0003398816240000058
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, respectively carrying out zero-averaging processing on each row of elements in the sparse matrix X by using a formula (5) to obtain a matrix H:
Figure BDA0003398816240000059
in the formula (5), the reaction mixture is,
Figure BDA00033988162400000510
and satisfy
Figure BDA00033988162400000511
Figure BDA00033988162400000512
Step 4.2, calculating the covariance matrix by using the formula (6)
Figure BDA00033988162400000513
Figure BDA0003398816240000061
In equation (6), the covariance matrix C is a real symmetric matrix with p rows and p columns, and diagonal elements thereof respectively correspond to the variances of each row of the matrix H, and the ith row and jth column have the same element, which represents the covariance between the ith row and jth column of the matrix H:
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector by using the formula (7), and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row by using the formula (8);
the characteristic value of C is obtained and is arranged in the order of size as lambda123,...,λpSince C is a real symmetric matrix with p rows and p columns, the sum of the characteristic value lambda and the characteristic value lambda can be easily obtained123,...,λpP unit orthogonal characteristic vectors e corresponding in sequence1,e2,e3,...,epThe matrix is formed by column (E) ═ E1,e2,e3,...,ep) Then, the following conclusion is made for the covariance matrix C:
Figure BDA0003398816240000062
Figure BDA0003398816240000063
and 4.4, taking a matrix formed by the first k rows of elements in the matrix P by using the formula (9), and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, and
Figure BDA0003398816240000064
Ykmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
Figure BDA0003398816240000071
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>1,2, …, M, and z ≠ M };
step 5.1, calculating the similarity sim between different document word vectors in the k-dimensional vector space by using the formula (10):
Figure BDA0003398816240000072
in the formula (10), YjmRepresenting the j dimension value, Y, in the m document word vector after dimension reductionjzRepresenting the j dimension value in the z document word vector after dimensionality reduction, wherein M, z is 1,2,3, the.
Step 5.2, calculating each document word vector
Figure BDA0003398816240000073
And cosine similarity between the document word vectors and all other document word vectors forms M sets, wherein each set has M-1 elements, and the sets are as follows in sequence:
{sim<d1,dm>|1<m≤M},{sim<d2,dm>|1≤m≤M,q≠2},……,{sim<dM,dm>|1≤m<M}
step 6, carrying out similarity set { sim) on the texts<dm,dz>|z=1,2, …, M, and z ≠ M } carries out descending sorting, selects the top L values with the maximum text similarity, and takes the Lth similarity value as the mth text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
6.1, sorting the cosine similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the smallest similarity value can represent the innovative size of the text, and a document djThe inventive size of (j ═ 1,2,3.., M.) can be expressed by formula (11):
siml(dj)(j=1,2,3...,m,...,M;l=1,2,3,...,k) (11)
6.2, standardizing the innovative calculation result of the scientific and technical big data, and giving scores in percentage, wherein the big data text document d under the time window M is shown as a formula (12)jThe innovativeness of (j ═ 1,2,3.., M.) can be expressed as:
Figure BDA0003398816240000081
in the formula (12), simmaxIs siml(dj) Is 1,2,3, M, l 1,2,3, k;
and 7, calculating the innovation scores of all the M document word vectors in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the scientific and technological big data to be evaluated.

Claims (1)

1. An innovation evaluation method based on scientific and technological big data text content is characterized by comprising the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word retention removal and repeated word combination to obtain the mth text content after word segmentation
Figure FDA0003398816230000011
Figure FDA0003398816230000012
Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the mth text content D 'after word segmentation in the corpus D by using the formula (1)'mN' th word
Figure FDA0003398816230000013
TF-IDF value of Tnm
Figure FDA0003398816230000014
In the formula (1), the reaction mixture is,
Figure FDA0003398816230000015
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure FDA0003398816230000016
The frequency of the words of (a) is,
Figure FDA0003398816230000017
representing the mth text content d 'after word segmentation'mMiddle nth word
Figure FDA0003398816230000018
Inverse document frequency in corpus D;
3.2, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector of
Figure FDA0003398816230000019
And obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
Figure FDA00033988162300000110
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero equalization processing on each row of elements in the sparse matrix X to obtain a matrix H;
step 4.2, calculating covariance matrix
Figure FDA0003398816230000021
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row;
step 4.4, get the front of matrix PMultiplying the matrix formed by k rows of elements by the matrix H to obtain a reduced-dimension matrix Y ═ Y1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, and
Figure FDA0003398816230000022
Ykmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>]z is 1,2, …, M, and z ≠ M };
step 6, carrying out similarity set { sim) on the texts<dm,dz>Sorting the similarity of 1,2, … and M in descending order, selecting the top L values with the maximum text similarity, and using the L-th similarity value as the M-th text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
and 7, calculating the innovation scores of all the M text contents in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the science and technology big data to be evaluated.
CN202111489894.1A 2021-12-08 2021-12-08 Innovative evaluation method based on science and technology big data text content Active CN114154498B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111489894.1A CN114154498B (en) 2021-12-08 2021-12-08 Innovative evaluation method based on science and technology big data text content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111489894.1A CN114154498B (en) 2021-12-08 2021-12-08 Innovative evaluation method based on science and technology big data text content

Publications (2)

Publication Number Publication Date
CN114154498A true CN114154498A (en) 2022-03-08
CN114154498B CN114154498B (en) 2024-02-20

Family

ID=80453329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111489894.1A Active CN114154498B (en) 2021-12-08 2021-12-08 Innovative evaluation method based on science and technology big data text content

Country Status (1)

Country Link
CN (1) CN114154498B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143971A1 (en) * 2003-10-27 2005-06-30 Jill Burstein Method and system for determining text coherence
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143971A1 (en) * 2003-10-27 2005-06-30 Jill Burstein Method and system for determining text coherence
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王莉军;姚长青;刘志辉;: "一种文本挖掘和文献计量的科技论文评估方法", 情报科学, no. 05, 1 May 2019 (2019-05-01) *

Also Published As

Publication number Publication date
CN114154498B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN108763214B (en) Automatic construction method of emotion dictionary for commodity comments
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN105183833B (en) Microblog text recommendation method and device based on user model
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN112434720A (en) Chinese short text classification method based on graph attention network
CN108804595B (en) Short text representation method based on word2vec
CN110765254A (en) Multi-document question-answering system model integrating multi-view answer reordering
CN110046264A (en) A kind of automatic classification method towards mobile phone document
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
CN106570076A (en) Computer text classification system
CN102142082A (en) Virtual sample based kernel discrimination method for face recognition
CN109918648A (en) A kind of rumour depth detection method based on the scoring of dynamic sliding window feature
CN115578137A (en) Agricultural product future price prediction method and system based on text mining and deep learning model
CN109086794A (en) A kind of driving behavior mode knowledge method based on T-LDA topic model
Chen et al. Max-margin discriminant projection via data augmentation
CN114792246A (en) Method and system for mining typical product characteristics based on topic integration clustering
CN116682015A (en) Feature decoupling-based cross-domain small sample radar one-dimensional image target recognition method
CN114154498A (en) Innovative evaluation method based on scientific and technological big data text content
Wu et al. SOUA: Towards Intelligent Recommendation for Applying for Overseas Universities
CN113342950B (en) Answer selection method and system based on semantic association
CN105404899A (en) Image classification method based on multi-directional context information and sparse coding model
CN114722183A (en) Knowledge pushing method and system for scientific research tasks
CN114138942A (en) Violation detection method based on text emotional tendency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant