CN114154498B - Innovative evaluation method based on science and technology big data text content - Google Patents
Innovative evaluation method based on science and technology big data text content Download PDFInfo
- Publication number
- CN114154498B CN114154498B CN202111489894.1A CN202111489894A CN114154498B CN 114154498 B CN114154498 B CN 114154498B CN 202111489894 A CN202111489894 A CN 202111489894A CN 114154498 B CN114154498 B CN 114154498B
- Authority
- CN
- China
- Prior art keywords
- text content
- word
- text
- mth
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 18
- 238000005516 engineering process Methods 0.000 title abstract description 8
- 239000013598 vector Substances 0.000 claims abstract description 42
- 230000011218 segmentation Effects 0.000 claims abstract description 36
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 48
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000012847 principal component analysis method Methods 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 abstract description 6
- 238000012216 screening Methods 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 2
- 125000003275 alpha amino acid group Chemical group 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Optimization (AREA)
- General Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an innovative evaluation method based on technological big data text content, which comprises the following steps: 1. acquiring, preprocessing and word segmentation of the text content of the science and technology big data; 2. using a TF-IDF model to process the segmented technical big data text data and constructing a technical big data document word vector to be evaluated; 3. performing dimension reduction on the document word vector by using a Principal Component Analysis (PCA); 4. calculating the similarity among all documents under the time window M, and representing the similarity by cosine between word vectors of each document; 5. and sorting the similarity values in each set in a descending order, and selecting L values with the highest similarity with the text, wherein the smallest similarity value can represent the innovation size of the text, and obtaining the normalized innovation score. The invention can effectively evaluate the innovation of the technical big data and improve the evaluation accuracy, thereby laying a foundation for evaluating and screening the valuable technical big data.
Description
Technical Field
The invention relates to the field of scientific and technological big data value evaluation, in particular to a scientific and technological big data innovation evaluation method based on text content.
Background
In recent years, with the vigorous development of network and communication technologies, data related to life production of people is in explosive growth, and modern society has also advanced into a big data era, and technological big data is a kind of information resource capable of reflecting the state and process of human technological activity. It can support human beings to get insight into new ideas, discover new laws, invent new technology and develop new products. In other words, on one hand, scientific and technological big data are as valuable as other common data; on the other hand, based on the characteristics of the system, the value of the technology big data is mainly guided by technology innovation; therefore, the innovation is an indispensable characteristic of the technological big data and is also a fundamental characteristic of distinguishing the technological big data from other data.
The scientific and technological big data comprise scientific and technological papers, patents, soft books, standard specifications, policy suggestions and the like, and comprise a large amount of unstructured data represented by text content data, wherein the value and innovation of the scientific and technological papers and the invention patents are more studied, on one hand, researchers describe the data value by establishing a value evaluation index system and applying a traditional metering model, but the quality of the unstructured data such as the text content and the like is harder to measure; on the other hand, students use traditional text analysis methods such as word frequency analysis, co-occurrence word analysis and the like and a topic model method represented by LDA to measure the quality of text content, and the innovation evaluation of the text content is less depicted.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an innovative evaluation method based on the text content of the technological big data, so that the innovation of the technological big data can be effectively evaluated, the evaluation accuracy is improved, and a foundation is laid for the evaluation and screening of valuable technological big data.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the innovative evaluation method based on the text content of the technical big data is characterized by comprising the following steps:
step 1, after acquiring a text content set of technical big data to be evaluated and performing preprocessing of removing duplication and deletion, dividing the preprocessed text content according to the generation time of the text content to obtain text content { d } with a time stamp 1 ,d 2 ,...,d m ,...,d M };d m Representing the mth text content, M representing the text space in the text content collection;
step 2, regarding the mth text content d m Performing word segmentation, word stopping removal and repeated word merging to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation' m N of the N-th word, N m Representing the mth text content d 'after word segmentation' m The total number of different words in (a); thus, forming a corpus D by all the segmented words in the M text contents;
step 3, using a TF-IDF model to process the text content after word segmentation, extracting keywords of technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the m-th text content D 'after word segmentation in the corpus D by using the formula (1)' m N-th word of (a)TF-IDF value T of (v) nm :
In the formula (1), the components are as follows,representing the mth text content d 'after word segmentation' m N-th word->Is used for the word frequency of (a),representing the mth text content d 'after word segmentation' m N-th word->Inverse document frequency in corpus D;
step 3.2, constructing a document word vector of the technical big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t 1 ,t 2 ,...,t p ,...,t P },t p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (2) p The mth text content d 'after word segmentation' m Degree of importance X in (2) pm Thereby obtaining the mth text content d 'after word segmentation' m Word vector X of (a) m =(X 1m ,X 2m ,…,X pm ,…,X Pm ) T And further obtaining word vectors X= (X) of all documents under the time window M 1 ,X 2 ,...,X m ,...,X M ) And as a sparse matrix;
step 4, performing dimension reduction on the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero-mean treatment on each row of elements in the sparse matrix X to obtain a matrix H;
step 4.2, calculating covariance matrix
Step 4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming the eigenvector into a matrix P according to the descending order of the corresponding eigenvalue and the row;
step 4.4, taking a matrix formed by the elements of the first k rows in the matrix P and multiplying the matrix by the matrix H, thereby obtaining a matrix Y= (Y) after dimension reduction 1 ,Y 2 ,...,Y m ,...,Y M ) Wherein Y is m Represents the mth document word vector after dimension reduction, andY km representing the kth dimension value in the mth document word vector after dimension reduction; k is less than P;
step 5, calculating an mth document word vector Y after the dimension reduction of the time window M m And the z-th document word vector Y after dimension reduction z Cosine similarity value cos between<Y m ,Y z >For representing the mth text content d m And the z-th text content d z Similarity sim between<d m ,d z >Thereby obtaining the mth text content d m Text similarity set { sim between other text content<d m ,d z >Z=1, 2, …, M, and z+.m };
step 6, for the text similarity set { sim }<d m ,d z >The similarity in z=1, 2, …, M, and z is not equal to M is ordered in descending order, the first L values with the largest text similarity are selected, and the L similarity value is used as the M text content d m Is to be added to the mth text content d m Normalized to obtain the mth text content d m Is a novel score of (2);
and 7, calculating innovation scores of all M text contents under the time window M according to the processes of the steps 5 and 6, and arranging the text contents in a descending order, so that the innovation evaluation of the text content set of the technological big data to be evaluated is completed.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention obtains text content data of science and technology big data for preprocessing, eliminates data with missing values, keeps only the latest item of production time for repeated data, and classifies the evaluated data under each time window according to years; combining the existing mature stay vocabulary, segmenting the scientific and technological big data text content through jieba segmentation, deleting the stay words in the text content, and removing nonsensical works; the quality of the scientific and technological big data text data is greatly improved, and the efficiency of the text analysis of the actual data is improved;
2. according to the invention, text data after word segmentation is processed by using a TF-IDF model, keywords of the technical big data are extracted, a TF-IDF value is calculated for each piece of evaluated technical big data word segmentation data, the implementation of key subjects and information of each piece of technical big data under the time window is realized, a document word vector of the technical big data is constructed through word segmentation of the text and the TF-IDF value, and preparation work is carried out for measuring the similarity between subsequent texts;
3. according to the invention, a sparse matrix formed by all the document word vectors under each time window is constructed, and the dimension of the document word vectors is reduced by a method of compressing the sparse matrix through a Principal Component Analysis (PCA); the problem of calculating the distance between sample points in a high-dimensional space is solved, preparation is made for calculating the similarity of the word vectors of the subsequent documents, and the accuracy of the subsequent similarity measurement is improved; the innovation of accurately describing all technological big data in the field under the time window is facilitated;
4. the invention expresses the similarity between documents by calculating cosine similarity between the word vectors of the documents after dimension reduction; and finally, combining a nearest neighbor (KNN) idea, and enabling the smallest similarity value in a plurality of values with the highest similarity between each technological big data and all other technological big data to represent the innovative size of the technological big data, so that the innovative size and the value of the technological big data can be evaluated, and the innovative and valuable technological big data can be screened more effectively.
Drawings
FIG. 1 is a flow chart of the inventive evaluation of scientific and technological big data based text content of the present invention.
Detailed Description
In the embodiment, an innovative evaluation method based on technological big data text content is that keywords of the evaluated technological big data text content are extracted through a TF-IDF model; constructing a document word vector through the keywords and TF-IDF values thereof to form a sparse matrix; performing dimension reduction on the document word vectors by using a Principal Component Analysis (PCA) method through a method of performing row compression on the sparse matrix; then, the cosine similarity between the word vectors of the documents after dimension reduction is calculated to represent the similarity between the documents; finally, combining the nearest neighbor (KNN) idea, the minimum similarity value of the L values with the highest similarity between each text and all other texts can represent the innovative size of the text, specifically, as shown in fig. 1, the method comprises the following steps:
step 1, after acquiring a text content set of technical big data to be evaluated and performing preprocessing of removing duplication and deletion, dividing the preprocessed text content according to the generation time of the text content to obtain text content { d } with a time stamp 1 ,d 2 ,...,d m ,...,d M };d m Representing the mth text content, M representing the text space in the text content collection;
step 2, regarding the mth text content d m Performing word segmentation, word stopping removal and repeated word merging to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation' m N of the N-th word, N m Representing the mth text content d 'after word segmentation' m The total number of different words in (a); thus, forming a corpus D by all the segmented words in the M text contents;
step 3, using a TF-IDF model to process the text content after word segmentation, extracting keywords of technological big data to be evaluated and constructing a document word vector;
TF-IDF (Term Frequency-Inverse Document Frequency) is used to evaluate the importance of words to text in a document set or corpus, and consists of two parts: TF and IDF.
Step 3.1: word frequency (TF) is the frequency of occurrence of the word in a text sample, assuming d m For a particular text sample to be used,for the n-th word (if there is a repeated word, the first appearance position of the word is selected as the reference, and the subsequent repeated word is not counted), the word frequency of the word in the text sample is +.>The ratio of the frequency of occurrence of the word to the total frequency of occurrence of all words in the text can be expressed by the formula (1):
step 3.2: inverse Document Frequency (IDF) is used to evaluate the popularity of terms with a corpus. In this embodiment, the corpus D is the text content data of all the large-tech data after word segmentation under each time window,IDF value +.>All of D may be used to include +.>Text number->The sum total number of samples N is expressed as:
step 3.3: calculating the m-th text content D 'after word segmentation in the corpus D by using the method (3)' m N-th word of (a)TF-IDF value T of (v) nm :
In the formula (3), the amino acid sequence of the compound,representing the mth text content d 'after word segmentation' m N-th word->Is used for the word frequency of (a),representing the mth text content d 'after word segmentation' m N-th word->Inverse document frequency in corpus D;
step 3.4, constructing a document word vector of the technical big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t 1 ,t 2 ,...,t p ,...,t P },t p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (4) p The mth text content d 'after word segmentation' m Degree of importance X in (2) pm Thereby obtaining the mth text content d 'after word segmentation' m Word vector X of (a) m =(X 1m ,X 2m ,…,X pm ,…,X Pm ) T And further obtaining word vectors X= (X) of all documents under the time window M 1 ,X 2 ,...,X m ,...,X M ) And as a sparse matrix;
step 4, performing dimension reduction on the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero-mean treatment on each row of elements in the sparse matrix X by using a formula (5) to obtain a matrix H:
in the formula (5), the amino acid sequence of the compound,and satisfy->
Step 4.2, calculating a covariance matrix by using the method (6)
In the formula (6), the covariance matrix C is a real symmetric matrix of p rows and p columns, diagonal elements of the covariance matrix C respectively correspond to variances of data of each row of the matrix H, and j-th row are identical in elements, which represents covariance between j-th row and j-th row of the matrix H:
step 4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector by using a formula (7), and forming the eigenvector into a matrix P by rows according to the descending order of the corresponding eigenvalue by using a formula (8);
the characteristic value of C is obtained and is arranged according to the order of the size to be lambda 1 ,λ 2 ,λ 3 ,...,λ p Due to CFor the real symmetric matrix of p rows and p columns, the sum eigenvalue lambda is not difficult to be found 1 ,λ 2 ,λ 3 ,...,λ p Sequentially corresponding p unit orthogonal eigenvectors e 1 ,e 2 ,e 3 ,...,e p It is formed into a matrix e= (E) in columns 1 ,e 2 ,e 3 ,...,e p ) Then the covariance matrix C is concluded as follows:
step 4.4, taking a matrix composed of the first k rows of elements in the matrix P by using the formula (9) and multiplying the matrix by the matrix H, thereby obtaining a matrix Y= (Y) after dimension reduction 1 ,Y 2 ,...,Y m ,...,Y M ) Wherein Y is m Represents the mth document word vector after dimension reduction, andY km representing the kth dimension value in the mth document word vector after dimension reduction; k is less than P;
step 5, calculating an mth document word vector Y after the dimension reduction of the time window M m And the z-th document word vector Y after dimension reduction z Cosine similarity value cos between<Y m ,Y z >For representing the mth text content d m And the z-th text content d z Similarity sim between<d m ,d z >Thereby obtaining the mth text content d m Text similarity set { sim between other text content<d m ,d z >Z=1, 2, …, M, and z+.m };
step 5.1, calculating similarity sim between different document word vectors in the k-dimensional vector space by using a formula (10):
in the formula (10), Y jm Represents the j dimension value, Y in the m-th document word vector after dimension reduction jz Representing a j-th dimension value in the z-th document word vector after dimension reduction, M, z=1, 2,3,..m, m+.z, j=1, 2,..k;
step 5.2, calculating word vectors of each documentCosine similarity with all other document word vectors forms M sets, wherein each set has M-1 elements, and the sets are sequentially:
{sim<d 1 ,d m >|1<m≤M},{sim<d 2 ,d m >|1≤m≤M,q≠2},……,{sim<d M ,d m >|1≤m<M}
step 6, for the text similarity set { sim }<d m ,d z >The similarity in z=1, 2, …, M, and z is not equal to M is ordered in descending order, the first L values with the largest text similarity are selected, and the L similarity value is used as the M text content d m Is to be added to the mth text content d m Normalized to obtain the mth text content d m Is a novel score of (2);
step 6.1, sorting cosine similarity values in each set in a descending order, and selecting L values with highest similarity to the text, wherein the smallest similarity value can represent the innovative size of the text, and the document d j (j=1, 2,3., M., (v.), the innovative size of M) can be expressed as formula (11):
sim l (d j )(j=1,2,3...,m,...,M;l=1,2,3,...,k) (11)
step 6.2, standardizing the innovative calculation result of the technical big data, and assigning percentages as shown in (12), wherein the big data text is displayed in a time window MDocument d j (j=1, 2,3., M) can be expressed as:
in the formula (12), sim max For sim l (d j ) Is set at the maximum value of (c), j=1, 2,3., M, M, l=1, 2,3,;
and 7, calculating innovation scores of all M document word vectors under the time window M according to the processes of the steps 5 and 6, and arranging the innovation scores in a descending order, so that the innovation evaluation of the text content set of the technical big data to be evaluated is completed.
Claims (1)
1. An innovative evaluation method based on technological big data text content is characterized by comprising the following steps:
step 1, after acquiring a text content set of technical big data to be evaluated and performing preprocessing of removing duplication and deletion, dividing the preprocessed text content according to the generation time of the text content to obtain text content { d } with a time stamp 1 ,d 2 ,...,d m ,...,d M };d m Representing the mth text content, M representing the text space in the text content collection;
step 2, regarding the mth text content d m Performing word segmentation, word stopping removal and repeated word merging to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation' m N of the N-th word, N m Representing the mth text content d 'after word segmentation' m The total number of different words in (a); thus, forming a corpus D by all the segmented words in the M text contents;
step 3, using a TF-IDF model to process the text content after word segmentation, extracting keywords of technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the m-th text content D 'after word segmentation in the corpus D by using the formula (1)' m N-th word of (a)TF-IDF value T of (v) nm :
In the formula (1), the components are as follows,representing the mth text content d 'after word segmentation' m N-th word->Is used for the word frequency of (a),representing the mth text content d 'after word segmentation' m N-th word->Inverse document frequency in corpus D;
step 3.2, constructing a document word vector of the technical big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' = { t 1 ,t 2 ,...,t p ,...,t P },t p Represents the p-th word; p represents the total word number in the merged corpus D', and the P-th word t is calculated by using the formula (2) p The mth text content d 'after word segmentation' m Degree of importance X in (2) pm Thereby obtaining the mth text content d 'after word segmentation' m Word vectors of (a)And then obtaining word vectors X= (X) of all documents under the time window M 1 ,X 2 ,...,X m ,...,X M ) And as a sparse matrix;
step 4, performing dimension reduction on the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero-mean treatment on each row of elements in the sparse matrix X to obtain a matrix H;
step 4.2, calculating covariance matrix
Step 4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming the eigenvector into a matrix P according to the descending order of the corresponding eigenvalue and the row;
step 4.4, taking a matrix formed by the elements of the first k rows in the matrix P and multiplying the matrix by the matrix H, thereby obtaining a matrix Y= (Y) after dimension reduction 1 ,Y 2 ,...,Y m ,...,Y M ) Wherein Y is m Represents the mth document word vector after dimension reduction, andY km representing the kth dimension value in the mth document word vector after dimension reduction; k is less than P;
step 5, calculating an mth document word vector Y after the dimension reduction of the time window M m And the z-th document word vector Y after dimension reduction z Cosine similarity value cos between<Y m ,Y z >For representing the mth text content d m And the z-th text content d z Similarity sim between<d m ,d z >Thereby obtaining the mth text content d m Text similarity to other text contentSex set { sim<d m ,d z >]z=1, 2, …, M, and z+.m };
step 6, for the text similarity set { sim }<d m ,d z >The similarity in z=1, 2, …, M, and z is not equal to M is ordered in descending order, the first L values with the largest text similarity are selected, and the L similarity value is used as the M text content d m Is to be added to the mth text content d m Normalized to obtain the mth text content d m Is a novel score of (2);
and 7, calculating innovation scores of all M text contents under the time window M according to the processes of the steps 5 and 6, and arranging the text contents in a descending order, so that the innovation evaluation of the text content set of the technological big data to be evaluated is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111489894.1A CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111489894.1A CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114154498A CN114154498A (en) | 2022-03-08 |
CN114154498B true CN114154498B (en) | 2024-02-20 |
Family
ID=80453329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111489894.1A Active CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114154498B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109885675A (en) * | 2019-02-25 | 2019-06-14 | 合肥工业大学 | Method is found based on the text sub-topic for improving LDA |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720675B2 (en) * | 2003-10-27 | 2010-05-18 | Educational Testing Service | Method and system for determining text coherence |
-
2021
- 2021-12-08 CN CN202111489894.1A patent/CN114154498B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109885675A (en) * | 2019-02-25 | 2019-06-14 | 合肥工业大学 | Method is found based on the text sub-topic for improving LDA |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
Non-Patent Citations (1)
Title |
---|
一种文本挖掘和文献计量的科技论文评估方法;王莉军;姚长青;刘志辉;;情报科学;20190501(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114154498A (en) | 2022-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Goëau et al. | Lifeclef bird identification task 2016: The arrival of deep learning | |
CN111401040B (en) | Keyword extraction method suitable for word text | |
CN104966097A (en) | Complex character recognition method based on deep learning | |
CN105843850B (en) | Search optimization method and device | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN110765254A (en) | Multi-document question-answering system model integrating multi-view answer reordering | |
CN110046264A (en) | A kind of automatic classification method towards mobile phone document | |
CN112051986B (en) | Code search recommendation device and method based on open source knowledge | |
Wolf et al. | Computerized paleography: tools for historical manuscripts | |
CN108647729A (en) | A kind of user's portrait acquisition methods | |
CN113688635B (en) | Class case recommendation method based on semantic similarity | |
CN111813933A (en) | Automatic identification method for technical field in technical atlas | |
CN103745242A (en) | Cross-equipment biometric feature recognition method | |
CN109344248B (en) | Academic topic life cycle analysis method based on scientific and technological literature abstract clustering | |
CN113342950B (en) | Answer selection method and system based on semantic association | |
CN111984790B (en) | Entity relation extraction method | |
CN114154498B (en) | Innovative evaluation method based on science and technology big data text content | |
CN111242131B (en) | Method, storage medium and device for identifying images in intelligent paper reading | |
CN116682015A (en) | Feature decoupling-based cross-domain small sample radar one-dimensional image target recognition method | |
CN111221915B (en) | Online learning resource quality analysis method based on CWK-means | |
CN114722183A (en) | Knowledge pushing method and system for scientific research tasks | |
CN113657106B (en) | Feature selection method based on normalized word frequency weight | |
Bria et al. | Deep Transfer Learning for writer identification in medieval books | |
CN105404899A (en) | Image classification method based on multi-directional context information and sparse coding model | |
CN115544361A (en) | Frame for predicting change of attention point of window similarity analysis and analysis method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |