CN114154498A - Innovative evaluation method based on scientific and technological big data text content - Google Patents
Innovative evaluation method based on scientific and technological big data text content Download PDFInfo
- Publication number
- CN114154498A CN114154498A CN202111489894.1A CN202111489894A CN114154498A CN 114154498 A CN114154498 A CN 114154498A CN 202111489894 A CN202111489894 A CN 202111489894A CN 114154498 A CN114154498 A CN 114154498A
- Authority
- CN
- China
- Prior art keywords
- text content
- word
- text
- scientific
- mth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 44
- 230000011218 segmentation Effects 0.000 claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 46
- 238000005516 engineering process Methods 0.000 claims description 4
- 239000011541 reaction mixture Substances 0.000 claims description 4
- 238000012847 principal component analysis method Methods 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 2
- 238000000513 principal component analysis Methods 0.000 abstract description 6
- 238000012216 screening Methods 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Mathematical Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an innovation evaluation method based on scientific and technological big data text contents, which comprises the following steps: 1. acquiring, preprocessing and word segmentation of scientific and technological big data text content; 2. processing the segmented scientific and technological big data text data by using a TF-IDF model, and constructing a scientific and technological big data document word vector to be evaluated; 3. performing dimensionality reduction on the document word vector by using a Principal Component Analysis (PCA); 4. calculating the similarity between all the documents in the time window M, and expressing the similarity by using the cosine between the word vectors of each document; 5. and sorting the similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the minimum similarity value can represent the innovative size of the text, and obtaining a standardized innovative score. The method can effectively evaluate the innovation of the scientific and technological big data and improve the evaluation accuracy, thereby laying a foundation for evaluation and screening of valuable scientific and technological big data.
Description
Technical Field
The invention relates to the field of scientific and technological big data value evaluation, in particular to a text content-based scientific and technological big data innovation evaluation method.
Background
In recent years, with the vigorous development of network and communication technology, data related to people in life and production is increased explosively, and modern society has advanced into the big data era, and scientific and technical big data is a kind of information resource capable of reflecting the state and process of human scientific and technical activities. The method can support human to understand new ideas, discover new rules, invent new technologies and develop new products. In other words, on the one hand, scientific and technical big data have value as other common data; on the other hand, based on the characteristics of the system, the value of the scientific and technological big data is mainly guided by scientific and technological innovation; therefore, the innovation is the indispensable characteristic of the scientific and technical big data and the fundamental characteristic of the innovation to be distinguished from other data.
The scientific and technological big data comprise scientific and technological articles, patents, soft works, standard specifications, policy suggestions and the like, and comprise a large amount of unstructured data represented by text content data, wherein the value and innovation of the scientific and technological articles and the invention patents are more researched, on one hand, researchers describe the data value by establishing a value evaluation index system and applying a traditional metering model, but the quality of the unstructured data such as text content and the like is difficult to measure; on the other hand, the learner uses traditional text analysis methods such as word frequency analysis and co-occurrence word analysis and a topic model method represented by LDA to measure the quality of the text content, and the innovation evaluation of the text content is less.
Disclosure of Invention
The invention provides an innovative assessment method based on the text content of the scientific and technological big data for overcoming the defects in the prior art, so that the innovative height of the scientific and technological big data can be effectively assessed, and the assessment accuracy is improved, thereby laying a foundation for assessment and screening of valuable scientific and technological big data.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to an innovative assessment method based on scientific and technological big data text content, which is characterized by comprising the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word retention removal and repeated word combination to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the mth text content D 'after word segmentation in the corpus D by using the formula (1)'mN' th wordTF-IDF value of Tnm:
In the formula (1), the reaction mixture is,representing the mth text content d 'after word segmentation'mMiddle nth wordThe frequency of the words of (a) is,representing the mth text content d 'after word segmentation'mMiddle nth wordInverse document frequency in corpus D;
3.2, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector Xm=(X1m,X2m,…,Xpm,…,XPm)TAnd further obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero equalization processing on each row of elements in the sparse matrix X to obtain a matrix H;
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row;
and 4.4, taking a matrix formed by the first k rows of elements in the matrix P and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, andYkmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>1,2, …, M, and z ≠ M };
step 6, carrying out similarity set { sim) on the texts<dm,dz>Sorting the similarity of 1,2, … and M in descending order, selecting the top L values with the maximum text similarity, and using the L-th similarity value as the M-th text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
and 7, calculating the innovation scores of all the M text contents in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the science and technology big data to be evaluated.
Compared with the prior art, the invention has the beneficial effects that:
1. the method comprises the steps of acquiring scientific and technological big data text content data for preprocessing, eliminating data with missing values, reserving only the latest item of production time for repeated data, and classifying the evaluated data under each time window according to the year; combining the existing mature staying word list, performing word segmentation on the scientific and technological big data text content through jieba word segmentation, deleting the staying words in the scientific and technological big data text content, and removing the virtual words without real meaning; the quality of the scientific and technological big data text data is greatly improved, and the efficiency of text analysis of actual data is favorably improved;
2. the method uses a TF-IDF model to process text data after word segmentation, extracts key words of scientific and technological big data, calculates TF-IDF values of data after word segmentation of each evaluated scientific and technological big data, realizes the embodiment of key subjects and information of each scientific and technological big data under the time window, constructs document word vectors of the scientific and technological big data through word segmentation of the text and the TF-IDF values, and makes preparation for the measurement of similarity between subsequent texts;
3. the invention reduces the dimension of the document word vectors by constructing a sparse matrix consisting of all the document word vectors under each time window and applying a Principal Component Analysis (PCA) method to compress the sparse matrix; the problem of sample point distance calculation in a high-dimensional space is solved, preparation is made for calculation of subsequent document word vector similarity, and accuracy of subsequent similarity measurement is improved; the innovativeness of all scientific and technological big data in the field under the time window can be accurately described;
4. the similarity between the documents is expressed by calculating cosine similarity between the document word vectors after dimension reduction; finally, by combining a nearest neighbor (KNN) idea, the minimum similarity value of the plurality of values with the highest similarity between each scientific and technological big data and all other scientific and technological big data can represent the innovative size of the scientific and technological big data, which is beneficial to evaluating the innovative size and value of the scientific and technological big data, so that the scientific and technological big data with higher innovation and value can be screened more effectively.
Drawings
FIG. 1 is a flow chart of the innovation evaluation based on scientific big data text content of the present invention.
Detailed Description
In the embodiment, an innovative evaluation method based on scientific and technological big data text content extracts key words of the evaluated scientific and technological big data text content through a TF-IDF model; constructing a document word vector through the key words and TF-IDF values of the key words to form a sparse matrix; performing dimensionality reduction on the document word vector by a method of compressing the sparse matrix by using a Principal Component Analysis (PCA); then, expressing the similarity between the documents by calculating cosine similarity between the document word vectors after the dimension reduction; finally, combining the nearest neighbor (KNN) idea, the innovative size of each text can be represented by the minimum similarity value among the L values with the highest similarity between each text and all other texts, specifically, as shown in fig. 1, the method is performed according to the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word despinning and word mergingRepeating the word processing to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
TF-IDF (Term Frequency-Inverse Document Frequency) is used for evaluating the importance degree of words to texts in a Document set or a corpus, and consists of two parts: TF and IDF.
Step 3.1: the word frequency (TF) is the frequency of occurrence of the word in a text sample, assuming dmFor a particular sample of text, the text sample,arranging the nth word according to position after word segmentation (if repeated words exist, the position of the first appearance of the word is selected as the standard, and the position of the subsequent repeated words is not counted), and then the word frequency of the word in the text sampleThe ratio of the frequency of occurrence of the word to the total frequency of occurrence of all words in the text can be expressed by equation (1):
step 3.2: inverse Document Frequency (IDF) is used to evaluate the prevalence of words for a corpus. In this embodiment, the corpus D is all the segmented scientific big data text content data in each time window,IDF value ofCan be represented by the formula (2) and is contained in DNumber of textsAnd the total number of samples N is expressed as:
step 3.3: calculating m text content D 'after word segmentation in corpus D by using formula (3)'mN' th wordTF-IDF value of Tnm:
In the formula (3), the reaction mixture is,representing the mth text content d 'after word segmentation'mMiddle nth wordThe frequency of the words of (a) is,representing the mth text content d 'after word segmentation'mMiddle nth wordInverse document frequency in corpus D;
3.4, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among all text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the corpus D' after merging, and the P-th word t is calculated by using the formula (4)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector Xm=(X1m,X2m,…,Xpm,…,XPm)TAnd further obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, respectively carrying out zero-averaging processing on each row of elements in the sparse matrix X by using a formula (5) to obtain a matrix H:
In equation (6), the covariance matrix C is a real symmetric matrix with p rows and p columns, and diagonal elements thereof respectively correspond to the variances of each row of the matrix H, and the ith row and jth column have the same element, which represents the covariance between the ith row and jth column of the matrix H:
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector by using the formula (7), and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row by using the formula (8);
the characteristic value of C is obtained and is arranged in the order of size as lambda1,λ2,λ3,...,λpSince C is a real symmetric matrix with p rows and p columns, the sum of the characteristic value lambda and the characteristic value lambda can be easily obtained1,λ2,λ3,...,λpP unit orthogonal characteristic vectors e corresponding in sequence1,e2,e3,...,epThe matrix is formed by column (E) ═ E1,e2,e3,...,ep) Then, the following conclusion is made for the covariance matrix C:
and 4.4, taking a matrix formed by the first k rows of elements in the matrix P by using the formula (9), and multiplying the matrix by the matrix H to obtain a matrix Y (Y) after dimensionality reduction1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, andYkmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>1,2, …, M, and z ≠ M };
step 5.1, calculating the similarity sim between different document word vectors in the k-dimensional vector space by using the formula (10):
in the formula (10), YjmRepresenting the j dimension value, Y, in the m document word vector after dimension reductionjzRepresenting the j dimension value in the z document word vector after dimensionality reduction, wherein M, z is 1,2,3, the.
Step 5.2, calculating each document word vectorAnd cosine similarity between the document word vectors and all other document word vectors forms M sets, wherein each set has M-1 elements, and the sets are as follows in sequence:
{sim<d1,dm>|1<m≤M},{sim<d2,dm>|1≤m≤M,q≠2},……,{sim<dM,dm>|1≤m<M}
step 6, carrying out similarity set { sim) on the texts<dm,dz>|z=1,2, …, M, and z ≠ M } carries out descending sorting, selects the top L values with the maximum text similarity, and takes the Lth similarity value as the mth text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
6.1, sorting the cosine similarity values in each set in a descending order, selecting L values with the highest similarity to the text, wherein the smallest similarity value can represent the innovative size of the text, and a document djThe inventive size of (j ═ 1,2,3.., M.) can be expressed by formula (11):
siml(dj)(j=1,2,3...,m,...,M;l=1,2,3,...,k) (11)
6.2, standardizing the innovative calculation result of the scientific and technical big data, and giving scores in percentage, wherein the big data text document d under the time window M is shown as a formula (12)jThe innovativeness of (j ═ 1,2,3.., M.) can be expressed as:
in the formula (12), simmaxIs siml(dj) Is 1,2,3, M, l 1,2,3, k;
and 7, calculating the innovation scores of all the M document word vectors in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the scientific and technological big data to be evaluated.
Claims (1)
1. An innovation evaluation method based on scientific and technological big data text content is characterized by comprising the following steps:
step 1, acquiring a text content set of scientific and technological big data to be evaluated, performing de-duplication and deletion-removal preprocessing, and dividing the preprocessed text content according to the generation time of the text content to obtain the text content { d with a timestamp1,d2,...,dm,...,dM};dmRepresenting the mth text content, wherein M represents the text number in the text content set;
step 2, for the mth text content dmPerforming word segmentation, word retention removal and repeated word combination to obtain the mth text content after word segmentation Representing the mth text content d 'after word segmentation'mN is the nth word inmRepresenting the mth text content d 'after word segmentation'mThe total number of different words in (a); thus, a corpus D is formed by all the participles in the M text contents;
step 3, processing the text content after word segmentation by using a TF-IDF model, extracting key words of the scientific and technological big data to be evaluated and constructing a document word vector;
step 3.1, calculating the mth text content D 'after word segmentation in the corpus D by using the formula (1)'mN' th wordTF-IDF value of Tnm:
In the formula (1), the reaction mixture is,representing the mth text content d 'after word segmentation'mMiddle nth wordThe frequency of the words of (a) is,representing the mth text content d 'after word segmentation'mMiddle nth wordInverse document frequency in corpus D;
3.2, constructing a document word vector of the scientific and technological big data to be evaluated;
combining repeated words among the text contents in the corpus D to obtain a combined corpus D' ═ { t1,t2,...,tp,...,tP},tpRepresents the p-th word; p represents the total number of words in the merged corpus D', and the P-th word t is calculated by using the formula (2)pMth text content d 'after word segmentation'mDegree of importance X inpmTo obtain the mth text content d 'after word segmentation'mWord vector ofAnd obtaining a word vector X (X) of all the documents under the time window M1,X2,...,Xm,...,XM) And as a sparse matrix;
step 4, reducing the dimension of the sparse matrix X by using a principal component analysis method;
step 4.1, performing zero equalization processing on each row of elements in the sparse matrix X to obtain a matrix H;
4.3, calculating the eigenvalue of the covariance matrix C and the corresponding unit orthogonal eigenvector, and forming a matrix P by descending the eigenvector according to the size of the corresponding eigenvalue and the row;
step 4.4, get the front of matrix PMultiplying the matrix formed by k rows of elements by the matrix H to obtain a reduced-dimension matrix Y ═ Y1,Y2,...,Ym,...,YM) Wherein Y ismRepresents the m-th document word vector after dimensionality reduction, andYkmrepresenting the kth dimension value in the mth document word vector after dimension reduction; k < P;
step 5, calculating the mth document word vector Y after the dimensionality reduction of the time window MmAnd the z-th document word vector Y after dimension reductionzCosine similarity value cos between<Ym,Yz>For representing the mth text content dmAnd the z text content dzSimilarity sim between<dm,dz>To obtain the mth text content dmSet of textual similarities { sim ] with other textual content<dm,dz>]z is 1,2, …, M, and z ≠ M };
step 6, carrying out similarity set { sim) on the texts<dm,dz>Sorting the similarity of 1,2, … and M in descending order, selecting the top L values with the maximum text similarity, and using the L-th similarity value as the M-th text content dmInnovation value of (1), will be the m text content dmThe innovation value of the text is standardized to obtain the content d of the mth textmThe innovation score of (a);
and 7, calculating the innovation scores of all the M text contents in the time window M according to the processes of the step 5 and the step 6, and performing descending order arrangement, thereby completing the innovation evaluation of the text content set of the science and technology big data to be evaluated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111489894.1A CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111489894.1A CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114154498A true CN114154498A (en) | 2022-03-08 |
CN114154498B CN114154498B (en) | 2024-02-20 |
Family
ID=80453329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111489894.1A Active CN114154498B (en) | 2021-12-08 | 2021-12-08 | Innovative evaluation method based on science and technology big data text content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114154498B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050143971A1 (en) * | 2003-10-27 | 2005-06-30 | Jill Burstein | Method and system for determining text coherence |
CN109885675A (en) * | 2019-02-25 | 2019-06-14 | 合肥工业大学 | Method is found based on the text sub-topic for improving LDA |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
-
2021
- 2021-12-08 CN CN202111489894.1A patent/CN114154498B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050143971A1 (en) * | 2003-10-27 | 2005-06-30 | Jill Burstein | Method and system for determining text coherence |
CN109885675A (en) * | 2019-02-25 | 2019-06-14 | 合肥工业大学 | Method is found based on the text sub-topic for improving LDA |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
Non-Patent Citations (1)
Title |
---|
王莉军;姚长青;刘志辉;: "一种文本挖掘和文献计量的科技论文评估方法", 情报科学, no. 05, 1 May 2019 (2019-05-01) * |
Also Published As
Publication number | Publication date |
---|---|
CN114154498B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110297988B (en) | Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm | |
CN108763214B (en) | Automatic construction method of emotion dictionary for commodity comments | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN112434720A (en) | Chinese short text classification method based on graph attention network | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN110765254A (en) | Multi-document question-answering system model integrating multi-view answer reordering | |
CN110046264A (en) | A kind of automatic classification method towards mobile phone document | |
CN112989802A (en) | Barrage keyword extraction method, device, equipment and medium | |
CN106570076A (en) | Computer text classification system | |
CN102142082A (en) | Virtual sample based kernel discrimination method for face recognition | |
CN109918648A (en) | A kind of rumour depth detection method based on the scoring of dynamic sliding window feature | |
CN115578137A (en) | Agricultural product future price prediction method and system based on text mining and deep learning model | |
CN109086794A (en) | A kind of driving behavior mode knowledge method based on T-LDA topic model | |
Chen et al. | Max-margin discriminant projection via data augmentation | |
CN114792246A (en) | Method and system for mining typical product characteristics based on topic integration clustering | |
CN116682015A (en) | Feature decoupling-based cross-domain small sample radar one-dimensional image target recognition method | |
CN114154498A (en) | Innovative evaluation method based on scientific and technological big data text content | |
Wu et al. | SOUA: Towards Intelligent Recommendation for Applying for Overseas Universities | |
CN113342950B (en) | Answer selection method and system based on semantic association | |
CN105404899A (en) | Image classification method based on multi-directional context information and sparse coding model | |
CN114722183A (en) | Knowledge pushing method and system for scientific research tasks | |
CN114138942A (en) | Violation detection method based on text emotional tendency |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |