CN114780617A

CN114780617A - Technical list generation method and system based on multi-source data and topic model

Info

Publication number: CN114780617A
Application number: CN202210483086.2A
Authority: CN
Inventors: 刘宇飞; 邓凡康; 周源; 杨建中
Original assignee: Strategic Consulting Center Of Chinese Academy Of Engineering; Tsinghua University; Huazhong University of Science and Technology
Current assignee: Strategic Consulting Center Of Chinese Academy Of Engineering; Tsinghua University; Huazhong University of Science and Technology
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-22

Abstract

The invention discloses a technical list generation method based on multi-source data and a topic model, which comprises the following steps: the method comprises the steps of obtaining a plurality of scientific and technological documents corresponding to a certain technical field, carrying out data processing on the scientific and technological documents, storing the processed scientific and technological documents in a scientific and technological document database, inputting all the scientific and technological documents in the scientific and technological document database into a trained support vector machine model SVM for technical category division, storing technical category division results in the scientific and technological document database, combining the scientific and technological documents belonging to the same technical category in the scientific and technological document database into a document, forming the document library by all the documents, successively carrying out stop word removal and low frequency word removal processing on the document library to obtain an updated document library, and carrying out theme clustering on the updated document library by using a theme modeling algorithm LDA to obtain a word distribution matrix corresponding to all the themes. The invention can solve the technical problems of strong subjectivity and high cost of the expert research and judgment based method.

Description

Technical list generation method and system based on multi-source data and topic model

Technical Field

The invention belongs to the field of data mining, and particularly relates to a technical list generation method and a system based on multi-source data and a topic model.

Background

Technical forecast is a process of systematically exploring the future of science, technology, economy and society, and aims to select a general new technology and a strategic research field which can generate the maximum economic, environmental and social benefits and form a key technical list. The technical forecast has great significance for promoting the technical popularization and application, accelerating the technical industrialization and the technical transfer, and the technical forecast is widely researched and applied in developed countries such as the United states, Japan, Germany and the like.

The existing technical list generation method comprises an expert judging method and a literature measuring method; the expert-based method is characterized in that domain experts are taken as cores, the experts in related domains are usually organized by authorities, and the knowledge and the intelligence of the domain experts are collected through multiple rounds of study to form a final technical list comprising peer comments, a Delphi method, a scene analysis method and the like; the literature measurement-based method generally uses scientific and technological documents such as articles, patents and the like as research objects, and explores a hidden mode of a technology by analyzing objective information in the scientific and technological documents to infer the future development trend of the technology and form a technical list, which mainly comprises a citation analysis method and a subject term analysis method.

However, both of the above two methods of generating the technical list have some technical drawbacks that are not negligible: the first, expert-based judgment method, whose effectiveness depends entirely on the breadth and depth of the expert knowledge in the relevant field, has strong subjective opinion, and although it has high accuracy for a specific technical field, it has a high recognition cost. Secondly, the method based on literature measurement is good at identifying the overall development condition of a certain technical field, and helps to solve the problem of subjective bias to a certain extent due to objective literature data as a drive, but the method is often based on simple statistical analysis, the excavation depth is not enough, complete and specific technical description is difficult to form, and the accuracy of technical list generation is difficult to guarantee. Thirdly, the method based on the literature measurement is based on single-dimensional data, and the defect of low coverage degree of the technical list exists.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a technical list generation method and a system based on multi-source data and a topic model, aiming at quickly and effectively completing the generation of the technical list in the field through primary technical identification and secondary technical identification, thereby solving the technical problems of strong subjectivity and high cost of a method based on expert study and judgment, the technical problem of difficult guarantee of accuracy of a method based on literature measurement and the technical problem of low coverage degree of the technical list based on single-dimensional literature data.

To achieve the above object, according to an aspect of the present invention, there is provided a technical inventory generation method based on multi-source data and a topic model, including the steps of:

(1) the method comprises the steps of obtaining a plurality of scientific and technical literatures corresponding to a certain technical field, carrying out data processing on the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.

(2) Inputting all the scientific and technological documents in the scientific and technological document database obtained in the step (1) into a trained support vector machine model SVM for technical category division, and storing technical category division results in the scientific and technological document database.

(3) Merging the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed in the step (2) into a document, wherein all the documents form a document library;

(4) and (4) successively carrying out stop word removal and low-frequency word removal on the document library obtained in the step (3) to obtain an updated document library.

(5) And (4) carrying out theme clustering on the document library updated in the step (4) by using a theme modeling algorithm LDA to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.

(6) And (5) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence networks corresponding to the topic by using a Louvain algorithm to obtain a plurality of cluster groups corresponding to the topic, and analyzing subject word contents in all the cluster groups corresponding to the topic to obtain the secondary technology list corresponding to the topic.

(7) And (4) correlating the primary technology list obtained in the step (5) with the secondary technology list obtained in the step (6) to generate a complete technology list.

Preferably, step (1) comprises the sub-steps of:

(1-1) making a thesis search formula and a patent search formula related to a certain technical field, and searching in a scientific citation database WOS and a Derwent patent database DII respectively to obtain a scientific literature corresponding to the technical field;

(1-2) for each scientific and technical literature corresponding to the technical field obtained in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lower case conversion and morphological restoration processing on the obtained title and abstract of each scientific and technical literature by using a natural language processing toolkit NLTK, and storing the processed scientific and technical literature in a scientific and technical literature database;

preferably, the SVM model is trained by:

(2-1) acquiring a title and an abstract of each technical document corresponding to a certain technical field after being processed in the step (1), and merging the acquired title and abstract to obtain a description text of the technical document;

(2-2) inputting a Doc2vec model by using the description text corresponding to each scientific and technical literature in the scientific and technical literature database obtained in the step (2-1) to obtain a text vector corresponding to the scientific and technical literature;

(2-3) establishing a technology category library, wherein the technology category library comprises a plurality of technology categories and a plurality of technical documents corresponding to each technology category;

(2-4) constructing a data set by using the text vectors corresponding to the scientific documents in the scientific document database obtained in the step (2-2) and the technical category database obtained in the step (2-3), and enabling the data set to be in a form of 8: 2, dividing the ratio into a training set and a test set;

(2-5) inputting the training set obtained in the step (2-4) into an SVM model to obtain an updated SVM model;

preferably, the weight parameters of the Doc2vec model comprise a text vector and a word vector, which, when training, concatenates the text vector with the context word vector, together making the next word prediction, and updates the model weight parameters by minimizing the loss of word prediction;

in the training process of the Doc2vec model, setting the word vector dimension of the Doc2vec model to be 200, the number of training rounds to be 100 and the width of a sampling window to be 8;

the final result obtained by the Doc2vec model is the mapping relation between the scientific and technical literature and the text vector.

Preferably, the step (2-5) is specifically to iteratively set a hyper-parameter of the SVM model, and iteratively verify the iteratively trained SVM model by using the test set in the data set obtained in the step (2-4) until the obtained classification accuracy is optimal, so as to obtain the trained SVM model.

Preferably, the stop word removal in the step (4) is to remove stop words existing in the document library, and the stop words can be specifically referred to a stop word table provided in the NLTK;

low frequency words refer to words in a document that appear less than 10 times.

Preferably, each row of the word distribution matrix represents a different topic, each column represents a different word, and the element in the ith row and jth column of the matrix represents the probability that the jth word belongs to the ith topic, where i ∈ [1, the total number of topics in the corpus of documents ], j ∈ [1, the total number of words in the corpus of documents ].

Generating a primary technical list according to the word distribution matrix in the step (5) specifically comprises the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics, and all the mapping relations constitute a primary technical list.

Preferably, the process of analyzing the subject word content in all the cluster clusters corresponding to the subject in step (6) to obtain the secondary technology list corresponding to the subject specifically includes first obtaining the elements selected in the first row in the word distribution matrix in step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technological document at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process aiming at the rest rows of the word distribution matrix obtained in the step (5) so as to obtain the mapping relations between a plurality of technical fields and the subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.

According to another aspect of the present invention, there is provided a technical inventory generation system based on multi-source data and a topic model, comprising:

the first module is used for acquiring a plurality of scientific and technological documents corresponding to a certain technical field, processing data of the scientific and technological documents and storing the processed scientific and technological documents in a scientific and technological document database.

And the second module is used for inputting all the scientific and technological documents in the scientific and technological document database obtained by the first module into the trained support vector machine model SVM for technical category division, and storing the technical category division result in the scientific and technological document database.

The third module is used for combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed by the second module into a document, and all the documents form a document library;

the fourth module is used for successively removing stop words and low-frequency words from the document library obtained by the third module to obtain an updated document library;

and the fifth module is used for performing theme clustering on the document library updated by the fourth module by using a theme modeling algorithm LDA to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.

A sixth module, configured to construct a word co-occurrence network for each topic in the primary technology list obtained by the fifth module, cluster the word co-occurrence networks corresponding to the topic by using a Louvain algorithm to obtain multiple cluster groups corresponding to the topic, and analyze subject word contents in all cluster groups corresponding to the topic to obtain a secondary technology list corresponding to the topic.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the invention adopts the step (1) which takes objective scientific and technical literature as data drive and ensures the objectivity of the technical list generation, thereby solving the technical problems of strong subjectivity and higher cost in the existing qualitative method;

(2) the invention adopts the step (1) which combines the thesis and the patent data, expands the data dimension, leads the technical identification result to be more comprehensive and solves the technical problem of lower coverage degree of the technical list.

(3) Because the step (5) is adopted, the used topic modeling algorithm considers the document semantics, the potential topics in the document can be automatically mined, and compared with a simple statistical metering method, the method has deeper semantic information, and meanwhile, the method can solve the technical problem of insufficient mining depth in the document metering method to a certain extent;

(4) according to the invention, the step (2) and the step (5) are adopted, and the expert priori knowledge is embedded into the subject modeling algorithm, so that the accuracy of the subject modeling algorithm can be improved, and the technical problem of insufficient accuracy of a technical list generation method based on literature measurement is solved

Drawings

FIG. 1 is a flow diagram of a method for generating a technical manifest based on multi-source data and a topic model in accordance with the present invention;

FIG. 2 is an architecture diagram of the Doc2vec model of the present invention;

FIG. 3 is a schematic representation of the word distribution matrix generated in step (5) of the method of the present invention;

FIG. 4 is a schematic representation of a plurality of clustered clusters generated in step (6) of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The basic idea of the invention is that the accuracy of the technical list generation is improved while the generation cost of the technical list is reduced by combining multi-source data and using a subject modeling algorithm embedded with expert knowledge.

As shown in fig. 1, the invention provides a technical inventory generation method based on multi-source data and a topic model, which comprises the following steps:

The method has the advantages that the data dimensionality is expanded by combining the thesis and the patent data, and the generated result of the technical list is more comprehensive.

The method comprises the following substeps:

(1-1) making a thesis search formula and a patent search formula related to a certain technical field, and respectively searching in a Web of Science (WOS) database and a Derwent Innovation Index (DII) database to obtain a scientific literature corresponding to the technical field;

in particular, the formulation of the search formula is a matter of expert in the technical field.

(1-2) for each scientific and technical literature corresponding to the technical field obtained in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lowercase conversion and morphological restoration processing on the obtained title and abstract of each scientific and technical literature by using a Natural Language processing Toolkit (NLTK for short), and storing the processed scientific and technical literature in a scientific and technical literature database;

specifically, the word-shape reduction processing in this step can convert a word into its prototype word.

(2) Inputting all the scientific and technical documents in the scientific and technical document database obtained in the step (1) into a trained Support Vector Machine (SVM) Model for technical category division, and storing the technical category division result in the scientific and technical document database.

The method has the advantages that the expert knowledge is combined, the technical classification of the scientific and technical literature is divided based on the machine learning algorithm, the expert knowledge is embedded into the subsequent LDA topic model, and the accuracy of the LDA topic model can be improved.

The SVM model is obtained by training in the following mode:

the Doc2vec model in this step is an unsupervised text vector characterization model, the model architecture is shown in fig. 2, the model weight parameters include a text vector and a word vector, and when training is performed, the text vector and a context word vector are connected together to jointly perform next word prediction, and the model weight parameters are updated by minimizing the loss of word prediction. Specifically, when each text is trained, a text vector and a sampled context word vector form a training sample, after a sliding window is used for multiple sampling training, the text vector understands the expressed subject of the text more and more accurately, and after the training is finished, each text can obtain a vector representation with a fixed length, wherein the fixed length simultaneously considers the semantics and the word order.

In the training process of the Doc2vec model in the step, the word vector dimension of the Doc2vec model is set to be 200, the number of training rounds is set to be 100, and the width of a sampling window is set to be 8;

The method has the advantages that the Doc2vec model is used, text semantics are fully considered, meanwhile, the text vectors of the scientific and technical literature can be automatically obtained, and various complex text processing and index definition processes are avoided.

specifically, the technical category library is established by experts in the technical field mentioned in step (1) according to expert knowledge accumulated by the experts, and the establishment should ensure that the quantity of papers and patents corresponding to each technical category is basically consistent.

the step has the advantage that the automatic classification of a large amount of scientific and technical literature data can be completed only by a small amount of expert labeling cost by using a machine learning algorithm.

Specifically, the hyper-parameters of the SVM model are set iteratively, iterative verification is carried out on the SVM model after iterative training by using the test set in the data set obtained in the step (2-4) until the obtained classification precision reaches the optimum, and the trained SVM model is obtained.

(3) Combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed in the step (2) into a document, wherein all the documents form a document library;

specifically, the number of documents in the document library in this step is equal to the number of divided technical categories.

(4) Successively removing stop words and low-frequency words from the document library obtained in the step (3) to obtain an updated document library;

specifically, stop word removal is to remove stop words (which may be referred to as a stop word table provided in NLTK) existing in a document library, and low-frequency words refer to words appearing less than 10 times in a document.

(5) And (3) carrying out theme clustering on the document library updated in the step (4) by using a theme Dirichlet Allocation (LDA for short) algorithm to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.

The method has the advantages that the potential topics in the document library can be automatically mined by using the topic modeling algorithm, and the semantic information with deeper levels is obtained.

After the LDA algorithm processing in this step, an obtained word distribution matrix is shown in fig. 3, where each row of the word distribution matrix represents a different topic, each column represents a different word, and an element in the ith row and the jth column in the matrix represents a probability that the jth word belongs to the ith topic, where i ∈ [1, a total number of topics in the document repository ], j ∈ [1, a total number of words in the document repository ].

The step of generating a primary technical list according to the word distribution matrix specifically includes the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics, and all the mapping relations constitute a primary technical list.

Specifically, the preset threshold in this step is the average value of all elements in the row.

(6) And (3) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence network corresponding to the topic by using a Louvain algorithm to obtain a plurality of clustering clusters corresponding to the topic (as shown in FIG. 4), and analyzing the subject word content in all the clustering clusters corresponding to the topic to obtain a secondary technology list corresponding to the topic.

In the step, the word co-occurrence network reveals the relation between the subject words, the Louvain algorithm can quickly and effectively mine the incidence relation in the network, and the word co-occurrence network is clustered by using the Louvain algorithm, so that the two-level technology under the subject can be summarized.

The process of analyzing the subject word content in all cluster clusters corresponding to the subject to obtain a secondary technical list corresponding to the subject specifically comprises the steps of firstly obtaining elements selected from the first row in the word distribution matrix in the step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technical literature at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process for the rest rows of the word distribution matrix obtained in the step (5) to obtain mapping relations between a plurality of technical fields and subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.

Specifically, the preset threshold in this step is 50.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A technical list generation method based on multi-source data and a topic model is characterized by comprising the following steps:

(2) Inputting all the scientific and technical documents in the scientific and technical document database obtained in the step (1) into a trained support vector machine model SVM for technical category division, and storing technical category division results in the scientific and technical document database.

(5) And (5) performing theme clustering on the document library updated in the step (4) by using a theme modeling algorithm LDA to obtain a word distribution matrix corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrix.

(6) And (5) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence network corresponding to the topic by using a Louvain algorithm to obtain a plurality of cluster groups corresponding to the topic, and analyzing the subject word content in all the cluster groups corresponding to the topic to obtain the secondary technology list corresponding to the topic.

2. The method for generating a technical inventory based on multi-source data and a topic model according to claim 1, wherein the step (1) comprises the following sub-steps:

(1-1) making a thesis retrieval formula and a patent retrieval formula related to a certain technical field, and respectively retrieving in a WOS (scientific citation database) and a DII (Derwent patent database) to obtain a scientific literature corresponding to the technical field;

and (1-2) for each scientific and technical literature corresponding to the technical field acquired in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lower case conversion and morphological restoration processing on the acquired title and abstract of each scientific and technical literature by using a natural language processing toolkit (NLTK), and storing the processed scientific and technical literature in a scientific and technical literature database.

3. The method for generating the technical list based on the multi-source data and the subject model according to claim 1 or 2, wherein the SVM model is obtained by training in the following way:

(2-1) acquiring a title and an abstract of each technical document corresponding to a certain technical field after being processed in the step (1), and combining the acquired title and the abstract to obtain a description text of the technical document;

(2-2) inputting a description text corresponding to each scientific and technical literature in the scientific and technical literature database obtained in the step (2-1) into a Doc2vec model to obtain a text vector corresponding to the scientific and technical literature;

(2-4) constructing a data set by using the text vector corresponding to the scientific literature in the scientific literature database obtained in the step (2-2) and the technical category library obtained in the step (2-3), and performing the following steps on the data set according to the data set in the steps of 8: 2, dividing the ratio into a training set and a test set;

and (2-5) inputting the training set obtained in the step (2-4) into an SVM model to obtain an updated SVM model.

4. The method for generating technical inventory based on multi-source data and topic model according to any one of claims 1 to 3,

the weight parameters of the Doc2vec model comprise a text vector and a word vector, when training is carried out, the text vector and the context word vector are connected together to jointly carry out next word prediction, and the model weight parameters are updated by minimizing the loss of word prediction;

5. The method for generating a technical list based on multi-source data and a theme model according to any one of claims 1 to 3, characterized in that step (2-5) is specifically to iteratively set hyper-parameters of an SVM model, and iteratively verify the iteratively trained SVM model by using the test set in the data set obtained in step (2-4) until the obtained classification accuracy reaches the optimum, thereby obtaining the trained SVM model.

6. The method of claim 1, wherein the subject model is a database of a plurality of data sources,

the stop word removal in the step (4) is to remove stop words existing in the document library, and the stop word list provided in the NLTK can be consulted specifically;

7. The method of claim 1, wherein the system is further configured to generate a technical inventory based on the multi-source data and the topic model,

each row of the word distribution matrix represents a different topic, each column represents a different word, the elements in the ith row and jth column of the matrix represent the probability that the jth word belongs to the ith topic, where i ∈ [1, the total number of topics in the document repository ], and j ∈ [1, the total number of words in the document repository ].

Generating a primary technical list according to the word distribution matrix in the step (5) specifically comprises the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics thereof, and all the mapping relations constitute a primary technical list.

8. The method for generating a technical list based on multi-source data and a theme model according to claim 1, wherein the step (6) of analyzing the subject word content in all cluster clusters corresponding to the theme to obtain the secondary technical list corresponding to the theme specifically comprises the steps of firstly obtaining elements selected in the first row of the word distribution matrix in the step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technical literature at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process aiming at the rest rows of the word distribution matrix obtained in the step (5) so as to obtain the mapping relations between a plurality of technical fields and the subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.

9. A technical inventory generation system based on multi-source data and a topic model is characterized by comprising:

the system comprises a first module and a second module, wherein the first module is used for acquiring a plurality of scientific and technical literatures corresponding to a certain technical field, processing data of the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.