CN114780617A - Technical list generation method and system based on multi-source data and topic model - Google Patents

Technical list generation method and system based on multi-source data and topic model Download PDF

Info

Publication number
CN114780617A
CN114780617A CN202210483086.2A CN202210483086A CN114780617A CN 114780617 A CN114780617 A CN 114780617A CN 202210483086 A CN202210483086 A CN 202210483086A CN 114780617 A CN114780617 A CN 114780617A
Authority
CN
China
Prior art keywords
technical
scientific
word
document
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210483086.2A
Other languages
Chinese (zh)
Inventor
刘宇飞
邓凡康
周源
杨建中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Strategic Consulting Center Of Chinese Academy Of Engineering
Tsinghua University
Huazhong University of Science and Technology
Original Assignee
Strategic Consulting Center Of Chinese Academy Of Engineering
Tsinghua University
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Strategic Consulting Center Of Chinese Academy Of Engineering, Tsinghua University, Huazhong University of Science and Technology filed Critical Strategic Consulting Center Of Chinese Academy Of Engineering
Priority to CN202210483086.2A priority Critical patent/CN114780617A/en
Publication of CN114780617A publication Critical patent/CN114780617A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a technical list generation method based on multi-source data and a topic model, which comprises the following steps: the method comprises the steps of obtaining a plurality of scientific and technological documents corresponding to a certain technical field, carrying out data processing on the scientific and technological documents, storing the processed scientific and technological documents in a scientific and technological document database, inputting all the scientific and technological documents in the scientific and technological document database into a trained support vector machine model SVM for technical category division, storing technical category division results in the scientific and technological document database, combining the scientific and technological documents belonging to the same technical category in the scientific and technological document database into a document, forming the document library by all the documents, successively carrying out stop word removal and low frequency word removal processing on the document library to obtain an updated document library, and carrying out theme clustering on the updated document library by using a theme modeling algorithm LDA to obtain a word distribution matrix corresponding to all the themes. The invention can solve the technical problems of strong subjectivity and high cost of the expert research and judgment based method.

Description

Technical list generation method and system based on multi-source data and topic model
Technical Field
The invention belongs to the field of data mining, and particularly relates to a technical list generation method and a system based on multi-source data and a topic model.
Background
Technical forecast is a process of systematically exploring the future of science, technology, economy and society, and aims to select a general new technology and a strategic research field which can generate the maximum economic, environmental and social benefits and form a key technical list. The technical forecast has great significance for promoting the technical popularization and application, accelerating the technical industrialization and the technical transfer, and the technical forecast is widely researched and applied in developed countries such as the United states, Japan, Germany and the like.
The existing technical list generation method comprises an expert judging method and a literature measuring method; the expert-based method is characterized in that domain experts are taken as cores, the experts in related domains are usually organized by authorities, and the knowledge and the intelligence of the domain experts are collected through multiple rounds of study to form a final technical list comprising peer comments, a Delphi method, a scene analysis method and the like; the literature measurement-based method generally uses scientific and technological documents such as articles, patents and the like as research objects, and explores a hidden mode of a technology by analyzing objective information in the scientific and technological documents to infer the future development trend of the technology and form a technical list, which mainly comprises a citation analysis method and a subject term analysis method.
However, both of the above two methods of generating the technical list have some technical drawbacks that are not negligible: the first, expert-based judgment method, whose effectiveness depends entirely on the breadth and depth of the expert knowledge in the relevant field, has strong subjective opinion, and although it has high accuracy for a specific technical field, it has a high recognition cost. Secondly, the method based on literature measurement is good at identifying the overall development condition of a certain technical field, and helps to solve the problem of subjective bias to a certain extent due to objective literature data as a drive, but the method is often based on simple statistical analysis, the excavation depth is not enough, complete and specific technical description is difficult to form, and the accuracy of technical list generation is difficult to guarantee. Thirdly, the method based on the literature measurement is based on single-dimensional data, and the defect of low coverage degree of the technical list exists.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a technical list generation method and a system based on multi-source data and a topic model, aiming at quickly and effectively completing the generation of the technical list in the field through primary technical identification and secondary technical identification, thereby solving the technical problems of strong subjectivity and high cost of a method based on expert study and judgment, the technical problem of difficult guarantee of accuracy of a method based on literature measurement and the technical problem of low coverage degree of the technical list based on single-dimensional literature data.
To achieve the above object, according to an aspect of the present invention, there is provided a technical inventory generation method based on multi-source data and a topic model, including the steps of:
(1) the method comprises the steps of obtaining a plurality of scientific and technical literatures corresponding to a certain technical field, carrying out data processing on the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.
(2) Inputting all the scientific and technological documents in the scientific and technological document database obtained in the step (1) into a trained support vector machine model SVM for technical category division, and storing technical category division results in the scientific and technological document database.
(3) Merging the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed in the step (2) into a document, wherein all the documents form a document library;
(4) and (4) successively carrying out stop word removal and low-frequency word removal on the document library obtained in the step (3) to obtain an updated document library.
(5) And (4) carrying out theme clustering on the document library updated in the step (4) by using a theme modeling algorithm LDA to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.
(6) And (5) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence networks corresponding to the topic by using a Louvain algorithm to obtain a plurality of cluster groups corresponding to the topic, and analyzing subject word contents in all the cluster groups corresponding to the topic to obtain the secondary technology list corresponding to the topic.
(7) And (4) correlating the primary technology list obtained in the step (5) with the secondary technology list obtained in the step (6) to generate a complete technology list.
Preferably, step (1) comprises the sub-steps of:
(1-1) making a thesis search formula and a patent search formula related to a certain technical field, and searching in a scientific citation database WOS and a Derwent patent database DII respectively to obtain a scientific literature corresponding to the technical field;
(1-2) for each scientific and technical literature corresponding to the technical field obtained in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lower case conversion and morphological restoration processing on the obtained title and abstract of each scientific and technical literature by using a natural language processing toolkit NLTK, and storing the processed scientific and technical literature in a scientific and technical literature database;
preferably, the SVM model is trained by:
(2-1) acquiring a title and an abstract of each technical document corresponding to a certain technical field after being processed in the step (1), and merging the acquired title and abstract to obtain a description text of the technical document;
(2-2) inputting a Doc2vec model by using the description text corresponding to each scientific and technical literature in the scientific and technical literature database obtained in the step (2-1) to obtain a text vector corresponding to the scientific and technical literature;
(2-3) establishing a technology category library, wherein the technology category library comprises a plurality of technology categories and a plurality of technical documents corresponding to each technology category;
(2-4) constructing a data set by using the text vectors corresponding to the scientific documents in the scientific document database obtained in the step (2-2) and the technical category database obtained in the step (2-3), and enabling the data set to be in a form of 8: 2, dividing the ratio into a training set and a test set;
(2-5) inputting the training set obtained in the step (2-4) into an SVM model to obtain an updated SVM model;
preferably, the weight parameters of the Doc2vec model comprise a text vector and a word vector, which, when training, concatenates the text vector with the context word vector, together making the next word prediction, and updates the model weight parameters by minimizing the loss of word prediction;
in the training process of the Doc2vec model, setting the word vector dimension of the Doc2vec model to be 200, the number of training rounds to be 100 and the width of a sampling window to be 8;
the final result obtained by the Doc2vec model is the mapping relation between the scientific and technical literature and the text vector.
Preferably, the step (2-5) is specifically to iteratively set a hyper-parameter of the SVM model, and iteratively verify the iteratively trained SVM model by using the test set in the data set obtained in the step (2-4) until the obtained classification accuracy is optimal, so as to obtain the trained SVM model.
Preferably, the stop word removal in the step (4) is to remove stop words existing in the document library, and the stop words can be specifically referred to a stop word table provided in the NLTK;
low frequency words refer to words in a document that appear less than 10 times.
Preferably, each row of the word distribution matrix represents a different topic, each column represents a different word, and the element in the ith row and jth column of the matrix represents the probability that the jth word belongs to the ith topic, where i ∈ [1, the total number of topics in the corpus of documents ], j ∈ [1, the total number of words in the corpus of documents ].
Generating a primary technical list according to the word distribution matrix in the step (5) specifically comprises the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics, and all the mapping relations constitute a primary technical list.
Preferably, the process of analyzing the subject word content in all the cluster clusters corresponding to the subject in step (6) to obtain the secondary technology list corresponding to the subject specifically includes first obtaining the elements selected in the first row in the word distribution matrix in step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technological document at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process aiming at the rest rows of the word distribution matrix obtained in the step (5) so as to obtain the mapping relations between a plurality of technical fields and the subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.
According to another aspect of the present invention, there is provided a technical inventory generation system based on multi-source data and a topic model, comprising:
the first module is used for acquiring a plurality of scientific and technological documents corresponding to a certain technical field, processing data of the scientific and technological documents and storing the processed scientific and technological documents in a scientific and technological document database.
And the second module is used for inputting all the scientific and technological documents in the scientific and technological document database obtained by the first module into the trained support vector machine model SVM for technical category division, and storing the technical category division result in the scientific and technological document database.
The third module is used for combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed by the second module into a document, and all the documents form a document library;
the fourth module is used for successively removing stop words and low-frequency words from the document library obtained by the third module to obtain an updated document library;
and the fifth module is used for performing theme clustering on the document library updated by the fourth module by using a theme modeling algorithm LDA to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.
A sixth module, configured to construct a word co-occurrence network for each topic in the primary technology list obtained by the fifth module, cluster the word co-occurrence networks corresponding to the topic by using a Louvain algorithm to obtain multiple cluster groups corresponding to the topic, and analyze subject word contents in all cluster groups corresponding to the topic to obtain a secondary technology list corresponding to the topic.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the invention adopts the step (1) which takes objective scientific and technical literature as data drive and ensures the objectivity of the technical list generation, thereby solving the technical problems of strong subjectivity and higher cost in the existing qualitative method;
(2) the invention adopts the step (1) which combines the thesis and the patent data, expands the data dimension, leads the technical identification result to be more comprehensive and solves the technical problem of lower coverage degree of the technical list.
(3) Because the step (5) is adopted, the used topic modeling algorithm considers the document semantics, the potential topics in the document can be automatically mined, and compared with a simple statistical metering method, the method has deeper semantic information, and meanwhile, the method can solve the technical problem of insufficient mining depth in the document metering method to a certain extent;
(4) according to the invention, the step (2) and the step (5) are adopted, and the expert priori knowledge is embedded into the subject modeling algorithm, so that the accuracy of the subject modeling algorithm can be improved, and the technical problem of insufficient accuracy of a technical list generation method based on literature measurement is solved
Drawings
FIG. 1 is a flow diagram of a method for generating a technical manifest based on multi-source data and a topic model in accordance with the present invention;
FIG. 2 is an architecture diagram of the Doc2vec model of the present invention;
FIG. 3 is a schematic representation of the word distribution matrix generated in step (5) of the method of the present invention;
FIG. 4 is a schematic representation of a plurality of clustered clusters generated in step (6) of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the invention is that the accuracy of the technical list generation is improved while the generation cost of the technical list is reduced by combining multi-source data and using a subject modeling algorithm embedded with expert knowledge.
As shown in fig. 1, the invention provides a technical inventory generation method based on multi-source data and a topic model, which comprises the following steps:
(1) the method comprises the steps of obtaining a plurality of scientific and technical literatures corresponding to a certain technical field, carrying out data processing on the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.
The method has the advantages that the data dimensionality is expanded by combining the thesis and the patent data, and the generated result of the technical list is more comprehensive.
The method comprises the following substeps:
(1-1) making a thesis search formula and a patent search formula related to a certain technical field, and respectively searching in a Web of Science (WOS) database and a Derwent Innovation Index (DII) database to obtain a scientific literature corresponding to the technical field;
in particular, the formulation of the search formula is a matter of expert in the technical field.
(1-2) for each scientific and technical literature corresponding to the technical field obtained in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lowercase conversion and morphological restoration processing on the obtained title and abstract of each scientific and technical literature by using a Natural Language processing Toolkit (NLTK for short), and storing the processed scientific and technical literature in a scientific and technical literature database;
specifically, the word-shape reduction processing in this step can convert a word into its prototype word.
(2) Inputting all the scientific and technical documents in the scientific and technical document database obtained in the step (1) into a trained Support Vector Machine (SVM) Model for technical category division, and storing the technical category division result in the scientific and technical document database.
The method has the advantages that the expert knowledge is combined, the technical classification of the scientific and technical literature is divided based on the machine learning algorithm, the expert knowledge is embedded into the subsequent LDA topic model, and the accuracy of the LDA topic model can be improved.
The SVM model is obtained by training in the following mode:
(2-1) acquiring a title and an abstract of each technical document corresponding to a certain technical field after being processed in the step (1), and merging the acquired title and abstract to obtain a description text of the technical document;
(2-2) inputting a Doc2vec model by using the description text corresponding to each scientific and technical literature in the scientific and technical literature database obtained in the step (2-1) to obtain a text vector corresponding to the scientific and technical literature;
the Doc2vec model in this step is an unsupervised text vector characterization model, the model architecture is shown in fig. 2, the model weight parameters include a text vector and a word vector, and when training is performed, the text vector and a context word vector are connected together to jointly perform next word prediction, and the model weight parameters are updated by minimizing the loss of word prediction. Specifically, when each text is trained, a text vector and a sampled context word vector form a training sample, after a sliding window is used for multiple sampling training, the text vector understands the expressed subject of the text more and more accurately, and after the training is finished, each text can obtain a vector representation with a fixed length, wherein the fixed length simultaneously considers the semantics and the word order.
In the training process of the Doc2vec model in the step, the word vector dimension of the Doc2vec model is set to be 200, the number of training rounds is set to be 100, and the width of a sampling window is set to be 8;
the final result obtained by the Doc2vec model is the mapping relation between the scientific and technical literature and the text vector.
The method has the advantages that the Doc2vec model is used, text semantics are fully considered, meanwhile, the text vectors of the scientific and technical literature can be automatically obtained, and various complex text processing and index definition processes are avoided.
(2-3) establishing a technology category library, wherein the technology category library comprises a plurality of technology categories and a plurality of technical documents corresponding to each technology category;
specifically, the technical category library is established by experts in the technical field mentioned in step (1) according to expert knowledge accumulated by the experts, and the establishment should ensure that the quantity of papers and patents corresponding to each technical category is basically consistent.
(2-4) constructing a data set by using the text vectors corresponding to the scientific documents in the scientific document database obtained in the step (2-2) and the technical category database obtained in the step (2-3), and enabling the data set to be in a form of 8: 2, dividing the ratio into a training set and a test set;
(2-5) inputting the training set obtained in the step (2-4) into an SVM model to obtain an updated SVM model;
the step has the advantage that the automatic classification of a large amount of scientific and technical literature data can be completed only by a small amount of expert labeling cost by using a machine learning algorithm.
Specifically, the hyper-parameters of the SVM model are set iteratively, iterative verification is carried out on the SVM model after iterative training by using the test set in the data set obtained in the step (2-4) until the obtained classification precision reaches the optimum, and the trained SVM model is obtained.
(3) Combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed in the step (2) into a document, wherein all the documents form a document library;
specifically, the number of documents in the document library in this step is equal to the number of divided technical categories.
(4) Successively removing stop words and low-frequency words from the document library obtained in the step (3) to obtain an updated document library;
specifically, stop word removal is to remove stop words (which may be referred to as a stop word table provided in NLTK) existing in a document library, and low-frequency words refer to words appearing less than 10 times in a document.
(5) And (3) carrying out theme clustering on the document library updated in the step (4) by using a theme Dirichlet Allocation (LDA for short) algorithm to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.
The method has the advantages that the potential topics in the document library can be automatically mined by using the topic modeling algorithm, and the semantic information with deeper levels is obtained.
After the LDA algorithm processing in this step, an obtained word distribution matrix is shown in fig. 3, where each row of the word distribution matrix represents a different topic, each column represents a different word, and an element in the ith row and the jth column in the matrix represents a probability that the jth word belongs to the ith topic, where i ∈ [1, a total number of topics in the document repository ], j ∈ [1, a total number of words in the document repository ].
The step of generating a primary technical list according to the word distribution matrix specifically includes the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics, and all the mapping relations constitute a primary technical list.
Specifically, the preset threshold in this step is the average value of all elements in the row.
(6) And (3) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence network corresponding to the topic by using a Louvain algorithm to obtain a plurality of clustering clusters corresponding to the topic (as shown in FIG. 4), and analyzing the subject word content in all the clustering clusters corresponding to the topic to obtain a secondary technology list corresponding to the topic.
In the step, the word co-occurrence network reveals the relation between the subject words, the Louvain algorithm can quickly and effectively mine the incidence relation in the network, and the word co-occurrence network is clustered by using the Louvain algorithm, so that the two-level technology under the subject can be summarized.
The process of analyzing the subject word content in all cluster clusters corresponding to the subject to obtain a secondary technical list corresponding to the subject specifically comprises the steps of firstly obtaining elements selected from the first row in the word distribution matrix in the step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technical literature at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process for the rest rows of the word distribution matrix obtained in the step (5) to obtain mapping relations between a plurality of technical fields and subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.
Specifically, the preset threshold in this step is 50.
(7) And (4) correlating the primary technology list obtained in the step (5) with the secondary technology list obtained in the step (6) to generate a complete technology list.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A technical list generation method based on multi-source data and a topic model is characterized by comprising the following steps:
(1) the method comprises the steps of obtaining a plurality of scientific and technical literatures corresponding to a certain technical field, carrying out data processing on the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.
(2) Inputting all the scientific and technical documents in the scientific and technical document database obtained in the step (1) into a trained support vector machine model SVM for technical category division, and storing technical category division results in the scientific and technical document database.
(3) Combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed in the step (2) into a document, wherein all the documents form a document library;
(4) and (4) successively carrying out stop word removal and low-frequency word removal on the document library obtained in the step (3) to obtain an updated document library.
(5) And (5) performing theme clustering on the document library updated in the step (4) by using a theme modeling algorithm LDA to obtain a word distribution matrix corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrix.
(6) And (5) constructing a word co-occurrence network for each topic in the primary technology list obtained in the step (5), clustering the word co-occurrence network corresponding to the topic by using a Louvain algorithm to obtain a plurality of cluster groups corresponding to the topic, and analyzing the subject word content in all the cluster groups corresponding to the topic to obtain the secondary technology list corresponding to the topic.
(7) And (4) correlating the primary technology list obtained in the step (5) with the secondary technology list obtained in the step (6) to generate a complete technology list.
2. The method for generating a technical inventory based on multi-source data and a topic model according to claim 1, wherein the step (1) comprises the following sub-steps:
(1-1) making a thesis retrieval formula and a patent retrieval formula related to a certain technical field, and respectively retrieving in a WOS (scientific citation database) and a DII (Derwent patent database) to obtain a scientific literature corresponding to the technical field;
and (1-2) for each scientific and technical literature corresponding to the technical field acquired in the step (1-1), respectively performing title extraction and abstract extraction on the technical literature, sequentially performing lower case conversion and morphological restoration processing on the acquired title and abstract of each scientific and technical literature by using a natural language processing toolkit (NLTK), and storing the processed scientific and technical literature in a scientific and technical literature database.
3. The method for generating the technical list based on the multi-source data and the subject model according to claim 1 or 2, wherein the SVM model is obtained by training in the following way:
(2-1) acquiring a title and an abstract of each technical document corresponding to a certain technical field after being processed in the step (1), and combining the acquired title and the abstract to obtain a description text of the technical document;
(2-2) inputting a description text corresponding to each scientific and technical literature in the scientific and technical literature database obtained in the step (2-1) into a Doc2vec model to obtain a text vector corresponding to the scientific and technical literature;
(2-3) establishing a technology category library, wherein the technology category library comprises a plurality of technology categories and a plurality of technical documents corresponding to each technology category;
(2-4) constructing a data set by using the text vector corresponding to the scientific literature in the scientific literature database obtained in the step (2-2) and the technical category library obtained in the step (2-3), and performing the following steps on the data set according to the data set in the steps of 8: 2, dividing the ratio into a training set and a test set;
and (2-5) inputting the training set obtained in the step (2-4) into an SVM model to obtain an updated SVM model.
4. The method for generating technical inventory based on multi-source data and topic model according to any one of claims 1 to 3,
the weight parameters of the Doc2vec model comprise a text vector and a word vector, when training is carried out, the text vector and the context word vector are connected together to jointly carry out next word prediction, and the model weight parameters are updated by minimizing the loss of word prediction;
in the training process of the Doc2vec model, setting the word vector dimension of the Doc2vec model to be 200, the number of training rounds to be 100 and the width of a sampling window to be 8;
the final result obtained by the Doc2vec model is the mapping relation between the scientific and technical literature and the text vector.
5. The method for generating a technical list based on multi-source data and a theme model according to any one of claims 1 to 3, characterized in that step (2-5) is specifically to iteratively set hyper-parameters of an SVM model, and iteratively verify the iteratively trained SVM model by using the test set in the data set obtained in step (2-4) until the obtained classification accuracy reaches the optimum, thereby obtaining the trained SVM model.
6. The method of claim 1, wherein the subject model is a database of a plurality of data sources,
the stop word removal in the step (4) is to remove stop words existing in the document library, and the stop word list provided in the NLTK can be consulted specifically;
low frequency words refer to words in a document that appear less than 10 times.
7. The method of claim 1, wherein the system is further configured to generate a technical inventory based on the multi-source data and the topic model,
each row of the word distribution matrix represents a different topic, each column represents a different word, the elements in the ith row and jth column of the matrix represent the probability that the jth word belongs to the ith topic, where i ∈ [1, the total number of topics in the document repository ], and j ∈ [1, the total number of words in the document repository ].
Generating a primary technical list according to the word distribution matrix in the step (5) specifically comprises the steps of firstly obtaining all elements in a first row in the word distribution matrix, selecting the elements which are larger than a preset threshold value from all the elements, obtaining words corresponding to the elements, determining the technical field to which the elements belong according to the words, and establishing a mapping relation between the technical field and the corresponding subject of the row; then, the above process is repeated for the remaining rows in the word distribution matrix, so as to obtain the mapping relations between a plurality of technical fields and the corresponding topics thereof, and all the mapping relations constitute a primary technical list.
8. The method for generating a technical list based on multi-source data and a theme model according to claim 1, wherein the step (6) of analyzing the subject word content in all cluster clusters corresponding to the theme to obtain the secondary technical list corresponding to the theme specifically comprises the steps of firstly obtaining elements selected in the first row of the word distribution matrix in the step (5), obtaining words corresponding to the elements, constructing a word co-occurrence network according to the obtained words, and adding 1 to the weights corresponding to the two words if the two words appear in the same scientific and technical literature at the same time; then inputting the word co-occurrence network into a Louvain algorithm to obtain a plurality of divided clustering clusters, selecting clusters with the number of words larger than a preset threshold value in the clustering clusters, determining the technical field to which the clusters belong, and establishing a mapping relation between the technical field and the corresponding clustering clusters; and finally, repeating the process aiming at the rest rows of the word distribution matrix obtained in the step (5) so as to obtain the mapping relations between a plurality of technical fields and the subjects to which the technical fields belong, wherein all the mapping relations form a secondary technical list.
9. A technical inventory generation system based on multi-source data and a topic model is characterized by comprising:
the system comprises a first module and a second module, wherein the first module is used for acquiring a plurality of scientific and technical literatures corresponding to a certain technical field, processing data of the scientific and technical literatures, and storing the processed scientific and technical literatures in a scientific and technical literature database.
And the second module is used for inputting all the scientific and technological documents in the scientific and technological document database obtained by the first module into the trained support vector machine model SVM for technical category division, and storing the technical category division result in the scientific and technological document database.
The third module is used for combining the scientific and technical documents belonging to the same technical category in the scientific and technical document database processed by the second module into a document, and all the documents form a document library;
the fourth module is used for successively removing stop words and low-frequency words from the document library obtained by the third module to obtain an updated document library;
and the fifth module is used for performing theme clustering on the document library updated by the fourth module by using a theme modeling algorithm LDA to obtain word distribution matrixes corresponding to all themes in the document library, and generating a primary technical list according to the word distribution matrixes.
A sixth module, configured to construct a word co-occurrence network for each topic in the primary technology list obtained by the fifth module, cluster the word co-occurrence networks corresponding to the topic by using a Louvain algorithm to obtain multiple cluster groups corresponding to the topic, and analyze subject word contents in all cluster groups corresponding to the topic to obtain a secondary technology list corresponding to the topic.
CN202210483086.2A 2022-05-05 2022-05-05 Technical list generation method and system based on multi-source data and topic model Pending CN114780617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210483086.2A CN114780617A (en) 2022-05-05 2022-05-05 Technical list generation method and system based on multi-source data and topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210483086.2A CN114780617A (en) 2022-05-05 2022-05-05 Technical list generation method and system based on multi-source data and topic model

Publications (1)

Publication Number Publication Date
CN114780617A true CN114780617A (en) 2022-07-22

Family

ID=82435583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210483086.2A Pending CN114780617A (en) 2022-05-05 2022-05-05 Technical list generation method and system based on multi-source data and topic model

Country Status (1)

Country Link
CN (1) CN114780617A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741311A (en) * 2023-08-14 2023-09-12 中科三清科技有限公司 Method and device for outputting natural source volatile organic compounds BVOCs emission list

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741311A (en) * 2023-08-14 2023-09-12 中科三清科技有限公司 Method and device for outputting natural source volatile organic compounds BVOCs emission list
CN116741311B (en) * 2023-08-14 2023-10-20 中科三清科技有限公司 Method and device for outputting natural source volatile organic compounds BVOCs emission list

Similar Documents

Publication Publication Date Title
CN110597735A (en) Software defect prediction method for open-source software defect feature deep learning
CN110298032A (en) Text classification corpus labeling training system
CN110543564B (en) Domain label acquisition method based on topic model
CN110647626B (en) REST data service clustering method based on Internet service domain
CN111767397A (en) Electric power system secondary equipment fault short text data classification method
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111079419A (en) Big data-based national defense science and technology hot word discovery method and system
CN111813933A (en) Automatic identification method for technical field in technical atlas
CN110046943A (en) A kind of optimization method and optimization system of consumer online's subdivision
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
CN114780617A (en) Technical list generation method and system based on multi-source data and topic model
Nurhachita et al. A comparison between deep learning, naïve bayes and random forest for the application of data mining on the admission of new students
Kovalchuk et al. Text mining for the analysis of legal texts
Mustafa et al. Optimizing document classification: Unleashing the power of genetic algorithms
Almunirawi et al. A comparative study on serial decision tree classification algorithms in text mining
CN116401338A (en) Design feature extraction and attention mechanism based on data asset intelligent retrieval input and output requirements and method thereof
CN116049376A (en) Method, device and system for retrieving and replying information and creating knowledge
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN114817567A (en) Construction method of classification number co-occurrence network, technical opportunity identification method and system
CN113849639A (en) Method and system for constructing theme model categories of urban data warehouse
Balbi et al. A two-step strategy for improving categorisation of short texts
Wang Retracted: Multi‐data multiple gray clustering analysis based on layered mining for ubiquitous clouds and social internet of things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination