CN114817567A - Construction method of classification number co-occurrence network, technical opportunity identification method and system - Google Patents

Construction method of classification number co-occurrence network, technical opportunity identification method and system Download PDF

Info

Publication number
CN114817567A
CN114817567A CN202210471066.3A CN202210471066A CN114817567A CN 114817567 A CN114817567 A CN 114817567A CN 202210471066 A CN202210471066 A CN 202210471066A CN 114817567 A CN114817567 A CN 114817567A
Authority
CN
China
Prior art keywords
classification number
node
technical
classification
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210471066.3A
Other languages
Chinese (zh)
Inventor
周源
褚恒
孔德婧
陈吉红
杨建中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Huazhong University of Science and Technology
Beijing University of Posts and Telecommunications
Original Assignee
Tsinghua University
Huazhong University of Science and Technology
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Huazhong University of Science and Technology, Beijing University of Posts and Telecommunications filed Critical Tsinghua University
Priority to CN202210471066.3A priority Critical patent/CN114817567A/en
Publication of CN114817567A publication Critical patent/CN114817567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/11Patent retrieval

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a construction method of a classification number co-occurrence network, a technical opportunity identification method and a system, and belongs to the technical opportunity identification field. Semantic information and co-occurrence information of joint patent classification (CPC) are combined to form a CPC co-occurrence network, a hidden connection mode between CPC nodes is mined through a graph neural network model, then the CPC nodes which are possibly connected with artificial nodes representing a target field are predicted, and finally technical development points which are possibly appeared in the future of the target field are identified, so that certain decision support is provided for making a technical development strategy. The method can fully combine the potential correlation between the CPC semantic information and the co-occurrence information mining technology, and rich node characteristics provide a good foundation for model learning, and can help better identify the technical opportunity.

Description

Construction method of classification number co-occurrence network, technical opportunity identification method and system
Technical Field
The invention belongs to the technical opportunity identification field, and more particularly relates to a construction method of a classification number co-occurrence network, a technical opportunity identification method and a system.
Background
The technical opportunity is to discover the latest technical trend and infer the technical forms or technical development points which may appear in a certain technical field by mining the development trends and the mutual relations of the existing technologies in the technical field. The technical opportunity is taken as a basis for innovation decision, and is an important factor which must be considered by any technical innovation.
The main methods for identifying the opportunity in the prior art are as follows: the method comprises five major categories of abnormal value detection based, patent map based, link prediction based, scientific and technical association method and composite method. The core steps of the abnormal value detection method comprise identification and evaluation of outlier patents, the identification step mainly depends on a quantitative data mining method, the evaluation step is based on expert experience or qualitative methods such as a TRIZ technical system evolutionary rule, and the subjectivity of an analysis result is high due to different knowledge accumulation of experts; the patent map method is that dimension reduction visualization is carried out on the basis of keyword vectorization representation of patents, and then blank areas in the patent map are searched as potential technical opportunities, but the identified patent blank areas are represented by a group of surrounding keywords, so that certain difficulty still exists in obtaining clear technical meanings; the link prediction method predicts the possibility of connection between two nodes which are not connected in the network by using the network structure information, and the process of searching for potential technical opportunities is the process of searching for missing connections in the network. The method mainly comprises two directions: the link prediction method is combined with a co-word network based on text content and the link prediction method is combined with a citation network based on a patent classification number. In the former, the technical association represented by "keyword pair" is used as a final prediction result, and the fuzzy semantic relationship easily causes deviation in understanding of the development trend of the technology by developers. The latter represents technical opportunities in the form of "patent classification number pairs", but is not favorable for refining specific technical details due to the macroscopic meaning of the patent classification numbers; the scientific and technical association method mainly searches for a theme which is active in scientific research but not appeared in the technical field by mining the technical evolution path of a paper and patent data and performing association analysis, and the adopted clustering method only reserves the main part of the data and possibly loses valuable information for citing a sparse field; the composite method is a direction which is not mature yet and still needs to be explored, and is used for carrying out technical opportunity identification by combining a plurality of analysis methods, and the identification accuracy can be improved through advantage complementation among the methods.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a construction method of a classification number co-occurrence network, a technical opportunity identification method and a system, and aims to combine semantic information and co-occurrence information of joint patent classification (CPC) to form the CPC co-occurrence network, then mine a hidden connection mode between CPC nodes through a graph neural network model, predict CPC nodes which are possibly connected with artificial nodes representing a target field, finally identify a technical development point which is possibly appeared in the future of the target field, and provide certain decision support for making a technical development strategy. The method can fully combine the potential correlation between the CPC semantic information and the co-occurrence information mining technology, and rich node characteristics provide a good foundation for model learning, and can help better identify the technical opportunity.
To achieve the above object, according to a first aspect of the present invention, there is provided a method for constructing a class number co-occurrence network, the method comprising the steps of:
s1, determining co-occurrence information of classification numbers of each patent in a target field and a field to be analyzed; for all patents in the target field and the field to be identified, counting all appeared classification numbers to form a classification number set;
s2, for each classification number in the classification number set, constructing text information of the classification number based on all patent texts with the classification number;
s3, taking the classification number as a document, taking the text information of the classification number as a word, and training a first Doc2vec model to obtain a semantic vector of each classification number in the classification number set; taking a patent as a document, taking the classification number as a word, training a second Doc2vec model, and obtaining a co-occurrence vector of each classification number in the classification number set;
s4, combining the semantic vector and the co-occurrence vector of each classification number in the classification number set into a classification number vector corresponding to the classification number;
and S5, abstracting each classification number in the classification number set into a network node, taking a corresponding classification number vector as the attribute of the network node, and if the co-occurrence frequency between the two network nodes exceeds a set threshold value, adding edges between the network nodes corresponding to the two classification numbers, thereby realizing the construction of the classification number co-occurrence network.
To achieve the above object, according to a second aspect of the present invention, there is provided a technical opportunity recognition method including the steps of:
s1, respectively constructing a classification number co-occurrence network of a past time period, a classification number co-occurrence network of a current time period and a classification number co-occurrence network comprising the past time period and the current time period by adopting the method of the first aspect;
s2, taking node pairs which do not exist in the past time period but exist in the current time period and correspond to edges as positive samples, taking node pairs which do not exist in the past time period and exist in the current time period and correspond to edges as negative samples, and obtaining a training sample set;
s3, counting all classification numbers appearing in all patents in the target field to form a target field classification number set;
s4, carrying out weighted average on the classification number vectors corresponding to the classification numbers in the target field classification number set to obtain a target field node vector, adding the target field node into a classification number co-occurrence network containing a past time period and a current time period, adding edges between the target field node and all nodes in the target field classification number set, and enabling the target field node and all nodes except the target field classification number set to correspond one by one to generate a sample set to be detected;
and S5, respectively inputting each sample to be tested in the sample set to be tested into the graph neural network model trained by the training sample set to obtain the probability of generating edges of each sample to be tested.
Preferably, in step S4, the target domain node is added to the classification number co-occurrence network including the past time period and the current time period by:
(1) counting the occurrence times of each classification number in the target field classification number set;
(2) dividing the occurrence frequency of the specific classification number by the maximum occurrence frequency of the classification number to be used as the weight of the classification number;
(3) carrying out weighted average on the classification number vectors corresponding to the classification numbers in the target field classification number set to obtain target field node vectors;
(4) and adding edges between the target field nodes and all nodes in the target field classification number set, wherein the weight of the edges is the weight of the corresponding classification number.
Has the advantages that: the invention constructs a virtual node representing the target field by aggregating the classification number vectors contained in the target field, and provides a simple and effective technical element combination idea. Compared with a method for performing association measurement after clustering and dividing the classification numbers of the target field, the method can directly measure the association between the target field and the classification numbers, and avoids the problem of information loss. In addition, the introduction of the weight highlights the more important technical information composition in the target field, so that the final recognition result can be focused on the core technical direction of the target field.
Preferably, the graph neural network model is trained by:
1) constructing a closed subgraph of the node pairs:
the closed subgraph is obtained by searching k-order neighbor nodes outwards from a central node needing to be predicted and linked, traversing each positive sample node pair and each negative sample node pair, and storing all node indexes in the closed subgraph of each node pair;
2) relative positions of the distinguishing nodes:
the importance is distinguished by a node marking method, and the formula is as follows:
d=d x +d y
Label(i)=1+min(d x ,d y )+(d/2)[(d/2)+(d%2)-1]
in the formula (d) x ,d y Respectively representing the distances from the node in the closed subgraph to two central nodes x and y, Label (i) representing the label of the node i, and defining the label of the central node x and y as 1; if there is (d) for node i x ,d y ) (1,1), then the label (i) 2; if there is (d) for node i x ,d y ) (1,2) or (d) x ,d y ) (2,1), then label (i) 3; and so on;
3) obtaining an adjacency matrix and a node information matrix of the closed subgraph:
the adjacency matrix comprises the information of edges between nodes in the closed subgraph; the node information matrix contains the characteristics of all nodes in the closed subgraph, one row in the matrix represents one node, each row consists of a semantic vector, a co-occurrence vector and a label of the node, and one training sample of the graph neural network model is an adjacent matrix and a node information matrix of the closed subgraph.
Preferably, the technical opportunity identification method further includes:
screening classification numbers which can become technical opportunities to form a primary technology list, wherein the primary technology is the technical connotation of the classification numbers;
and (3) refining specific technical items of the patents containing the technical opportunity classification numbers in the field of the search targets, and refining a secondary technical list, wherein the secondary technology is the technical items extracted from the patents contained in the corresponding classification numbers.
Has the advantages that: the technical opportunity identification method provided by the invention further screens and produces a technical list with a secondary structure, and provides richer and more specific technical information.
Preferably, the secondary technical list is obtained by:
for each classification number in the primary technology list, acquiring all patents of which the classification number corresponds to the technical opportunity field within 5 years, and clustering the acquired patents into a plurality of clusters by adopting a topological clustering algorithm, wherein each cluster corresponds to one sub-field of the technical opportunity;
respectively counting word frequencies of words extracted from the patent texts in each cluster to obtain a high-frequency word set of each cluster;
screening out patents with the occurrence frequency of high-frequency words exceeding a threshold value from each cluster according to the high-frequency word set;
for each patent screened by each cluster, calculating the importance of the patent through the centrality of singular values;
screening patents with the importance exceeding a threshold value from each cluster after the high-frequency words are screened;
for the patents in each cluster after two-layer screening, specific technical items are abstracted from the patent text, and a plurality of secondary technologies subordinate to the classification number are generated.
Preferably, the technical opportunity identification method further includes:
constructing a patent citation network by using all patents and target field patents under the technical opportunity classification number;
dividing all data into patents only containing target field classification numbers, patents only containing technical opportunity classification numbers and patents simultaneously containing the target field classification numbers and the technical opportunity classification numbers, and clustering the patents into a plurality of clusters by using a topological clustering algorithm respectively, wherein the clusters are defined as technical subjects;
and analyzing the aggregation trend change of the three types of nodes along with the time, and adding the aggregation trend change into the technical list as the development situation description of the first-level technical opportunity.
Has the advantages that: the invention provides a trend of combining the technical opportunity and the target field by using the relation between the technical opportunity and the target field of the patent citation network clustering mining based on the time sequence, and provides more effective decision support for the formulation of the technical development strategy by only regarding CPC with relatively wide connotation as the technical opportunity.
To achieve the above object, according to a third aspect of the present invention, there is provided a technical opportunity recognition system including: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium and execute the second method for identifying technical opportunities described above.
To achieve the above object, according to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of constructing a class number co-occurrence network according to the first aspect or the method of identifying technical opportunities according to the second aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the invention provides a construction method of a classification number co-occurrence network, which utilizes a machine learning method to generate classification number node characteristics containing network structure information and text information, provides richer information sources, provides a good foundation for model learning, and can improve the comprehensiveness of technical opportunity identification results to a certain extent.
(2) The invention provides a technical opportunity identification method, which adopts a current more advanced graph neural network model to carry out technical opportunity identification, the model has better adaptability to different types of networks, a single prediction mode of a heuristic link prediction method based on specific hypothesis is avoided, diversified potential relation modes among nodes can be automatically learned, constraint brought by the heuristic method based on the specific hypothesis in the traditional link prediction is avoided, and more accurate and reliable identification results can be provided.
Drawings
FIG. 1 is a flow diagram of a technology opportunity identification based on a federated patent classification co-occurrence network;
FIG. 2 is an architecture diagram of the Doc2vec model extracting CPC vectors;
fig. 3 is a flow chart of patent citation network clustering analysis.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a construction method of a classification number co-occurrence network and a technical opportunity identification method, which are suitable for any type of patent classification number, such as IPC, EPC, CPC and the like.
The main idea of the preferred technology opportunity identification method for fusing the semantics and the co-occurrence information of the joint patent classification (CPC) is to extract CPC vectors by adopting the semantics and the co-occurrence information of the joint patent classification to further construct a CPC co-occurrence network, then, mine a hidden connection mode between CPC nodes through a graph neural network, predict a technology opportunity CPC which is possibly connected with an artificial node representing a target field, finally identify the technology opportunity which is possibly appeared in the future of the target field, and extract a secondary technology list. As shown in fig. 1, the method comprises the following steps:
s1, dividing all patent data into a t1 time period (before 2015 years) and a t2 time period (2015-2020 years), and then respectively carrying out pretreatment: on one hand, the CPC included in each patent is extracted to obtain CPC co-occurrence information. On the other hand, text word segmentation and word deactivation processing are carried out on the titles and the abstracts of the patents, and then the patent texts are distributed according to the CPCs contained in the patents to obtain semantic information corresponding to the CPCs;
s2, learning a semantic vector and a co-occurrence vector of the CPC by using a machine learning Doc2vec model;
and S3, combining the semantic vector and the co-occurrence vector of the CPC to construct a co-occurrence network. Where the network nodes represent CPCs and the presence of edges between the nodes indicates that the two CPCs co-occur in the patent. Firstly, dividing the edges between the nodes into a t1 time period and a t2 time period according to the earliest time of the edges to appear, and accordingly constructing CPC co-occurrence networks of the t1 time period and the t2 time period respectively;
and S4, acquiring training data from the constructed CPC co-occurrence network. Node pairs with no edges in the time period t1 but edges in the time period t2 are taken as positive samples, and node pairs with no edges in both the time periods t1 and t2 are taken as negative samples. The number of positive and negative samples is the same, and is as follows 9: 1 into a training set and a test set;
s5, selecting a researched target field, counting CPC vectors contained in patents of the target field to obtain a target field vector set, taking a mean value to obtain an artificial node vector of the target field, adding the vector into a co-occurrence network in a time period t2 to generate a new prediction network, and extracting CPC node pairs except the newly added artificial nodes and the target field CPC set from the prediction network to serve as prediction data;
s6, preprocessing the divided training data, inputting a graph neural network model for iterative training, and predicting the possibility of connection between the artificial node and a CPC node outside the target field by using the trained model;
and S7, screening CPCs possibly becoming technical opportunities by combining expert knowledge, and extracting specific technical items according to patents corresponding to the CPCs to form a technical list with a secondary structure. Wherein, the first-level technology is the technical connotation of CPC, and the second-level technology is the technical item extracted from the patent contained in the corresponding CPC;
s8, constructing a patent citation network by using all patents and target field patents under the technical opportunity CPC, dividing all data into three time slices, clustering the patents into a plurality of clusters by using a topological clustering algorithm respectively, and comparing the distribution of the patents in the two fields in different clusters along with the change of time to obtain the combination trend of the technical opportunity and the target field.
In step S1, two main aspects of the patent data preprocessing work are performed:
1) extracting CPC co-occurrence information: the patent CPC can be decomposed into five parts of 'portion-major class-minor class-major group-minor group', and considering that the technical granularity is too small due to the use of the complete CPC, the CPCs of all patents are intercepted to obtain 8-bit CPCs, which correspond to the information of the four parts of 'portion-major class-minor class-major group'. The data format for CPC co-occurrence information extraction is: { patent No.: a list containing a plurality of 8-bit CPCs }. Since the window size of the subsequent Doc2vec model is set to k, patent data with CPC number smaller than k are filtered out (for example, k is 3).
2) Extracting CPC semantic information: the title and abstract of a patent are required to be utilized for constructing the semantic information of the CPC, and the patent text preprocessing method comprises the following steps: a. the title and abstract of the patent are combined; b. removing punctuation marks and stop words, and restoring the word shapes; c. successive texts are segmented and a patent is stored in correspondence with a list containing a plurality of words. After the text information of each patent is processed, the patent texts are distributed according to the CPCs contained in the patents, and one specific CPC text information is obtained by splicing the text information of a plurality of patents. The data form for extracting CPC semantic information is as follows: { CPC: CPC text information }.
In step S2, two Doc2vec models need to be trained for the semantic information and the co-occurrence information of the CPC, respectively, and the models can perform vectorization characterization on all documents in the corpus and words contained in the documents, and migrate the application mode of the models to the present invention. As shown in fig. 2, the CPC is regarded as a document when semantic information is extracted, and CPC text information is regarded as a word; when co-occurrence information is extracted, patents are regarded as documents, and CPCs are regarded as words. I.e. the CPC is trained as a document and word vector, respectively, in two different models. And extracting the semantic vector and the co-occurrence vector of the CPC as the characteristics of the nodes in the network. The main parameters of the model are window size and output vector dimension, and considering that CPC text information is relatively long, the window size k1 of the Doc2vec model for extracting semantic information may be relatively large (e.g. k1 ═ 20), the window size k2 of the Doc2vec model for extracting co-occurrence information may be small (e.g. k2 ═ 3), and the document vector output by the two models and n of the word vector dimension may be identical (e.g. n ═ 50). The CPC is respectively used as a document vector and a word vector to be trained in two different models, and the trained semantic vector and the co-occurrence vector are spliced to obtain a CPC node vector.
In step S3, a co-occurrence network is constructed based on the CPC co-occurrence relationship in the patent data. Counting the co-occurrence times of each pair of CPCs in the patent, if the co-occurrence times of the two CPCs are greater than a threshold value, considering that the two CPCs have close co-occurrence, adding edges to the two nodes in the network, and calculating the weight of the edges according to the co-occurrence times; if the number of co-occurrence times of the two CPCs is smaller than the threshold value or no co-occurrence occurs, no edge is added between the two nodes, and the network construction rule is as follows:
Figure BDA0003622405340000101
in the formula, CPC i And CPC j Each representing two different CPC nodes in the network, w ij Represents CPC i And CPC j Weight of the edges in between, v ij Representation of CPC i And CPC j The number of co-occurrences in all patent data, threshold, can be set according to experimental effects.
Since the invention performs future-oriented dynamic link prediction, a co-occurrence network of two time periods needs to be constructed, and potential association between the state of the node pair in the network at the time period t1 and whether the node pair will generate connection in the network at the time period t2 is captured. Namely, the edges between the nodes are divided into a t1 time period (2015 ago) and a t2 time period (2015-2020) according to the earliest time of occurrence of the edges, and accordingly, CPC co-occurrence networks of a t1 time period and a t2 time period are respectively constructed.
In step S4, training data is obtained from the CPC co-occurrence network obtained in step S3 and divided into training sets and test sets. Node pairs that do not have an edge in the time period t1 but do exist in the time period t2 are taken as positive samples, and node pairs that do not have an edge in both the time periods t1 and t2 are taken as negative samples. The number of positive and negative samples is the same, and is as follows 9: the scale of 1 is divided into a training set and a test set. And finally, respectively storing the CPC co-occurrence network, the divided training set and the test set of the time periods t1 and t2 into a database.
In step S5, the CPC co-occurrence network for prediction is reconstructed. Firstly, selecting a target field to be researched, wherein data in the field is a subset of all data used in the previous steps, counting all CPCs contained in a patent in the target field to obtain a CPC set in the target field, averaging vectors corresponding to the CPCs to obtain artificial node vectors (the artificial nodes can be regarded as special CPCs which are fused with technical connotations of the target field), and the artificial node vectors are used as special CPC nodes which are fused with technical connotations of the target field, and then adding the artificial nodes into the CPC network in the t2 time period constructed in the step S3, adding edges between the artificial nodes and all CPC nodes in the CPC set in the target field, and generating a CPC co-occurrence network for prediction. And then, the artificial nodes correspond to all CPC nodes except the target field CPC set one by one to generate predicted node pairs, indexes of the nodes are stored into a txt file, one line represents one node pair, and the nodes are separated by spaces. And finally, respectively storing the CPC co-occurrence network and the prediction node pair files for prediction to a database.
In step S6, the training data needs to be processed into an input form required by the graph neural network model, which specifically includes the following three steps:
1) constructing a closed subgraph of the node pairs: the closed subgraph is obtained by searching k-order neighbor nodes outwards from a central node needing to be predicted and linked (generally, k is 1 or 2, and the subgraph is too large due to too large k value, so that the memory consumption in the calculation process is greatly improved), traversing each positive sample node pair and each negative sample node pair, and storing all node indexes in the closed subgraph of each node pair.
2) Relative positions of the distinguishing nodes: the distances between the nodes in the closed subgraph and the central node needing to predict the link are different, and the importance of the nodes with different relative positions on the link prediction is different. The importance is distinguished by a node marking method, the formula is as follows:
d=d x +d y
Label(i)=1+min(d x ,d y )+(d/2)[(d/2)+(d%2)-1]
in the formula (d) x ,d y The distances from the nodes in the closed subgraph to the two central nodes x and y (the distance is the number of edges between the nodes). Label (i) represents the label of node i. Defining the labels of the central nodes x and y as 1; if there is (d) for node i x ,d y ) (1,1), then the label (i) 2; if there is (d) for node i x ,d y ) (1,2) or (d) x ,d y ) (2,1), then label (i) 3; and so on.
3) Obtaining an adjacency matrix and a node information matrix of the closed subgraph: the adjacency matrix comprises the information of edges between nodes in the closed subgraph; the node information matrix contains the characteristics of all nodes in the closed subgraph, one row in the matrix represents one node, and each row consists of a semantic vector, a co-occurrence vector and a label of the node. A training sample of the graph neural network model is an adjacent matrix and a node information matrix of a closed subgraph.
Furthermore, preprocessing the prediction data according to the three steps, inputting the training data into a neural network model of the graph for iterative training, then inputting the prediction data into the trained model, predicting the possibility of connection between the newly-added artificial node and the CPC node outside the target field CPC set, wherein the value range of the prediction result is [0, 1], and the prediction result can be regarded as the probability of generating the link between the artificial node and the CPC node outside the target field CPC set. And outputting the prediction results to an Excel file in a descending order, wherein the first is CPC, and the second is the possibility that the CPC is connected with the target field.
In step S7, a technical list is prepared according to the link prediction result of step S6. Firstly, primary screening is carried out, and CPCs (the threshold m can be set to be 0.8) with connection probability less than the threshold m generated with the target field are deleted. The CPCs with a greater likelihood of being a technical opportunity in the target domain are then screened as a primary technical listing.
Further, the patent refinement secondary technology list is screened based on the following two aspects: firstly, patents in the technical opportunity field of nearly 5 years are obtained, a topological clustering algorithm is adopted to cluster the patents into a plurality of clusters, word frequency obtained by extracting proprietary texts from the clusters is counted to obtain a high-frequency word set of each cluster, and patents with more high-frequency words are screened according to the high-frequency word set; secondly, the importance of the patent is evaluated by Singular Value Centrality (SVC), i.e. the Schatten matrix norm is obtained from the summation of the p-th powers of the first k largest singular values of the similarity matrix S:
Figure BDA0003622405340000121
the significance of the j patent was further weighed by the following formula, i.e., removing the variation of the Schatten matrix norm before and after the j patent:
Figure BDA0003622405340000131
in the formula, S (-j) means that the similarity matrix, σ, is removed from the j patent ii Is the ith singular value of S.
Patents with more high-frequency words and higher importance are screened through the two aspects, and specific technical items are extracted as secondary technologies based on the patents. In step S8, trend mining is performed in which the technical opportunity is combined with the target area. The specific flow is shown in fig. 3, all patents under the technical opportunity CPC and the patents in the target field are firstly divided into three parts according to time slices, and the patent citation networks are respectively constructed. The nodes in the patent citation network that employ three colors represent patents that contain only the target-realm CPC, patents that contain only the technology-opportunity CPC, and patents that contain both the target-realm CPC and the technology-opportunity CPC, respectively. And then clustering the patents into a plurality of clusters by using a topological clustering algorithm, defining the clustered clusters as technical subjects, further analyzing the aggregation trend change of the three types of nodes along with the time, and adding the aggregation trend change into a technical list to be used as the development situation description of a first-level technical opportunity. Over time, the technology opportunity patents may have had a more significant connection and aggregation with the target domain patents, and may not yet have had too much intersection.
Through the steps, a technical opportunity list in a certain field can be obtained, the trend of combining the field and the technical opportunity is included, more specific secondary technical points under a primary technical opportunity are provided, and more effective decision support is provided for making a technical development strategy.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A construction method of a classification number co-occurrence network is characterized by comprising the following steps:
s1, determining co-occurrence information of classification numbers of each patent in a target field and a field to be analyzed; for all patents in the target field and the field to be identified, counting all appeared classification numbers to form a classification number set;
s2, for each classification number in the classification number set, constructing text information of the classification number based on all patent texts with the classification number;
s3, taking the classification number as a document, taking the text information of the classification number as a word, and training a first Doc2vec model to obtain a semantic vector of each classification number in the classification number set; taking a patent as a document, taking the classification number as a word, training a second Doc2vec model, and obtaining a co-occurrence vector of each classification number in the classification number set;
s4, combining the semantic vector and the co-occurrence vector of each classification number in the classification number set into a classification number vector corresponding to the classification number;
and S5, abstracting each classification number in the classification number set into a network node, taking a corresponding classification number vector as the attribute of the network node, and if the co-occurrence frequency between the two network nodes exceeds a set threshold value, adding edges between the network nodes corresponding to the two classification numbers, thereby realizing the construction of the classification number co-occurrence network.
2. A technical opportunity recognition method is characterized by comprising the following steps:
s1, adopting the method of claim 1 to respectively construct a classification number co-occurrence network of a past time period, a classification number co-occurrence network of a current time period and a classification number co-occurrence network comprising the past time period and the current time period;
s2, taking node pairs which do not exist in the past time period but exist in the current time period and correspond to edges as positive samples, taking node pairs which do not exist in the past time period and exist in the current time period and correspond to edges as negative samples, and obtaining a training sample set;
s3, counting all classification numbers appearing in all patents in the target field to form a target field classification number set;
s4, carrying out weighted average on the classification number vectors corresponding to the classification numbers in the target field classification number set to obtain a target field node vector, adding the target field node into a classification number co-occurrence network containing a past time period and a current time period, adding edges between the target field node and all nodes in the target field classification number set, and enabling the target field node and all nodes except the target field classification number set to correspond one by one to generate a sample set to be detected;
and S5, respectively inputting each sample to be tested in the sample set to be tested into the graph neural network model trained by the training sample set to obtain the probability of generating edges of each sample to be tested.
3. The technical opportunity identification method of claim 2, wherein in step S4, the target domain node is added to the classification number co-occurrence network including the past time period and the current time period by:
(1) counting the occurrence times of each classification number in the target field classification number set;
(2) dividing the occurrence frequency of the specific classification number by the maximum occurrence frequency of the classification number to be used as the weight of the classification number;
(3) carrying out weighted average on the classification number vectors corresponding to the classification numbers in the target field classification number set to obtain target field node vectors;
(4) and adding edges between the target field nodes and all nodes in the target field classification number set, wherein the weight of the edges is the weight of the corresponding classification number.
4. The technology opportunity recognition method of claim 2, wherein the graph neural network model is trained by:
1) constructing a closed subgraph of the node pairs:
the closed subgraph is obtained by searching k-order neighbor nodes outwards from a central node needing to be predicted and linked, traversing each positive sample node pair and each negative sample node pair, and storing all node indexes in the closed subgraph of each node pair;
2) relative positions of the distinguishing nodes:
the importance is distinguished by a node marking method, and the formula is as follows:
d=d x +d y
Label(i)=1+min(d x ,d y )+(d/2)[(d/2)+(d%2)-1]
in the formula (d) x ,d y Respectively representing the distances from the node in the closed subgraph to two central nodes x and y, Label (i) representing the label of the node i, and defining the label of the central node x and y as 1; if there is (d) for node i x ,d y ) (1,1), then the label (i) 2; if there is (d) for node i x ,d y ) (1,2) or (d) x ,d y ) (2,1), then label (i) 3; and so on;
3) obtaining an adjacency matrix and a node information matrix of the closed subgraph:
the adjacency matrix comprises the information of edges between nodes in the closed subgraph; the node information matrix contains the characteristics of all nodes in the closed subgraph, one row in the matrix represents one node, each row consists of a semantic vector, a co-occurrence vector and a label of the node, and one training sample of the graph neural network model is an adjacent matrix and a node information matrix of the closed subgraph.
5. The technical opportunity identification method of claim 2, further comprising:
screening classification numbers which can become technical opportunities to form a primary technology list, wherein the primary technology is the technical connotation of the classification numbers;
and (3) refining specific technical items of the patents containing the technical opportunity classification numbers in the field of the search targets, and refining a secondary technical list, wherein the secondary technology is the technical items extracted from the patents contained in the corresponding classification numbers.
6. The technology opportunity identification method of claim 5, wherein the secondary technology list is obtained by:
for each classification number in the primary technology list, acquiring all patents of which the classification number corresponds to the technical opportunity field within 5 years, and clustering the acquired patents into a plurality of clusters by adopting a topological clustering algorithm, wherein each cluster corresponds to one sub-field of the technical opportunity;
respectively counting word frequencies of words extracted from the patent texts in each cluster to obtain a high-frequency word set of each cluster;
screening out patents with the occurrence frequency of high-frequency words exceeding a threshold value from each cluster according to the high-frequency word set;
for each patent screened by each cluster, calculating the importance of the patent through the centrality of singular values;
screening patents with the importance exceeding a threshold value from each cluster after the high-frequency words are screened;
for the patents in each cluster after two-layer screening, specific technical items are abstracted from the patent text, and a plurality of secondary technologies subordinate to the classification number are generated.
7. The technical opportunity identification method of claim 5 or 6, further comprising:
constructing a patent citation network by using all patents and target field patents under the technical opportunity classification number;
dividing all data into patents only containing target field classification numbers, patents only containing technical opportunity classification numbers and patents simultaneously containing the target field classification numbers and the technical opportunity classification numbers, and clustering the patents into a plurality of clusters by using a topological clustering algorithm respectively, wherein the clusters are defined as technical subjects;
and analyzing the aggregation trend change of the three types of nodes along with the time, and adding the aggregation trend change into the technical list as the development situation description of the first-level technical opportunity.
8. A technical opportunity identification system, comprising: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium and execute the technical opportunity identification method of any one of claims 2 to 7.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the method of constructing a classification number co-occurrence network according to claim 1 or the method of identifying technical opportunities according to any one of claims 2 to 7.
CN202210471066.3A 2022-04-28 2022-04-28 Construction method of classification number co-occurrence network, technical opportunity identification method and system Pending CN114817567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210471066.3A CN114817567A (en) 2022-04-28 2022-04-28 Construction method of classification number co-occurrence network, technical opportunity identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210471066.3A CN114817567A (en) 2022-04-28 2022-04-28 Construction method of classification number co-occurrence network, technical opportunity identification method and system

Publications (1)

Publication Number Publication Date
CN114817567A true CN114817567A (en) 2022-07-29

Family

ID=82509399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210471066.3A Pending CN114817567A (en) 2022-04-28 2022-04-28 Construction method of classification number co-occurrence network, technical opportunity identification method and system

Country Status (1)

Country Link
CN (1) CN114817567A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591676A (en) * 2024-01-19 2024-02-23 数据空间研究院 Method for identifying enterprise on industrial chain of Coarse-to-fine

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591676A (en) * 2024-01-19 2024-02-23 数据空间研究院 Method for identifying enterprise on industrial chain of Coarse-to-fine
CN117591676B (en) * 2024-01-19 2024-04-05 数据空间研究院 Method for identifying enterprise on industrial chain of Coarse-to-fine

Similar Documents

Publication Publication Date Title
CA2423033C (en) A document categorisation system
Tang et al. Multi-label patent categorization with non-local attention-based graph convolutional network
Tosi et al. SciKGraph: A knowledge graph approach to structure a scientific field
Li et al. Product functional information based automatic patent classification: method and experimental studies
Liu et al. Behavior2vector: Embedding users’ personalized travel behavior to Vector
Zhang et al. A latent-dirichlet-allocation based extension for domain ontology of enterprise’s technological innovation
Zhang et al. Multi-dimension topic mining based on hierarchical semantic graph model
Zandbiglari et al. Capability Language Processing (CLP): Classification and Ranking of Manufacturing Suppliers Based on Unstructured Capability Data
Bakirli et al. DTreeSim: A new approach to compute decision tree similarity using re-mining
CN114817567A (en) Construction method of classification number co-occurrence network, technical opportunity identification method and system
Pouriyeh et al. A comprehensive survey of ontology summarization: measures and methods
Kumar et al. Community-enhanced Link Prediction in Dynamic Networks
De Martino et al. Multi-view overlapping clustering for the identification of the subject matter of legal judgments
Menon et al. Gmm-based document clustering of knowledge graph embeddings
CN116450938A (en) Work order recommendation realization method and system based on map
de Oliveira et al. A syntactic-relationship approach to construct well-informative knowledge graphs representation
Jaiswal et al. Genetic approach based bug triage for sequencing the instance and features
CN115905705A (en) Industrial algorithm model recommendation method based on industrial big data
Das et al. Graph-based text summarization and its application on COVID-19 twitter data
Sikandar et al. Combining sequence entropy and subgraph topology for complex prediction in protein protein interaction (PPI) network
Gou et al. Effective and Efficient Community Search with Graph Embeddings
Wijesinghe et al. Mining frequent patterns in bioinformatics workflows
Tsapatsoulis et al. Quo Vadis Computer Science? The topics of the influential papers during the period 2014-2021
Le et al. Choosing seeds for semi-supervised graph based clustering
Xu et al. News Text Sentiment Classification Method Based on Knowledge Graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination