CN112529302A

CN112529302A - Method and system for predicting success rate of patent application authorization and electronic equipment

Info

Publication number: CN112529302A
Application number: CN202011475523.3A
Authority: CN
Inventors: 张琳; 蒋洪迅
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-19

Abstract

The invention discloses a method and a system for predicting the success rate of patent application authorization and an electronic device, wherein the method for predicting the success rate of patent application authorization comprises the following steps: historical invention data of published patent applicants and application companies are obtained from the national intellectual property office, and a heterogeneous information network is constructed. Filtering patent text information of historical invention data, only keeping description texts of the abstract and the claim part of the specification, preprocessing the description texts by word segmentation and stop words, and arranging the preprocessed description texts into a corpus. And respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the material set to obtain a document vector and a feature vector. And fusing the document vector, the feature vector and the heterogeneous information network to predict the success rate of the authorization of the patent application. Therefore, the method for predicting the success rate of the patent application authorization can predict the final success or failure result of the patent application authorization.

Description

Method and system for predicting success rate of patent application authorization and electronic equipment

Technical Field

The invention relates to a method and a system for predicting success rate of patent application authorization and an electronic device.

Background

The patent is used as an internal resource of an enterprise, has scarcity, impersonability and irreplaceability, and is a resource basis for acquiring continuous competitive advantages of the enterprise. For strategic reasons, many enterprises apply for a large number of patents each year, but patent application is not an easy task, especially in terms of both application costs and latency; and after application cost and waiting time are consumed, the passing rate of patent applications is not high. Patent grants and rejections affect patent portfolio management and corporate investment decisions, so the possibility of prior knowledge of the grant before patent application review, for the application company, can deploy patent strategies ahead of time, develop technological investment activities, and even in cases where the patent ultimately does not get a grant, the applicant still wants to know the decision of rejection as early as possible so that they can prioritize other protective measures of the technology. Meanwhile, for investors and competitors of companies, whether according to the market stealing effect or the market overflowing effect, the patent innovation can be subjected to competition analysis, measures can be extracted and taken, and the commercial loss is avoided. How to learn effective characteristics and ensure higher accuracy in a large-scale patent network so that a patent applicant can obtain the success possibility of a submitted patent as soon as possible becomes a technical problem which needs to be solved at present.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a method and a system for predicting the success rate of patent application authorization and an electronic device, which can predict the final success or failure result of the patent application authorization.

In order to achieve the above object, the present invention provides a method, a system and an electronic device for predicting a success rate of patent application authorization, wherein the method for predicting a success rate of patent application authorization comprises: historical invention data of published patent applicants and application companies are obtained from the national intellectual property office, and a heterogeneous information network is constructed. Filtering patent text information of historical invention data, only keeping description texts of the abstract and the claim part of the specification, preprocessing the description texts by word segmentation and stop words, and arranging the preprocessed description texts into a corpus. And respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the material set to obtain a document vector and a feature vector. And fusing the document vector, the feature vector and the heterogeneous information network to predict the success rate of the authorization of the patent application.

In an embodiment of the present invention, the nodes of the heterogeneous information network include patent applications, applicants, application companies, and other patents in a patent cluster, and the relationships of the heterogeneous information network include invention relationships, cooperation relationships, and reference relationships.

In an embodiment of the present invention, each piece of historical invention data includes field information such as patent application number, application date, applicant, application company, abstract of the specification, claims, detailed description, and citation.

In one embodiment of the present invention, constructing the heterogeneous information network includes: the patent applicant and the application company are extracted to be respectively used as entities, a heterogeneous information network comprising three entities of a patent, an applicant and the application company is constructed, other patents in the same patent cluster are inquired according to the patent application, and the other patents are used as nodes and supplemented to the heterogeneous information network.

In one embodiment of the present invention, training a corpus based on a deep learning model of natural language processing technology comprises: and loading word vectors pre-trained based on a large-scale language family library to endow semantic information to the text, learning potential word vector representation through a bidirectional cyclic neural network, and performing dimension exchange to obtain document vectors of the patent text.

In one embodiment of the present invention, training a material set by a node classification model based on a graph convolution neural network technology comprises: and inputting an adjacent matrix of the heterogeneous information network as a node classification model, performing convolution based on active learning twice, and then performing dense vector representation on the output of the convolution layer by a full connection layer to obtain a characteristic vector of each patent node in the heterogeneous information network.

In an embodiment of the present invention, fusing a document vector, a feature vector, and a heterogeneous information network, and predicting an authorization success rate of a patent application includes: and splicing the document vector, the feature vector and the heterogeneous information network along an X axis, and carrying out vector normalization to be used as the input feature of the fusion model. And feeding the input characteristics into a three-layer fully-connected network, and outputting a digital variable finally mapped to two dimensions by the last layer after learning of the middle layer, thereby obtaining a final authorization success or failure prediction result.

Another aspect of the present invention provides a system for predicting a success rate of patent application authorization, including: the system comprises a data processing module, a text classification module, a node classification module and a feature fusion module. And the data processing module is used for processing the historical invention data of the published patent applicant and the published patent application company acquired from the national intellectual property office and constructing a heterogeneous information network. And the text classification module is used for filtering patent text information of historical invention data, only keeping the description abstract and the description text of the claim part, performing word segmentation and word stop preprocessing on the description text, and sorting the description text into a language material set. And the node classification module is used for respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the material set to obtain a document vector and a feature vector. And the characteristic fusion module is used for fusing the document vector, the characteristic vector and the heterogeneous information network to predict the success rate of the authorization of the patent application.

Compared with the prior art, the method, the system and the electronic equipment for predicting the patent application authorization success rate can predict the final successful or failed authorization result of the patent application.

Drawings

FIG. 1 is a flow chart illustrating a method for predicting success rate of patent application authorization according to an embodiment of the present invention;

FIG. 2 is a detailed flowchart illustrating a method for predicting success rate of patent application authorization according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of constructing heterogeneous information network node types and relationship types according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of patent document information obtained from the national intellectual property office according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of an algorithm architecture according to an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

Fig. 1 is a flowchart illustrating a method for predicting a success rate of patent application authorization according to an embodiment of the present invention. Fig. 2 is a detailed flowchart illustrating a method for predicting a success rate of patent application authorization according to an embodiment of the present invention. Fig. 3 is a schematic diagram of constructing heterogeneous information network node types and relationship types according to an embodiment of the present invention. Fig. 4 is a schematic diagram of patent document information obtained from the national intellectual property office according to an embodiment of the present invention. FIG. 5 is a schematic diagram of an algorithm architecture according to an embodiment of the present invention.

As shown in fig. 1 to fig. 5, in a first aspect, an embodiment of the present invention provides a method for predicting a success rate of patent application authorization, including: and S1, acquiring historical invention data of published patent applicants and application companies from the national intellectual property office, and constructing a heterogeneous information network. And S2, filtering the patent text information of the historical invention data, only keeping the description text of the abstract and the claim part of the specification, preprocessing the description text by word segmentation and stop words, and arranging the preprocessed description text into a language material set. And S3, respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the material set to obtain a document vector and a feature vector. And S4, fusing the document vector, the feature vector and the heterogeneous information network, and predicting the success rate of the patent application authorization.

In an embodiment of the present invention, the nodes of the heterogeneous information network include patent applications, applicants, application companies, and other patents in a patent cluster, and the relationships of the heterogeneous information network include invention relationships, cooperation relationships, and reference relationships. Each piece of historical invention data includes field information such as patent application number, application date, applicant, application company, abstract of the specification, claims, detailed description, and citation.

In a second aspect, an embodiment of the present invention further provides a system for predicting a success rate of patent application authorization, including: the system comprises a data processing module, a text classification module, a node classification module and a feature fusion module. And the data processing module is used for processing the historical invention data of the published patent applicant and the published patent application company acquired from the national intellectual property office and constructing a heterogeneous information network. And the text classification module is used for filtering patent text information of historical invention data, only keeping the description abstract and the description text of the claim part, performing word segmentation and word stop preprocessing on the description text, and sorting the description text into a language material set. And the node classification module is used for respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the material set to obtain a document vector and a feature vector. And the characteristic fusion module is used for fusing the document vector, the characteristic vector and the heterogeneous information network to predict the success rate of the authorization of the patent application.

In a third aspect, fig. 6 shows a block diagram of an electronic device according to another embodiment of the invention. The electronic device 1100 may be a host server with computing capability, a personal computer PC, or a portable computer or terminal that is portable, or the like. The specific embodiment of the present invention does not limit the specific implementation of the electronic device.

The electronic device 1100 includes at least one processor (processor)1110, a Communications Interface 1120, a memory 1130, and a bus 1140. The processor 1110, the communication interface 1120, and the memory 1130 communicate with each other via the bus 1140.

The communication interface 1120 is used for communicating with network elements including, for example, virtual machine management centers, shared storage, etc.

Processor 1110 is configured to execute programs. Processor 1110 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

The memory 1130 is used for executable instructions. The memory 1130 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1130 may also be a memory array. The storage 1130 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules. The instructions stored in the memory 1130 are executable by the processor 1110 to enable the processor 1110 to perform a method for predicting a success rate of a patent application grant in any of the above-described method embodiments.

In practical application, the method, the system and the electronic equipment for predicting the patent application authorization success rate perform information mining from two dimensions of text content and an innovative network of an application team, provide a classification model based on a deep neural network and shallow feature fusion, wherein the classification model comprises three parts, namely text mining, heterogeneous information network analysis and feature fusion, learn distributed representation of features by using a natural language processing and graph convolution neural network method, train a classification prediction model based on a full-connection neural network finally, and verify the classification prediction model in a large number of real samples. Different from the multivariate statistics and regression analysis method used in the existing patent application prediction research, the method and the characteristics used by the method are not limited by strong manual design any more, but are more based on data driving, and a remarkable prediction effect is achieved on the success of patent application identification. Aiming at the problem of uncertain factors such as long waiting time of patent application, the invention provides a deep learning prediction algorithm based on a document vector and a heterogeneous network, the algorithm can capture innovative knowledge content through text mining and capture cooperative relation and reference records through constructing the heterogeneous information network, so that the potential characteristics of a patent are mined and the technical evolution of the patent invention is described, and meanwhile, the deep characteristics are subjected to shallow fusion through a neural network to carry out reasonable prediction.

In order to achieve the purpose, the invention adopts a method for predicting the success rate of patent application authorization, which is based on the fusion of a document vector and a heterogeneous network and comprises the following steps: s1, obtaining historical invention data of published patent applicants and application companies from the state intellectual property office, including patents of historical application, authorization results of historical patents, patent cluster information and patent citation records, and constructing a heterogeneous information network integrating team cooperation and patent citation; s2, filtering the patent text information, and keeping the description text of the abstract and the claim part; s3, respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology; and S4, fusing the document vector and the heterogeneous information network to predict the success rate of the patent application authorization.

The step S1 includes the following steps: s11, acquiring data of all patent application documents from the national intellectual property office, wherein one document comprises field information of patent application numbers, application time, applicants, application companies, abstract specifications, claims, detailed description, citations and the like; s12, extracting patent application persons and patent application companies as entities respectively, and constructing a heterogeneous information network containing three entities of patents, applicants and application companies, namely the heterogeneous information network fusing team cooperative relations; s13, inquiring other patents in the same patent cluster according to the patent application, and supplementing the patents as nodes into the heterogeneous information network; and S14, retrieving the reference relationship between every two patents and adding the node reference relationship in the heterogeneous information network. Finally, the heterogeneous information network nodes comprise four types of patent applications, applicants, application companies and other patents in a patent cluster, and the heterogeneous information network relationship comprises three types of invention relationship, cooperation relationship and reference relationship, so that the network constructed by the invention is a heterogeneous information network integrating team cooperation relationship and patent reference relationship.

The step S2 includes the following steps: s21, extracting two fields of 'abstract of description' and 'claim' in the patent application document; and S22, preprocessing the words and stop words of the text and sorting the words and stop words into a language material set.

The step S3 includes the following steps: s31, loading word vectors pre-trained based on a large-scale corpus to endow semantic information to the text, learning potential word vector representation through a bidirectional recurrent neural network, and finally obtaining document vectors of patent texts after dimension transformation; and S32, inputting the adjacent matrix of the heterogeneous information network as a node classification model, performing convolution based on active learning twice, and obtaining dense vector representation by passing the output of the convolution layer through a full connection layer in order to obtain node representation with higher discrimination, so as to finally obtain the feature vector of each patent node in the heterogeneous information network.

The method for calculating the document vector representation in step S31 specifically includes: firstly, inputting two fields of 'abstract of description' and 'claim' of a patent text after preprocessing, and loading a large-scale word vector which is pre-trained to correspondingly obtain word embedding representation of the text; secondly, because the word sequences are unequal in length, the model can cut the length of the text and perform zero padding on short sequences; then, in the training process, the word sequence and the word vector are simultaneously input into an embedding layer for parameter learning, the output of the embedding layer is used as the input of a next bidirectional LSTM layer, and the step is mainly to learn the context and word sequence relation of the text through a bidirectional cyclic neural network; and finally, further training and extracting to obtain the text feature vector of the document level.

The method for calculating the vector representation of the network node in step S32 specifically includes: firstly, decomposing a heterogeneous information network into a plurality of homogeneous bipartite graph networks, namely graph networks consisting of nodes of two types only; secondly, respectively learning deep semantic information of nodes in each bipartite graph network at each convolution layer; and finally, vector representation obtained by the splicing nodes in each bipartite graph network is used as the final output characteristics of the nodes. In each iteration, the model selects a batch of most valuable nodes based on the active learning strategy and updates labels of the nodes, and label information is fed back to the network model as supervision information, so that the classification effect is improved. It should be noted that if a node is not an element in a bipartite graph network, a zero vector is used to represent the output characteristics of the node in the bipartite graph network.

The step S4 includes the following steps: s41, splicing the document vector and the network vector from two different vector spaces along the x axis, and carrying out vector normalization to be used as the input characteristic of the fusion model; s42, the input characteristics are fed into the three-layer full-connection network, after the middle layer learns, the last layer outputs a digital variable which is finally mapped to a two-dimensional, and the final authorization success or failure prediction result is obtained.

As shown in the algorithm architecture diagram of FIG. 5, the system of the present invention comprises four modules:

the data processing module is used for processing text information, application team background information and citation information among patents acquired from each site on a patent website and constructing a heterogeneous information network integrating team cooperation relations and patent citation relations;

and the text classification module is used for performing potential semantic learning on the patent text by using a classification model based on a bidirectional recurrent neural network according to the pre-trained word vector, performing dimension transformation, and finally obtaining a document vector of the patent text. The classification model based on the bidirectional circulation neural network comprises an embedded layer module, a bidirectional circulation neural network module, a batch normalization module, a Dropout module and a full-connection neural network module: the embedded layer module is used for converting texts of the patent abstract and the patent power specification into dense vectors to obtain target patent vectors; the bidirectional cyclic neural network module is used for calculating the sequence vectors of the front and rear words in a section of text to obtain a document vector corresponding to each patent; the batch normalization module is used for normalizing the weight parameters in the neural network so as to pull the biased weight distribution back to normal distribution, and is a common and effective threshold control method in deep learning; the Dropout module is used for resetting the weight parameters in the neural network under a certain probability, and other parameters are kept unchanged, so that the generalization capability of neural network learning is ensured, and the possibility of overfitting is reduced; and the fully-connected neural network module takes the vector obtained by the Dropout module as input, and carries out linear transformation on the vector through a multilayer neural network to obtain a final patent document vector.

And the node classification module is used for learning the distributed expression of the nodes in the network based on the active learning heterogeneous network embedded model, obtaining dense vector expression through a full connection layer and finally obtaining the characteristic vector of each patent node in the heterogeneous information network. The active learning-based heterogeneous network embedded model comprises a heterogeneous network distinguishing module and an active learning module: the heterogeneous network discrimination module aggregates information of neighbor nodes in the network through convolution operation and updates the information to the current patent node, and splices results of all bipartite graph networks as feature vectors of the nodes, if the nodes are not elements in a certain bipartite graph network, zero vectors are used for representing output vectors of the nodes in the bipartite graph network; the active learning module comprehensively uses three active learning strategies of network centrality, convolution information entropy and convolution information density, the most uncertain and most representative nodes in the network are obtained through calculation in each iterative learning, and the classification effect of the node classification model is improved by using the labeling results of the nodes.

And the feature fusion module is used for fusing two deep learning models of the document vector and the heterogeneous information network at a feature level, predicting the authorization success rate of the patent application and obtaining the final success or failure prediction result of the authorization of one patent.

In summary, the method, the system and the electronic device for predicting the success rate of patent application authorization can predict the final success or failure result of patent application authorization, are not limited by manual design any more, and are driven based on data.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A method for predicting success rate of patent application authorization is characterized by comprising the following steps:

acquiring historical invention data of published patent applicants and application companies from the national intellectual property office, and constructing a heterogeneous information network;

filtering the patent text information of the historical invention data, only keeping the description text of the abstract and the claim part of the specification, and preprocessing the description text by word segmentation and stop words to be arranged into a language material set;

respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the corpus set to obtain a document vector and a feature vector; and

and fusing the document vector, the feature vector and the heterogeneous information network to predict the success rate of the authorization of the patent application.

2. The method for predicting success rate of patent application authorization according to claim 1, wherein the nodes of the heterogeneous information network include patent applications, applicants, application companies, and other patents in a patent cluster, and the relationships of the heterogeneous information network include an invention relationship, a cooperation relationship, and a reference relationship.

3. The method for predicting success rate of patent application authorization according to claim 1, wherein each of the historical invention data includes field information of patent application number, application date, applicant, application company, abstract of description, claims, detailed description, and citation.

4. The method of claim 1, wherein constructing the heterogeneous information network comprises: and extracting the patent applicant and the application company respectively as entities, constructing the heterogeneous information network comprising three entities of a patent, an applicant and the application company, inquiring other patents in the same patent cluster according to the patent application, and supplementing the patents as nodes into the heterogeneous information network.

5. The method of claim 1, wherein training the corpus set based on a deep learning model of natural language processing technology comprises: and loading word vectors pre-trained based on a large-scale language family library to endow semantic information to the text, learning potential word vector representation through a bidirectional cyclic neural network, and performing dimension exchange to obtain document vectors of the patent text.

6. The method of claim 1, wherein the training of the corpus set based on a node classification model of graph convolutional neural network technique comprises: and inputting the adjacent matrix of the heterogeneous information network as a node classification model, performing convolution based on active learning twice, and performing dense vector representation on the output of the convolution layer by a full connection layer to obtain the characteristic vector of each patent node in the heterogeneous information network.

7. The method for predicting the success rate of patent application authorization according to claim 1, wherein fusing the document vector, the feature vector and the heterogeneous information network, and predicting the success rate of patent application authorization comprises:

splicing the document vector, the feature vector and the heterogeneous information network along an X axis, and carrying out vector normalization to be used as input features of a fusion model;

and feeding the input features into a three-layer fully-connected network, and outputting a digital variable finally mapped to two dimensions by the last layer after learning of the middle layer, thereby obtaining a final authorization success or failure prediction result.

8. A system for predicting success rate of patent application authorization, comprising:

the data processing module is used for processing historical invention data of published patent applicants and application companies acquired from the national intellectual property office and constructing a heterogeneous information network;

the text classification module is used for filtering the patent text information of the historical invention data, only keeping the description text of the abstract and the claim part of the specification, preprocessing the word segmentation and stop words of the description text and sorting the word segmentation and stop words into a language material set;

the node classification module is used for respectively training a deep learning model based on a natural language processing technology and a node classification model based on a graph convolution neural network technology on the corpus set to obtain a document vector and a feature vector; and

and the characteristic fusion module is used for fusing the document vector, the characteristic vector and the heterogeneous information network and predicting the authorization success rate of the patent application.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of predicting success rate of patent application grant according to any one of claims 1-7.