Disclosure of Invention
In order to solve the technical problems, the invention provides a pre-application patent quality assessment method and system based on clustering center characterization.
The first aspect of the invention provides a pre-application patent quality assessment method based on clustering center characterization, which comprises the following steps:
Extracting keywords based on patent text input by a user for searching, generating a sub-data set with feature similarity meeting a preset standard in the patent big data, and generating a central representation of the sub-data set through a clustering model;
intercepting patent information to be predicted from a patent text input by a user, and generating text representation of the patent information to be predicted;
Calculating the similarity between the patent information to be predicted and the central representation, and generating constraint information based on the similarity and the patent quality index;
and training a patent quality evaluation model by using constraint information, and obtaining a multidimensional quality evaluation result for the patent input by the user through the patent quality evaluation model.
In the scheme, keywords are extracted and retrieved based on patent text input by a user, and a sub-data set with similar characteristics meeting preset standards is generated in patent big data, specifically:
The method comprises the steps of obtaining a patent text input by a user, performing word segmentation pretreatment, generating a serialization representation of the patent text, judging part-of-speech tags of word vectors in the serialization representation of the patent text, and performing sequence labeling by using the part-of-speech tags;
cutting and blocking the serialized representation of the patent text and embedding the representation by Roberta to obtain an embedded vector of the patent text, screening a preset phrase through the part-of-speech tag, screening a corresponding embedded vector based on the position feature matching of the preset phrase, and splicing the matched and screened embedded vectors to obtain a spliced embedded vector;
introducing a self-attention mechanism into the embedded vector of the patent text, strengthening the characteristics of the embedded vector through the weighting of the self-attention weight, introducing cross attention between the spliced embedded vector and the embedded vector, acquiring a neighborhood embedded vector of the spliced embedded vector, and strengthening the context semantic;
Acquiring an attention-encoded embedded vector sequence and a neighborhood embedded vector sequence, calculating the similarity of the embedded vector and the neighborhood embedded vector in the sequence, and acquiring a spliced embedded vector with the similarity meeting a preset similarity threshold value for decoding, wherein the spliced embedded vector is used as a keyword extraction result;
and establishing a search index according to the keywords, and calculating the feature similarity of the keywords in massive patent big data by using the search index to obtain the patent data meeting the preset similarity standard to construct a sub-data set containing the keywords.
In this scheme, the central representation of the sub-dataset is generated by a clustering model, specifically:
Optimizing an initial clustering center of the sub-data set by utilizing a sparrow search algorithm, initializing parameters of the sparrow search algorithm, calculating fitness values in the sparrow population, and obtaining an optimal fitness value, a worst fitness value and corresponding positions;
Selecting discoverers, joiners and scouters, updating positions, introducing adaptive t distribution variation in the process of updating the positions of sparrows, iteratively calculating fitness and updating the positions of the sparrows, and outputting the optimal sparrows to obtain a clustering center matrix after the maximum iteration times are met;
acquiring an initial cluster center according to the cluster center matrix, using Euclidean distance as a measurement function, distributing the patent data in the sub-data set to the initial cluster center closest to the initial cluster center, and updating the cluster center in different clusters after the distribution of all the patent data is finished;
And obtaining a final clustering result of the sub-data set through iterative clustering, and generating a central representation of the sub-data set according to the partitioned different clusters.
In the scheme, patent information to be predicted is intercepted in a patent text input by a user, and text representation of the patent information to be predicted is generated, specifically:
intercepting a patent text input by a user according to a preset paragraph position and an indication keyword to generate patent information to be predicted, extracting and generating an embedded vector of the patent text corresponding to the patent information to be predicted, and dividing the word embedded vector, the sentence embedded vector and the paragraph embedded vector;
Leading the embedded vectors of the patent information to be predicted into a two-way long-short-term memory network, introducing an attention mechanism to calculate the embedded vectors of different levels by utilizing a forward LSTM and a reverse LSTM, calculating forward and reverse calculation results through a hidden layer, and outputting semantic features corresponding to the embedded vectors of the patent information to be predicted;
And carrying out representation matching on the semantic features according to the embedded vectors of different levels corresponding to the to-be-predicted patent information, and generating text representation of the to-be-predicted patent information.
In this scheme, constraint information is generated based on the similarity and the patent quality index, specifically:
calculating the text representation and all center representations and the similarity of the patent information to be predicted and the sub-data set, calculating cosine similarity between embedded vectors after dimension alignment, and extracting semantic features of the patent information to be predicted at corresponding positions when the cosine similarity is larger than a preset threshold value, and carrying out similarity correction by using the semantic features;
traversing the patent information to be predicted to obtain all the similarities, carrying out average value calculation of absolute values to generate average similarity, taking the reciprocal of the average similarity, and generating one of constraint information;
acquiring a patent quality evaluation example by utilizing a big data engine, extracting a patent quality evaluation index from the patent quality evaluation example, and carrying out principal component analysis on the patent quality evaluation index to identify key influencing factors;
according to the patent quality evaluation example, obtaining interaction relations between key influence factors and patent texts and between different key influence factors, constructing triples based on different interaction relations and attributes corresponding to the key influence factors, and constructing a knowledge graph by utilizing a knowledge graph convolution neural network learning graph structure;
The method comprises the steps of obtaining the centrality of the relation edge quantity computing nodes directly connected with key influence factors in a knowledge graph, utilizing the centrality to represent the importance degree of the key influence factors, selecting a preset quantity of key influence factors according to the importance degree, and obtaining constraint information composed of corresponding index variables.
In the scheme, the patent quality evaluation model is trained by using constraint information, and specifically comprises the following steps:
Constructing a patent quality evaluation model, training corresponding encoders through training data of patent quality indexes of different categories in constraint information, and extracting index features from text characterization of the patent information to be predicted by utilizing the encoders of the different patent quality indexes;
inputting index features of the to-be-predicted patent information into different multi-layer perceptrons by combining the text representation with the inverse of the average similarity of the center representation, obtaining a feature importance matrix, obtaining attention distribution of the feature importance distribution by adopting cooperative attention, and obtaining the representation of the to-be-predicted patent information under different constraints according to weighted calculation;
And fully connecting the index features with the weighted characterization, outputting vectors through interaction of the multi-layer perceptron, converting the output vectors into probability distribution to obtain predictive evaluation, scoring by using MSE evaluation indexes, and obtaining a quality evaluation result of the patent information to be predicted.
The invention also provides a pre-application patent quality assessment system based on the clustering center characterization, which comprises a memory, a processor, a user interaction module, an assessment data set generation module, a quality assessment module and a data storage management module, wherein the memory and the processor store and execute a pre-application patent quality assessment method program based on the clustering center characterization;
The system comprises a user interaction module, a patent data storage module, a system evaluation module, a user interaction module and a user analysis module, wherein the user is used for inputting a keyword group, determining an estimated patent data subset, inputting information of patent information to be predicted as an estimation input window, returning a result after the system estimation, and displaying an estimation result for the user;
the evaluation data set generation module is used for generating a sub-data set based on the patent big data set according to the keyword group provided by the user;
the quality evaluation module is responsible for carrying out quality evaluation based on the patent information to be evaluated and the sub-data set;
And the data storage management module is responsible for storing the patent big data set and storing the patent subset generated based on the user key word group, so that the operation of the non-real-time assessment task is facilitated.
The invention discloses a pre-application patent quality assessment method and system based on clustering center characterization, and the method comprises the steps of extracting keywords based on patent texts input by a user for searching, generating a sub-data set with similar characteristics in big patent data, generating center characterization of the sub-data set through a clustering model, intercepting patent information to be predicted in the patent texts input by the user, generating text characterization, calculating similarity between the text characterization of the patent information to be predicted and the center characterization, generating constraint information based on the similarity and combining with patent quality indexes, training the patent quality assessment model through the constraint information, and obtaining a multi-dimensional quality assessment result for the patents input by the user. The multi-dimensional mass analysis method and the multi-dimensional mass analysis system can rapidly analyze the patent applied by the user plan while solving the problem of mass data comparison, are beneficial to improving the success rate of the user application and cultivating high-value patents, and reduce the cost of patent application of enterprises.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
FIG. 1 shows a flow chart of a pre-application patent quality assessment method based on cluster center characterization of the present invention.
As shown in fig. 1, a first aspect of the present invention provides a method for evaluating the quality of a pre-application patent based on cluster center characterization, including:
S102, extracting keywords based on patent text input by a user for searching, generating a sub-data set with feature similarity meeting a preset standard in patent big data, and generating a central representation of the sub-data set through a clustering model;
S104, intercepting patent information to be predicted from a patent text input by a user, and generating a text representation of the patent information to be predicted;
s106, calculating the similarity between the to-be-predicted patent information and the central representation, and generating constraint information based on the similarity and the patent quality index;
S108, training a patent quality evaluation model by using constraint information, and obtaining a multidimensional quality evaluation result for the patent input by the user through the patent quality evaluation model.
It is to be noted that, the method includes the steps of obtaining the patent text input by the user, performing pretreatment such as word segmentation, normalization and stop word screening, generating the serialization representation of the patent text, judging the part of speech labels of word vectors in the serialization representation of the patent text, including prepositions, adjectives, nouns, proper nouns and the like, using the part of speech labels to perform sequence labeling, utilizing Roberta to cut and block the serialization representation of the patent text and insert the representation, using Roberta to encode, and then mutually associating the insert vectors, enhancing semantic learning capability, and obtaining semantic features expressed under different contexts. The method comprises the steps of obtaining embedded vectors of patent texts, screening preset phrases through the part-of-speech labels, screening corresponding embedded vectors based on position feature matching of the preset phrases, splicing the matched and screened embedded vectors, performing dimension transformation through a linear layer after unifying the lengths, obtaining spliced embedded vectors, introducing a self-attention mechanism into the embedded vectors of the patent texts, strengthening the features of the embedded vectors through weighting of self-attention weights, introducing cross attention between the spliced embedded vectors and the embedded vectors, obtaining neighborhood embedded vectors of the spliced embedded vectors, strengthening context semantics, obtaining an attention-coded embedded vector sequence and a neighborhood embedded vector sequence, wherein the neighborhood embedded vector sequence contains global semantics, the neighborhood embedded vector sequence is rich in local upper and lower Wen Yuyi, calculating similarity of the embedded vectors and the neighborhood embedded vectors in the sequence after dimension reduction, obtaining spliced embedded vectors with the similarity as a basis of importance or not, decoding the similarity meeting a preset similarity threshold value, establishing a search index according to the keyword, calculating the key feature similarity in massive patent big data according to the key, and obtaining a patent data set with a keyword similarity standard.
FIG. 2 illustrates a flow chart of the present invention for generating a central representation of a sub-dataset.
According to the embodiment of the invention, the central representation of the sub-data set is generated through a clustering model, specifically:
S202, optimizing an initial clustering center of a sub-data set by utilizing a sparrow search algorithm, initializing parameters of the sparrow search algorithm, calculating fitness values in a sparrow population, and obtaining an optimal fitness value, a worst fitness value and corresponding positions;
S204, selecting discoverers, joiners and scouters, updating positions, introducing self-adaptive t distribution variation in the process of updating the positions of sparrows, iteratively calculating fitness and updating the positions of the sparrows, and outputting the optimal sparrows to obtain a clustering center matrix after the maximum iteration times are met;
S206, acquiring an initial cluster center according to the cluster center matrix, using Euclidean distance as a measurement function, distributing the patent data in the sub-data set to the initial cluster center closest to the initial cluster center, and updating the cluster center in different clusters after the distribution of all the patent data is finished;
S208, obtaining a final clustering result of the sub-data set through iterative clustering, and generating a center representation of the sub-data set according to the partitioned different clusters.
It should be noted that the K-means clustering algorithm can improve the running speed, but too centralized or dispersed center points can cause poor clustering effect when the clustering center is selected randomly, and the accuracy of center characterization after clustering is affected. The sparrow search algorithm has the advantages of high convergence speed and the like, and improves the influence of an initial clustering center on a clustering result. Initializing parameters of a sparrow search algorithm, and setting maximum iteration times, population scale, number of discoverers, number of alerters and alarm values. The discoverer consists of sparrows with the best positions, the rest sparrows are the followers, the alerter can randomly generate sparrows, and the higher the adaptation degree of the sparrows is, the higher the priority of the sparrows for obtaining food is represented. And finding a clustering center matrix with the best adaptability through a sparrow search algorithm and the input clustering number, and carrying out self-adaptive t distribution variation on the sparrow positions in order to avoid the clustering algorithm from falling into local optimum, wherein the t distribution combines the characteristics of Cauchy distribution and Gaussian distribution, balances global exploration capacity and local development capacity, acquires the current latest position, and updates the position matrix if the current latest position is better than the previous optimal position until the optimal position matrix is output to acquire the clustering center matrix. Frequently, more than one center token is clustered, e.g., N categories are obtained after cluster analysis, with corresponding N center tokens.
It should be noted that, according to the preset paragraph positions such as the claims or the abstract of the specification and the instruction keywords, the patent text input by the user is intercepted to generate the patent information to be predicted, the embedded vector corresponding to the patent text to be predicted is extracted and generated, the word embedded vector, the sentence embedded vector and the segment embedded vector are divided, the embedded vectors at different positions and levels can improve the efficiency of text semantic recognition, the embedded vectors of the patent information to be predicted are imported into a two-way long-short-term memory network, the attention mechanism is introduced to calculate the embedded vectors at different levels by using forward LSTM and reverse LSTM, the forward LSTM carries out forward operation on the embedded vectors input at the t moment and the output at the t-1 moment to obtain the forward output at the t moment, the reverse LSTM carries out reverse operation on the embedded vectors input at the t moment and the output at the t+1 moment to obtain the reverse output at the t moment, the semantic features corresponding to the embedded vectors of the patent information to be predicted are output through the operation of the hidden layer, the semantic features are represented and matched with the embedded vectors at different levels corresponding to the patent information to be predicted according to the position codes to generate the forward characterization text of the patent information to be predicted.
FIG. 3 shows a flow chart of the present invention for constructing a patent quality assessment model.
According to the embodiment of the invention, the patent quality assessment model is trained by using constraint information, and specifically comprises the following steps:
S302, constructing a patent quality evaluation model, training corresponding encoders through training data of patent quality indexes of different categories in constraint information, and extracting index features from text characterization of the to-be-predicted patent information by utilizing the encoders of the different patent quality indexes;
S304, inputting index features of the to-be-predicted patent information into different multi-layer perceptrons by combining text characterization and inverse of central characterization average similarity, acquiring a feature importance matrix, acquiring attention distribution of the feature importance distribution by adopting cooperative attention, and acquiring characterization of the to-be-predicted patent information under different constraints according to weighted calculation;
And S306, fully connecting the index features with the weighted characterization, outputting vectors through interaction of the multi-layer perceptron, converting the output vectors into probability distribution to obtain prediction evaluation, and grading by using MSE evaluation indexes to obtain a quality evaluation result of the patent information to be predicted.
It is to be noted that, calculating the overall center characterization and similarity of the text characterization and sub-data set of the patent information to be predicted, calculating the cosine similarity between the embedded vectors after the dimensions are aligned, when the cosine similarity is larger than a preset threshold, extracting the semantic features of the patent information to be predicted at the corresponding position, carrying out similarity correction by using the semantic features, traversing the overall similarity of the patent information to be predicted, carrying out mean calculation of absolute values, generating average similarity, taking the reciprocal of the average similarity, obtaining the probability of reaction authorization to a certain extent, generating one of the quality evaluation indexes in constraint information, obtaining a patent quality evaluation example by using a big data engine, extracting the patent quality evaluation indexes in the patent quality evaluation example, such as the number and length of claims, the number of patent references, the number of non-patent documents, the technical life cycle, the patent class, the number of patent families, the inventor and the number of applicant, and the like, carrying out principal component analysis on the key influence factors for the patent quality evaluation indexes, obtaining the interaction relation between the key influence factors and the patent text and different key influence factors according to the patent quality evaluation, obtaining the key relation between the key relation and the key relation, based on the different three-dimensional interaction relation and the key relation, obtaining the key relation between the key relation and the key relation, and the key relation by using the key relation, and the key relation, which can be directly constructed by using the key relation of the key node map, the key node map and the key node map has the characteristics, the key relation, the key node map and the key node map has the importance relation and the key relation is obtained by the key node map and the key relation, a more central node will be more important than other nodes. And selecting a preset number of key influence factors according to the importance degree, and acquiring constraint information composed of corresponding index variables.
Constructing a patent quality evaluation model, acquiring the extracted index features from the text representation of the patent information to be predicted by using a multi-scale encoder module to obtain the representation of the text representation under different patent evaluation index variables in constraint information, and estimating the different importance of the different representations by using a cooperative attention mechanism, wherein the calculation formula is as follows: Wherein The distribution of the attention is indicated and,The representation text represents the representation corresponding to the j-th and n-th patent evaluation index variables, and m represents the total number of representations. And correspondingly connecting the index features before the attention mechanism and the index representations after the attention mechanism respectively, outputting a result through a multi-layer perceptron interactive network, grading by using an MSE evaluation index, and obtaining a quality evaluation result of the to-be-predicted patent information, wherein the smaller the MSE evaluation index is, the closer the predicted value output by the quality evaluation model is to the true value, and the better the quality of the to-be-predicted patent information is proved.
It should be noted that, a historical quality evaluation result of an enterprise patent text is obtained, a writing image of the enterprise patent text is constructed according to the historical quality evaluation result, a personalized database is constructed based on writing images of different time periods, quality evaluation indexes with larger differences from other quality evaluation indexes in the writing image are obtained in the personalized database to mark, an improvement direction of patent text writing is generated in a current writing workflow according to the marked quality evaluation indexes, tracing is carried out by utilizing the improvement direction based on an ant colony algorithm, influence factors of abnormal quality evaluation indexes are obtained according to tracing paths, writing flows of patent text and technical bottom documents are improved according to influence factor retrieval optimization measures, and corresponding writing workflow is updated.
FIG. 4 shows a block diagram of a pre-application patent quality assessment system of the present invention based on cluster center characterization.
The invention also provides a pre-application patent quality assessment system 4 based on cluster center characterization, which comprises a memory 41, a processor 42, a user interaction module 43, an assessment data set generation module 44, a quality assessment module 45 and a data storage management module 46, wherein the memory 41 and the processor 42 store and execute a pre-application patent quality assessment method program based on cluster center characterization;
The user interaction module 43 is used for inputting key word groups by a user to determine an estimated patent data subset, inputting information of the patent information to be predicted as an estimated input window, returning a result after system estimation and displaying an estimated result for the user;
An evaluation dataset generation module 44 that generates sub-datasets based on the patent big dataset from the user-supplied key word groups;
the quality evaluation module 45 is responsible for performing quality evaluation based on the patent information to be evaluated and the sub-data set;
The data storage management module 46 is responsible for the storage of patent big data sets and the storage of patent subsets generated based on user key word groups, facilitating the operation of non-real-time assessment tasks.
The third aspect of the present invention also provides a computer readable storage medium, where the computer readable storage medium includes a pre-application patent quality assessment method program based on cluster center characterization, where the pre-application patent quality assessment method program based on cluster center characterization is executed by a processor, to implement the steps of the pre-application patent quality assessment method based on cluster center characterization as described in any one of the above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic or optical disk, or other various media that may store program code.
Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.