CN118229465B

CN118229465B - Pre-application patent quality assessment method and system based on cluster center representation

Info

Publication number: CN118229465B
Application number: CN202410610670.9A
Authority: CN
Inventors: 赖培源; 李岱素; 江昊钒; 廖晓东; 蔡焕涛; 刘士雨; 李奎; 梁育玮; 孙晓麒; 黄俊铮
Original assignee: Guangdong South China Technology Transfer Center Co ltd
Current assignee: Guangdong South China Technology Transfer Center Co ltd
Priority date: 2024-05-16
Filing date: 2024-05-16
Publication date: 2025-02-11
Anticipated expiration: 2044-05-16
Also published as: CN118229465A

Abstract

The present invention discloses a patent quality assessment method and system before application based on cluster center representation, including: extracting keywords based on patent text input by users for retrieval, generating sub-datasets with similar features in patent big data, and generating center representation of the sub-datasets through clustering models; intercepting patent information to be predicted in the patent text input by users to generate text representation; calculating the similarity between the text representation and the center representation of the patent information to be predicted, and generating constraint information based on the similarity combined with patent quality indicators; using constraint information to train patent quality assessment models, and obtaining multi-dimensional quality evaluation results for patents input by users. While solving the problem of massive data comparison, the present invention quickly conducts multi-dimensional quality analysis on patents that users plan to apply for, which is conducive to improving the success rate of user applications and cultivating high-value patents, and reducing the cost of patent applications for enterprises.

Description

Pre-application patent quality assessment method and system based on clustering center characterization

Technical Field

The invention relates to the technical field of patent quality evaluation, in particular to a method and a system for evaluating the quality of a pre-application patent based on clustering center characterization.

Background

Patents are important components of intellectual property rights and main achievements of technological innovation, wherein the number of the patents reflects the whole scale of the patents, and the quality of the patents reflects the quality of the patents. At present, the patent level of a region is usually measured by analyzing the number of patents, but the analysis of the quality of the patents is ignored, and the result is that the real situation of the patents is reflected on one side. In recent years, the number of patents is increased in an explosive manner, and a plurality of challenges are brought to patent examination and conversion operation work, so that the patent quality is highly concerned, and the selection of a scientific and reasonable patent quality evaluation method is also a hot problem of academic research, and particularly in mass data analysis, the quality evaluation assistance is carried out by constructing a subdivided small data set, so that the method is an important direction for carrying out large-scale application on an evaluation model.

At present, the number of patent application files increases faster, but the number of patent practitioners is insufficient and the expertise is good and bad, so that the workload of the patent practitioners is increased, and the quality of the patent application files is reduced indirectly. Therefore, the quality of the patent application is affected by the patent application file, the quality of the patent application file is improved, on one hand, the protection scope of the research and development scheme of the current enterprise is fully shown, the intellectual property service work of the enterprise is better carried out, and on the other hand, the quality of the patent application is improved. Therefore, multidimensional quality assessment of patent application text is one of the problems to be solved.

Disclosure of Invention

In order to solve the technical problems, the invention provides a pre-application patent quality assessment method and system based on clustering center characterization.

The first aspect of the invention provides a pre-application patent quality assessment method based on clustering center characterization, which comprises the following steps:

Extracting keywords based on patent text input by a user for searching, generating a sub-data set with feature similarity meeting a preset standard in the patent big data, and generating a central representation of the sub-data set through a clustering model;

intercepting patent information to be predicted from a patent text input by a user, and generating text representation of the patent information to be predicted;

Calculating the similarity between the patent information to be predicted and the central representation, and generating constraint information based on the similarity and the patent quality index;

and training a patent quality evaluation model by using constraint information, and obtaining a multidimensional quality evaluation result for the patent input by the user through the patent quality evaluation model.

In the scheme, keywords are extracted and retrieved based on patent text input by a user, and a sub-data set with similar characteristics meeting preset standards is generated in patent big data, specifically:

The method comprises the steps of obtaining a patent text input by a user, performing word segmentation pretreatment, generating a serialization representation of the patent text, judging part-of-speech tags of word vectors in the serialization representation of the patent text, and performing sequence labeling by using the part-of-speech tags;

cutting and blocking the serialized representation of the patent text and embedding the representation by Roberta to obtain an embedded vector of the patent text, screening a preset phrase through the part-of-speech tag, screening a corresponding embedded vector based on the position feature matching of the preset phrase, and splicing the matched and screened embedded vectors to obtain a spliced embedded vector;

introducing a self-attention mechanism into the embedded vector of the patent text, strengthening the characteristics of the embedded vector through the weighting of the self-attention weight, introducing cross attention between the spliced embedded vector and the embedded vector, acquiring a neighborhood embedded vector of the spliced embedded vector, and strengthening the context semantic;

Acquiring an attention-encoded embedded vector sequence and a neighborhood embedded vector sequence, calculating the similarity of the embedded vector and the neighborhood embedded vector in the sequence, and acquiring a spliced embedded vector with the similarity meeting a preset similarity threshold value for decoding, wherein the spliced embedded vector is used as a keyword extraction result;

and establishing a search index according to the keywords, and calculating the feature similarity of the keywords in massive patent big data by using the search index to obtain the patent data meeting the preset similarity standard to construct a sub-data set containing the keywords.

In this scheme, the central representation of the sub-dataset is generated by a clustering model, specifically:

Optimizing an initial clustering center of the sub-data set by utilizing a sparrow search algorithm, initializing parameters of the sparrow search algorithm, calculating fitness values in the sparrow population, and obtaining an optimal fitness value, a worst fitness value and corresponding positions;

Selecting discoverers, joiners and scouters, updating positions, introducing adaptive t distribution variation in the process of updating the positions of sparrows, iteratively calculating fitness and updating the positions of the sparrows, and outputting the optimal sparrows to obtain a clustering center matrix after the maximum iteration times are met;

acquiring an initial cluster center according to the cluster center matrix, using Euclidean distance as a measurement function, distributing the patent data in the sub-data set to the initial cluster center closest to the initial cluster center, and updating the cluster center in different clusters after the distribution of all the patent data is finished;

And obtaining a final clustering result of the sub-data set through iterative clustering, and generating a central representation of the sub-data set according to the partitioned different clusters.

In the scheme, patent information to be predicted is intercepted in a patent text input by a user, and text representation of the patent information to be predicted is generated, specifically:

intercepting a patent text input by a user according to a preset paragraph position and an indication keyword to generate patent information to be predicted, extracting and generating an embedded vector of the patent text corresponding to the patent information to be predicted, and dividing the word embedded vector, the sentence embedded vector and the paragraph embedded vector;

Leading the embedded vectors of the patent information to be predicted into a two-way long-short-term memory network, introducing an attention mechanism to calculate the embedded vectors of different levels by utilizing a forward LSTM and a reverse LSTM, calculating forward and reverse calculation results through a hidden layer, and outputting semantic features corresponding to the embedded vectors of the patent information to be predicted;

And carrying out representation matching on the semantic features according to the embedded vectors of different levels corresponding to the to-be-predicted patent information, and generating text representation of the to-be-predicted patent information.

In this scheme, constraint information is generated based on the similarity and the patent quality index, specifically:

calculating the text representation and all center representations and the similarity of the patent information to be predicted and the sub-data set, calculating cosine similarity between embedded vectors after dimension alignment, and extracting semantic features of the patent information to be predicted at corresponding positions when the cosine similarity is larger than a preset threshold value, and carrying out similarity correction by using the semantic features;

traversing the patent information to be predicted to obtain all the similarities, carrying out average value calculation of absolute values to generate average similarity, taking the reciprocal of the average similarity, and generating one of constraint information;

acquiring a patent quality evaluation example by utilizing a big data engine, extracting a patent quality evaluation index from the patent quality evaluation example, and carrying out principal component analysis on the patent quality evaluation index to identify key influencing factors;

according to the patent quality evaluation example, obtaining interaction relations between key influence factors and patent texts and between different key influence factors, constructing triples based on different interaction relations and attributes corresponding to the key influence factors, and constructing a knowledge graph by utilizing a knowledge graph convolution neural network learning graph structure;

The method comprises the steps of obtaining the centrality of the relation edge quantity computing nodes directly connected with key influence factors in a knowledge graph, utilizing the centrality to represent the importance degree of the key influence factors, selecting a preset quantity of key influence factors according to the importance degree, and obtaining constraint information composed of corresponding index variables.

In the scheme, the patent quality evaluation model is trained by using constraint information, and specifically comprises the following steps:

Constructing a patent quality evaluation model, training corresponding encoders through training data of patent quality indexes of different categories in constraint information, and extracting index features from text characterization of the patent information to be predicted by utilizing the encoders of the different patent quality indexes;

inputting index features of the to-be-predicted patent information into different multi-layer perceptrons by combining the text representation with the inverse of the average similarity of the center representation, obtaining a feature importance matrix, obtaining attention distribution of the feature importance distribution by adopting cooperative attention, and obtaining the representation of the to-be-predicted patent information under different constraints according to weighted calculation;

And fully connecting the index features with the weighted characterization, outputting vectors through interaction of the multi-layer perceptron, converting the output vectors into probability distribution to obtain predictive evaluation, scoring by using MSE evaluation indexes, and obtaining a quality evaluation result of the patent information to be predicted.

The invention also provides a pre-application patent quality assessment system based on the clustering center characterization, which comprises a memory, a processor, a user interaction module, an assessment data set generation module, a quality assessment module and a data storage management module, wherein the memory and the processor store and execute a pre-application patent quality assessment method program based on the clustering center characterization;

The system comprises a user interaction module, a patent data storage module, a system evaluation module, a user interaction module and a user analysis module, wherein the user is used for inputting a keyword group, determining an estimated patent data subset, inputting information of patent information to be predicted as an estimation input window, returning a result after the system estimation, and displaying an estimation result for the user;

the evaluation data set generation module is used for generating a sub-data set based on the patent big data set according to the keyword group provided by the user;

the quality evaluation module is responsible for carrying out quality evaluation based on the patent information to be evaluated and the sub-data set;

And the data storage management module is responsible for storing the patent big data set and storing the patent subset generated based on the user key word group, so that the operation of the non-real-time assessment task is facilitated.

The invention discloses a pre-application patent quality assessment method and system based on clustering center characterization, and the method comprises the steps of extracting keywords based on patent texts input by a user for searching, generating a sub-data set with similar characteristics in big patent data, generating center characterization of the sub-data set through a clustering model, intercepting patent information to be predicted in the patent texts input by the user, generating text characterization, calculating similarity between the text characterization of the patent information to be predicted and the center characterization, generating constraint information based on the similarity and combining with patent quality indexes, training the patent quality assessment model through the constraint information, and obtaining a multi-dimensional quality assessment result for the patents input by the user. The multi-dimensional mass analysis method and the multi-dimensional mass analysis system can rapidly analyze the patent applied by the user plan while solving the problem of mass data comparison, are beneficial to improving the success rate of the user application and cultivating high-value patents, and reduce the cost of patent application of enterprises.

Drawings

FIG. 1 shows a flow chart of a pre-application patent quality assessment method based on cluster center characterization of the present invention;

FIG. 2 illustrates a flow chart of the present invention for generating a central representation of a sub-dataset;

FIG. 3 shows a flow chart of the present invention for constructing a patent quality assessment model;

FIG. 4 shows a block diagram of a pre-application patent quality assessment system of the present invention based on cluster center characterization.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

FIG. 1 shows a flow chart of a pre-application patent quality assessment method based on cluster center characterization of the present invention.

As shown in fig. 1, a first aspect of the present invention provides a method for evaluating the quality of a pre-application patent based on cluster center characterization, including:

S102, extracting keywords based on patent text input by a user for searching, generating a sub-data set with feature similarity meeting a preset standard in patent big data, and generating a central representation of the sub-data set through a clustering model;

S104, intercepting patent information to be predicted from a patent text input by a user, and generating a text representation of the patent information to be predicted;

s106, calculating the similarity between the to-be-predicted patent information and the central representation, and generating constraint information based on the similarity and the patent quality index;

S108, training a patent quality evaluation model by using constraint information, and obtaining a multidimensional quality evaluation result for the patent input by the user through the patent quality evaluation model.

It is to be noted that, the method includes the steps of obtaining the patent text input by the user, performing pretreatment such as word segmentation, normalization and stop word screening, generating the serialization representation of the patent text, judging the part of speech labels of word vectors in the serialization representation of the patent text, including prepositions, adjectives, nouns, proper nouns and the like, using the part of speech labels to perform sequence labeling, utilizing Roberta to cut and block the serialization representation of the patent text and insert the representation, using Roberta to encode, and then mutually associating the insert vectors, enhancing semantic learning capability, and obtaining semantic features expressed under different contexts. The method comprises the steps of obtaining embedded vectors of patent texts, screening preset phrases through the part-of-speech labels, screening corresponding embedded vectors based on position feature matching of the preset phrases, splicing the matched and screened embedded vectors, performing dimension transformation through a linear layer after unifying the lengths, obtaining spliced embedded vectors, introducing a self-attention mechanism into the embedded vectors of the patent texts, strengthening the features of the embedded vectors through weighting of self-attention weights, introducing cross attention between the spliced embedded vectors and the embedded vectors, obtaining neighborhood embedded vectors of the spliced embedded vectors, strengthening context semantics, obtaining an attention-coded embedded vector sequence and a neighborhood embedded vector sequence, wherein the neighborhood embedded vector sequence contains global semantics, the neighborhood embedded vector sequence is rich in local upper and lower Wen Yuyi, calculating similarity of the embedded vectors and the neighborhood embedded vectors in the sequence after dimension reduction, obtaining spliced embedded vectors with the similarity as a basis of importance or not, decoding the similarity meeting a preset similarity threshold value, establishing a search index according to the keyword, calculating the key feature similarity in massive patent big data according to the key, and obtaining a patent data set with a keyword similarity standard.

FIG. 2 illustrates a flow chart of the present invention for generating a central representation of a sub-dataset.

According to the embodiment of the invention, the central representation of the sub-data set is generated through a clustering model, specifically:

S202, optimizing an initial clustering center of a sub-data set by utilizing a sparrow search algorithm, initializing parameters of the sparrow search algorithm, calculating fitness values in a sparrow population, and obtaining an optimal fitness value, a worst fitness value and corresponding positions;

S204, selecting discoverers, joiners and scouters, updating positions, introducing self-adaptive t distribution variation in the process of updating the positions of sparrows, iteratively calculating fitness and updating the positions of the sparrows, and outputting the optimal sparrows to obtain a clustering center matrix after the maximum iteration times are met;

S206, acquiring an initial cluster center according to the cluster center matrix, using Euclidean distance as a measurement function, distributing the patent data in the sub-data set to the initial cluster center closest to the initial cluster center, and updating the cluster center in different clusters after the distribution of all the patent data is finished;

S208, obtaining a final clustering result of the sub-data set through iterative clustering, and generating a center representation of the sub-data set according to the partitioned different clusters.

It should be noted that the K-means clustering algorithm can improve the running speed, but too centralized or dispersed center points can cause poor clustering effect when the clustering center is selected randomly, and the accuracy of center characterization after clustering is affected. The sparrow search algorithm has the advantages of high convergence speed and the like, and improves the influence of an initial clustering center on a clustering result. Initializing parameters of a sparrow search algorithm, and setting maximum iteration times, population scale, number of discoverers, number of alerters and alarm values. The discoverer consists of sparrows with the best positions, the rest sparrows are the followers, the alerter can randomly generate sparrows, and the higher the adaptation degree of the sparrows is, the higher the priority of the sparrows for obtaining food is represented. And finding a clustering center matrix with the best adaptability through a sparrow search algorithm and the input clustering number, and carrying out self-adaptive t distribution variation on the sparrow positions in order to avoid the clustering algorithm from falling into local optimum, wherein the t distribution combines the characteristics of Cauchy distribution and Gaussian distribution, balances global exploration capacity and local development capacity, acquires the current latest position, and updates the position matrix if the current latest position is better than the previous optimal position until the optimal position matrix is output to acquire the clustering center matrix. Frequently, more than one center token is clustered, e.g., N categories are obtained after cluster analysis, with corresponding N center tokens.

It should be noted that, according to the preset paragraph positions such as the claims or the abstract of the specification and the instruction keywords, the patent text input by the user is intercepted to generate the patent information to be predicted, the embedded vector corresponding to the patent text to be predicted is extracted and generated, the word embedded vector, the sentence embedded vector and the segment embedded vector are divided, the embedded vectors at different positions and levels can improve the efficiency of text semantic recognition, the embedded vectors of the patent information to be predicted are imported into a two-way long-short-term memory network, the attention mechanism is introduced to calculate the embedded vectors at different levels by using forward LSTM and reverse LSTM, the forward LSTM carries out forward operation on the embedded vectors input at the t moment and the output at the t-1 moment to obtain the forward output at the t moment, the reverse LSTM carries out reverse operation on the embedded vectors input at the t moment and the output at the t+1 moment to obtain the reverse output at the t moment, the semantic features corresponding to the embedded vectors of the patent information to be predicted are output through the operation of the hidden layer, the semantic features are represented and matched with the embedded vectors at different levels corresponding to the patent information to be predicted according to the position codes to generate the forward characterization text of the patent information to be predicted.

FIG. 3 shows a flow chart of the present invention for constructing a patent quality assessment model.

According to the embodiment of the invention, the patent quality assessment model is trained by using constraint information, and specifically comprises the following steps:

S302, constructing a patent quality evaluation model, training corresponding encoders through training data of patent quality indexes of different categories in constraint information, and extracting index features from text characterization of the to-be-predicted patent information by utilizing the encoders of the different patent quality indexes;

S304, inputting index features of the to-be-predicted patent information into different multi-layer perceptrons by combining text characterization and inverse of central characterization average similarity, acquiring a feature importance matrix, acquiring attention distribution of the feature importance distribution by adopting cooperative attention, and acquiring characterization of the to-be-predicted patent information under different constraints according to weighted calculation;

And S306, fully connecting the index features with the weighted characterization, outputting vectors through interaction of the multi-layer perceptron, converting the output vectors into probability distribution to obtain prediction evaluation, and grading by using MSE evaluation indexes to obtain a quality evaluation result of the patent information to be predicted.

It is to be noted that, calculating the overall center characterization and similarity of the text characterization and sub-data set of the patent information to be predicted, calculating the cosine similarity between the embedded vectors after the dimensions are aligned, when the cosine similarity is larger than a preset threshold, extracting the semantic features of the patent information to be predicted at the corresponding position, carrying out similarity correction by using the semantic features, traversing the overall similarity of the patent information to be predicted, carrying out mean calculation of absolute values, generating average similarity, taking the reciprocal of the average similarity, obtaining the probability of reaction authorization to a certain extent, generating one of the quality evaluation indexes in constraint information, obtaining a patent quality evaluation example by using a big data engine, extracting the patent quality evaluation indexes in the patent quality evaluation example, such as the number and length of claims, the number of patent references, the number of non-patent documents, the technical life cycle, the patent class, the number of patent families, the inventor and the number of applicant, and the like, carrying out principal component analysis on the key influence factors for the patent quality evaluation indexes, obtaining the interaction relation between the key influence factors and the patent text and different key influence factors according to the patent quality evaluation, obtaining the key relation between the key relation and the key relation, based on the different three-dimensional interaction relation and the key relation, obtaining the key relation between the key relation and the key relation, and the key relation by using the key relation, and the key relation, which can be directly constructed by using the key relation of the key node map, the key node map and the key node map has the characteristics, the key relation, the key node map and the key node map has the importance relation and the key relation is obtained by the key node map and the key relation, a more central node will be more important than other nodes. And selecting a preset number of key influence factors according to the importance degree, and acquiring constraint information composed of corresponding index variables.

Constructing a patent quality evaluation model, acquiring the extracted index features from the text representation of the patent information to be predicted by using a multi-scale encoder module to obtain the representation of the text representation under different patent evaluation index variables in constraint information, and estimating the different importance of the different representations by using a cooperative attention mechanism, wherein the calculation formula is as follows: Wherein The distribution of the attention is indicated and,The representation text represents the representation corresponding to the j-th and n-th patent evaluation index variables, and m represents the total number of representations. And correspondingly connecting the index features before the attention mechanism and the index representations after the attention mechanism respectively, outputting a result through a multi-layer perceptron interactive network, grading by using an MSE evaluation index, and obtaining a quality evaluation result of the to-be-predicted patent information, wherein the smaller the MSE evaluation index is, the closer the predicted value output by the quality evaluation model is to the true value, and the better the quality of the to-be-predicted patent information is proved.

It should be noted that, a historical quality evaluation result of an enterprise patent text is obtained, a writing image of the enterprise patent text is constructed according to the historical quality evaluation result, a personalized database is constructed based on writing images of different time periods, quality evaluation indexes with larger differences from other quality evaluation indexes in the writing image are obtained in the personalized database to mark, an improvement direction of patent text writing is generated in a current writing workflow according to the marked quality evaluation indexes, tracing is carried out by utilizing the improvement direction based on an ant colony algorithm, influence factors of abnormal quality evaluation indexes are obtained according to tracing paths, writing flows of patent text and technical bottom documents are improved according to influence factor retrieval optimization measures, and corresponding writing workflow is updated.

The invention also provides a pre-application patent quality assessment system 4 based on cluster center characterization, which comprises a memory 41, a processor 42, a user interaction module 43, an assessment data set generation module 44, a quality assessment module 45 and a data storage management module 46, wherein the memory 41 and the processor 42 store and execute a pre-application patent quality assessment method program based on cluster center characterization;

The user interaction module 43 is used for inputting key word groups by a user to determine an estimated patent data subset, inputting information of the patent information to be predicted as an estimated input window, returning a result after system estimation and displaying an estimated result for the user;

An evaluation dataset generation module 44 that generates sub-datasets based on the patent big dataset from the user-supplied key word groups;

the quality evaluation module 45 is responsible for performing quality evaluation based on the patent information to be evaluated and the sub-data set;

The data storage management module 46 is responsible for the storage of patent big data sets and the storage of patent subsets generated based on user key word groups, facilitating the operation of non-real-time assessment tasks.

The third aspect of the present invention also provides a computer readable storage medium, where the computer readable storage medium includes a pre-application patent quality assessment method program based on cluster center characterization, where the pre-application patent quality assessment method program based on cluster center characterization is executed by a processor, to implement the steps of the pre-application patent quality assessment method based on cluster center characterization as described in any one of the above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.

It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic or optical disk, or other various media that may store program code.

Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A pre-application patent quality assessment method based on cluster center representation, characterized in that

Extract keywords from patent texts input by users for retrieval, generate sub-datasets whose feature similarity meets preset standards in patent big data, and generate central representations of the sub-datasets through clustering models;

Extracting the patent information to be predicted from the patent text input by the user, and generating a text representation of the patent information to be predicted;

Calculate the similarity between the patent information to be predicted and the central representation, and generate constraint information based on the similarity combined with the patent quality index;

Using constraint information to train a patent quality assessment model, and obtaining a multi-dimensional quality evaluation result for a patent input by a user through the patent quality assessment model;

Based on the similarity and patent quality indicators, constraint information is generated, specifically:

Calculate the similarity between the text representation of the patent information to be predicted and the central representation of all sub-datasets, calculate the cosine similarity between the embedded vectors after dimensional alignment, and when the cosine similarity is greater than a preset threshold, extract the semantic features of the patent information to be predicted at the corresponding position, and use the semantic features to correct the similarity;

Traversing the patent information to be predicted to obtain all similarities, calculating the average of the absolute values to generate an average similarity, taking the inverse of the average similarity to generate one of the constraint information;

Using a big data engine to obtain patent quality evaluation examples, extracting patent quality evaluation indicators from the patent quality evaluation examples, and performing principal component analysis on the patent quality evaluation indicators to identify key influencing factors;

According to the patent quality evaluation examples, the interactive relationship between key influencing factors and patent texts and between different key influencing factors is obtained, and triplets are formed based on the attributes corresponding to different interactive relationships and key influencing factors. The knowledge graph convolutional neural network is used to learn the graph structure and construct the knowledge graph.

Obtain the number of relationship edges directly connected to the key influencing factors in the knowledge graph to calculate the centrality of the node, use the centrality to characterize the importance of the key influencing factors, select a preset number of key influencing factors according to the importance, and obtain the corresponding indicator variable composition constraint information;

The patent quality assessment model is trained using constraint information, specifically:

Construct a patent quality assessment model, train the corresponding encoders through the training data of different categories of patent quality indicators in the constraint information, and use the encoders of different patent quality indicators to extract indicator features from the text representation of the patent information to be predicted;

The index features of the patent information to be predicted are combined with the inverse of the average similarity between the text representation and the center representation and input into different multi-layer perceptrons to obtain the feature importance matrix, and the attention distribution of the feature importance distribution is obtained by using collaborative attention. The representation of the patent information to be predicted under different constraints is obtained according to weighted calculation;

The indicator features are fully connected with the weighted representation, and the output vector is converted into a probability distribution through the interaction of the multi-layer perceptron to obtain the prediction evaluation. The MSE evaluation indicator is used for scoring to obtain the quality assessment result of the patent information to be predicted.

2. A method for pre-application patent quality assessment based on cluster center representation according to claim 1, characterized in that keywords are extracted from patent text input by the user for retrieval, and a sub-dataset with similar features and meeting preset standards is generated in the patent big data, specifically:

Obtain the patent text input by the user for word segmentation preprocessing, generate a serialized representation of the patent text, determine the part-of-speech tags of the word vectors in the serialized representation of the patent text, and use the part-of-speech tags for sequence annotation;

Using Roberta to trim, segment and embed the serialized representation of the patent text, obtain an embedding vector of the patent text, filter preset phrases through the part-of-speech tags, filter corresponding embedding vectors based on position feature matching of the preset phrases, and splice the matched and filtered embedding vectors to obtain a spliced embedding vector;

A self-attention mechanism is introduced into the embedding vector of the patent text. The features of the embedding vector are strengthened by weighting the self-attention weight. Cross-attention is introduced between the concatenated embedding vector and the embedded vector to obtain the neighborhood embedding vector of the concatenated embedding vector and strengthen the contextual semantics.

Obtain the embedded vector sequence and the neighborhood embedded vector sequence after attention encoding, calculate the similarity of the embedded vector and the neighborhood embedded vector in the sequence, obtain the concatenated embedded vector whose similarity meets the preset similarity threshold for decoding, and use it as the keyword extraction result;

A search index is established based on the keywords, and the keyword feature similarity is calculated in the massive patent big data using the search index to obtain patent data that meets the preset similarity standard to construct a sub-dataset containing the keywords.

3. The method for pre-application patent quality assessment based on cluster center representation according to claim 1 is characterized in that the center representation of the sub-dataset is generated by a clustering model, specifically:

Use the sparrow search algorithm to optimize the initial cluster center of the sub-dataset, initialize the parameters of the sparrow search algorithm, calculate the fitness value in the sparrow population, and obtain the optimal fitness value and the worst fitness value and the corresponding position;

Select the discoverer, joiner and scout and update the position. In the process of updating the sparrow's position, introduce the adaptive t-distribution variation, iteratively calculate the fitness and update the sparrow's position. After the maximum number of iterations is met, output the best sparrow position to obtain the cluster center matrix.

Obtaining initial cluster centers according to the cluster center matrix, using Euclidean distance as a metric function, assigning patent data in the sub-dataset to the initial cluster center closest to the data, and updating cluster centers in different clusters after all patent data are assigned;

The final clustering result of the sub-dataset is obtained through iterative clustering, and the central representation of the sub-dataset is generated according to the different clusters divided.

4. A method for pre-application patent quality assessment based on cluster center representation according to claim 1, characterized in that the patent information to be predicted is intercepted from the patent text input by the user to generate a text representation of the patent information to be predicted, specifically:

According to the preset paragraph position and indicated keywords, the patent text input by the user is intercepted to generate the patent information to be predicted, and the embedding vector of the patent text corresponding to the patent information to be predicted is extracted and divided into word embedding vector, sentence embedding vector and paragraph embedding vector;

The embedded vector of the patent information to be predicted is imported into the bidirectional long short-term memory network, and the attention mechanism is introduced to calculate the embedded vectors of different levels using forward LSTM and reverse LSTM. The forward and reverse calculation results are calculated through the hidden layer to output the semantic features corresponding to the embedded vector of the patent information to be predicted;

The semantic features are represented and matched with embedding vectors of different levels corresponding to the patent information to be predicted according to the position encoding to generate a text representation of the patent information to be predicted.

5. A system for evaluating patent quality before application based on cluster center representation, characterized in that it implements the method for evaluating patent quality before application based on cluster center representation as described in any one of claims 1 to 4, and the system comprises: a memory, a processor, a user interaction module, an evaluation data set generation module, a quality evaluation module, and a data storage management module, wherein the program for evaluating patent quality before application based on cluster center representation is stored and executed in the memory and the processor;

The user interaction module is used for the user to input a keyword group to determine the patent data subset to be evaluated; and to input information of the patent information to be predicted as an evaluation input window; and to return the results of the system evaluation and display the evaluation results to the user;

The evaluation data set generation module generates sub-data sets based on the patent big data set according to the keyword groups provided by the user;

The quality assessment module is responsible for quality assessment based on the patent information and sub-datasets to be assessed;

The data storage management module is responsible for the storage of large patent data sets and the storage of patent subsets generated based on user keyword groups, which facilitates the operation of non-real-time evaluation tasks.

6. A pre-application patent quality assessment system based on cluster center representation according to claim 5, characterized in that the center representation of the sub-dataset is generated in the assessment data set generation module, specifically:

7. The pre-application patent quality assessment system based on cluster center representation according to claim 5 is characterized in that the constraint information of the patent quality assessment model is obtained in the quality assessment module, specifically:

Calculate the text representation of the patent information to be predicted and all the central representations of the sub-datasets and their similarities, calculate the cosine similarity between the embedded vectors after dimensional alignment, and when the cosine similarity is greater than a preset threshold, extract the semantic features of the patent information to be predicted at the corresponding position, and use the semantic features to correct the similarity;

In the knowledge graph, the number of relationship edges directly connected to the key influencing factors is obtained to calculate the centrality of the node, and the centrality is used to characterize the importance of the key influencing factors. A preset number of key influencing factors are selected according to the importance, and the corresponding indicator variable composition constraint information is obtained.

8. A pre-application patent quality assessment system based on cluster center representation according to claim 5, characterized in that the patent quality assessment model in the quality assessment module is specifically:

Construct a patent quality assessment model, train the corresponding encoders with the training data of different categories of patent quality indicators in the constraint information, and use the encoders of different patent quality indicators to extract indicator features from the text representation of the patent information to be predicted;

The indicator features are fully connected with the weighted representation, and the output vector is converted into a probability distribution through the interaction of the multi-layer perceptron to obtain the prediction evaluation. The MSE evaluation index is used for scoring to obtain the quality assessment result of the patent information to be predicted.