CN111708865B

CN111708865B - Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Info

Publication number: CN111708865B
Application number: CN202010557407.XA
Authority: CN
Inventors: 黄梦醒; 李茂�; 冯思玲; 冯文龙; 张雨
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2021-07-09
Anticipated expiration: 2040-06-18
Also published as: CN111708865A

Abstract

The invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps that a user inputs a patent search formula, a patent subject database is constructed according to the patent search formula, a convolutional neural network is adopted to carry out feature extraction on patent texts of the patent subject database and obtain feature vectors, a test set is constructed according to the feature vectors, an XGboost model is trained through the training set, then the XGboost model is improved through a wolf optimization algorithm, the classification precision and the classification efficiency are improved, an XGboost classifier is obtained after the XGboost model is tested through the test set, after a patent to be early warned is input into the XGboost classifier, the classification of the patent to be early warned and other patent texts in the same class as the patent to be early warned can be obtained, so that the patent early warning analysis, the technology maturity and the technology evolution direction forecasting can be carried out, a forecasting result with high accuracy and visualization degree can be provided for the user, the user can understand the development situation of the prior art and the future evolution direction at a glance.

Description

Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Technical Field

The invention relates to the technical field of patent information processing, in particular to a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm.

Background

With the rapid development of science and technology, various high and new technologies emerge endlessly, intellectual property rights are increasingly paid attention to people, the competitive environment of the market is more and more complex, how enterprises keep leading in the intense technical competitive environment is important to improve the level of competitiveness of the enterprises, and the patents increasingly become the core elements of the level of competitiveness of the enterprises, so that the enterprises can analyze the existing patents to realize technical forecast and patent early warning, thereby avoiding trapping in patent traps and mastering the future development situation of the technology.

A technology competition and patent early warning analysis method based on knowledge discovery is disclosed as CN106897392A, a special database is established after a user inputs an index formula, a cluster data set of the special database is obtained by utilizing data mining and knowledge discovery tools such as vector space, mathematical statistics and the like, then patent early warning and patent theme life cycle analysis are carried out on the cluster data set, and a visual result is provided for the user, so that the technology competition and patent early warning are realized.

The method for analyzing the cross-domain patent early warning information based on the multi-branch tree with the publication number of CN106845798A has the core idea that collected patent data are screened, classified, subjected to feature extraction and the like, the multi-branch tree in the patent domain is built, each leaf node stores patent technology and associated user information, leaf nodes matched with patents to be early warned are searched in the tree, and cross-domain patent early warning is carried out according to different matching results; the technology has the defects that data collection needs to collect a large amount of patent technology information and the related user information of the technology, a large amount of time and labor are consumed in the process, the collected information is not necessarily reliable, the stored content of leaf nodes is not accurate, errors are prone to occurring in searching and matching, and the patent early warning accuracy is greatly influenced.

A big data patent early warning service system based on a genetic algorithm with the publication number of CN107369007A adopts the data mining algorithm based on the genetic algorithm in a data mining module to classify a patent data set, and then analyzes a classification result to realize patent early warning.

Disclosure of Invention

Therefore, the invention provides a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm, which is used for optimizing parameters by using the improved XGboost algorithm and classifying patent subject databases to improve the classification precision and the classification efficiency.

The technical scheme of the invention is realized as follows:

a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm comprises the following steps:

step S1, constructing a patent theme database according to the patent retrieval formula;

step S2, extracting the features of the patent texts in the patent subject database by using a convolutional neural network to obtain a feature vector V_k；

Step S3, according to the feature vector V_kConstructing a test set S;

step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;

step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;

s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;

and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.

Preferably, step S1 is preceded by:

and step S0, setting a patent early warning threshold value and an early warning result receiving mode.

Preferably, the specific step of step S1 is: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.

Preferably, the specific step of step S2 is: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject database_kForming a patent text representation matrix M after passing through an input layer_kPatent text representation matrix M_kExpressed as a feature vector V by the operation of convolution and pooling layers_k。

Preferably, the test set S ═ { S } of step S3 is set_k＝(V_k,L_k) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, L_kIs a sample s_kThe corresponding patent label classification number.

Preferably, the step S4 includes:

step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;

step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;

step S43, XGboost model according to the initial base classifier number, learning rate, maximum tree depth and minimum leaf node sample weightLearning the training set according to a fitness function F_newCalculating the fitness function value of each wolf;

step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;

step S45, updating the position of each individual in the grey wolf group, recalculating the fitness function value of each grey wolf individual at the new position, and performing the fitness function value F with the last iteration optimal fitness function value_gMaking a comparison if F_new>F_gIf the function value of the individual fitness of the wolf is F_newAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is F_g；

Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;

and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.

Preferably, the fitness function F of step S43_newIs expressed as

F_new＝(F_Precision+F_Recall+F₁)/3；

Wherein, F_PrecisionFor accuracy, the expression is:

F_Recallfor recall, the expression is:

F₁for measuring the index of classification accuracy, the expression is as follows:

wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.

Preferably, the patent early warning analysis in step S6 includes the specific steps of: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.

Preferably, the specific steps of predicting the technology maturity and the technology evolution direction in step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.

Preferably, the analysis prediction result in step S7 is sent to the user in the warning result receiving manner set in step S0.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of extracting the characteristics of patent texts of a patent subject database through a convolutional neural network, constructing a test set, training an XGboost model through the training set, improving the XGboost model by adopting a wolf optimization algorithm, obtaining the optimal parameters of the XGboost model, testing the XGboost model through the test set to obtain an XGboost classifier, ensuring the classification accuracy of the XGboost classifier, classifying the early warning patents through the XGboost classifier, using the classification of the patents to be early warned obtained after classification and other patent texts in the same class as the classification of the patents to be early warned to perform patent early warning analysis, technology maturity and technology direction prediction, finally obtaining an analysis prediction result, improving the classification accuracy of the XGboost model and improving the accuracy through the wolf optimization algorithm, and the times of operation are reduced, and the classification efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a technical anticipation and patent early warning analysis method based on an improved XGboost algorithm according to the present invention;

FIG. 2 is a diagram of a technical evolution radar chart based on the improved XGboost algorithm technical forecast and patent early warning analysis method of the invention;

fig. 3 is a technical evolution analysis process diagram of a technical forecast and patent early warning analysis method based on an improved XGBoost algorithm.

Detailed Description

For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.

Referring to fig. 1, the technical anticipation and patent early warning analysis method based on the improved XGBoost algorithm provided by the present invention includes the following steps:

Step S3, according to the feature vector V_kConstructing a test set S;

The invention relates to a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of firstly, after a user inputs a corresponding patent retrieval formula on an interface according to a prompt, constructing a patent subject database according to the patent retrieval formula, then, extracting the characteristics of patent texts in the patent subject database by adopting a convolutional neural network, extracting the characteristic vectors in the patent texts, using the obtained characteristic vectors to construct a test set S, training an XGboost model according to the training set, then, improving the XGboost model by adopting a Grey wolf optimization algorithm, finally, testing the XGboost model according to a test set prepared in advance to enable the parameters of the XGboost model to be in an optimal state, obtaining a trained XGboost classifier at the moment, then, inputting the patents to be early warned into the XGboost classifier, classifying by the XGboost classifier to obtain the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned, therefore, patent early warning analysis, technical maturity and technical evolution direction prediction can be carried out, the final analysis prediction result is visually displayed, feature extraction is carried out through a convolutional neural network, complex data preprocessing can be avoided, the method does not depend on external knowledge, the user friendliness degree is high, after the XGboost model is optimized through the Hui wolf optimization algorithm, the classification precision and the classification efficiency can be improved, the time complexity is reduced, and a user can carry out patent early warning analysis, technical maturity and technical evolution direction prediction on the patent to be early warned only by inputting a patent retrieval formula.

Preferably, step S1 is preceded by:

After a user inputs a patent retrieval formula on an interface, the user also needs to set a patent early warning threshold and an early warning result receiving mode by himself, because the patent early warning process of the invention is carried out by adopting a SimHash algorithm, and according to an empirical value, for a 64-bit SimHash value, the similarity of hamming distance within 3 can be considered to be higher, so the patent early warning threshold is set as 3 by default, but the user can also select the patent early warning threshold according to specific use, and the early warning result receiving mode can be various, such as mail receiving and the like.

The invention adopts a word vector combined convolutional neural network algorithm to extract the characteristics of a patent text of a constructed patent subject database to form characteristic vectors, wherein the convolutional neural network comprises a convolutional layer, a pooling layer, a full-link layer and an output layer, the neurons of adjacent layers are connected with one another, and the neurons of the same layer are not connected with one another.

Inputting a layer: will be each word w in the patent text_iConversion into vectors v by a word vector dictionary_iThe word vector has the advantages of making up the defects of the BOW and TF-IDF models in expressing the grammar, the sequence and the semantic relation of the words, and for the patent document F_kIn other words, a patent text representation matrix is formed by vector join operations, denoted as M_k＝{v₁,v₂,...,v_n}。

② rolling and laminating: the convolution kernel passes through the matrix M_kMiddle-sized trimmerThe continuous sliding of the lines realizes the extraction of local features, the width b of the convolution kernel and the word vector v_iThe dimensions of the convolution kernel are consistent, the height h of the convolution kernel represents the range of local text features to be extracted, and a good feature extraction effect can be obtained when the value of h is between 2 and 5. Using n convolution kernels in matrix M_kAnd (4) performing middle sliding and performing convolution operation.

Let M_k[i:j]Is the i-j line, D in the patent text matrix_iRepresenting the ith convolution kernel, the output of the convolution kernel can be represented as

r_i＝M_k[i:i+h-1]·D_i

C_i＝f(r_i+b)

Where is a dot product operation, C_iThe method is characterized by learning of the ith convolution kernel, b is a bias variable, f is an activation function such as Sigmoid, and ReLU is selected as a nonlinear activation function because the ReLU has higher convergence speed and no gradient saturation while the computational complexity is reduced compared with the activation function such as Sigmoid.

③ a pooling layer: all local features C obtained for convolutional layer by the maximum pooling function_iPerforming aggregation, the maximum pooling function acting on each feature C captured_iTo reduce dimensionality and obtain the features with the highest values, the expression of the maximum pooling function is:

W_i＝pooling_max(C_i)；

wherein W_iIs effected on the local feature C by means of a maximum pooling function_iThe resulting maximum, n eigenvectors generated for n convolution kernels may be denoted as V_k＝{W₁,W₂,...,W_n}。

An output layer: feature vector V obtained from pooling layer_kAnd outputting the data.

Preferably, the step S4 includes:

step S43, the XGboost model learns the training set according to the initial number of the base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight, and learns the training set according to the fitness function F_newCalculating the fitness function value of each wolf;

Preferably, the fitness function F of step S43_newIs expressed as

F_new＝(F_Precision+F_Recall+F₁)/3；

Wherein, F_PrecisionFor accuracy, the expression is:

F_Recallfor recall, the expression is:

The XGboost model is optimized by adopting a gray wolf optimization algorithm, four parameters which have large influence on the model are selected for iterative optimization because the XGboost model contains more parameters, and other parameters are set as default values for improving the classification precision and the classification efficiency of the model_PrecisionRecall rate F_RecallAnd F₁Evaluating the classification accuracy of the model by using three indexes, and taking the macro average as a fitness function, wherein the macro average is the accuracy F of all classes_PrecisionRecall rate F_RecallAnd F₁The values are averaged to evaluate the mesosome performance of the patent text classification.

The improved XGboost model comprises the following specific implementation steps:

a. initializing the weak learner:

in the case of a loss of square,

b. iteratively, M basis learners are generated, for M1, 2.

1) For each sample i 1, 2.. times.n, a negative gradient, i.e. a residual, is calculated:

2) taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x)_i,x_im) I 1,2, n is used as training data of the next tree, and a new regression tree f is obtained_m(x) The corresponding leaf node region is R_jmJ is 1, 2. Wherein J is the number of leaf nodes of the regression tree t.

3) For the leaf region R_jmJ1, 2.. J, calculating the best fit value, deriving γ and making the derivative be 0:

4) updating the strong learner:

c. obtaining a final learner:

d. and obtaining the classification result of each patent document by using a final learner in a scoring mode.

The invention adopts the SimHash algorithm to carry out patent early warning analysis, utilizes the SimHash algorithm to calculate the similarity between a patent to be early warned and other patents belonging to the same category, and outputs the patent with the similarity exceeding an early warning threshold value, thereby realizing patent early warning, the main idea of the SimHash algorithm is dimension reduction, high-dimensional eigenvector is mapped into low-dimensional eigenvector, whether articles are repeated or are highly similar is determined by the Hamming distance between the two vectors, and the SimHash algorithm of the invention is divided into 4 steps: word segmentation and weight calculation, hash calculation, weighting and merging and dimension reduction output.

The first step is as follows: and (3) performing word segmentation and weight calculation, performing word segmentation processing on the words, calculating the weight of each word segmentation in the text, considering the k words before selection for the text with overlong length, and performing calculation to obtain k keyword weight pairs.

The second step is that: and (3) performing hash calculation, namely calculating the hash value of each keyword through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers 0 and 1, and the keyword-weight pair is converted into a hash value-weight pair.

The third step: and weighting and merging, namely performing bitwise multiplication on the weight of the keyword obtained in the first step and the hash value of the keyword, namely W (hash) weight, positively multiplying the hash value and the weight when the bit is 1, negatively multiplying the hash value and the weight when the bit is 0, and merging and accumulating the weighted values of the text keywords if the global characteristics of the text need to be analyzed.

The fourth step: and (4) performing dimension reduction output, wherein the weighting result of the third step already generates feature codes of the text, the purpose of dimension reduction is to reduce the complexity of the feature codes, each bit of the feature codes is judged, the value of the feature codes is more than or equal to 0 and is set as 1, the value of the feature codes is less than 0 and is set as-1, so that the SimHash value of the text is obtained, and finally, whether the similarity of the text exceeds an early warning threshold value or not is judged according to the Hamming distance of different text SimHash values.

The technical maturity prediction of the invention is carried out based on the TRIZ theory, a patent characteristic fitting curve is drawn according to the TRIZ theory, the fitting curve is compared with a standard S curve, meanwhile, the patent technology maturity prediction is carried out by combining a patent data measurement algorithm, the TRIZ theory provides 4 stages of the evolution of technology through infancy stage, growth stage, maturity stage and decline stage, the technical maturity prediction mainly inspects 4 indexes of performance parameters, patent grade, patent quantity and economic benefit, a patent document of a certain label is analyzed, firstly, the number and grade of the patents are counted, a curve which changes along with time is drawn, then the performance parameters and the main indexes of the economic benefit of the label patent technology are investigated and researched, a corresponding performance parameter change curve and an economic benefit change curve are drawn, and then, a proper fitting model is selected to draw the patent characteristic fitting curve, finally, the patent characteristic fitting curve is compared with a standard S curve, meanwhile, the slopes of the 4 curves obtained by comprehensively analyzing the 4 indexes can be used for judging the position of the label patent technology on the S curve, namely the current life cycle of the label patent technology, and therefore the maturity prediction of the patent technology is achieved.

The technical evolution direction prediction is to analyze by using a technical evolution radar map, and to clearly see the place where the technology needs to be improved and innovated by visually showing the difference between the patent technology and the evolution limit, the technical evolution direction prediction is as shown in fig. 2, wherein the center of a polygon is the lowest level of the technical evolution, the periphery of the polygon is the limit of the technical evolution, each spoke represents an evolution route, and scales on the spokes represent the series of the evolution route. The method comprises the following steps of connecting positioning points of the prior patent technology system on each route into a line to obtain a shadow part, representing the current state of the patent technology system, representing the development potential of the patent technology system by a blank part of a polygonal area which is not covered by the shadow, dividing a technology system into a plurality of subsystems, drawing a technology radar map of each subsystem, judging which subsystems of the technology system have better performance and which subsystems are weak links, and predicting the technology evolution direction by utilizing the technology evolution radar map, wherein the specific steps of: firstly, analyzing a technical system formed by a patent document of a certain label, and designing a plurality of technical routes related to the technical system, namely possible evolution directions of the technical system; then positioning the technical system in each technical route, namely the evolution state of the technical system in each evolution direction at present, and drawing a technical evolution radar map; and then analyzing the radar map, if a technical innovation point is found, carrying out technical innovation on the technical system, otherwise, subdividing the radar map to obtain a radar tree map, and repeating the analysis steps, wherein the technical evolution analysis process is shown in fig. 3.

The invention sequentially outputs the analysis and prediction results after the analysis method is executed for each user, wherein the analysis and prediction results comprise a patent early warning analysis result, a technical maturity prediction result and a technical evolution direction prediction result, for the patent early warning analysis, all patents with similarity exceeding an early warning threshold value with the patents to be early warned are output, and simultaneously, the patents are sent to the user in real time according to a receiving mode selected by the user, so that the patents are prevented from being trapped in a patent trap.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A technology forecasting and patent early warning analysis method based on an improved XGboost algorithm is characterized by comprising the following steps:

step S2, adopting convolution neural netCarrying out feature extraction on the patent text of the patent subject database to obtain a feature vector V_k；

Step S3, according to the feature vector V_kConstructing a test set S;

2. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the step S1 is preceded by the steps of:

3. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the specific steps of the step S1 are as follows: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.

4. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S2 is executedThe method comprises the following specific steps: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject database_kForming a patent text representation matrix M after passing through an input layer_kPatent text representation matrix M_kExpressed as a feature vector V by the operation of convolution and pooling layers_k。

5. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the test set S ═ { S } of the step S3_k＝(V_k,L_k) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, L_kIs a sample s_kThe corresponding patent label classification number.

6. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S4 comprises:

step S45, updating the position of each individual in the gray wolf group, recalculating the position of each individual in the gray wolf groupFitness function value of new position and the last iteration optimal fitness function value F_gMaking a comparison if F_new>F_gIf the function value of the individual fitness of the wolf is F_newAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is F_g；

7. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 6, wherein the fitness function F of the step S43_newIs expressed as

F_new＝(F_Precision+F_Recall+F₁)/3；

Wherein, F_PrecisionFor accuracy, the expression is:

F_Recallfor recall, the expression is:

8. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the patent early warning analysis in step S6 comprises the following specific steps: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.

9. The improved XGboost algorithm-based technical forecasting and patent early warning analysis method as claimed in claim 1, wherein the specific steps of predicting the technical maturity and the technical evolution direction in the step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.

10. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the analysis and prediction result in step S7 is sent to the user in an early warning result receiving manner set in step S0.