CN111708865B - Technology forecasting and patent early warning analysis method based on improved XGboost algorithm - Google Patents
Technology forecasting and patent early warning analysis method based on improved XGboost algorithm Download PDFInfo
- Publication number
- CN111708865B CN111708865B CN202010557407.XA CN202010557407A CN111708865B CN 111708865 B CN111708865 B CN 111708865B CN 202010557407 A CN202010557407 A CN 202010557407A CN 111708865 B CN111708865 B CN 111708865B
- Authority
- CN
- China
- Prior art keywords
- xgboost
- early warning
- algorithm
- early
- technical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 57
- 238000005516 engineering process Methods 0.000 title claims abstract description 43
- 241000282461 Canis lupus Species 0.000 claims abstract description 42
- 238000012360 testing method Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000005457 optimization Methods 0.000 claims abstract description 13
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 11
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000012800 visualization Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 33
- 238000011176 pooling Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 7
- PGLIUCLTXOYQMV-UHFFFAOYSA-N Cetirizine hydrochloride Chemical compound Cl.Cl.C1CN(CCOCC(=O)O)CCN1C(C=1C=CC(Cl)=CC=1)C1=CC=CC=C1 PGLIUCLTXOYQMV-UHFFFAOYSA-N 0.000 claims description 6
- 230000001537 neural effect Effects 0.000 claims 1
- 238000011161 development Methods 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000007418 data mining Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000002860 competitive effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010030 laminating Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
- G06Q50/184—Intellectual property management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Technology Law (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Tourism & Hospitality (AREA)
- Molecular Biology (AREA)
- Operations Research (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps that a user inputs a patent search formula, a patent subject database is constructed according to the patent search formula, a convolutional neural network is adopted to carry out feature extraction on patent texts of the patent subject database and obtain feature vectors, a test set is constructed according to the feature vectors, an XGboost model is trained through the training set, then the XGboost model is improved through a wolf optimization algorithm, the classification precision and the classification efficiency are improved, an XGboost classifier is obtained after the XGboost model is tested through the test set, after a patent to be early warned is input into the XGboost classifier, the classification of the patent to be early warned and other patent texts in the same class as the patent to be early warned can be obtained, so that the patent early warning analysis, the technology maturity and the technology evolution direction forecasting can be carried out, a forecasting result with high accuracy and visualization degree can be provided for the user, the user can understand the development situation of the prior art and the future evolution direction at a glance.
Description
Technical Field
The invention relates to the technical field of patent information processing, in particular to a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm.
Background
With the rapid development of science and technology, various high and new technologies emerge endlessly, intellectual property rights are increasingly paid attention to people, the competitive environment of the market is more and more complex, how enterprises keep leading in the intense technical competitive environment is important to improve the level of competitiveness of the enterprises, and the patents increasingly become the core elements of the level of competitiveness of the enterprises, so that the enterprises can analyze the existing patents to realize technical forecast and patent early warning, thereby avoiding trapping in patent traps and mastering the future development situation of the technology.
A technology competition and patent early warning analysis method based on knowledge discovery is disclosed as CN106897392A, a special database is established after a user inputs an index formula, a cluster data set of the special database is obtained by utilizing data mining and knowledge discovery tools such as vector space, mathematical statistics and the like, then patent early warning and patent theme life cycle analysis are carried out on the cluster data set, and a visual result is provided for the user, so that the technology competition and patent early warning are realized.
The method for analyzing the cross-domain patent early warning information based on the multi-branch tree with the publication number of CN106845798A has the core idea that collected patent data are screened, classified, subjected to feature extraction and the like, the multi-branch tree in the patent domain is built, each leaf node stores patent technology and associated user information, leaf nodes matched with patents to be early warned are searched in the tree, and cross-domain patent early warning is carried out according to different matching results; the technology has the defects that data collection needs to collect a large amount of patent technology information and the related user information of the technology, a large amount of time and labor are consumed in the process, the collected information is not necessarily reliable, the stored content of leaf nodes is not accurate, errors are prone to occurring in searching and matching, and the patent early warning accuracy is greatly influenced.
A big data patent early warning service system based on a genetic algorithm with the publication number of CN107369007A adopts the data mining algorithm based on the genetic algorithm in a data mining module to classify a patent data set, and then analyzes a classification result to realize patent early warning.
Disclosure of Invention
Therefore, the invention provides a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm, which is used for optimizing parameters by using the improved XGboost algorithm and classifying patent subject databases to improve the classification precision and the classification efficiency.
The technical scheme of the invention is realized as follows:
a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm comprises the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, extracting the features of the patent texts in the patent subject database by using a convolutional neural network to obtain a feature vector Vk;
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
Preferably, step S1 is preceded by:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
Preferably, the specific step of step S1 is: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
Preferably, the specific step of step S2 is: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk。
Preferably, the test set S ═ { S } of step S3 is setk=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
Preferably, the step S4 includes:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, XGboost model according to the initial base classifier number, learning rate, maximum tree depth and minimum leaf node sample weightLearning the training set according to a fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the grey wolf group, recalculating the fitness function value of each grey wolf individual at the new position, and performing the fitness function value F with the last iteration optimal fitness function valuegMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg;
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
Preferably, the fitness function F of step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:FRecallfor recall, the expression is:F1for measuring the index of classification accuracy, the expression is as follows:wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
Preferably, the patent early warning analysis in step S6 includes the specific steps of: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
Preferably, the specific steps of predicting the technology maturity and the technology evolution direction in step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
Preferably, the analysis prediction result in step S7 is sent to the user in the warning result receiving manner set in step S0.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of extracting the characteristics of patent texts of a patent subject database through a convolutional neural network, constructing a test set, training an XGboost model through the training set, improving the XGboost model by adopting a wolf optimization algorithm, obtaining the optimal parameters of the XGboost model, testing the XGboost model through the test set to obtain an XGboost classifier, ensuring the classification accuracy of the XGboost classifier, classifying the early warning patents through the XGboost classifier, using the classification of the patents to be early warned obtained after classification and other patent texts in the same class as the classification of the patents to be early warned to perform patent early warning analysis, technology maturity and technology direction prediction, finally obtaining an analysis prediction result, improving the classification accuracy of the XGboost model and improving the accuracy through the wolf optimization algorithm, and the times of operation are reduced, and the classification efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a technical anticipation and patent early warning analysis method based on an improved XGboost algorithm according to the present invention;
FIG. 2 is a diagram of a technical evolution radar chart based on the improved XGboost algorithm technical forecast and patent early warning analysis method of the invention;
fig. 3 is a technical evolution analysis process diagram of a technical forecast and patent early warning analysis method based on an improved XGBoost algorithm.
Detailed Description
For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.
Referring to fig. 1, the technical anticipation and patent early warning analysis method based on the improved XGBoost algorithm provided by the present invention includes the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, extracting the features of the patent texts in the patent subject database by using a convolutional neural network to obtain a feature vector Vk;
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
The invention relates to a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of firstly, after a user inputs a corresponding patent retrieval formula on an interface according to a prompt, constructing a patent subject database according to the patent retrieval formula, then, extracting the characteristics of patent texts in the patent subject database by adopting a convolutional neural network, extracting the characteristic vectors in the patent texts, using the obtained characteristic vectors to construct a test set S, training an XGboost model according to the training set, then, improving the XGboost model by adopting a Grey wolf optimization algorithm, finally, testing the XGboost model according to a test set prepared in advance to enable the parameters of the XGboost model to be in an optimal state, obtaining a trained XGboost classifier at the moment, then, inputting the patents to be early warned into the XGboost classifier, classifying by the XGboost classifier to obtain the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned, therefore, patent early warning analysis, technical maturity and technical evolution direction prediction can be carried out, the final analysis prediction result is visually displayed, feature extraction is carried out through a convolutional neural network, complex data preprocessing can be avoided, the method does not depend on external knowledge, the user friendliness degree is high, after the XGboost model is optimized through the Hui wolf optimization algorithm, the classification precision and the classification efficiency can be improved, the time complexity is reduced, and a user can carry out patent early warning analysis, technical maturity and technical evolution direction prediction on the patent to be early warned only by inputting a patent retrieval formula.
Preferably, step S1 is preceded by:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
After a user inputs a patent retrieval formula on an interface, the user also needs to set a patent early warning threshold and an early warning result receiving mode by himself, because the patent early warning process of the invention is carried out by adopting a SimHash algorithm, and according to an empirical value, for a 64-bit SimHash value, the similarity of hamming distance within 3 can be considered to be higher, so the patent early warning threshold is set as 3 by default, but the user can also select the patent early warning threshold according to specific use, and the early warning result receiving mode can be various, such as mail receiving and the like.
Preferably, the specific step of step S1 is: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
Preferably, the specific step of step S2 is: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk。
The invention adopts a word vector combined convolutional neural network algorithm to extract the characteristics of a patent text of a constructed patent subject database to form characteristic vectors, wherein the convolutional neural network comprises a convolutional layer, a pooling layer, a full-link layer and an output layer, the neurons of adjacent layers are connected with one another, and the neurons of the same layer are not connected with one another.
Inputting a layer: will be each word w in the patent textiConversion into vectors v by a word vector dictionaryiThe word vector has the advantages of making up the defects of the BOW and TF-IDF models in expressing the grammar, the sequence and the semantic relation of the words, and for the patent document FkIn other words, a patent text representation matrix is formed by vector join operations, denoted as Mk={v1,v2,...,vn}。
② rolling and laminating: the convolution kernel passes through the matrix MkMiddle-sized trimmerThe continuous sliding of the lines realizes the extraction of local features, the width b of the convolution kernel and the word vector viThe dimensions of the convolution kernel are consistent, the height h of the convolution kernel represents the range of local text features to be extracted, and a good feature extraction effect can be obtained when the value of h is between 2 and 5. Using n convolution kernels in matrix MkAnd (4) performing middle sliding and performing convolution operation.
Let Mk[i:j]Is the i-j line, D in the patent text matrixiRepresenting the ith convolution kernel, the output of the convolution kernel can be represented as
ri=Mk[i:i+h-1]·Di
Ci=f(ri+b)
Where is a dot product operation, CiThe method is characterized by learning of the ith convolution kernel, b is a bias variable, f is an activation function such as Sigmoid, and ReLU is selected as a nonlinear activation function because the ReLU has higher convergence speed and no gradient saturation while the computational complexity is reduced compared with the activation function such as Sigmoid.
③ a pooling layer: all local features C obtained for convolutional layer by the maximum pooling functioniPerforming aggregation, the maximum pooling function acting on each feature C capturediTo reduce dimensionality and obtain the features with the highest values, the expression of the maximum pooling function is:
Wi=poolingmax(Ci);
wherein WiIs effected on the local feature C by means of a maximum pooling functioniThe resulting maximum, n eigenvectors generated for n convolution kernels may be denoted as Vk={W1,W2,...,Wn}。
An output layer: feature vector V obtained from pooling layerkAnd outputting the data.
Preferably, the test set S ═ { S } of step S3 is setk=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
Preferably, the step S4 includes:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, the XGboost model learns the training set according to the initial number of the base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight, and learns the training set according to the fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the grey wolf group, recalculating the fitness function value of each grey wolf individual at the new position, and performing the fitness function value F with the last iteration optimal fitness function valuegMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg;
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
Preferably, the fitness function F of step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:FRecallfor recall, the expression is:F1for measuring the index of classification accuracy, the expression is as follows:wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
The XGboost model is optimized by adopting a gray wolf optimization algorithm, four parameters which have large influence on the model are selected for iterative optimization because the XGboost model contains more parameters, and other parameters are set as default values for improving the classification precision and the classification efficiency of the modelPrecisionRecall rate FRecallAnd F1Evaluating the classification accuracy of the model by using three indexes, and taking the macro average as a fitness function, wherein the macro average is the accuracy F of all classesPrecisionRecall rate FRecallAnd F1The values are averaged to evaluate the mesosome performance of the patent text classification.
The improved XGboost model comprises the following specific implementation steps:
a. initializing the weak learner:
b. iteratively, M basis learners are generated, for M1, 2.
1) For each sample i 1, 2.. times.n, a negative gradient, i.e. a residual, is calculated:
2) taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x)i,xim) I 1,2, n is used as training data of the next tree, and a new regression tree f is obtainedm(x) The corresponding leaf node region is RjmJ is 1, 2. Wherein J is the number of leaf nodes of the regression tree t.
3) For the leaf region RjmJ1, 2.. J, calculating the best fit value, deriving γ and making the derivative be 0:
4) updating the strong learner:
c. obtaining a final learner:
d. and obtaining the classification result of each patent document by using a final learner in a scoring mode.
Preferably, the patent early warning analysis in step S6 includes the specific steps of: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
The invention adopts the SimHash algorithm to carry out patent early warning analysis, utilizes the SimHash algorithm to calculate the similarity between a patent to be early warned and other patents belonging to the same category, and outputs the patent with the similarity exceeding an early warning threshold value, thereby realizing patent early warning, the main idea of the SimHash algorithm is dimension reduction, high-dimensional eigenvector is mapped into low-dimensional eigenvector, whether articles are repeated or are highly similar is determined by the Hamming distance between the two vectors, and the SimHash algorithm of the invention is divided into 4 steps: word segmentation and weight calculation, hash calculation, weighting and merging and dimension reduction output.
The first step is as follows: and (3) performing word segmentation and weight calculation, performing word segmentation processing on the words, calculating the weight of each word segmentation in the text, considering the k words before selection for the text with overlong length, and performing calculation to obtain k keyword weight pairs.
The second step is that: and (3) performing hash calculation, namely calculating the hash value of each keyword through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers 0 and 1, and the keyword-weight pair is converted into a hash value-weight pair.
The third step: and weighting and merging, namely performing bitwise multiplication on the weight of the keyword obtained in the first step and the hash value of the keyword, namely W (hash) weight, positively multiplying the hash value and the weight when the bit is 1, negatively multiplying the hash value and the weight when the bit is 0, and merging and accumulating the weighted values of the text keywords if the global characteristics of the text need to be analyzed.
The fourth step: and (4) performing dimension reduction output, wherein the weighting result of the third step already generates feature codes of the text, the purpose of dimension reduction is to reduce the complexity of the feature codes, each bit of the feature codes is judged, the value of the feature codes is more than or equal to 0 and is set as 1, the value of the feature codes is less than 0 and is set as-1, so that the SimHash value of the text is obtained, and finally, whether the similarity of the text exceeds an early warning threshold value or not is judged according to the Hamming distance of different text SimHash values.
Preferably, the specific steps of predicting the technology maturity and the technology evolution direction in step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
The technical maturity prediction of the invention is carried out based on the TRIZ theory, a patent characteristic fitting curve is drawn according to the TRIZ theory, the fitting curve is compared with a standard S curve, meanwhile, the patent technology maturity prediction is carried out by combining a patent data measurement algorithm, the TRIZ theory provides 4 stages of the evolution of technology through infancy stage, growth stage, maturity stage and decline stage, the technical maturity prediction mainly inspects 4 indexes of performance parameters, patent grade, patent quantity and economic benefit, a patent document of a certain label is analyzed, firstly, the number and grade of the patents are counted, a curve which changes along with time is drawn, then the performance parameters and the main indexes of the economic benefit of the label patent technology are investigated and researched, a corresponding performance parameter change curve and an economic benefit change curve are drawn, and then, a proper fitting model is selected to draw the patent characteristic fitting curve, finally, the patent characteristic fitting curve is compared with a standard S curve, meanwhile, the slopes of the 4 curves obtained by comprehensively analyzing the 4 indexes can be used for judging the position of the label patent technology on the S curve, namely the current life cycle of the label patent technology, and therefore the maturity prediction of the patent technology is achieved.
The technical evolution direction prediction is to analyze by using a technical evolution radar map, and to clearly see the place where the technology needs to be improved and innovated by visually showing the difference between the patent technology and the evolution limit, the technical evolution direction prediction is as shown in fig. 2, wherein the center of a polygon is the lowest level of the technical evolution, the periphery of the polygon is the limit of the technical evolution, each spoke represents an evolution route, and scales on the spokes represent the series of the evolution route. The method comprises the following steps of connecting positioning points of the prior patent technology system on each route into a line to obtain a shadow part, representing the current state of the patent technology system, representing the development potential of the patent technology system by a blank part of a polygonal area which is not covered by the shadow, dividing a technology system into a plurality of subsystems, drawing a technology radar map of each subsystem, judging which subsystems of the technology system have better performance and which subsystems are weak links, and predicting the technology evolution direction by utilizing the technology evolution radar map, wherein the specific steps of: firstly, analyzing a technical system formed by a patent document of a certain label, and designing a plurality of technical routes related to the technical system, namely possible evolution directions of the technical system; then positioning the technical system in each technical route, namely the evolution state of the technical system in each evolution direction at present, and drawing a technical evolution radar map; and then analyzing the radar map, if a technical innovation point is found, carrying out technical innovation on the technical system, otherwise, subdividing the radar map to obtain a radar tree map, and repeating the analysis steps, wherein the technical evolution analysis process is shown in fig. 3.
Preferably, the analysis prediction result in step S7 is sent to the user in the warning result receiving manner set in step S0.
The invention sequentially outputs the analysis and prediction results after the analysis method is executed for each user, wherein the analysis and prediction results comprise a patent early warning analysis result, a technical maturity prediction result and a technical evolution direction prediction result, for the patent early warning analysis, all patents with similarity exceeding an early warning threshold value with the patents to be early warned are output, and simultaneously, the patents are sent to the user in real time according to a receiving mode selected by the user, so that the patents are prevented from being trapped in a patent trap.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A technology forecasting and patent early warning analysis method based on an improved XGboost algorithm is characterized by comprising the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, adopting convolution neural netCarrying out feature extraction on the patent text of the patent subject database to obtain a feature vector Vk;
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
2. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the step S1 is preceded by the steps of:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
3. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the specific steps of the step S1 are as follows: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
4. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S2 is executedThe method comprises the following specific steps: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk。
5. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the test set S ═ { S } of the step S3k=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
6. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S4 comprises:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, the XGboost model learns the training set according to the initial number of the base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight, and learns the training set according to the fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the gray wolf group, recalculating the position of each individual in the gray wolf groupFitness function value of new position and the last iteration optimal fitness function value FgMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg;
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
7. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 6, wherein the fitness function F of the step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:FRecallfor recall, the expression is:F1for measuring the index of classification accuracy, the expression is as follows:wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
8. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the patent early warning analysis in step S6 comprises the following specific steps: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
9. The improved XGboost algorithm-based technical forecasting and patent early warning analysis method as claimed in claim 1, wherein the specific steps of predicting the technical maturity and the technical evolution direction in the step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
10. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the analysis and prediction result in step S7 is sent to the user in an early warning result receiving manner set in step S0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010557407.XA CN111708865B (en) | 2020-06-18 | 2020-06-18 | Technology forecasting and patent early warning analysis method based on improved XGboost algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010557407.XA CN111708865B (en) | 2020-06-18 | 2020-06-18 | Technology forecasting and patent early warning analysis method based on improved XGboost algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111708865A CN111708865A (en) | 2020-09-25 |
CN111708865B true CN111708865B (en) | 2021-07-09 |
Family
ID=72540975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010557407.XA Active CN111708865B (en) | 2020-06-18 | 2020-06-18 | Technology forecasting and patent early warning analysis method based on improved XGboost algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708865B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112801140A (en) * | 2021-01-07 | 2021-05-14 | 长沙理工大学 | XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm |
CN114615010B (en) * | 2022-01-19 | 2023-12-15 | 上海电力大学 | Edge server-side intrusion prevention system design method based on deep learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107563394A (en) * | 2017-09-19 | 2018-01-09 | 广东工业大学 | A kind of method and system of predicted pictures popularity |
CN107908688A (en) * | 2017-10-31 | 2018-04-13 | 温州大学 | A kind of data classification Forecasting Methodology and system based on improvement grey wolf optimization algorithm |
CN109190828A (en) * | 2018-09-07 | 2019-01-11 | 苏州大学 | Gas leakage concentration distribution determines method, apparatus, equipment and readable storage medium storing program for executing |
CN110110848A (en) * | 2019-05-05 | 2019-08-09 | 武汉烽火众智数字技术有限责任公司 | A kind of combination forecasting construction method and device |
CN110289097A (en) * | 2019-07-02 | 2019-09-27 | 重庆大学 | A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network |
-
2020
- 2020-06-18 CN CN202010557407.XA patent/CN111708865B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107563394A (en) * | 2017-09-19 | 2018-01-09 | 广东工业大学 | A kind of method and system of predicted pictures popularity |
CN107908688A (en) * | 2017-10-31 | 2018-04-13 | 温州大学 | A kind of data classification Forecasting Methodology and system based on improvement grey wolf optimization algorithm |
CN109190828A (en) * | 2018-09-07 | 2019-01-11 | 苏州大学 | Gas leakage concentration distribution determines method, apparatus, equipment and readable storage medium storing program for executing |
CN110110848A (en) * | 2019-05-05 | 2019-08-09 | 武汉烽火众智数字技术有限责任公司 | A kind of combination forecasting construction method and device |
CN110289097A (en) * | 2019-07-02 | 2019-09-27 | 重庆大学 | A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network |
Non-Patent Citations (2)
Title |
---|
Research on Load Prediction Based on Improve GWO and ELM in Cloud Computing;Shengcai Zhang.ET-AL;《2019 IEEE 5th International Conference on Computer and Communications (ICCC)》;20200413;第102-105页 * |
Taxi Trip Travel Time Prediction with Isolated XGBoost Regression;Kusal D. Kankanamge.ET-AL;《2019 Moratuwa Engineering Research Conference (MERCon)》;20190705;第54-59页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111708865A (en) | 2020-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399158B (en) | Attribute emotion classification method based on dependency tree and attention mechanism | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN109241377A (en) | A kind of text document representation method and device based on the enhancing of deep learning topic information | |
CN111597340A (en) | Text classification method and device and readable storage medium | |
CN112687374B (en) | Psychological crisis early warning method based on text and image information joint calculation | |
CN112732921B (en) | False user comment detection method and system | |
CN115688024B (en) | Network abnormal user prediction method based on user content characteristics and behavior characteristics | |
CN111708865B (en) | Technology forecasting and patent early warning analysis method based on improved XGboost algorithm | |
Pandey et al. | Fake news detection from online media using machine learning classifiers | |
CN112528668A (en) | Deep emotion semantic recognition method, system, medium, computer equipment and terminal | |
CN112416358B (en) | Intelligent contract code defect detection method based on structured word embedded network | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN112148868A (en) | Law recommendation method based on law co-occurrence | |
CN113240201A (en) | Method for predicting ship host power based on GMM-DNN hybrid model | |
CN114942974A (en) | E-commerce platform commodity user evaluation emotional tendency classification method | |
CN109376235A (en) | The feature selection approach to be reordered based on document level word frequency | |
CN112489689B (en) | Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure | |
CN114519508A (en) | Credit risk assessment method based on time sequence deep learning and legal document information | |
Saha et al. | The Corporeality of Infotainment on Fans Feedback Towards Sports Comment Employing Convolutional Long-Short Term Neural Network | |
CN116629258A (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN113837266B (en) | Software defect prediction method based on feature extraction and Stacking ensemble learning | |
CN114358813B (en) | Improved advertisement putting method and system based on field matrix factorization machine | |
CN117235253A (en) | Truck user implicit demand mining method based on natural language processing technology | |
Selvi et al. | Topic categorization of Tamil news articles | |
CN113821571A (en) | Food safety relation extraction method based on BERT and improved PCNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |