CN107368526A

CN107368526A - A kind of data processing method and device

Info

Publication number: CN107368526A
Application number: CN201710433424.0A
Authority: CN
Inventors: 金海旭; 滕放; 马超; 赵继广
Original assignee: Beijing Causality Network Technology Co Ltd
Current assignee: Beijing Causality Network Technology Co Ltd
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2017-11-21

Abstract

The present invention relates to data processing field, more particularly to a kind of data processing method and device, obtains target patent data to be sorted；Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition and default matching condition, judging target patent data to be sorted, whether the match is successful with the link of predefined industrial chain；Wherein, the disaggregated model of training in advance, it is the patent sample data in the link based on default sorting algorithm and each industrial chain collected, is trained what is obtained；If it is determined that the match is successful, then obtain the corresponding relation of the link of target patent data to be sorted and industrial chain, and according to corresponding relation, target patent data to be sorted is categorized into the link of corresponding industrial chain, so, using the matching way of disaggregated model, for patent data, classified according to the link of industrial chain, meet patent search demand of the user in the link of industrial chain, also improve the efficiency of search.

Description

A kind of data processing method and device

Technical field

The present invention relates to data processing field, more particularly to a kind of data processing method and device.

Background technology

With the reach of science and progress, the communication technology is checked and searched for for ease of user also with rapid development, for The classification and processing of mass data are also more and more important.

In the prior art, it is for patent data, in general sorting technique, by patent data according to applicant, the applying date Or keyword etc. carries out homogeneous classification, so, user can view the patent data belonged under these classifications.

For example, when user search patent, after system receives the retrieval type of user's input, according to retrieval type, by retrieval type In information, match query is carried out in database, the search result of output matching is simultaneously shown, and so, user can look into See the patent data related to retrieval type belonged under above-mentioned classification.

But in the prior art, these patent datas can only be classified according to simple standard, for example, according to Applicant, applying date etc., it is impossible to classified according to the link of industrial chain, also, if the pass set in the retrieval type of user Key word is incorrect, then may be very more according to the patent data of keyword search, also inaccurate, and user may need to spend more Time searches the patent data in the link of required industrial chain, very big inconvenience is caused to user, it is impossible to meet user very well Search need.

The content of the invention

The embodiment of the present invention provides a kind of data processing method and device, to solve to meet that user exists in the prior art Patent search demand in the link of industrial chain, improve the search efficiency of user.

Concrete technical scheme provided in an embodiment of the present invention is as follows：

A kind of data processing method, including：

Obtain target patent data to be sorted；

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition With default matching condition, judge whether the target patent data to be sorted matches into the link of predefined industrial chain Work(；Wherein, the disaggregated model of the training in advance, it is in the link based on default sorting algorithm and each industrial chain collected Patent sample data, be trained what is obtained；

If it is determined that the match is successful, then the target patent data to be sorted pass corresponding with the link of industrial chain is obtained System, and according to the corresponding relation, the target patent data to be sorted is categorized into the link of corresponding industrial chain.

Preferably, judging whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, specifically include：

According to the probability subject of the target patent data to be sorted, the target patent to be sorted is calculated respectively Data belong to the posterior probability values of the link of industrial chain, and judge that the target patent data to be sorted belongs to the ring of industrial chain Whether the posterior probability values of section are more than predetermined threshold value；Or,

It is default whether the number for the probability subject for judging to extract from the target patent data to be sorted is more than Number.

Preferably, the disaggregated model of the training in advance, including the first disaggregated model and the second disaggregated model；Wherein, One disaggregated model represents the model by the training of predefined industrial chain, and the second disaggregated model represents the ring by predefined industrial chain Save the model of training；

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition With default matching condition, judge whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, specifically include：

Obtain the probability subject of the target patent data to be sorted；

Based on the first disaggregated model, according to the probability subject and default matching condition, judge described to be sorted Whether the match is successful with predefined industrial chain for target patent data；

It is determined that with industrial chain after the match is successful, based on the second disaggregated model, according to the probability subject and default Matching condition, judging the target patent data to be sorted, whether the match is successful with the link in the industrial chain that the match is successful.

Preferably, further comprise：

If it is determined that it fails to match, then according to default feature extracting method, the target patent data to be sorted is obtained Patent characteristic word, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, obtain institute State the corresponding relation of the link of target patent data to be sorted and industrial chain.

Preferably, according to default feature extracting method, the patent characteristic of the acquisition target patent data to be sorted Word, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, obtained described to be sorted The corresponding relation of the link of target patent data and industrial chain, is specifically included：

According to default feature extracting method, the patent characteristic word of the acquisition target patent data to be sorted；

Respectively by the patent characteristic word compared with the keyword of the link of default industrial chain, and statistics is special respectively The keyword identical number of the link of sharp Feature Words and industrial chain, determine the link of the most industrial chain of same number；

According to the link of the most industrial chain of the same number, the target patent data to be sorted and industry are obtained The corresponding relation of the link of chain.

Preferably, the training method of the disaggregated model is：

The patent sample data in the link of each industrial chain is gathered respectively, and extracts the feature of each patent sample data respectively Part, using the characteristic as characteristic index；

According to the characteristic index of each patent sample data, the span of each characteristic index is divided respectively, And the division according to the span of each characteristic index and default sorting algorithm, the disaggregated model is trained, is calculated every One characteristic index belongs to the probability of the link of each industrial chain, and patent sample data is categorized into the industry of corresponding maximum probability In the link of chain.

Preferably, the characteristic index includes following a kind of or any combination：International Patent classificating number IPC classification, patent Title, summary, patent characteristic word.

Preferably, before the span of each characteristic index is divided, further comprise：

The value of each characteristic index is normalized in the default span of identical respectively.

A kind of data processing equipment, including：

Acquiring unit, for obtaining target patent data to be sorted；

Matching unit, for the disaggregated model based on training in advance, according to the target patent number to be sorted of acquisition According to probability subject and default matching condition, judge the target patent data to be sorted whether with predefined industry The match is successful for the link of chain；Wherein, the disaggregated model of the training in advance is based on default sorting algorithm and is collected each Patent sample data in the link of industrial chain, is trained what is obtained；

Taxon, for if it is determined that the match is successful, then obtaining the target patent data to be sorted and industrial chain The corresponding relation of link, and according to the corresponding relation, the target patent data to be sorted is categorized into corresponding industry In the link of chain.

Preferably, judging whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, matching unit are specifically used for：

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition With default matching condition, judge whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, matching unit are specifically used for：

Obtain the probability subject of the target patent data to be sorted；

Preferably, taxon is further used for：

Preferably, according to default feature extracting method, the patent characteristic of the acquisition target patent data to be sorted Word, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, obtained described to be sorted The corresponding relation of the link of target patent data and industrial chain, taxon are specifically used for：

Preferably, the training method of the disaggregated model is：

Collecting unit, the patent sample data in link for gathering each industrial chain respectively, and each patent is extracted respectively The characteristic of sample data, using the characteristic as characteristic index；

Training unit, for the characteristic index according to each patent sample data, respectively by the value of each characteristic index Scope is divided, and the division according to the span of each characteristic index and default sorting algorithm, described point of training Class model, the probability that each characteristic index belongs to the link of each industrial chain is calculated, patent sample data is categorized into corresponding In the link of the industrial chain of maximum probability.

Normalized unit, for the value of each characteristic index to be normalized into the default value of identical respectively In the range of.

In the embodiment of the present invention, target patent data to be sorted is obtained；Disaggregated model based on training in advance, according to obtaining The probability subject of the target patent data to be sorted taken and default matching condition, judge the target to be sorted Whether the match is successful with the link of predefined industrial chain for patent data；Wherein, the disaggregated model of the training in advance, it is to be based on Patent sample data in the link of default sorting algorithm and each industrial chain collected, is trained what is obtained；If it is determined that The match is successful, then obtains the corresponding relation of the link of the target patent data to be sorted and industrial chain, and according to described right It should be related to, the target patent data to be sorted is categorized into the link of corresponding industrial chain, so, extracted to be sorted The probability subject of target patent data, based on disaggregated model, target patent data is classified according to probability subject, entered And patent data is directed to, classified according to the link of industrial chain, meet patent search of the user in the link of industrial chain Demand, when user searches for patent, the patent data under each industrial chain link can be obtained, is easy to user to check and distinguish, The accuracy rate of search is improved, meets user's request.

Brief description of the drawings

Fig. 1 is data processing method general introduction flow chart in the embodiment of the present invention；

Fig. 2 is data processing method detail flowchart in the embodiment of the present invention；

Fig. 3 is data processing equipment structural representation in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, is not whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

In order to meet patent search demand of the user in industrial chain link, the search efficiency and accuracy of user are improved, In the embodiment of the present invention, training in advance disaggregated model, by target patent data to be sorted, respectively with predefined each industrial chain Link matched, and then by target patent data to be sorted be categorized into corresponding to industrial chain link in, meet user Patent search demand in the link of industrial chain.

The present invention program is described in detail below by specific embodiment, certainly, the present invention is not limited to following reality Apply example.

As shown in fig.1, in the embodiment of the present invention, the idiographic flow of data processing method is as follows：

Step 100：Obtain target patent data to be sorted.

In practice, during user search patent, for example, keyword can be inputted, the keyword inputted according to user, looked into Matching is ask, shows search result, also, these search results can be according to, such as applicant, the applying date are classified, user The patent data belonged under different applicants or applying date etc. can be checked, still, these are only simple secondary classifications, it is impossible to Classified according to the link of industrial chain, can not reach the demand of user, in the embodiment of the present invention, by patent data according to pre- The link of the industrial chain of definition is classified, and can meet patent search demand of the user in the link of each industrial chain, more Added with effect.

Wherein, target patent data to be sorted, it is, for example, new patent data, or user retrieves in retrieval All patent datas.

Step 110：Disaggregated model based on training in advance, according to the general of the target patent data to be sorted of acquisition Rate descriptor and default matching condition, judge the target patent data to be sorted whether the ring with predefined industrial chain The match is successful for section.

Wherein, the disaggregated model of the training in advance, it is based on default sorting algorithm and each industrial chain collected Patent sample data in link, is trained what is obtained.

Wherein, the disaggregated model of above-mentioned training in advance, including the first disaggregated model and the second disaggregated model；Wherein, first Disaggregated model represents the model by the training of predefined industrial chain, and the second disaggregated model represents the link by predefined industrial chain The model of training.

Also, above-mentioned disaggregated model, for example, Naive Bayes Classification Model, it is of course also possible to be other classification moulds Type, in the embodiment of the present invention, and it is not limited.

That is, in the embodiment of the present invention, the link of industrial chain and its industrial chain is pre-defined, and respectively according to industry The classification of the link of chain and industrial chain, train classification models.Wherein, the link of industrial chain and industrial chain can be understood as each skill Art field, sport technique segment or technical method for being related to etc..

Wherein, the definition of the link of the definition for industrial chain and industrial chain, in the embodiment of the present invention, and without limit System, the classification of the link of existing industrial chain and industrial chain can be used, industrial chain can also be redefined according to actual conditions And its link of industrial chain.

For example, agricultural industry chain, forestry industry chain, IT industry chain, religion can be divided into according to the industrial chain of trade classification method Educate industrial chain etc..

In another example for Internet of Things industrial chain, its each link can be defined as：Chip supplier, sensor supplier, nothing Line module (containing antenna) manufacturer, Virtual network operator (business containing SIM card), platform service business, system and software developer, Intelligent hardware Manufacturer, the system integration and application service provider.

When performing step 110, specifically include：

First, the probability subject of the target patent data to be sorted is obtained.

Wherein, probability subject represents, the posterior probability for belonging to some classification is more than the word of setting value.

Wherein, setting value is, for example, the value more than 0.5, therefore, generally for a target patent data to be sorted, is carried The number of the probability subject of taking-up will not be a lot.

First, based on the first disaggregated model, according to the probability subject and default matching condition, treated described in judgement point Whether the match is successful with predefined industrial chain for the target patent data of class.

For example, target patent data to be sorted is inputted into the first disaggregated model, to target patent number to be sorted According to being analyzed, matched respectively with each industrial chain.

Then, it is determined that with industrial chain after the match is successful, based on the second disaggregated model, according to the probability subject and pre- If matching condition, judge whether the target patent data to be sorted matches into the link in the industrial chain that the match is successful Work(.

That is, in the embodiment of the present invention, first target patent data to be sorted is carried out with each industrial chain respectively Match somebody with somebody, with industrial chain after the match is successful, then by target patent data to be sorted each link with the industrial chain that the match is successful respectively Matched.

Specifically, judging target patent data to be sorted, whether the match is successful with the link of predefined industrial chain, can By be divided into it is following two in a manner of：

First way：According to the probability subject of the target patent data to be sorted, calculate respectively described in treat point The target patent data of class belongs to the posterior probability values of the link of industrial chain, and judges the target patent data category to be sorted Whether it is more than predetermined threshold value in the posterior probability values of the link of industrial chain.

The second way：The number for the probability subject for judging to extract from the target patent data to be sorted is It is no to be more than preset number.

That is, default matching condition can be to judge that the target patent data to be sorted belongs to industrial chain Whether the posterior probability values of link are more than predetermined threshold value, or, judge what is extracted from the target patent data to be sorted Whether the number of probability subject is more than preset number.

In the embodiment of the present invention, the first disaggregated model of industrial chain is not only constructed, the production also in industrial chain internal build Second disaggregated model of the link of industry chain, so, when classifying to target patent data to be sorted, use the first classification mould The matching way that type and the second disaggregated model are combined, it is more efficient, target patent data to be sorted may finally be classified Into the link of industrial chain.

Step 120：If it is determined that the match is successful, then the link of the target patent data to be sorted and industrial chain is obtained Corresponding relation, and according to the corresponding relation, by the target patent data to be sorted be categorized into corresponding to industrial chain ring In section.

When performing step 120, specifically include：

First, however, it is determined that the match is successful, then obtains pair of the link of the target patent data to be sorted and industrial chain It should be related to.

Specially：Judge that target patent data to be sorted belongs to the posterior probability values of the link of industrial chain more than default threshold Value, or, the number for the probability subject for judging to extract from target patent data to be sorted are more than preset number, it is determined that The match is successful, obtains the corresponding relation of the link of the target patent data to be sorted and industrial chain.

For example, target patent data to be sorted is a, the link of industrial chain has two, respectively link 1 and link 2, point A is not matched with link 1 and link 2, predetermined threshold value 0.8, if the posterior probability that a belongs to link 1 belongs to link 2 for 0.5, a Posterior probability be 0.9, then can obtain corresponding relation for a it is corresponding with link 2.

Then, according to the corresponding relation, the target patent data to be sorted is categorized into corresponding industrial chain In link.

Further, when it fails to match, can also include：If it is determined that it fails to match, then according to default feature extraction side Method, obtain the patent characteristic word of the target patent data to be sorted, by the patent characteristic word respectively with default industry The keyword of the link of chain is matched, and obtains the target patent data to be sorted pass corresponding with the link of industrial chain System.

Specifically：

First, it is determined that it fails to match.

Specially：Judge that target patent data to be sorted belongs to the posterior probability values of the link of industrial chain no more than default Threshold value, or, the number for the probability subject for judging to extract from target patent data to be sorted are not more than preset number, then It is determined that it fails to match.

Then, according to default feature extracting method, the patent characteristic word of the acquisition target patent data to be sorted.

Wherein, default feature extracting method, in the embodiment of the present invention, also it is not limited, is, for example, information gain is special Levy extracting method, word frequency method etc..Also, with these feature extracting methods, the patent characteristic word number extracted can compare probability Descriptor is more.

Then, respectively by the patent characteristic word compared with the keyword of the link of default industrial chain, and respectively The keyword identical number of the link of patent characteristic word and industrial chain is counted, determines the ring of the most industrial chain of same number Section.

Wherein, the keyword of the link of industrial chain, can be obtained according to keyword extraction algorithm of the prior art, It can be configured with self-defined, be not defined in the embodiment of the present invention.

Finally, according to the link of the most industrial chain of the same number, the target patent data to be sorted is obtained With the corresponding relation of the link of industrial chain.

That is, by the patent characteristic word with target patent data to be sorted, there are most same number keywords Industrial chain link, the link of industrial chain corresponding to the target patent data to be sorted as this.

For example, target patent data to be sorted there are 3 patent characteristic words, carried out respectively with the link of 2 industrial chains To match somebody with somebody, the link of first industrial chain and the link of second industrial chain also have 3 keywords, then by this 3 patent characteristic words, Respectively compared with the keyword of the two links, for example, having 2 with the keyword identical of first link, with second The keyword identical of individual link has 3, then establishes pair of the link of target patent data to be sorted and second industrial chain It should be related to.

So, in the embodiment of the present invention, target patent data is classified according to disaggregated model and probability subject, and For classification failure, then the patent characteristic word of target patent data to be sorted is extracted, according to the pass of the link of each industrial chain Keyword is matched, and finally target patent data to be sorted is categorized into the link of industrial chain, is not only increased patent The classification effectiveness and correctness of data, also meet patent search demand of the user in the link of industrial chain, and user's search is special When sharp, the patent data under each industrial chain link can be obtained, is easy to user to check and distinguish, improves the accurate of search Rate, meet user's request.

The training method of disaggregated model is briefly described below, the training method of disaggregated model is：

First, the patent sample data in the link of each industrial chain is gathered respectively, and extracts each patent sample data respectively Characteristic, using the characteristic as characteristic index.

Wherein, features described above index includes following a kind of or any combination：International Patent classificating number (International Patent Classification, IPC) classification, patent name, summary, patent characteristic word.Certainly, characteristic index can also be Other patent characteristic parts, in the embodiment of the present invention, and it is not limited.

Specifically, in units of the link of industrial chain, patent sample data corresponding to the link of each industrial chain is collected, it is right In each link patent sample data, for example, being evaluated in IPC classification, patent name, summary, patent characteristic word.

Further, in the patent sample data of the link of some industrial chains, patent content feature is obvious, from plucking Want, be that can be seen that the degree of association with the link of industrial chain in title, then carry patent name, summary as the weight of characteristic index It is high；For that in the patent sample data of the link of some industrial chains, patent content feature unobvious, then IPC can be classified, specially Sharp Feature Words improve as the weight of characteristic index.

So, after being characterized the different weights of index imparting, it is trained according to these characteristic indexs, builds disaggregated model Grader is more accurate.

Then, according to the characteristic index of each patent sample data, the span of each characteristic index is carried out respectively Division, and the division according to the span of each characteristic index and default sorting algorithm, train the disaggregated model, count The probability that each characteristic index belongs to the link of each industrial chain is calculated, patent sample data is categorized into corresponding maximum probability In the link of industrial chain.

Further, before the span of each characteristic index is divided, in addition to：

For example, default span for (0,1], then can according to it is default normalization formula be handled, for example, returning One, which changes formula, is：Y=[x-MinValue (x)]/[MaxValue (x)-MinValue (x)], wherein, x is that any one feature refers to Value before mark normalization, MinValue (x) and MaxValue (x) are respectively x minimum value and maximum, and y is normalization The value of this feature index afterwards.

So that disaggregated model is Naive Bayes Classification Model as an example, specifically：

1) Bayes kit Spark MLlib are called.

2) span of each characteristic index is divided, and according to the span of each characteristic index Division, the value for obtaining P (yj ＞ ajk | Ci), P (Ci) and P (yj ＞ ajk) is calculated respectively.

Wherein, yj is the value of j-th of characteristic index, and j=1,2 ..., N, N is the sum of the characteristic index； Ajk is the value of j-th of characteristic index yj k-th of division points, and 0 ＜ ajk≤1, k is positive integer；Ci indicate whether to belong to for Some link of industrial chain, i=1,2, C1 be the link for belonging to certain industrial chain, and C2 is the link for being not belonging to certain industrial chain, P (Ci) Belong to the probability of Ci class links for patent data, P (yj ＞ ajk | Ci) is the value yj ＞ ajk of the characteristic index in Ci class links The conditional probability of appearance, P (yj ＞ ajk) are characterized the value yj ＞ ajk of index probability.

For example, any one patent sample data can have 4 kinds of characteristic indexs, i.e. Y={ y1, y2, y3, y4 }, Suo Youzhuan 4 kinds of characteristic index composing training data sets of sharp sample data.According to naive Bayesian theorem：P (Ci | yj)=P (yj | Ci) * P (Ci)/P (yj), wherein, i=1,2, j=1,2 ..., 8.It is general for any one characteristic index yj, P (yj), P (Ci) and condition Rate P (yj | Ci) can directly calculate from training dataset to be obtained.

3) train classification models, the probability that each characteristic index belongs to the link of each industrial chain is calculated, by patent Sample data is categorized into the link of the industrial chain of corresponding maximum probability.

It is possible to further set iterations, the accuracy rate of Naive Bayes Classifier is calculated or assesses, when simple shellfish When the accuracy rate of this grader of leaf is more than the threshold value of setting, final Naive Bayes Classifier is obtained, i.e. classification is completed in training Model.

In practice, field of distributed file processing HDFS and distribution based on Hadoop distributed system architectures Computational frame MapReduce is widely used in big data analysis field.Spark is that UC Berkeley AMP lab are increased income Class Hadoop MapReduce universal parallel framework, Spark possesses advantage possessed by Hadoop MapReduce；But no Be same as MapReduce is that output result can be stored in internal memory among Job, so as to no longer need to read and write HDFS, therefore Spark can preferably be applied to the algorithm that data mining and machine learning etc. need the MapReduce of iteration.

Therefore, in the embodiment of the present invention, training to obtain Naive Bayes Classification based on the kit of naive Bayesian During device, make full use of Spark to be based on the advantages of internal memory calculates, the parallelization interface of Spark MLlib offers is be provided, will be selected In the characteristic index input Spark MLlib of the sample user taken NB Algorithm interface, and iterations is set, Spark MLlib automatic Iteratives calculate, and after the completion of iteration, obtain Naive Bayes Classifier, make patent data and industrial chain The mining process of link matching is more intelligent, and the characteristic index combination of excavation is more comprehensively.So, Spark is made full use of to be based on interior The advantages of depositing calculating, calculating speed faster, substantially reduce the time of structure Naive Bayes Classifier.

What deserves to be explained is the mode of above-mentioned train classification models, suitable for the first disaggregated model and the second classification mould Type, so, in the embodiment of the present invention, the first disaggregated model is constructed not only for each industrial chain, also, in order to further improve The accuracy rate that patent data matches with each link inside industrial chain, inside industrial chain, equally used for the link of industrial chain Patent sample data is trained, and builds the second disaggregated model, for the link in patent data and industrial chain is carried out Match somebody with somebody, improve classification accuracy and efficiency.

Further description is made to above-described embodiment using a specific application scenarios below.Referring particularly to Fig. 2 Shown, in the embodiment of the present invention, the implementation procedure of data processing method is specific as follows：

Step 200：Obtain target patent data to be sorted.

Step 201：Disaggregated model based on training in advance, by target patent data to be sorted respectively with predefined production The link of industry chain is matched.

Wherein, the disaggregated model of training in advance, preferably Naive Bayes Classification Model.Also, point of training in advance Class model includes the first disaggregated model and the second disaggregated model.

Specifically, based on the first disaggregated model, according to probability subject and default matching condition, mesh to be sorted is judged Marking patent data, whether the match is successful with predefined industrial chain；It is determined that with industrial chain after the match is successful, based on the second classification Model, according to probability subject and default matching condition, judge target patent data to be sorted whether with the match is successful The match is successful for link in industrial chain.

Step 202：Judge whether that the match is successful, if so, then performing step 203, otherwise, then perform step 204.

Step 203：Obtain the corresponding relation of the link of target patent data to be sorted and industrial chain.

Step 204：Obtain the patent characteristic word of target patent data to be sorted.

Step 205：Patent characteristic word is matched with the keyword of the link of default industrial chain, obtained to be sorted The corresponding relation of the link of target patent data and industrial chain.

Step 206：According to corresponding relation, by target patent data to be sorted be categorized into corresponding to industrial chain link In.

In the embodiment of the present invention, using second point corresponding to the link of the first disaggregated model and industrial chain corresponding to industrial chain The matching way that class model is combined, also, the matching way being combined using disaggregated model and Keywords matching, are further carried High efficiency and accuracy to the classification of target patent data, target patent data can be categorized into corresponding industry exactly In the link of chain, the purpose that target patent data is classified according to the link of industrial chain is realized.

Based on above-described embodiment, as shown in fig.3, in the embodiment of the present invention, data processing equipment, specifically include：

Acquiring unit 30, for obtaining target patent data to be sorted；

Matching unit 31, for the disaggregated model based on training in advance, according to the target patent to be sorted of acquisition The probability subject of data and default matching condition, judge the target patent data to be sorted whether with predefined production The match is successful for the link of industry chain；Wherein, the disaggregated model of the training in advance, based on default sorting algorithm and collect Patent sample data in the link of each industrial chain, is trained what is obtained；

Taxon 32, for if it is determined that the match is successful, then obtaining the target patent data to be sorted and industrial chain Link corresponding relation, and according to the corresponding relation, the target patent data to be sorted is categorized into corresponding production In the link of industry chain.

Preferably, judging whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, matching unit 31 are specifically used for：

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition With default matching condition, judge whether the target patent data to be sorted matches into the link of predefined industrial chain Work(, matching unit 31 are specifically used for：

Obtain the probability subject of the target patent data to be sorted；

Preferably, taxon 32 is further used for：

Preferably, according to default feature extracting method, the patent characteristic of the acquisition target patent data to be sorted Word, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, obtained described to be sorted The corresponding relation of the link of target patent data and industrial chain, taxon 32 are specifically used for：

Preferably, the training method of the disaggregated model is：

Collecting unit 33, the patent sample data in link for gathering each industrial chain respectively, and extract respectively each special The characteristic of sharp sample data, using the characteristic as characteristic index；

Training unit 34, for the characteristic index according to each patent sample data, taking each characteristic index respectively Value scope is divided, and the division according to the span of each characteristic index and default sorting algorithm, described in training Disaggregated model, the probability that each characteristic index belongs to the link of each industrial chain is calculated, patent sample data is categorized into correspondingly Maximum probability industrial chain link in.

Normalized unit 35, normalizes to the value of each characteristic index for respectively that identical is default to be taken In the range of value.

In summary, in the embodiment of the present invention, target patent data to be sorted is obtained；Classification mould based on training in advance Type, according to the probability subject of the target patent data to be sorted of acquisition and default matching condition, treated described in judgement Whether the match is successful with the link of predefined industrial chain for the target patent data of classification；Wherein, the classification of the training in advance Model, it is the patent sample data in the link based on default sorting algorithm and each industrial chain collected, is trained Arrive；If it is determined that the match is successful, then the corresponding relation of the link of the target patent data to be sorted and industrial chain is obtained, and According to the corresponding relation, the target patent data to be sorted is categorized into the link of corresponding industrial chain, so, carried The probability subject of target patent data to be sorted is taken, based on disaggregated model, according to probability subject to target patent data Classified, and then be directed to patent data, classified according to the link of industrial chain, meet user in the link of industrial chain Patent search demand, user search for patent when, the patent data under each industrial chain link can be obtained, be easy to user to look into See and distinguish, improve the accuracy rate of search, meet user's request.

Also, for classification failure, further according to default feature extracting method, extract target patent data to be sorted Patent characteristic word, matched according to the keyword of the link of each industrial chain, mutually tied with Keywords matching using disaggregated model The matching way of conjunction, improve the accuracy and efficiency of target patent data classification.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into having altered and changing for the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.So, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to comprising including these changes and modification.

Claims

A kind of 1. data processing method, it is characterised in that including：

Obtain target patent data to be sorted；

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition and in advance If matching condition, judging the target patent data to be sorted, whether the match is successful with the link of predefined industrial chain； Wherein, the disaggregated model of the training in advance, it is in the link based on default sorting algorithm and each industrial chain collected Patent sample data, is trained what is obtained；

If it is determined that the match is successful, then the corresponding relation of the link of the target patent data to be sorted and industrial chain is obtained, and According to the corresponding relation, the target patent data to be sorted is categorized into the link of corresponding industrial chain.
2. the method as described in claim 1, it is characterised in that judge the target patent data to be sorted whether with making a reservation for The match is successful for the link of the industrial chain of justice, specifically includes：

According to the probability subject of the target patent data to be sorted, the target patent data to be sorted is calculated respectively Belong to the posterior probability values of the link of industrial chain, and judge that the target patent data to be sorted belongs to the link of industrial chain Whether posterior probability values are more than predetermined threshold value；Or,

Whether the number for the probability subject for judging to extract from the target patent data to be sorted is more than preset number.
3. method as claimed in claim 2, it is characterised in that the disaggregated model of the training in advance, including the first classification mould Type and the second disaggregated model；Wherein, the first disaggregated model represents the model by the training of predefined industrial chain, the second disaggregated model Represent the model trained by the link of predefined industrial chain；

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition and in advance If matching condition, judging the target patent data to be sorted, whether the match is successful with the link of predefined industrial chain, Specifically include：

Obtain the probability subject of the target patent data to be sorted；

Based on the first disaggregated model, according to the probability subject and default matching condition, the target to be sorted is judged Whether the match is successful with predefined industrial chain for patent data；

It is determined that with industrial chain after the match is successful, based on the second disaggregated model, according to the probability subject and default matching Condition, judging the target patent data to be sorted, whether the match is successful with the link in the industrial chain that the match is successful.
4. method as claimed in claim 2, it is characterised in that further comprise：

If it is determined that it fails to match, then according to default feature extracting method, the special of the target patent data to be sorted is obtained Sharp Feature Words, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, treated described in acquisition The corresponding relation of the target patent data of classification and the link of industrial chain.
5. method as claimed in claim 4, it is characterised in that according to default feature extracting method, obtain described to be sorted Target patent data patent characteristic word, keyword of the patent characteristic word respectively with the link of default industrial chain is entered Row matching, obtains the corresponding relation of the link of the target patent data to be sorted and industrial chain, specifically includes：

According to default feature extracting method, the patent characteristic word of the acquisition target patent data to be sorted；

Respectively by the patent characteristic word compared with the keyword of the link of default industrial chain, and it is special to count patent respectively The keyword identical number of the link of word and industrial chain is levied, determines the link of the most industrial chain of same number；

According to the link of the most industrial chain of the same number, the target patent data to be sorted and industrial chain are obtained The corresponding relation of link.
6. the method as described in claim any one of 1-5, it is characterised in that the training method of the disaggregated model is：

The patent sample data in the link of each industrial chain is gathered respectively, and extracts the features of each patent sample data respectively Point, using the characteristic as characteristic index；

According to the characteristic index of each patent sample data, the span of each characteristic index is divided respectively, and root Division and default sorting algorithm according to the span of each characteristic index, train the disaggregated model, calculate each Characteristic index belongs to the probability of the link of each industrial chain, and patent sample data is categorized into the industrial chain of corresponding maximum probability In link.
7. method as claimed in claim 6, it is characterised in that the characteristic index includes following a kind of or any combination：State Border Patent classificating number IPC classification, patent name, summary, patent characteristic word.
8. method as claimed in claim 6, it is characterised in that carry out the span of each characteristic index to divide it Before, further comprise：

The value of each characteristic index is normalized in the default span of identical respectively.
A kind of 9. data processing equipment, it is characterised in that including：

Acquiring unit, for obtaining target patent data to be sorted；

Matching unit, for the disaggregated model based on training in advance, according to the target patent data to be sorted of acquisition Probability subject and default matching condition, judge the target patent data to be sorted whether with predefined industrial chain The match is successful for link；Wherein, the disaggregated model of the training in advance, it is based on default sorting algorithm and each industry collected Patent sample data in the link of chain, is trained what is obtained；

Taxon, for if it is determined that the match is successful, then obtaining the link of the target patent data to be sorted and industrial chain Corresponding relation, and according to the corresponding relation, by the target patent data to be sorted be categorized into corresponding to industrial chain In link.
10. device as claimed in claim 9, it is characterised in that judge the target patent data to be sorted whether with advance The match is successful for the link of the industrial chain of definition, and matching unit is specifically used for：

According to the probability subject of the target patent data to be sorted, the target patent data to be sorted is calculated respectively Belong to the posterior probability values of the link of industrial chain, and judge that the target patent data to be sorted belongs to the link of industrial chain Whether posterior probability values are more than predetermined threshold value；Or,

Whether the number for the probability subject for judging to extract from the target patent data to be sorted is more than preset number.
11. device as claimed in claim 10, it is characterised in that the disaggregated model of the training in advance, including the first classification Model and the second disaggregated model；Wherein, the first disaggregated model represents the model by the training of predefined industrial chain, the second classification mould Type represents the model trained by the link of predefined industrial chain；

Disaggregated model based on training in advance, according to the probability subject of the target patent data to be sorted of acquisition and in advance If matching condition, judging the target patent data to be sorted, whether the match is successful with the link of predefined industrial chain, Matching unit is specifically used for：

Obtain the probability subject of the target patent data to be sorted；

Based on the first disaggregated model, according to the probability subject and default matching condition, the target to be sorted is judged Whether the match is successful with predefined industrial chain for patent data；

It is determined that with industrial chain after the match is successful, based on the second disaggregated model, according to the probability subject and default matching Condition, judging the target patent data to be sorted, whether the match is successful with the link in the industrial chain that the match is successful.
12. device as claimed in claim 10, it is characterised in that taxon is further used for：

If it is determined that it fails to match, then according to default feature extracting method, the special of the target patent data to be sorted is obtained Sharp Feature Words, keyword of the patent characteristic word respectively with the link of default industrial chain is matched, treated described in acquisition The corresponding relation of the target patent data of classification and the link of industrial chain.
13. device as claimed in claim 12, it is characterised in that according to default feature extracting method, treated described in acquisition point The patent characteristic word of the target patent data of class, by the patent characteristic word keyword with the link of default industrial chain respectively Matched, obtain the corresponding relation of the link of the target patent data to be sorted and industrial chain, taxon is specifically used In：

According to default feature extracting method, the patent characteristic word of the acquisition target patent data to be sorted；

Respectively by the patent characteristic word compared with the keyword of the link of default industrial chain, and it is special to count patent respectively The keyword identical number of the link of word and industrial chain is levied, determines the link of the most industrial chain of same number；

According to the link of the most industrial chain of the same number, the target patent data to be sorted and industrial chain are obtained The corresponding relation of link.
14. the device as described in claim any one of 9-13, it is characterised in that the training method of the disaggregated model is：

Collecting unit, the patent sample data in link for gathering each industrial chain respectively, and each patent sample is extracted respectively The characteristic of data, using the characteristic as characteristic index；

Training unit, for the characteristic index according to each patent sample data, respectively by the span of each characteristic index Divided, and the division according to the span of each characteristic index and default sorting algorithm, train the classification mould Type, calculates the probability that each characteristic index belongs to the link of each industrial chain, and patent sample data is categorized into corresponding probability In the link of maximum industrial chain.
15. device as claimed in claim 14, it is characterised in that the characteristic index includes following a kind of or any combination： International Patent classificating number IPC classification, patent name, summary, patent characteristic word.
16. device as claimed in claim 14, it is characterised in that carry out the span of each characteristic index to divide it Before, further comprise：

Normalized unit, for the value of each characteristic index to be normalized into the default span of identical respectively It is interior.