CN107193915A

CN107193915A - A kind of company information sorting technique and device

Info

Publication number: CN107193915A
Application number: CN201710339393.2A
Authority: CN
Inventors: 赵全颖; 张道泉; 曹培坤; 马超; 赵继广
Original assignee: Beijing Causality Network Technology Co Ltd
Current assignee: Beijing Causality Network Technology Co Ltd
Priority date: 2017-05-15
Filing date: 2017-05-15
Publication date: 2017-09-22

Abstract

The present invention relates to data analysis technique field, more particularly to a kind of company information sorting technique and device, in order to be able to the company information of timely typing magnanimity, and quickly it is correctly classified, this method is, meet some words of setting rule by being extracted from the company information to be sorted of acquisition, and each two word is defined as a word pair, then, based on default coupling network model, determine each word to the complete correlation in each default type of business, and then, determine each word to belonging to the coupling probability of each type of business, and the corresponding type of business of maximum coupling probability is defined as to the type of business of company information to be sorted, so, for the company information to be sorted directly obtained, can be based on the semantic association degree between each word, determine the corresponding type of business, improve the accuracy of classification, and, due to without any artificial operation, improve treatment effeciency, and then improve customer experience.

Description

A kind of company information sorting technique and device

Technical field

The present invention relates to data analysis technique field, more particularly to a kind of company information sorting technique and device.

Background technology

Internet technology flourish, driven science and technology, media, communication (Technology Media Telecom, TMT) blowout of enterprise increases, and enterprise of interest can be inquired rapidly from the company information of magnanimity for the ease of user Under relevant information, prior art, beforehand through the company information of manual type typing magnanimity one by one, then, manually to typing All company informations are classified, and obtain classification results, so, and user can just be based on classification results, quickly navigate to of interest Enterprise, further gets the relevant information of enterprise.

Obviously, at present, Data Enter and information classification are still carried out to the company information of big data quantity using manual type, The company information that upgrades in time is not simply failed to, causes to handle time-consuming lengthening, is also easy to cause company information classification inaccurate, further Influence Consumer's Experience.

In view of this, it is necessary to design a kind of new firms information classification approach to overcome drawbacks described above.

The content of the invention

The embodiment of the present invention provides a kind of company information sorting technique and device, enterprise's letter to the energy magnanimity of typing in time Breath, and quickly it is correctly classified.

Concrete technical scheme provided in an embodiment of the present invention is as follows：

A kind of company information sorting technique, including：

Company information to be sorted is obtained, and some words for meeting setting rule are extracted from the company information to be sorted Language, and each two word is defined as a word pair；

Based on default coupling network model, determine each word to right in each default type of business respectively The complete correlation answered, wherein, complete correlation is used to characterize the semantic association degree between two words, each described enterprise The enterprise level of type is identical；

Each described word is based respectively on to the corresponding complete correlation in each described type of business, it is determined that respectively Individual word, by the corresponding type of business of maximum coupling probability, is defined as institute to belonging to the coupling probability of each type of business State the type of business of the company information to be sorted under current enterprise rank.

Optionally, obtain before company information to be sorted, further comprise：

Some company informations are obtained, and if filter out from some company informations and to meet setting screening rule Dry bar company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined each The corresponding type of business；

Each each self-corresponding type of business of bar company information is concentrated according to the training sample, the same type of business will be belonged to Each bar company information be defined as a training sample subset, wherein, a kind of type of business of training sample subset correspondence, respectively The enterprise level of individual each self-corresponding type of business of training sample subset is identical；

The each company information for each training sample subset performs following operate respectively：

The keyword for meeting setting number or setting number range is extracted, keyword set is constituted；

Each two keyword in the keyword set is defined as a keyword pair, and it is crucial to calculate each respectively Complete correlation between two keywords of word centering.

Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all Each self-corresponding type of business is determined, including：

Some company informations are crawled using default web crawlers device, and respectively from crawling each enterprise's letter In breath, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and believe respectively for each Breath pair, performs following operate：

Split using clause, some simple sentences included in the company profile information for extracting information pair；

Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and be based on Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical sentence of trade classification rule Formula；

Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and be directed to respectively Each information pair that the training sample is concentrated, performs following operate：Based on preset rules, from least one corresponding canonical Target canonical clause is filtered out in clause, and based on the target canonical clause, determines the corresponding type of business.

Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause In the target canonical clause, the corresponding type of business is determined, including：

According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as Target canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business；Or,

From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and be based on The target canonical clause, by corresponding information to recalling to the corresponding type of business.

Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every Complete correlation between one keyword centering, two keywords, including：

Based on variance distribution, each keyword is calculated in the keyword set respectively in corresponding company profile information Shared weighted value, and each two keyword in the keyword set is defined as a keyword pair, and base respectively In each self-corresponding weighted value of two keywords of each keyword centering, it is determined that two passes of each described keyword centering Co-occurrence correlation between keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously；

Be based respectively on the co-occurrence correlation between described two keywords of each keyword centering, it is determined that it is described each Co-occurrence dependent probability between two keywords of keyword centering, wherein, co-occurrence dependent probability is characterized between two keywords Co-occurrence correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set；

Each keyword pair is directed to respectively, performs following operate：Judge there is at least one interim key word so that close When co-occurrence dependent probability of two keywords of keyword centering each between at least one described interim key word is all higher than zero, Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, is determined described two Coupling correlation between keyword；

The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering, It is determined that the complete correlation between described two keywords of each keyword centering.

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, determines the coupling correlation between described two keywords, including：

Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that Conditional dependencies between described two keywords and at least one described interim key word, wherein, two keywords and one Existence condition correlation between interim key word, represent using said one interim key word as condition, above-mentioned two keyword it Between have relevance；

Based on the conditional dependencies between described two keywords and at least one described interim key word, described two are determined Coupling correlation between individual keyword.

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, determines the conditional dependencies between described two keywords and at least one described interim key word, including：

For each interim key word, following operate is performed：

The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken, It is used as the conditional dependencies between described two keywords and the interim key word.

Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really Coupling correlation between fixed described two keywords, including：

To each interim key word at least one described interim key word, the bar between described two keywords respectively Part correlation, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.

Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business Property, it is determined that each described word is to belonging to the coupling probability of each type of business, including：

Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that Each described word is to the class conditional probability in each described type of business；

Each described word of determination is based respectively on to the class conditional probability in each described type of business, Yi Jisuo The prior probability of each type of business is stated, it is determined that each described word is to belonging to the coupling probability of each type of business.

Optionally, by the corresponding type of business of maximum coupling probability, it is defined as the company information to be sorted and is looked forward to currently After the type of business under industry rank, further comprise：

Determine the type of business of the company information to be sorted under each default different enterprise level；

Based on default multistage screening rule, one is filtered out from the type of business under each described different enterprise level The type of business, is used as the Target Enterprise type of the company information to be sorted.

A kind of company information sorter, including：

Data capture unit, for obtaining company information to be sorted, and the extraction symbol from the company information to be sorted Some words of setting rule are closed, and each two word is defined as a word pair；

Processing unit, for based on default coupling network model, determining each word to default each respectively Corresponding complete correlation in the type of business is planted, wherein, complete correlation is used to characterize the semantic association degree between two words, The enterprise level of each type of business is identical；

Taxon, for being based respectively on each described word to corresponding complete in each described type of business Correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding enterprise of probability Type, is defined as the type of business of the company information to be sorted under current enterprise rank.

Optionally, in addition to training unit, the training unit is used for：

Obtain before company information to be sorted, perform following operate：

Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all When determining each self-corresponding type of business, the training unit is used for：

Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause In the target canonical clause, when determining the corresponding type of business, the training unit is used for：

Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every During complete correlation between one keyword centering, two keywords, the training unit is used for：

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, when determining the coupling correlation between described two keywords, the training unit is used for：

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, when determining the conditional dependencies between described two keywords and at least one described interim key word, the training unit For：

For each interim key word, following operate is performed：

Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really When determining the coupling correlation between described two keywords, the training unit is used for：

Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business Property, it is determined that during coupling probability of each the described word to belonging to each type of business, the taxon is used for：

Optionally, in addition to multiclass classification unit, the multiclass classification unit is used for：

By the corresponding type of business of maximum coupling probability, it is defined as the company information to be sorted under current enterprise rank The type of business after, perform following operate：

In the embodiment of the present invention, acquisition company information to be sorted is first passed through, then, from the company information to be sorted of acquisition Some words for meeting setting rule are extracted, and each two word is defined as a word pair, then, based on default coupling Network model, determines each word to the complete correlation in each default type of business, wherein, complete correlation For characterizing the semantic association degree between two words, finally, based on each word in each above-mentioned type of business Corresponding complete correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into probability The corresponding type of business is defined as the type of business of company information to be sorted, so, believes for the enterprise to be sorted directly obtained Breath, just can determine company information pair to be sorted based on the semantic association degree between each word extracted in company information to be sorted The type of business answered, improves the accuracy of classification, is additionally, since without any artificial operation, also improves treatment effeciency, enter And improve customer experience.

Brief description of the drawings

Fig. 1 be the embodiment of the present invention in, the three-level enterprise architecture classification chart of house property house ornamentation；

Fig. 2 be the embodiment of the present invention in, web crawlers apparatus structure schematic diagram；

Fig. 3 be the embodiment of the present invention in, screen training sample set method flow diagram；

Fig. 4 be the embodiment of the present invention in, determine the method flow diagram of coupling network model；

Fig. 5 be the embodiment of the present invention in, what the coupling network model based on determination was classified to company information to be sorted Method flow diagram；

Fig. 6 be the embodiment of the present invention in, company information sorter structural representation.

Embodiment

In order to be able to the company information of timely typing magnanimity, and quickly it is correctly classified, in the embodiment of the present invention, weight A kind of company information sorting technique is newly devised, this method is, by obtaining company information to be sorted, then, from treating for acquisition Some words for meeting setting rule are extracted in classification company information, and each two word is defined as a word pair, then, Based on default coupling network model, each word is determined to the complete correlation in each default type of business, Wherein, complete correlation is used to characterize the semantic association degree between two words, finally, based on each word to above-mentioned every Corresponding complete correlation in a kind of type of business, determines each word to belonging to the coupling probability of each type of business, and The corresponding type of business of maximum coupling probability is defined as to the type of business of company information to be sorted.

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, is not whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

The solution of the present invention will be described in detail by specific embodiment below, certainly, the present invention be not limited to Lower embodiment.

In the embodiment of the present invention, based on trade classification rule, multiple first-class enterprise's types are preset, for example, medium Industry, house property Decoration Industry, game industry etc., wherein, each first-class enterprise's type can be subdivided into multiple second-class enterprises again Type, and each second-class enterprise's type can be subdivided into several three-level types of business, can the like, it can finally be subdivided into Some N grades of types of business.

In the embodiment of the present invention, three-level enterprise architecture is used, i.e. the three-level type of business can be finally sub-divided into, with room Produce exemplified by Decoration Industry, referring particularly to shown in Fig. 1, first-class enterprise's type is：" house property house ornamentation "；Second-class enterprise's type is：" house Intermediary, furniture appliance, Decoration Design, house property information and community, service for infrastructure and house property house ornamentation other ", be with " letting agency " , the three-level type of business of " letting agency " is：" in real estate consulting intermediary, Price Evaluation of Real Estate intermediary, estate agent Be situated between, rent a house platform and software and house deal platform and software ".

Further, in the embodiment of the present invention, before classifying to the company information of acquisition, some can first be obtained Company information, as training sample set, then, the coupling network mould classified for company information is built based on training sample set Type.

Preferably, during the present invention is implemented, company information can derive from web crawlers, for example, web crawlers device can be increased, The framework of web crawlers device specifically see shown in Fig. 2, and web crawlers device includes download module, parsing module and storage mould Block, concrete processing procedure is as follows：

First, configuration webpage reptile rule, above-mentioned spiders rule is used to the webpage batch of collection be saved in locally.

Secondly, configuration webpage collection rule, for example, using a webpage as template, the data block for needing to gather is set, it is other Rule parsing will be carried out according to above-mentioned rule by meeting the webpage of this template.

Then, acquisition tasks are configured, specifically, being combined to spiders and web retrieval, combined result is one Acquisition tasks, wherein, a spiders can correspond to multiple web retrievals.

Finally, acquisition tasks are issued, specifically, the acquisition tasks configured can be distributed into given server Some collection queue in.

By above-mentioned steps, you can complete the web crawlers operation of company information.

Further, because some company informations crawled are unknown, i.e. where be not aware that company information ownership The individual type of business, therefore, some company informations being directly obtained cannot function as training sample set, if need to be to acquisition Dry bar company information is screened, to filter out some enterprises for meeting setting screening rule from some company informations Industry information, constitutes training sample set, as shown in fig.3, specific screening process is as follows：

Step 300：Respectively for each company information, following operate is performed：Extract enterprise name and company profile letter Breath, constitutes an information pair.

Specifically, each company information has comprised at least enterprise name and company profile information, therefore, respectively from each In bar company information, each self-contained enterprise name and company profile information are extracted, each self-corresponding information pair is constituted.

Further, for make subsequently can convenient use information pair, in the embodiment of the present invention, by the information pair extracted, with The form of key-value pair is stored in associated databases, for example, database is internal memory redis databases, the composition form of key-value pair For " Key " and " Value ", referring specifically to shown in table 1.

Table 1

In the embodiment of the present invention, why by some information of determination to being stored in internal memory redis databases, be because Subsequently in use information pair, promptly to extract the information pair of needs, extraction rate is unaffected.

Step 310：Each information pair is directed to respectively, performs following operate：To information to comprising company profile information Clause segmentation is performed, some simple sentences are obtained.

Specifically, due to information to comprising company profile information, be typically be made up of one big section of word, in order to extract The keyword of the type of business can be more embodied, first according to clause, company profile information is split, some simple sentences are obtained, e.g., can Clause segmentation is carried out to company profile information according to punctuation mark.

With "." exemplified by, for the company profile information of " AA " in table 1, can be divided into " AA companies set up in April, 2010, Be one be absorbed in Intelligent hardware and electronic product research and development mobile Internet company ", " ' for have a fever give birth to ' be AA companies production Product concept ", " AA companies have initiated to be participated in developing improved pattern with the Internet model exploitation mobile phone operating system, fan ".

Step 320：Perform semanteme for each simple sentence respectively to excavate, each self-contained SVO composition of extraction, and point Not with each self-contained SVO composition, the canonical clause that each simple sentence each meets trade classification rule is constructed.

Specifically, why excavate the SVO composition of simple sentence, be because, in Chinese, the master of a complete sentence Guest is called, being typically can be as the trunk of sentence, with higher cohesion, moreover, most of sentence has SVO, very Rare scarce subject or object, even, what SVO lacked simultaneously.

Based on this, semantic excavation is performed to each simple sentence respectively, semanteme is performed and excavates, each wrapped with extracting each simple sentence The SVO composition contained, then, for each simple sentence for the SVO composition that can be excavated, is arrived with respective excavation respectively SVO composition, construct it is respective meets trade classification rule canonical clause.

Further, if keyword determining method can be used by meeting trade classification rule, such as, be pre-set and row The related keyword of industry classification, if in the SVO composition extracted, containing default keyword, is then believed that what is excavated SVO composition meets industry rule, and builds canonical clause with above-mentioned SVO composition.

For example, with above-mentioned example " AA companies set up in April, 2010, be one be absorbed in Intelligent hardware and electronics production The mobile Internet company of product research and development " illustrates, and by semanteme excavation, can obtain " AA companies are mobile Internet companies ", In this manner it is possible to which based on " AA companies are mobile Internet companies " the SVO composition excavated, construction meets trade classification rule Canonical clause then, can construct following canonical clause：" it is (.*) mobile Internet, cultural_media ".

Certainly, not all simple sentence, can excavate SVO composition, for that can not excavate SVO composition Simple sentence, can not constitute canonical clause.

For example, it is assumed that the simple sentence of segmentation is conjunction " then ", then, SVO composition is just not present in this simple sentence, SVO composition can be based on by being just not present, and construction meets the canonical clause of trade classification rule.

Moreover, not all simple sentence, the SVO composition excavated, can constitute and meet trade classification rule just Then clause.

For example, being illustrated with " ' being given birth to for fever ' is the product concept of AA companies " in above-mentioned example, excavate SVO composition is " ' being given birth to for fever ' is product concept ", it is clear that the SVO excavated does not meet trade classification rule.

Step 330：It is determined that in the presence of each information pair of at least one canonical clause, and be directed to respectively in the presence of at least one just Then each information pair of clause, performs following operate：Based on preset rules, screened from least one corresponding canonical clause Go out target canonical clause, and based on above-mentioned target canonical clause, determine the corresponding type of business.

Specifically, be not each information to there is canonical clause, and, nor there is the information pair of canonical clause Unique canonical clause is respectively provided with, therefore, completes after the structure to meeting the regular canonical clause of trade classification, need to determine exist Each information pair of at least one canonical clause.

Further, it is determined that in the presence of at least one canonical clause each information to rear, to each above-mentioned information to holding Row is following to be operated：Based on preset rules, unique target canonical clause is filtered out from least one corresponding canonical clause, and Using target canonical clause, by information to recalling to the corresponding type of business, wherein, it is so-called to recall, that is, refer to determine information To the corresponding type of business.

, can be according to above-mentioned multiple canonical clause if information is to having multiple canonical clause by taking an information pair as an example Sequencing in corresponding profile information, using preceding canonical clause as target canonical clause, and using target canonical sentence Formula is by said one information to recalling to the corresponding type of business.

For example, it is assumed that information has following three canonical clause to A, it " is (.* to be respectively) Video Applications, cultural_ Media ", " it is (.*) skin makeup, consume_life ", " it is (.*) finishing (.*) $, house ", according to preceding canonical Clause then can will " be (.* as target canonical clause) Video Applications, cultural_media " as target canonical clause, And it " is (.* to use) Video Applications, cultural_media " to information to A perform recall operation, determine information to A correspondence " video medium " class.

It is of course also possible to randomly choose one from there are the most multiple canonical clause of same keyword, target is used as Canonical clause, and use target canonical clause by said one information to recalling to the corresponding type of business.

For example, it is assumed that information determines there are 5 canonical clause to M, wherein, 4 canonical clause are related to " house property finishing ", Only 1 canonical clause is related to " cultural medium ", then random in canonical clause that can be related to " house property finishing " from above-mentioned 4 One is selected, as target canonical clause, then, information is determined to the corresponding type of business using target canonical clause, certainly, The above-mentioned type of business is inevitable related to " house property finishing ".

In the embodiment of the present invention, as long as canonical clause can determine that the type of business, it is possible to which above-mentioned canonical clause is defined as Target canonical clause, specific screening process is not limited.

So far, some information pair of above-mentioned setting screening rule are met, can be as training sample set, due to training sample set In each information to being determined the type of business, therefore, it then follows classifying rules at the same level, concentrate each to believe based on training sample Breath is to each self-corresponding type of business, by each information of the same type of business of ownership to being defined as a training sample subset, Wherein, an a kind of type of business of training sample subset correspondence, with each self-corresponding type of business of a collection of training sample subset Enterprise level is identical.

Specifically, so-called classifying rules at the same level, that is, refer to, however, it is determined that each information belongs to one-level enterprise to the corresponding type of business Industry type, then according to the classifying and dividing training sample set of first-class enterprise's type, however, it is determined that each information is to the corresponding type of business Belong to the N grades of types of business, then according to the classifying and dividing training sample set of the N grades of types of business.

For example, it is assumed that default first-class enterprise's type there are 3 classes, there are 2 kinds of second-class enterprises under each first-class enterprise's type Type, referring particularly to shown in table 2.

Table 2

If continue it is assumed that in the presence of contain 5 information pair training sample set M information to 1, information to 2, information to 3, Information is to 4, information to 5, information to 6 }, and, information to 1 with information to 2 corresponding first-class enterprise's types " cultural medium ", its In, information is to 1 corresponding second-class enterprise's type " new media ", and information is to 2 corresponding second-class enterprise's types " traditional media "；Information is to 3 First-class enterprise's type " house property house ornamentation " is corresponded to 4 with information, wherein, information is to 3 corresponding second-class enterprise's types " real estate ", letter Breath is to 4 corresponding second-class enterprise's types " Decoration Design "；Information corresponds to first-class enterprise's type " local life " to 6 to 5 and information, Wherein, information is to 5 corresponding second-class enterprise's types " cuisines ", and information is to 6 corresponding second-class enterprise's types " beauty ".

If according to the classifying and dividing of first-class enterprise's type, training sample set M can be divided into：Training sample subset M₁ { information is to 1, information to 2 }, training sample subset M₂{ information is to 3, information to 4 }, training sample subset M₃{ information is to 5, information To 6 }, totally 3 training sample subsets；

If according to the classifying and dividing of second-class enterprise's type, training sample set M can be divided into：Training sample subset M₁ { information is to 1 }, training sample subset M₂{ information is to 2 }, training sample subset M₃{ information is to 3 }, training sample subset M₄{ information To 4 }, training sample subset M₅{ information is to 5 }, training sample subset M₆{ information is to 6 }, totally 6 training sample subsets.

Determine that each training sample subset that training sample set is included, and each above-mentioned training sample subset are each corresponded to The type of business after, further, each training sample subset included based on above-mentioned training sample set determines coupling network mould In type, the embodiment of the present invention, coupling network model can be Bayes's coupling network model, referring particularly to shown in Fig. 4, determine coupling The method flow for closing network model is as follows：

Step 400：Respectively for each training sample subset each information pair company profile information, perform with Lower operation：The keyword for meeting setting number or setting number range is extracted, keyword set is constituted.

Specifically, each information to comprising company profile information be by some crucial phrases into, but be not Each keyword possesses reference value, is the follow-up convenient correlation degree calculated between keyword, can be according to setting number Or setting number range, from some keywords of the company profile information of each information pair, extract corresponding keyword, composition Respective keyword set.

For example, it is assumed that number is set as 200, if training sample, which is concentrated, has two information pair, respectively from two information To company profile information in, extract 200 and meet the keyword imposed a condition, constitute respective keyword set, wherein, if Fixed condition can be related to the type of business.

In another example, it is assumed that number range is set as 100-150, if training sample, which is concentrated, has two information pair, is distinguished From the company profile information of two information pair, extract 100-150 and meet the keyword imposed a condition, constitute respective key Set of words.

Step 410：Based on variance distribution, each keyword in each keyword set is calculated respectively, current Shared weighted value in residing company profile information.

Specifically, after obtaining each information to each self-corresponding keyword set, determining in each keyword set Each keyword, the shared weighted value in the company profile information being presently in.

Preferably, in the embodiment of the present invention, calculating keyword h using below equation and being presently in company profile information d In shared weighted value：

Wherein, t_hdFor word frequency, calculation formula is:T_hdRepresent that keyword h occurs in company profile information d Number of times, S_dRepresent the sum for all words that company profile information d is included；N represents the company profile that training sample set is included The sum of information；N(w_h) represent the company profile information that keyword h occurs in each company profile information of training sample set Number；Represent the average time that keyword h occurs in each company profile information of training sample set；For adjust ginseng because Son, overweight dependence when being mainly used in the weighted value of regulating calculation keyword to word frequency.

Step 420：Each keyword set is directed to respectively, performs following operate：Each two keyword is defined as one Individual keyword pair, and each self-corresponding weighted value of two keywords of each keyword centering is based respectively on, determine above-mentioned two Co-occurrence correlation between individual keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously.

Specifically, incidence relation is there may be between different terms, and in passage information, word A appearance, Can guide has co-occurrence correlation between word B appearance, commonly referred to as word A and word B.

Further, each two keyword in each keyword set is defined as a keyword pair, with one Exemplified by keyword pair, each self-corresponding weighted value of two keywords based on above-mentioned keyword centering determines that above-mentioned two is crucial Co-occurrence correlation between word, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously.

Preferably, in the embodiment of the present invention, keyword key is determined using below equation_iWith keyword key_kBetween co-occurrence Correlation：

Wherein, w_xiAnd w_xkKeyword key is represented respectively_iWith keyword key_kIn company profile information d_xIn weighted value；S =x | (w_xi≠0)∧(w_xk≠ 0) }, represent to concentrate keyword key in training sample_iWith keyword key_kWeighted value is not zero Each company profile information.

Step 430：Each keyword pair is directed to respectively, performs following operate：Two keys based on keyword centering Co-occurrence correlation between word, determines the co-occurrence dependent probability between above-mentioned two keyword, wherein, co-occurrence dependent probability is characterized Co-occurrence correlation between two keywords, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set.

Specifically, determining that the co-occurrence in each keyword set between two keywords of each keyword centering is related After property, the co-occurrence dependent probability between two keywords of each keyword centering need to be determined.

Further, by taking a keyword pair as an example, based on the co-occurrence phase between above-mentioned two keywords of keyword centering Other keywords determine above-mentioned keyword centering two to corresponding co-occurrence correlation in Guan Xing, and affiliated keyword set Co-occurrence dependent probability between keyword.

Preferably, in the embodiment of the present invention, keyword key can be calculated using below equation_kWith keyword key_iBetween it is same Existing dependent probability, wherein, keyword key_kWith keyword key_iBetween co-occurrence dependent probability can characterize, keyword key_kIn instruction Practice the company profile information d that sample set is included_xIn when occurring, keyword key_iThe probability occurred simultaneously：

Wherein, R_co-occur(key_i,key_k) keyword key_iWith keyword key_kBetween co-occurrence correlation.

Step 440：Each keyword pair is directed to respectively, performs following operate：Judge there is at least one interim key Word so that when co-occurrence dependent probability of two keywords each between at least one above-mentioned interim key word is all higher than zero, base In the co-occurrence dependent probability of above-mentioned two keyword each between at least one above-mentioned interim key word, determine that above-mentioned two is closed Conditional dependencies between keyword and at least one above-mentioned interim key word.

Specifically, due between two keywords except with direct correlation relation, i.e. have between two keywords Co-occurrence correlation, it is also possible to there is indirect association relation, for above-mentioned situation, judges there is at least one interim key word, makes Co-occurrence dependent probability of two keywords respectively between at least one above-mentioned interim key word when being all higher than zero, then can be based on The co-occurrence dependent probability of above-mentioned two keyword respectively between at least one above-mentioned interim key word, determines that above-mentioned two is crucial Conditional dependencies between word and at least one above-mentioned interim key word.

So-called conditional dependencies, for example, the co-occurrence dependent probability between keyword A and keyword C is more than zero, keyword B Co-occurrence dependent probability between keyword C is more than zero, then, has conditional dependencies between keyword A and keyword B.

Further, by taking a keyword pair as an example, if two keywords of said one keyword centering each with extremely Co-occurrence dependent probability between a few interim key word is all higher than zero, then, for each interim key word, perform following Operation：The small side of value in co-occurrence dependent probability of two keywords each between interim key word is taken, above-mentioned two are used as Conditional dependencies between individual keyword and above-mentioned interim key word.

Preferably, in the embodiment of the present invention, if training sample, which is concentrated, at least has a keyword key_kSo that R_condit (key_m,key_k) ＞ 0, and, R_condit(key_n,key_k) ＞ 0, then illustrate keyword key_mWith keyword key_nBetween existence condition Correlation, and keyword key is calculated using following equation_mWith keyword key_nBetween conditional dependencies：

R(key_m,key_n|key_k)=min (R_condit(key_m,key_k),R_condit(key_n,key_k))

Wherein, R_condit(key_m,key_k) represent keyword key_kWith keyword key_mBetween co-occurrence dependent probability, R_condit(key_n,key_k) represent keyword key_kWith keyword key_nBetween co-occurrence dependent probability.

For example, it is assumed that the co-occurrence dependent probability between keyword A and keyword C is " 0.6 ", it is assumed that keyword B and key Co-occurrence dependent probability between word C is " 0.4 ", then the conditional dependencies between keyword A and keyword B, with keyword C are “0.4”。

Further, between two keywords, it can be associated by more interim key words, then above-mentioned two is crucial Conditional dependencies between word are higher.

For example, between keyword A and keyword C, can be associated through keyword B, between keyword A and keyword C, also It can be associated through keyword D, it is clear that in said circumstances, the conditional dependencies between keyword A and keyword C are higher than, only Through the conditional dependencies between the keyword B keyword A being associated and keyword C.

Step 450：Each keyword pair is directed to respectively, performs following operate：Based on two keywords of keyword centering With the conditional dependencies between at least one interim key word, the coupling correlation between above-mentioned two keyword is determined.

Specifically, by taking a keyword pair as an example, based on two keywords of keyword centering and at least one interim key Conditional dependencies between word, determine the coupling correlation between above-mentioned two keyword.

Further, still by taking a keyword pair as an example, to each interim key at least one above-mentioned interim key word Word, the conditional dependencies between above-mentioned two keyword, which are weighted, respectively is averaged, and the result after being averaged is defined as described Coupling correlation between two keywords.

Preferably, in the embodiment of the present invention, a keyword is calculated to (keyword key using below equation_nWith keyword key_m) in the coupling correlation of training sample concentration：

Wherein, L={ key_k|(R_condit(key_m,key_k))∧(R_condit(key_n,key_k))}。

For example, it is assumed that the conditional dependencies between keyword A and keyword B and keyword C are " 0.4 ", keyword A and pass Conditional dependencies between keyword B and keyword D are " 0.6 ", then the coupling correlation between keyword A and keyword B is “0.5”。

Certainly, if being associated between two keywords in the absence of interim key word, between above-mentioned two keyword It is zero to couple correlation.

Step 460：The co-occurrence dependent probability and coupling phase being based respectively between two keywords of each keyword centering Guan Xing, determines the complete correlation between above-mentioned two keywords of each keyword centering, wherein, a keyword centering two Complete correlation between individual keyword, for characterizing the semantic association degree between two keywords.

Specifically, more accurately to capture the correlation degree between two keywords, need to be with reference between two keywords Co-occurrence dependent probability and coupling correlation, determine the complete correlation between above-mentioned two keyword, wherein, two keywords Between complete correlation it is higher, then it represents that the semantic association degree between above-mentioned two keyword is higher.

Preferably, in the embodiment of the present invention, a keyword can be calculated using below equation to (keyword key_nWith key Word key_m) between complete correlation：

Wherein, α is a parameter between 0 and 1, and for adjusting condition correlation and coupling, correlation is respective accounts for Than.

For example, it is assumed that α is " 0.7 ", if keyword is to the co-occurrence dependent probability between the keyword A and keyword B in 1 " 0.3 ", coupling correlation is " 0.6 ", then keyword is to the complete correlation between the keyword A and keyword B in 1： " 0.7 × 0.3+ (1-0.3) × 0.6=0.63 ", i.e. in corresponding training sample subset, keyword to the keyword A in 1 with Complete correlation between keyword B is " 0.63 ".

So, in each training sample subset of training sample set, each information to comprising each keyword centering Two keywords between complete correlation just have determined that.

In the embodiment of the present invention, for convenience of each keyword of subsequent extracted in different training sample subsets (different enterprises Type) in complete correlation, can using the complete correlation between two keywords of each keyword centering as an element, Determine the general semantic matrix of coupling network model.

Preferably, in the embodiment of the present invention, the corresponding general semantic matrix M' of training sample set can be represented by below equation Middle keyword is to (keyword key_nWith keyword key_m) determine an element：

M'(m, n)=R (key_m,key_n)

In the embodiment of the present invention, why the complete correlation based on keyword pair is selected, determined in coupling network model The method of general semantic matrix, is because the incidence relation considered between each keyword that can be more thorough reduces general semantic square Element is openness in battle array.

Further, in the embodiment of the present invention, in advance according to the different types of business, if training sample set is divided into Dry training sample subset, and, it is also in key during complete correlation subsequently between two keywords in calculating keyword pair Word in affiliated training sample subset to calculating, therefore, each element in the general semantic matrix of training sample set there is also Each self-corresponding type of business.

In the embodiment of the present invention, to verify the accuracy of coupling network model, the part instruction that can be concentrated using training sample Practice sample to test above-mentioned coupling network model, or, using unknown company information, manually to above-mentioned coupling network model Tested, if test accuracy is more than given threshold (e.g., 99%), then above-mentioned coupling network model can be come into operation, If test accuracy is unsatisfactory for given threshold, more training sample sets are chosen, coupling network model is trained, until Test accuracy meets given threshold.

So far, you can it is determined that can be used for the coupling network model of company information classification.

Referring particularly to shown in Fig. 5, in the embodiment of the present invention, for the company information (abbreviation of the unknown type of business of acquisition Company information to be sorted), the corresponding type of business of company information to be sorted can be determined based on below scheme：

Step 500：Some words for meeting setting rule are extracted from the company information to be sorted of acquisition, and by each two Word is defined as a word pair.

Specifically, from the company profile information of company information to be sorted, can be extracted based on clause segmentation and semantic excavation Some words, and based on variance distribution, calculate the corresponding weighted value of each word, and from all some words extracted In, each word for meeting setting rule (e.g., weighted value is more than given threshold) is filtered out, and each two word is defined as one Individual word pair.

Step 510：Based on default coupling network model, determine each word in each default enterprise respectively Corresponding complete correlation in type, wherein, complete correlation is used to characterize the semantic association degree between two words.

Specifically, based on default coupling network model, from each corresponding type of business of general semantic matrix, searching every Complete correlation of one word to the corresponding keyword pair in each type of business.

Step 520：Each above-mentioned word is based respectively on to the corresponding complete correlation in each type of business, really Maximum is coupled the corresponding type of business of probability by each fixed word to belonging to the coupling probability of each type of business, it is determined that For the type of business of the company information to be sorted under current enterprise rank.

Preferably, in the embodiment of the present invention, calculating company information to be sorted using below equation and belonging to the general of type of business C Rate：

Wherein, word key_iWith word key_hEach word pair included for company information to be sorted, i and h are variable；For company information to be sorted each word to the class conditional probability under type of business C；P (c) is enterprise The class prior probability of industry Type C, so-called class prior probability, that is, refer to, is concentrated in training sample, all enterprises under type of business C Industry information number concentrates the ratio of all company information numbers with training sample；For company information to be sorted Each word, to the probability sum occurred in each company information of training sample set, is in general a definite value.

Specifically, calculatingWhen, by taking two groups of words pair as an example, from general semantic matrix, extract enterprise Under industry Type C, first group of word is extracted under type of business C to corresponding complete correlation, and from general semantic matrix, the Two groups of words are to corresponding complete correlation.

However, in general, if under type of business C, in the absence of first group of word pair, or, in the absence of second group of word Language pair, then corresponding complete correlation is zero, in the embodiment of the present invention, in specific calculate, to prevent because the factor is to make zero It is zero to obtain product, the complete correlation of each group of keyword pair extracted, plus an invariant, e.g., " 1 ".

For example, it is assumed that type of business C is " Investment ＆ Financing ", and in fixed general semantic matrix, it is determined that " investment reason The keyword included under the wealth " type of business is to " finance " and " investment ", " fund " and " security ", " stock " and " insurance ", " state Debt " and " futures ", each self-corresponding complete correlation are respectively " 0.6 ", " 0.8 ", " 0.3 " and " 0.4 "；

If company information to be sorted extracts three groups of words pair, wherein, first group of word is to for " finance " and " investment ", Two groups of words are to for " fund " and " security ", and the 3rd group of word is to for " animation " and " animation ", then, in type of business C, One group of word to corresponding complete correlation be " 0.6 ", second group of word to corresponding complete correlation be " 0.8 ", due to In type of business C, in the absence of the 3rd group of word pair, therefore, the complete phase of the corresponding 3rd group of word pair of company information to be sorted Guan Xingwei " 0 "；

If it is further assumed that invariant is set as " 1 ", then, the corresponding three groups of words of company information to be sorted are to each Self-corresponding complete correlation is respectively：" 0.6+1 ", " 0.8+1 " and " 0+1 ".

Obviously, for each type of business, company information to be sorted can all obtain coupling probability accordingly, from above-mentioned each In individual coupling probability, the corresponding type of business of maximum coupling probability is filtered out, the corresponding enterprise-class of company information to be sorted is used as Type.

If for example, having 3 kinds of types of business, the coupling probability that company information A to be sorted belongs to the type of business 1 is “0.35”；The coupling probability for belonging to the type of business 2 is " 0.73 "；The coupling probability for belonging to the type of business 3 is " 0.96 ", then, will The type of business 3 is defined as the company information A to be sorted type of business.

However, carried out due to the embodiment of the present invention based on multi-level enterprise framework, therefore, for the enterprise of different enterprise levels Industry type, company information to be sorted can be determined respectively, the corresponding type of business in each enterprise level, then, based on default Multistage screening rule, from the type of business of each different enterprise level, filter out a type of business, be used as enterprise to be sorted The Target Enterprise type of industry information.

Wherein, above-mentioned multistage screening rule, can be classification backstepping method, or classification forwards method.

On classification backstepping method, by taking Fig. 1 as an example, if company information to be sorted 1, the first enterprise level first determined is corresponding The type of business " house property house ornamentation ", and the corresponding type of business of the second enterprise level " letting agency ", it is clear that " letting agency " is The child node of " house property house ornamentation ", continue derive, it is determined that the corresponding type of business " furniture " of the 3rd enterprise level, it is clear that " family Tool " is not the child node of " letting agency ", i.e. the corresponding type of business of the 3rd enterprise level is simultaneously not belonging to the second enterprise level The corresponding type of business, then, the corresponding type of business of the second enterprise level " letting agency " can be defined as enterprise to be sorted The Target Enterprise type of information 1.

On classification forwards method, still by taking Fig. 1 as an example, if company information to be sorted 1, the 3rd enterprise level correspondence first determined The type of business " furniture ", and determine the corresponding type of business " letting agency " of the second enterprise level, it is clear that " furniture " not For the child node of " letting agency ", the Target Enterprise type that thus can determine that company information 1 to be sorted is not " furniture " certainly, after It is continuous to derive, it is determined that the corresponding type of business of the second enterprise level the corresponding type of business (the first enterprise of a upper enterprise level The corresponding type of business of rank) " house property house ornamentation ", it is clear that " letting agency " is the child node of " house property house ornamentation ", therefore, by second The corresponding type of business of enterprise level " letting agency " is defined as the Target Enterprise type of company information 1 to be sorted.

Obviously, either classification forwards method is still classified backstepping method, can reduce the error rate of classification.

Based on above-described embodiment, as shown in fig.6, in the embodiment of the present invention, company information sorter, at least including number According to acquiring unit 61, processing unit 62 and taxon 63, wherein,

Data capture unit 61, is extracted for obtaining company information to be sorted, and from the company information to be sorted Meet some words of setting rule, and each two word is defined as a word pair；

Processing unit 62, for based on default coupling network model, determining each word to default every respectively Corresponding complete correlation in a kind of type of business, wherein, complete correlation is used to characterize the semantic association between two words Degree, the enterprise level of each type of business is identical；

Taxon 63, for being based respectively on each described word to corresponding complete in each described type of business Whole correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding enterprise of probability Industry type, is defined as the type of business of the company information to be sorted under current enterprise rank.

Optionally, in addition to training unit 64, the training unit 64 is used for：

Obtain before company information to be sorted, perform following operate：

Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all When determining each self-corresponding type of business, the training unit 64 is used for：

Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause In the target canonical clause, when determining the corresponding type of business, the training unit 64 is used for：

Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every During complete correlation between one keyword centering, two keywords, the training unit 64 is used for：

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, when determining the coupling correlation between described two keywords, the training unit 64 is used for：

Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general Rate, when determining the conditional dependencies between described two keywords and at least one described interim key word, the training unit 64 are used for：

For each interim key word, following operate is performed：

Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really When determining the coupling correlation between described two keywords, the training unit 64 is used for：

Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business Property, it is determined that during coupling probability of each the described word to belonging to each type of business, the taxon 63 is used for：

Optionally, in addition to multiclass classification unit 65, the multiclass classification unit 65 is used for：

In summary, in the embodiment of the present invention, acquisition company information to be sorted is first passed through, then, from the to be sorted of acquisition Some words for meeting setting rule are extracted in company information, and each two word is defined as a word pair, then, are based on Default coupling network model, determines each word to the complete correlation in each default type of business, wherein, Complete correlation is used to characterize the semantic association degree between two words, and the enterprise level of each type of business is identical, finally, Based on each word to the corresponding complete correlation in each above-mentioned type of business, determine each word to belonging to each The coupling probability of the type of business is planted, and the corresponding type of business of maximum coupling probability is defined as company information to be sorted current The type of business under enterprise level, so, for the company information to be sorted directly obtained, just can be based on company information to be sorted Semantic association degree between each word of middle extraction, determines the corresponding type of business of company information to be sorted, improves classification Accuracy, is additionally, since without any artificial operation, also improves treatment effeciency, and then improve customer experience.

Further, based on default multistage screening rule, from each corresponding type of business of each above-mentioned enterprise level In, a type of business is filtered out, so, just can be from enterprise to be sorted as the Target Enterprise type of company information to be sorted In the type of business of the corresponding different enterprise levels of information, the enterprise for the actual demand for more conforming to company information to be sorted is filtered out Industry type, further increases the accuracy of classification.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can make other change and modification to these embodiments.So, appended claims are intended to be construed to include excellent Select embodiment and fall into having altered and changing for the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.So, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to comprising including these changes and modification.

Claims

1. a kind of company information sorting technique, it is characterised in that including：

Company information to be sorted is obtained, and extraction meets some words of setting rule from the company information to be sorted, And each two word is defined as a word pair；

Based on default coupling network model, determine each word to corresponding in each default type of business respectively Complete correlation, wherein, complete correlation is used to characterize the semantic association degree between two words, each described type of business Enterprise level it is identical；

Each described word is based respectively on to the corresponding complete correlation in each described type of business, each word is determined Language, by the corresponding type of business of maximum coupling probability, is defined as described treat to belonging to the coupling probability of each type of business The type of business of the classification company information under current enterprise rank.

2. the method as described in claim 1, it is characterised in that obtain before company information to be sorted, further comprise：

Some company informations are obtained, and filters out from some company informations and to meet some of setting screening rule Company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined respective correspondence The type of business；

Each each self-corresponding type of business of bar company information is concentrated according to the training sample, each of the same type of business will be belonged to Bar company information is defined as a training sample subset, wherein, an a kind of type of business of training sample subset correspondence, each instruction The enterprise level for practicing each self-corresponding type of business of sample set is identical；

Each two keyword in the keyword set is defined as a keyword pair, and calculates each keyword pair respectively In complete correlation between two keywords.

3. method as claimed in claim 2, it is characterised in that obtain some company informations, and from some enterprises Some company informations for meeting setting screening rule are filtered out in information, training sample set is constituted, wherein, the training sample The each company information concentrated is determined each self-corresponding type of business, including：

Some company informations are crawled using default web crawlers device, and respectively from crawling each company information In, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and is directed to each information respectively It is right, perform following operate：

Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and based on described Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical clause of trade classification rule；

Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and respectively for described Each information pair that training sample is concentrated, performs following operate：Based on preset rules, from least one corresponding canonical clause In filter out target canonical clause, and based on the target canonical clause, determine the corresponding type of business.

4. method as claimed in claim 3, it is characterised in that based on preset rules, from least one corresponding canonical clause In filter out target canonical clause, and based on the target canonical clause, determine the corresponding type of business, including：

From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and based on described Target canonical clause, by corresponding information to recalling to the corresponding type of business.

5. the method as described in claim 2,3 or 4, it is characterised in that each two keyword in the keyword set is true It is set to a keyword pair, and calculates the complete correlation between two keywords of each keyword centering respectively, including：

Based on variance distribution, each keyword in the keyword set is calculated respectively shared in corresponding company profile information Weighted value, and each two keyword in the keyword set is defined as a keyword pair, and be based respectively on every Each self-corresponding weighted value of two keywords of one keyword centering, it is determined that described two keywords of each keyword centering Between co-occurrence correlation, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously；

The co-occurrence correlation between two keywords of each described keyword centering is based respectively on, it is determined that each described key Co-occurrence dependent probability between two keywords of word centering, wherein, co-occurrence dependent probability characterizes the co-occurrence between two keywords Correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set；

Each keyword pair is directed to respectively, performs following operate：Judge there is at least one interim key word so that keyword When co-occurrence dependent probability of two keywords of centering each between at least one described interim key word is all higher than zero, it is based on The co-occurrence dependent probability of described two keywords each between at least one described interim key word, determines described two keys Coupling correlation between word；

The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering, it is determined that Complete correlation between described two keywords of each keyword centering.

6. method as claimed in claim 5, it is characterised in that based on described two keywords each with described at least one Between co-occurrence dependent probability between keyword, determine the coupling correlation between described two keywords, including：

Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that described Conditional dependencies between two keywords and at least one described interim key word, wherein, two keywords and a centre Existence condition correlation between keyword, is represented using said one interim key word as condition, is had between above-mentioned two keyword Relevant property；

Based on the conditional dependencies between described two keywords and at least one described interim key word, described two passes are determined Coupling correlation between keyword.

7. method as claimed in claim 6, it is characterised in that based on described two keywords each with described at least one Between co-occurrence dependent probability between keyword, determine the bar between described two keywords and at least one described interim key word Part correlation, including：

For each interim key word, following operate is performed：

The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken, as Conditional dependencies between described two keywords and the interim key word.

8. method as claimed in claim 6, it is characterised in that based on described two keywords and at least one described middle pass Conditional dependencies between keyword, determine the coupling correlation between described two keywords, including：

To each interim key word at least one described interim key word, the condition phase between described two keywords respectively Closing property, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.

9. the method as described in claim 1, it is characterised in that be based respectively on each described word in each described enterprise Each self-corresponding complete correlation in type, it is determined that each described word is to belonging to the coupling probability of each type of business, bag Include：

Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that described Each word is to the class conditional probability in each described type of business；

Each described word of determination is based respectively on to the class conditional probability in each described type of business, and it is described every A kind of prior probability of the type of business, it is determined that each described word is to belonging to the coupling probability of each type of business.

10. the method as described in claim 1, it is characterised in that by the corresponding type of business of maximum coupling probability, be defined as institute State after the type of business of the company information to be sorted under current enterprise rank, further comprise：

Based on default multistage screening rule, an enterprise is filtered out from the type of business under each described different enterprise level Type, is used as the Target Enterprise type of the company information to be sorted.

11. a kind of company information sorter, it is characterised in that including：

Data capture unit, for obtaining company information to be sorted, and extracts to meet and sets from the company information to be sorted Some words of set pattern then, and each two word is defined as a word pair；

Processing unit, for based on default coupling network model, determining each word in each default enterprise respectively Corresponding complete correlation in industry type, wherein, complete correlation is used to characterize the semantic association degree between two words, described The enterprise level of each type of business is identical；

Taxon, for being based respectively on each described word to the corresponding complete correlation in each described type of business Property, each word is determined to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding type of business of probability, It is defined as the type of business of the company information to be sorted under current enterprise rank.

12. device as claimed in claim 11, it is characterised in that also including training unit, the training unit is used for：

Obtain before company information to be sorted, perform following operate：

13. device as claimed in claim 12, it is characterised in that obtain some company informations, and looked forward to from described some Some company informations for meeting setting screening rule are filtered out in industry information, training sample set is constituted, wherein, the training sample When each company information of this concentration is determined each self-corresponding type of business, the training unit is used for：

14. device as claimed in claim 13, it is characterised in that based on preset rules, from least one corresponding canonical sentence Target canonical clause is filtered out in formula, and based on the target canonical clause, when determining the corresponding type of business, the training list Member is used for：

15. the device as described in claim 12,13 or 14, it is characterised in that each two in the keyword set is crucial Word is defined as a keyword pair, and when calculating the complete correlation between two keywords of each keyword centering respectively, The training unit is used for：

16. device as claimed in claim 15, it is characterised in that based on described two keywords each with it is described at least one Co-occurrence dependent probability between interim key word, when determining the coupling correlation between described two keywords, the training list Member is used for：

17. device as claimed in claim 16, it is characterised in that based on described two keywords each with it is described at least one Co-occurrence dependent probability between interim key word, is determined between described two keywords and at least one described interim key word During conditional dependencies, the training unit is used for：

For each interim key word, following operate is performed：

18. device as claimed in claim 16, it is characterised in that based on described two keywords and described in the middle of at least one Conditional dependencies between keyword, when determining the coupling correlation between described two keywords, the training unit is used for：

19. device as claimed in claim 11, it is characterised in that be based respectively on each described word in each described enterprise Each self-corresponding complete correlation in industry type, it is determined that each described word is to belonging to the coupling probability of each type of business When, the taxon is used for：

20. device as claimed in claim 11, it is characterised in that also including multiclass classification unit, the multiclass classification unit For：

By the corresponding type of business of maximum coupling probability, it is defined as enterprise of the company information to be sorted under current enterprise rank After industry type, following operate is performed：