CN107193915A - A kind of company information sorting technique and device - Google Patents
A kind of company information sorting technique and device Download PDFInfo
- Publication number
- CN107193915A CN107193915A CN201710339393.2A CN201710339393A CN107193915A CN 107193915 A CN107193915 A CN 107193915A CN 201710339393 A CN201710339393 A CN 201710339393A CN 107193915 A CN107193915 A CN 107193915A
- Authority
- CN
- China
- Prior art keywords
- keyword
- business
- type
- keywords
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to data analysis technique field, more particularly to a kind of company information sorting technique and device, in order to be able to the company information of timely typing magnanimity, and quickly it is correctly classified, this method is, meet some words of setting rule by being extracted from the company information to be sorted of acquisition, and each two word is defined as a word pair, then, based on default coupling network model, determine each word to the complete correlation in each default type of business, and then, determine each word to belonging to the coupling probability of each type of business, and the corresponding type of business of maximum coupling probability is defined as to the type of business of company information to be sorted, so, for the company information to be sorted directly obtained, can be based on the semantic association degree between each word, determine the corresponding type of business, improve the accuracy of classification, and, due to without any artificial operation, improve treatment effeciency, and then improve customer experience.
Description
Technical field
The present invention relates to data analysis technique field, more particularly to a kind of company information sorting technique and device.
Background technology
Internet technology flourish, driven science and technology, media, communication (Technology Media Telecom,
TMT) blowout of enterprise increases, and enterprise of interest can be inquired rapidly from the company information of magnanimity for the ease of user
Under relevant information, prior art, beforehand through the company information of manual type typing magnanimity one by one, then, manually to typing
All company informations are classified, and obtain classification results, so, and user can just be based on classification results, quickly navigate to of interest
Enterprise, further gets the relevant information of enterprise.
Obviously, at present, Data Enter and information classification are still carried out to the company information of big data quantity using manual type,
The company information that upgrades in time is not simply failed to, causes to handle time-consuming lengthening, is also easy to cause company information classification inaccurate, further
Influence Consumer's Experience.
In view of this, it is necessary to design a kind of new firms information classification approach to overcome drawbacks described above.
The content of the invention
The embodiment of the present invention provides a kind of company information sorting technique and device, enterprise's letter to the energy magnanimity of typing in time
Breath, and quickly it is correctly classified.
Concrete technical scheme provided in an embodiment of the present invention is as follows:
A kind of company information sorting technique, including:
Company information to be sorted is obtained, and some words for meeting setting rule are extracted from the company information to be sorted
Language, and each two word is defined as a word pair;
Based on default coupling network model, determine each word to right in each default type of business respectively
The complete correlation answered, wherein, complete correlation is used to characterize the semantic association degree between two words, each described enterprise
The enterprise level of type is identical;
Each described word is based respectively on to the corresponding complete correlation in each described type of business, it is determined that respectively
Individual word, by the corresponding type of business of maximum coupling probability, is defined as institute to belonging to the coupling probability of each type of business
State the type of business of the company information to be sorted under current enterprise rank.
Optionally, obtain before company information to be sorted, further comprise:
Some company informations are obtained, and if filter out from some company informations and to meet setting screening rule
Dry bar company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined each
The corresponding type of business;
Each each self-corresponding type of business of bar company information is concentrated according to the training sample, the same type of business will be belonged to
Each bar company information be defined as a training sample subset, wherein, a kind of type of business of training sample subset correspondence, respectively
The enterprise level of individual each self-corresponding type of business of training sample subset is identical;
The each company information for each training sample subset performs following operate respectively:
The keyword for meeting setting number or setting number range is extracted, keyword set is constituted;
Each two keyword in the keyword set is defined as a keyword pair, and it is crucial to calculate each respectively
Complete correlation between two keywords of word centering.
Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening
Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all
Each self-corresponding type of business is determined, including:
Some company informations are crawled using default web crawlers device, and respectively from crawling each enterprise's letter
In breath, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and believe respectively for each
Breath pair, performs following operate:
Split using clause, some simple sentences included in the company profile information for extracting information pair;
Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and be based on
Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical sentence of trade classification rule
Formula;
Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and be directed to respectively
Each information pair that the training sample is concentrated, performs following operate:Based on preset rules, from least one corresponding canonical
Target canonical clause is filtered out in clause, and based on the target canonical clause, determines the corresponding type of business.
Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause
In the target canonical clause, the corresponding type of business is determined, including:
According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as
Target canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business;Or,
From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and be based on
The target canonical clause, by corresponding information to recalling to the corresponding type of business.
Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every
Complete correlation between one keyword centering, two keywords, including:
Based on variance distribution, each keyword is calculated in the keyword set respectively in corresponding company profile information
Shared weighted value, and each two keyword in the keyword set is defined as a keyword pair, and base respectively
In each self-corresponding weighted value of two keywords of each keyword centering, it is determined that two passes of each described keyword centering
Co-occurrence correlation between keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously;
Be based respectively on the co-occurrence correlation between described two keywords of each keyword centering, it is determined that it is described each
Co-occurrence dependent probability between two keywords of keyword centering, wherein, co-occurrence dependent probability is characterized between two keywords
Co-occurrence correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set;
Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key word so that close
When co-occurrence dependent probability of two keywords of keyword centering each between at least one described interim key word is all higher than zero,
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, is determined described two
Coupling correlation between keyword;
The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering,
It is determined that the complete correlation between described two keywords of each keyword centering.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, determines the coupling correlation between described two keywords, including:
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that
Conditional dependencies between described two keywords and at least one described interim key word, wherein, two keywords and one
Existence condition correlation between interim key word, represent using said one interim key word as condition, above-mentioned two keyword it
Between have relevance;
Based on the conditional dependencies between described two keywords and at least one described interim key word, described two are determined
Coupling correlation between individual keyword.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, determines the conditional dependencies between described two keywords and at least one described interim key word, including:
For each interim key word, following operate is performed:
The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken,
It is used as the conditional dependencies between described two keywords and the interim key word.
Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really
Coupling correlation between fixed described two keywords, including:
To each interim key word at least one described interim key word, the bar between described two keywords respectively
Part correlation, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.
Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business
Property, it is determined that each described word is to belonging to the coupling probability of each type of business, including:
Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that
Each described word is to the class conditional probability in each described type of business;
Each described word of determination is based respectively on to the class conditional probability in each described type of business, Yi Jisuo
The prior probability of each type of business is stated, it is determined that each described word is to belonging to the coupling probability of each type of business.
Optionally, by the corresponding type of business of maximum coupling probability, it is defined as the company information to be sorted and is looked forward to currently
After the type of business under industry rank, further comprise:
Determine the type of business of the company information to be sorted under each default different enterprise level;
Based on default multistage screening rule, one is filtered out from the type of business under each described different enterprise level
The type of business, is used as the Target Enterprise type of the company information to be sorted.
A kind of company information sorter, including:
Data capture unit, for obtaining company information to be sorted, and the extraction symbol from the company information to be sorted
Some words of setting rule are closed, and each two word is defined as a word pair;
Processing unit, for based on default coupling network model, determining each word to default each respectively
Corresponding complete correlation in the type of business is planted, wherein, complete correlation is used to characterize the semantic association degree between two words,
The enterprise level of each type of business is identical;
Taxon, for being based respectively on each described word to corresponding complete in each described type of business
Correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding enterprise of probability
Type, is defined as the type of business of the company information to be sorted under current enterprise rank.
Optionally, in addition to training unit, the training unit is used for:
Obtain before company information to be sorted, perform following operate:
Some company informations are obtained, and if filter out from some company informations and to meet setting screening rule
Dry bar company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined each
The corresponding type of business;
Each each self-corresponding type of business of bar company information is concentrated according to the training sample, the same type of business will be belonged to
Each bar company information be defined as a training sample subset, wherein, a kind of type of business of training sample subset correspondence, respectively
The enterprise level of individual each self-corresponding type of business of training sample subset is identical;
The each company information for each training sample subset performs following operate respectively:
The keyword for meeting setting number or setting number range is extracted, keyword set is constituted;
Each two keyword in the keyword set is defined as a keyword pair, and it is crucial to calculate each respectively
Complete correlation between two keywords of word centering.
Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening
Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all
When determining each self-corresponding type of business, the training unit is used for:
Some company informations are crawled using default web crawlers device, and respectively from crawling each enterprise's letter
In breath, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and believe respectively for each
Breath pair, performs following operate:
Split using clause, some simple sentences included in the company profile information for extracting information pair;
Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and be based on
Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical sentence of trade classification rule
Formula;
Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and be directed to respectively
Each information pair that the training sample is concentrated, performs following operate:Based on preset rules, from least one corresponding canonical
Target canonical clause is filtered out in clause, and based on the target canonical clause, determines the corresponding type of business.
Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause
In the target canonical clause, when determining the corresponding type of business, the training unit is used for:
According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as
Target canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business;Or,
From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and be based on
The target canonical clause, by corresponding information to recalling to the corresponding type of business.
Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every
During complete correlation between one keyword centering, two keywords, the training unit is used for:
Based on variance distribution, each keyword is calculated in the keyword set respectively in corresponding company profile information
Shared weighted value, and each two keyword in the keyword set is defined as a keyword pair, and base respectively
In each self-corresponding weighted value of two keywords of each keyword centering, it is determined that two passes of each described keyword centering
Co-occurrence correlation between keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously;
Be based respectively on the co-occurrence correlation between described two keywords of each keyword centering, it is determined that it is described each
Co-occurrence dependent probability between two keywords of keyword centering, wherein, co-occurrence dependent probability is characterized between two keywords
Co-occurrence correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set;
Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key word so that close
When co-occurrence dependent probability of two keywords of keyword centering each between at least one described interim key word is all higher than zero,
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, is determined described two
Coupling correlation between keyword;
The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering,
It is determined that the complete correlation between described two keywords of each keyword centering.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, when determining the coupling correlation between described two keywords, the training unit is used for:
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that
Conditional dependencies between described two keywords and at least one described interim key word, wherein, two keywords and one
Existence condition correlation between interim key word, represent using said one interim key word as condition, above-mentioned two keyword it
Between have relevance;
Based on the conditional dependencies between described two keywords and at least one described interim key word, described two are determined
Coupling correlation between individual keyword.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, when determining the conditional dependencies between described two keywords and at least one described interim key word, the training unit
For:
For each interim key word, following operate is performed:
The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken,
It is used as the conditional dependencies between described two keywords and the interim key word.
Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really
When determining the coupling correlation between described two keywords, the training unit is used for:
To each interim key word at least one described interim key word, the bar between described two keywords respectively
Part correlation, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.
Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business
Property, it is determined that during coupling probability of each the described word to belonging to each type of business, the taxon is used for:
Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that
Each described word is to the class conditional probability in each described type of business;
Each described word of determination is based respectively on to the class conditional probability in each described type of business, Yi Jisuo
The prior probability of each type of business is stated, it is determined that each described word is to belonging to the coupling probability of each type of business.
Optionally, in addition to multiclass classification unit, the multiclass classification unit is used for:
By the corresponding type of business of maximum coupling probability, it is defined as the company information to be sorted under current enterprise rank
The type of business after, perform following operate:
Determine the type of business of the company information to be sorted under each default different enterprise level;
Based on default multistage screening rule, one is filtered out from the type of business under each described different enterprise level
The type of business, is used as the Target Enterprise type of the company information to be sorted.
In the embodiment of the present invention, acquisition company information to be sorted is first passed through, then, from the company information to be sorted of acquisition
Some words for meeting setting rule are extracted, and each two word is defined as a word pair, then, based on default coupling
Network model, determines each word to the complete correlation in each default type of business, wherein, complete correlation
For characterizing the semantic association degree between two words, finally, based on each word in each above-mentioned type of business
Corresponding complete correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into probability
The corresponding type of business is defined as the type of business of company information to be sorted, so, believes for the enterprise to be sorted directly obtained
Breath, just can determine company information pair to be sorted based on the semantic association degree between each word extracted in company information to be sorted
The type of business answered, improves the accuracy of classification, is additionally, since without any artificial operation, also improves treatment effeciency, enter
And improve customer experience.
Brief description of the drawings
Fig. 1 be the embodiment of the present invention in, the three-level enterprise architecture classification chart of house property house ornamentation;
Fig. 2 be the embodiment of the present invention in, web crawlers apparatus structure schematic diagram;
Fig. 3 be the embodiment of the present invention in, screen training sample set method flow diagram;
Fig. 4 be the embodiment of the present invention in, determine the method flow diagram of coupling network model;
Fig. 5 be the embodiment of the present invention in, what the coupling network model based on determination was classified to company information to be sorted
Method flow diagram;
Fig. 6 be the embodiment of the present invention in, company information sorter structural representation.
Embodiment
In order to be able to the company information of timely typing magnanimity, and quickly it is correctly classified, in the embodiment of the present invention, weight
A kind of company information sorting technique is newly devised, this method is, by obtaining company information to be sorted, then, from treating for acquisition
Some words for meeting setting rule are extracted in classification company information, and each two word is defined as a word pair, then,
Based on default coupling network model, each word is determined to the complete correlation in each default type of business,
Wherein, complete correlation is used to characterize the semantic association degree between two words, finally, based on each word to above-mentioned every
Corresponding complete correlation in a kind of type of business, determines each word to belonging to the coupling probability of each type of business, and
The corresponding type of business of maximum coupling probability is defined as to the type of business of company information to be sorted.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, is not whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
The solution of the present invention will be described in detail by specific embodiment below, certainly, the present invention be not limited to
Lower embodiment.
In the embodiment of the present invention, based on trade classification rule, multiple first-class enterprise's types are preset, for example, medium
Industry, house property Decoration Industry, game industry etc., wherein, each first-class enterprise's type can be subdivided into multiple second-class enterprises again
Type, and each second-class enterprise's type can be subdivided into several three-level types of business, can the like, it can finally be subdivided into
Some N grades of types of business.
In the embodiment of the present invention, three-level enterprise architecture is used, i.e. the three-level type of business can be finally sub-divided into, with room
Produce exemplified by Decoration Industry, referring particularly to shown in Fig. 1, first-class enterprise's type is:" house property house ornamentation ";Second-class enterprise's type is:" house
Intermediary, furniture appliance, Decoration Design, house property information and community, service for infrastructure and house property house ornamentation other ", be with " letting agency "
, the three-level type of business of " letting agency " is:" in real estate consulting intermediary, Price Evaluation of Real Estate intermediary, estate agent
Be situated between, rent a house platform and software and house deal platform and software ".
Further, in the embodiment of the present invention, before classifying to the company information of acquisition, some can first be obtained
Company information, as training sample set, then, the coupling network mould classified for company information is built based on training sample set
Type.
Preferably, during the present invention is implemented, company information can derive from web crawlers, for example, web crawlers device can be increased,
The framework of web crawlers device specifically see shown in Fig. 2, and web crawlers device includes download module, parsing module and storage mould
Block, concrete processing procedure is as follows:
First, configuration webpage reptile rule, above-mentioned spiders rule is used to the webpage batch of collection be saved in locally.
Secondly, configuration webpage collection rule, for example, using a webpage as template, the data block for needing to gather is set, it is other
Rule parsing will be carried out according to above-mentioned rule by meeting the webpage of this template.
Then, acquisition tasks are configured, specifically, being combined to spiders and web retrieval, combined result is one
Acquisition tasks, wherein, a spiders can correspond to multiple web retrievals.
Finally, acquisition tasks are issued, specifically, the acquisition tasks configured can be distributed into given server
Some collection queue in.
By above-mentioned steps, you can complete the web crawlers operation of company information.
Further, because some company informations crawled are unknown, i.e. where be not aware that company information ownership
The individual type of business, therefore, some company informations being directly obtained cannot function as training sample set, if need to be to acquisition
Dry bar company information is screened, to filter out some enterprises for meeting setting screening rule from some company informations
Industry information, constitutes training sample set, as shown in fig.3, specific screening process is as follows:
Step 300:Respectively for each company information, following operate is performed:Extract enterprise name and company profile letter
Breath, constitutes an information pair.
Specifically, each company information has comprised at least enterprise name and company profile information, therefore, respectively from each
In bar company information, each self-contained enterprise name and company profile information are extracted, each self-corresponding information pair is constituted.
Further, for make subsequently can convenient use information pair, in the embodiment of the present invention, by the information pair extracted, with
The form of key-value pair is stored in associated databases, for example, database is internal memory redis databases, the composition form of key-value pair
For " Key " and " Value ", referring specifically to shown in table 1.
Table 1
In the embodiment of the present invention, why by some information of determination to being stored in internal memory redis databases, be because
Subsequently in use information pair, promptly to extract the information pair of needs, extraction rate is unaffected.
Step 310:Each information pair is directed to respectively, performs following operate:To information to comprising company profile information
Clause segmentation is performed, some simple sentences are obtained.
Specifically, due to information to comprising company profile information, be typically be made up of one big section of word, in order to extract
The keyword of the type of business can be more embodied, first according to clause, company profile information is split, some simple sentences are obtained, e.g., can
Clause segmentation is carried out to company profile information according to punctuation mark.
With "." exemplified by, for the company profile information of " AA " in table 1, can be divided into " AA companies set up in April, 2010,
Be one be absorbed in Intelligent hardware and electronic product research and development mobile Internet company ", " ' for have a fever give birth to ' be AA companies production
Product concept ", " AA companies have initiated to be participated in developing improved pattern with the Internet model exploitation mobile phone operating system, fan ".
Step 320:Perform semanteme for each simple sentence respectively to excavate, each self-contained SVO composition of extraction, and point
Not with each self-contained SVO composition, the canonical clause that each simple sentence each meets trade classification rule is constructed.
Specifically, why excavate the SVO composition of simple sentence, be because, in Chinese, the master of a complete sentence
Guest is called, being typically can be as the trunk of sentence, with higher cohesion, moreover, most of sentence has SVO, very
Rare scarce subject or object, even, what SVO lacked simultaneously.
Based on this, semantic excavation is performed to each simple sentence respectively, semanteme is performed and excavates, each wrapped with extracting each simple sentence
The SVO composition contained, then, for each simple sentence for the SVO composition that can be excavated, is arrived with respective excavation respectively
SVO composition, construct it is respective meets trade classification rule canonical clause.
Further, if keyword determining method can be used by meeting trade classification rule, such as, be pre-set and row
The related keyword of industry classification, if in the SVO composition extracted, containing default keyword, is then believed that what is excavated
SVO composition meets industry rule, and builds canonical clause with above-mentioned SVO composition.
For example, with above-mentioned example " AA companies set up in April, 2010, be one be absorbed in Intelligent hardware and electronics production
The mobile Internet company of product research and development " illustrates, and by semanteme excavation, can obtain " AA companies are mobile Internet companies ",
In this manner it is possible to which based on " AA companies are mobile Internet companies " the SVO composition excavated, construction meets trade classification rule
Canonical clause then, can construct following canonical clause:" it is (.*) mobile Internet, cultural_media ".
Certainly, not all simple sentence, can excavate SVO composition, for that can not excavate SVO composition
Simple sentence, can not constitute canonical clause.
For example, it is assumed that the simple sentence of segmentation is conjunction " then ", then, SVO composition is just not present in this simple sentence,
SVO composition can be based on by being just not present, and construction meets the canonical clause of trade classification rule.
Moreover, not all simple sentence, the SVO composition excavated, can constitute and meet trade classification rule just
Then clause.
For example, being illustrated with " ' being given birth to for fever ' is the product concept of AA companies " in above-mentioned example, excavate
SVO composition is " ' being given birth to for fever ' is product concept ", it is clear that the SVO excavated does not meet trade classification rule.
Step 330:It is determined that in the presence of each information pair of at least one canonical clause, and be directed to respectively in the presence of at least one just
Then each information pair of clause, performs following operate:Based on preset rules, screened from least one corresponding canonical clause
Go out target canonical clause, and based on above-mentioned target canonical clause, determine the corresponding type of business.
Specifically, be not each information to there is canonical clause, and, nor there is the information pair of canonical clause
Unique canonical clause is respectively provided with, therefore, completes after the structure to meeting the regular canonical clause of trade classification, need to determine exist
Each information pair of at least one canonical clause.
Further, it is determined that in the presence of at least one canonical clause each information to rear, to each above-mentioned information to holding
Row is following to be operated:Based on preset rules, unique target canonical clause is filtered out from least one corresponding canonical clause, and
Using target canonical clause, by information to recalling to the corresponding type of business, wherein, it is so-called to recall, that is, refer to determine information
To the corresponding type of business.
, can be according to above-mentioned multiple canonical clause if information is to having multiple canonical clause by taking an information pair as an example
Sequencing in corresponding profile information, using preceding canonical clause as target canonical clause, and using target canonical sentence
Formula is by said one information to recalling to the corresponding type of business.
For example, it is assumed that information has following three canonical clause to A, it " is (.* to be respectively) Video Applications, cultural_
Media ", " it is (.*) skin makeup, consume_life ", " it is (.*) finishing (.*) $, house ", according to preceding canonical
Clause then can will " be (.* as target canonical clause) Video Applications, cultural_media " as target canonical clause,
And it " is (.* to use) Video Applications, cultural_media " to information to A perform recall operation, determine information to A correspondence
" video medium " class.
It is of course also possible to randomly choose one from there are the most multiple canonical clause of same keyword, target is used as
Canonical clause, and use target canonical clause by said one information to recalling to the corresponding type of business.
For example, it is assumed that information determines there are 5 canonical clause to M, wherein, 4 canonical clause are related to " house property finishing ",
Only 1 canonical clause is related to " cultural medium ", then random in canonical clause that can be related to " house property finishing " from above-mentioned 4
One is selected, as target canonical clause, then, information is determined to the corresponding type of business using target canonical clause, certainly,
The above-mentioned type of business is inevitable related to " house property finishing ".
In the embodiment of the present invention, as long as canonical clause can determine that the type of business, it is possible to which above-mentioned canonical clause is defined as
Target canonical clause, specific screening process is not limited.
So far, some information pair of above-mentioned setting screening rule are met, can be as training sample set, due to training sample set
In each information to being determined the type of business, therefore, it then follows classifying rules at the same level, concentrate each to believe based on training sample
Breath is to each self-corresponding type of business, by each information of the same type of business of ownership to being defined as a training sample subset,
Wherein, an a kind of type of business of training sample subset correspondence, with each self-corresponding type of business of a collection of training sample subset
Enterprise level is identical.
Specifically, so-called classifying rules at the same level, that is, refer to, however, it is determined that each information belongs to one-level enterprise to the corresponding type of business
Industry type, then according to the classifying and dividing training sample set of first-class enterprise's type, however, it is determined that each information is to the corresponding type of business
Belong to the N grades of types of business, then according to the classifying and dividing training sample set of the N grades of types of business.
For example, it is assumed that default first-class enterprise's type there are 3 classes, there are 2 kinds of second-class enterprises under each first-class enterprise's type
Type, referring particularly to shown in table 2.
Table 2
If continue it is assumed that in the presence of contain 5 information pair training sample set M information to 1, information to 2, information to 3,
Information is to 4, information to 5, information to 6 }, and, information to 1 with information to 2 corresponding first-class enterprise's types " cultural medium ", its
In, information is to 1 corresponding second-class enterprise's type " new media ", and information is to 2 corresponding second-class enterprise's types " traditional media ";Information is to 3
First-class enterprise's type " house property house ornamentation " is corresponded to 4 with information, wherein, information is to 3 corresponding second-class enterprise's types " real estate ", letter
Breath is to 4 corresponding second-class enterprise's types " Decoration Design ";Information corresponds to first-class enterprise's type " local life " to 6 to 5 and information,
Wherein, information is to 5 corresponding second-class enterprise's types " cuisines ", and information is to 6 corresponding second-class enterprise's types " beauty ".
If according to the classifying and dividing of first-class enterprise's type, training sample set M can be divided into:Training sample subset M1
{ information is to 1, information to 2 }, training sample subset M2{ information is to 3, information to 4 }, training sample subset M3{ information is to 5, information
To 6 }, totally 3 training sample subsets;
If according to the classifying and dividing of second-class enterprise's type, training sample set M can be divided into:Training sample subset M1
{ information is to 1 }, training sample subset M2{ information is to 2 }, training sample subset M3{ information is to 3 }, training sample subset M4{ information
To 4 }, training sample subset M5{ information is to 5 }, training sample subset M6{ information is to 6 }, totally 6 training sample subsets.
Determine that each training sample subset that training sample set is included, and each above-mentioned training sample subset are each corresponded to
The type of business after, further, each training sample subset included based on above-mentioned training sample set determines coupling network mould
In type, the embodiment of the present invention, coupling network model can be Bayes's coupling network model, referring particularly to shown in Fig. 4, determine coupling
The method flow for closing network model is as follows:
Step 400:Respectively for each training sample subset each information pair company profile information, perform with
Lower operation:The keyword for meeting setting number or setting number range is extracted, keyword set is constituted.
Specifically, each information to comprising company profile information be by some crucial phrases into, but be not
Each keyword possesses reference value, is the follow-up convenient correlation degree calculated between keyword, can be according to setting number
Or setting number range, from some keywords of the company profile information of each information pair, extract corresponding keyword, composition
Respective keyword set.
For example, it is assumed that number is set as 200, if training sample, which is concentrated, has two information pair, respectively from two information
To company profile information in, extract 200 and meet the keyword imposed a condition, constitute respective keyword set, wherein, if
Fixed condition can be related to the type of business.
In another example, it is assumed that number range is set as 100-150, if training sample, which is concentrated, has two information pair, is distinguished
From the company profile information of two information pair, extract 100-150 and meet the keyword imposed a condition, constitute respective key
Set of words.
Step 410:Based on variance distribution, each keyword in each keyword set is calculated respectively, current
Shared weighted value in residing company profile information.
Specifically, after obtaining each information to each self-corresponding keyword set, determining in each keyword set
Each keyword, the shared weighted value in the company profile information being presently in.
Preferably, in the embodiment of the present invention, calculating keyword h using below equation and being presently in company profile information d
In shared weighted value:
Wherein, thdFor word frequency, calculation formula is:ThdRepresent that keyword h occurs in company profile information d
Number of times, SdRepresent the sum for all words that company profile information d is included;N represents the company profile that training sample set is included
The sum of information;N(wh) represent the company profile information that keyword h occurs in each company profile information of training sample set
Number;Represent the average time that keyword h occurs in each company profile information of training sample set;For adjust ginseng because
Son, overweight dependence when being mainly used in the weighted value of regulating calculation keyword to word frequency.
Step 420:Each keyword set is directed to respectively, performs following operate:Each two keyword is defined as one
Individual keyword pair, and each self-corresponding weighted value of two keywords of each keyword centering is based respectively on, determine above-mentioned two
Co-occurrence correlation between individual keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously.
Specifically, incidence relation is there may be between different terms, and in passage information, word A appearance,
Can guide has co-occurrence correlation between word B appearance, commonly referred to as word A and word B.
Further, each two keyword in each keyword set is defined as a keyword pair, with one
Exemplified by keyword pair, each self-corresponding weighted value of two keywords based on above-mentioned keyword centering determines that above-mentioned two is crucial
Co-occurrence correlation between word, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously.
Preferably, in the embodiment of the present invention, keyword key is determined using below equationiWith keyword keykBetween co-occurrence
Correlation:
Wherein, wxiAnd wxkKeyword key is represented respectivelyiWith keyword keykIn company profile information dxIn weighted value;S
=x | (wxi≠0)∧(wxk≠ 0) }, represent to concentrate keyword key in training sampleiWith keyword keykWeighted value is not zero
Each company profile information.
Step 430:Each keyword pair is directed to respectively, performs following operate:Two keys based on keyword centering
Co-occurrence correlation between word, determines the co-occurrence dependent probability between above-mentioned two keyword, wherein, co-occurrence dependent probability is characterized
Co-occurrence correlation between two keywords, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set.
Specifically, determining that the co-occurrence in each keyword set between two keywords of each keyword centering is related
After property, the co-occurrence dependent probability between two keywords of each keyword centering need to be determined.
Further, by taking a keyword pair as an example, based on the co-occurrence phase between above-mentioned two keywords of keyword centering
Other keywords determine above-mentioned keyword centering two to corresponding co-occurrence correlation in Guan Xing, and affiliated keyword set
Co-occurrence dependent probability between keyword.
Preferably, in the embodiment of the present invention, keyword key can be calculated using below equationkWith keyword keyiBetween it is same
Existing dependent probability, wherein, keyword keykWith keyword keyiBetween co-occurrence dependent probability can characterize, keyword keykIn instruction
Practice the company profile information d that sample set is includedxIn when occurring, keyword keyiThe probability occurred simultaneously:
Wherein, Rco-occur(keyi,keyk) keyword keyiWith keyword keykBetween co-occurrence correlation.
Step 440:Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key
Word so that when co-occurrence dependent probability of two keywords each between at least one above-mentioned interim key word is all higher than zero, base
In the co-occurrence dependent probability of above-mentioned two keyword each between at least one above-mentioned interim key word, determine that above-mentioned two is closed
Conditional dependencies between keyword and at least one above-mentioned interim key word.
Specifically, due between two keywords except with direct correlation relation, i.e. have between two keywords
Co-occurrence correlation, it is also possible to there is indirect association relation, for above-mentioned situation, judges there is at least one interim key word, makes
Co-occurrence dependent probability of two keywords respectively between at least one above-mentioned interim key word when being all higher than zero, then can be based on
The co-occurrence dependent probability of above-mentioned two keyword respectively between at least one above-mentioned interim key word, determines that above-mentioned two is crucial
Conditional dependencies between word and at least one above-mentioned interim key word.
So-called conditional dependencies, for example, the co-occurrence dependent probability between keyword A and keyword C is more than zero, keyword B
Co-occurrence dependent probability between keyword C is more than zero, then, has conditional dependencies between keyword A and keyword B.
Further, by taking a keyword pair as an example, if two keywords of said one keyword centering each with extremely
Co-occurrence dependent probability between a few interim key word is all higher than zero, then, for each interim key word, perform following
Operation:The small side of value in co-occurrence dependent probability of two keywords each between interim key word is taken, above-mentioned two are used as
Conditional dependencies between individual keyword and above-mentioned interim key word.
Preferably, in the embodiment of the present invention, if training sample, which is concentrated, at least has a keyword keykSo that Rcondit
(keym,keyk) > 0, and, Rcondit(keyn,keyk) > 0, then illustrate keyword keymWith keyword keynBetween existence condition
Correlation, and keyword key is calculated using following equationmWith keyword keynBetween conditional dependencies:
R(keym,keyn|keyk)=min (Rcondit(keym,keyk),Rcondit(keyn,keyk))
Wherein, Rcondit(keym,keyk) represent keyword keykWith keyword keymBetween co-occurrence dependent probability,
Rcondit(keyn,keyk) represent keyword keykWith keyword keynBetween co-occurrence dependent probability.
For example, it is assumed that the co-occurrence dependent probability between keyword A and keyword C is " 0.6 ", it is assumed that keyword B and key
Co-occurrence dependent probability between word C is " 0.4 ", then the conditional dependencies between keyword A and keyword B, with keyword C are
“0.4”。
Further, between two keywords, it can be associated by more interim key words, then above-mentioned two is crucial
Conditional dependencies between word are higher.
For example, between keyword A and keyword C, can be associated through keyword B, between keyword A and keyword C, also
It can be associated through keyword D, it is clear that in said circumstances, the conditional dependencies between keyword A and keyword C are higher than, only
Through the conditional dependencies between the keyword B keyword A being associated and keyword C.
Step 450:Each keyword pair is directed to respectively, performs following operate:Based on two keywords of keyword centering
With the conditional dependencies between at least one interim key word, the coupling correlation between above-mentioned two keyword is determined.
Specifically, by taking a keyword pair as an example, based on two keywords of keyword centering and at least one interim key
Conditional dependencies between word, determine the coupling correlation between above-mentioned two keyword.
Further, still by taking a keyword pair as an example, to each interim key at least one above-mentioned interim key word
Word, the conditional dependencies between above-mentioned two keyword, which are weighted, respectively is averaged, and the result after being averaged is defined as described
Coupling correlation between two keywords.
Preferably, in the embodiment of the present invention, a keyword is calculated to (keyword key using below equationnWith keyword
keym) in the coupling correlation of training sample concentration:
Wherein, L={ keyk|(Rcondit(keym,keyk))∧(Rcondit(keyn,keyk))}。
For example, it is assumed that the conditional dependencies between keyword A and keyword B and keyword C are " 0.4 ", keyword A and pass
Conditional dependencies between keyword B and keyword D are " 0.6 ", then the coupling correlation between keyword A and keyword B is
“0.5”。
Certainly, if being associated between two keywords in the absence of interim key word, between above-mentioned two keyword
It is zero to couple correlation.
Step 460:The co-occurrence dependent probability and coupling phase being based respectively between two keywords of each keyword centering
Guan Xing, determines the complete correlation between above-mentioned two keywords of each keyword centering, wherein, a keyword centering two
Complete correlation between individual keyword, for characterizing the semantic association degree between two keywords.
Specifically, more accurately to capture the correlation degree between two keywords, need to be with reference between two keywords
Co-occurrence dependent probability and coupling correlation, determine the complete correlation between above-mentioned two keyword, wherein, two keywords
Between complete correlation it is higher, then it represents that the semantic association degree between above-mentioned two keyword is higher.
Preferably, in the embodiment of the present invention, a keyword can be calculated using below equation to (keyword keynWith key
Word keym) between complete correlation:
Wherein, α is a parameter between 0 and 1, and for adjusting condition correlation and coupling, correlation is respective accounts for
Than.
For example, it is assumed that α is " 0.7 ", if keyword is to the co-occurrence dependent probability between the keyword A and keyword B in 1
" 0.3 ", coupling correlation is " 0.6 ", then keyword is to the complete correlation between the keyword A and keyword B in 1:
" 0.7 × 0.3+ (1-0.3) × 0.6=0.63 ", i.e. in corresponding training sample subset, keyword to the keyword A in 1 with
Complete correlation between keyword B is " 0.63 ".
So, in each training sample subset of training sample set, each information to comprising each keyword centering
Two keywords between complete correlation just have determined that.
In the embodiment of the present invention, for convenience of each keyword of subsequent extracted in different training sample subsets (different enterprises
Type) in complete correlation, can using the complete correlation between two keywords of each keyword centering as an element,
Determine the general semantic matrix of coupling network model.
Preferably, in the embodiment of the present invention, the corresponding general semantic matrix M' of training sample set can be represented by below equation
Middle keyword is to (keyword keynWith keyword keym) determine an element:
M'(m, n)=R (keym,keyn)
In the embodiment of the present invention, why the complete correlation based on keyword pair is selected, determined in coupling network model
The method of general semantic matrix, is because the incidence relation considered between each keyword that can be more thorough reduces general semantic square
Element is openness in battle array.
Further, in the embodiment of the present invention, in advance according to the different types of business, if training sample set is divided into
Dry training sample subset, and, it is also in key during complete correlation subsequently between two keywords in calculating keyword pair
Word in affiliated training sample subset to calculating, therefore, each element in the general semantic matrix of training sample set there is also
Each self-corresponding type of business.
In the embodiment of the present invention, to verify the accuracy of coupling network model, the part instruction that can be concentrated using training sample
Practice sample to test above-mentioned coupling network model, or, using unknown company information, manually to above-mentioned coupling network model
Tested, if test accuracy is more than given threshold (e.g., 99%), then above-mentioned coupling network model can be come into operation,
If test accuracy is unsatisfactory for given threshold, more training sample sets are chosen, coupling network model is trained, until
Test accuracy meets given threshold.
So far, you can it is determined that can be used for the coupling network model of company information classification.
Referring particularly to shown in Fig. 5, in the embodiment of the present invention, for the company information (abbreviation of the unknown type of business of acquisition
Company information to be sorted), the corresponding type of business of company information to be sorted can be determined based on below scheme:
Step 500:Some words for meeting setting rule are extracted from the company information to be sorted of acquisition, and by each two
Word is defined as a word pair.
Specifically, from the company profile information of company information to be sorted, can be extracted based on clause segmentation and semantic excavation
Some words, and based on variance distribution, calculate the corresponding weighted value of each word, and from all some words extracted
In, each word for meeting setting rule (e.g., weighted value is more than given threshold) is filtered out, and each two word is defined as one
Individual word pair.
Step 510:Based on default coupling network model, determine each word in each default enterprise respectively
Corresponding complete correlation in type, wherein, complete correlation is used to characterize the semantic association degree between two words.
Specifically, based on default coupling network model, from each corresponding type of business of general semantic matrix, searching every
Complete correlation of one word to the corresponding keyword pair in each type of business.
Step 520:Each above-mentioned word is based respectively on to the corresponding complete correlation in each type of business, really
Maximum is coupled the corresponding type of business of probability by each fixed word to belonging to the coupling probability of each type of business, it is determined that
For the type of business of the company information to be sorted under current enterprise rank.
Preferably, in the embodiment of the present invention, calculating company information to be sorted using below equation and belonging to the general of type of business C
Rate:
Wherein, word keyiWith word keyhEach word pair included for company information to be sorted, i and h are variable;For company information to be sorted each word to the class conditional probability under type of business C;P (c) is enterprise
The class prior probability of industry Type C, so-called class prior probability, that is, refer to, is concentrated in training sample, all enterprises under type of business C
Industry information number concentrates the ratio of all company information numbers with training sample;For company information to be sorted
Each word, to the probability sum occurred in each company information of training sample set, is in general a definite value.
Specifically, calculatingWhen, by taking two groups of words pair as an example, from general semantic matrix, extract enterprise
Under industry Type C, first group of word is extracted under type of business C to corresponding complete correlation, and from general semantic matrix, the
Two groups of words are to corresponding complete correlation.
However, in general, if under type of business C, in the absence of first group of word pair, or, in the absence of second group of word
Language pair, then corresponding complete correlation is zero, in the embodiment of the present invention, in specific calculate, to prevent because the factor is to make zero
It is zero to obtain product, the complete correlation of each group of keyword pair extracted, plus an invariant, e.g., " 1 ".
For example, it is assumed that type of business C is " Investment & Financing ", and in fixed general semantic matrix, it is determined that " investment reason
The keyword included under the wealth " type of business is to " finance " and " investment ", " fund " and " security ", " stock " and " insurance ", " state
Debt " and " futures ", each self-corresponding complete correlation are respectively " 0.6 ", " 0.8 ", " 0.3 " and " 0.4 ";
If company information to be sorted extracts three groups of words pair, wherein, first group of word is to for " finance " and " investment ",
Two groups of words are to for " fund " and " security ", and the 3rd group of word is to for " animation " and " animation ", then, in type of business C,
One group of word to corresponding complete correlation be " 0.6 ", second group of word to corresponding complete correlation be " 0.8 ", due to
In type of business C, in the absence of the 3rd group of word pair, therefore, the complete phase of the corresponding 3rd group of word pair of company information to be sorted
Guan Xingwei " 0 ";
If it is further assumed that invariant is set as " 1 ", then, the corresponding three groups of words of company information to be sorted are to each
Self-corresponding complete correlation is respectively:" 0.6+1 ", " 0.8+1 " and " 0+1 ".
Obviously, for each type of business, company information to be sorted can all obtain coupling probability accordingly, from above-mentioned each
In individual coupling probability, the corresponding type of business of maximum coupling probability is filtered out, the corresponding enterprise-class of company information to be sorted is used as
Type.
If for example, having 3 kinds of types of business, the coupling probability that company information A to be sorted belongs to the type of business 1 is
“0.35”;The coupling probability for belonging to the type of business 2 is " 0.73 ";The coupling probability for belonging to the type of business 3 is " 0.96 ", then, will
The type of business 3 is defined as the company information A to be sorted type of business.
However, carried out due to the embodiment of the present invention based on multi-level enterprise framework, therefore, for the enterprise of different enterprise levels
Industry type, company information to be sorted can be determined respectively, the corresponding type of business in each enterprise level, then, based on default
Multistage screening rule, from the type of business of each different enterprise level, filter out a type of business, be used as enterprise to be sorted
The Target Enterprise type of industry information.
Wherein, above-mentioned multistage screening rule, can be classification backstepping method, or classification forwards method.
On classification backstepping method, by taking Fig. 1 as an example, if company information to be sorted 1, the first enterprise level first determined is corresponding
The type of business " house property house ornamentation ", and the corresponding type of business of the second enterprise level " letting agency ", it is clear that " letting agency " is
The child node of " house property house ornamentation ", continue derive, it is determined that the corresponding type of business " furniture " of the 3rd enterprise level, it is clear that " family
Tool " is not the child node of " letting agency ", i.e. the corresponding type of business of the 3rd enterprise level is simultaneously not belonging to the second enterprise level
The corresponding type of business, then, the corresponding type of business of the second enterprise level " letting agency " can be defined as enterprise to be sorted
The Target Enterprise type of information 1.
On classification forwards method, still by taking Fig. 1 as an example, if company information to be sorted 1, the 3rd enterprise level correspondence first determined
The type of business " furniture ", and determine the corresponding type of business " letting agency " of the second enterprise level, it is clear that " furniture " not
For the child node of " letting agency ", the Target Enterprise type that thus can determine that company information 1 to be sorted is not " furniture " certainly, after
It is continuous to derive, it is determined that the corresponding type of business of the second enterprise level the corresponding type of business (the first enterprise of a upper enterprise level
The corresponding type of business of rank) " house property house ornamentation ", it is clear that " letting agency " is the child node of " house property house ornamentation ", therefore, by second
The corresponding type of business of enterprise level " letting agency " is defined as the Target Enterprise type of company information 1 to be sorted.
Obviously, either classification forwards method is still classified backstepping method, can reduce the error rate of classification.
Based on above-described embodiment, as shown in fig.6, in the embodiment of the present invention, company information sorter, at least including number
According to acquiring unit 61, processing unit 62 and taxon 63, wherein,
Data capture unit 61, is extracted for obtaining company information to be sorted, and from the company information to be sorted
Meet some words of setting rule, and each two word is defined as a word pair;
Processing unit 62, for based on default coupling network model, determining each word to default every respectively
Corresponding complete correlation in a kind of type of business, wherein, complete correlation is used to characterize the semantic association between two words
Degree, the enterprise level of each type of business is identical;
Taxon 63, for being based respectively on each described word to corresponding complete in each described type of business
Whole correlation, determines each word to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding enterprise of probability
Industry type, is defined as the type of business of the company information to be sorted under current enterprise rank.
Optionally, in addition to training unit 64, the training unit 64 is used for:
Obtain before company information to be sorted, perform following operate:
Some company informations are obtained, and if filter out from some company informations and to meet setting screening rule
Dry bar company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined each
The corresponding type of business;
Each each self-corresponding type of business of bar company information is concentrated according to the training sample, the same type of business will be belonged to
Each bar company information be defined as a training sample subset, wherein, a kind of type of business of training sample subset correspondence, respectively
The enterprise level of individual each self-corresponding type of business of training sample subset is identical;
The each company information for each training sample subset performs following operate respectively:
The keyword for meeting setting number or setting number range is extracted, keyword set is constituted;
Each two keyword in the keyword set is defined as a keyword pair, and it is crucial to calculate each respectively
Complete correlation between two keywords of word centering.
Optionally, some company informations are obtained, and filters out from some company informations and to meet setting screening
Some company informations of rule, constitute training sample set, wherein, each company information that the training sample is concentrated is all
When determining each self-corresponding type of business, the training unit 64 is used for:
Some company informations are crawled using default web crawlers device, and respectively from crawling each enterprise's letter
In breath, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and believe respectively for each
Breath pair, performs following operate:
Split using clause, some simple sentences included in the company profile information for extracting information pair;
Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and be based on
Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical sentence of trade classification rule
Formula;
Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and be directed to respectively
Each information pair that the training sample is concentrated, performs following operate:Based on preset rules, from least one corresponding canonical
Target canonical clause is filtered out in clause, and based on the target canonical clause, determines the corresponding type of business.
Optionally, based on preset rules, target canonical clause, and base are filtered out from least one corresponding canonical clause
In the target canonical clause, when determining the corresponding type of business, the training unit 64 is used for:
According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as
Target canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business;Or,
From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and be based on
The target canonical clause, by corresponding information to recalling to the corresponding type of business.
Optionally, each two keyword in the keyword set is defined as a keyword pair, and calculated respectively every
During complete correlation between one keyword centering, two keywords, the training unit 64 is used for:
Based on variance distribution, each keyword is calculated in the keyword set respectively in corresponding company profile information
Shared weighted value, and each two keyword in the keyword set is defined as a keyword pair, and base respectively
In each self-corresponding weighted value of two keywords of each keyword centering, it is determined that two passes of each described keyword centering
Co-occurrence correlation between keyword, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously;
Be based respectively on the co-occurrence correlation between described two keywords of each keyword centering, it is determined that it is described each
Co-occurrence dependent probability between two keywords of keyword centering, wherein, co-occurrence dependent probability is characterized between two keywords
Co-occurrence correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set;
Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key word so that close
When co-occurrence dependent probability of two keywords of keyword centering each between at least one described interim key word is all higher than zero,
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, is determined described two
Coupling correlation between keyword;
The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering,
It is determined that the complete correlation between described two keywords of each keyword centering.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, when determining the coupling correlation between described two keywords, the training unit 64 is used for:
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that
Conditional dependencies between described two keywords and at least one described interim key word, wherein, two keywords and one
Existence condition correlation between interim key word, represent using said one interim key word as condition, above-mentioned two keyword it
Between have relevance;
Based on the conditional dependencies between described two keywords and at least one described interim key word, described two are determined
Coupling correlation between individual keyword.
Optionally, the co-occurrence based on described two keywords each between at least one described interim key word is related general
Rate, when determining the conditional dependencies between described two keywords and at least one described interim key word, the training unit
64 are used for:
For each interim key word, following operate is performed:
The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken,
It is used as the conditional dependencies between described two keywords and the interim key word.
Optionally, based on the conditional dependencies between described two keywords and at least one described interim key word, really
When determining the coupling correlation between described two keywords, the training unit 64 is used for:
To each interim key word at least one described interim key word, the bar between described two keywords respectively
Part correlation, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.
Optionally, each described word is based respectively on to each self-corresponding complete correlation in each described type of business
Property, it is determined that during coupling probability of each the described word to belonging to each type of business, the taxon 63 is used for:
Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that
Each described word is to the class conditional probability in each described type of business;
Each described word of determination is based respectively on to the class conditional probability in each described type of business, Yi Jisuo
The prior probability of each type of business is stated, it is determined that each described word is to belonging to the coupling probability of each type of business.
Optionally, in addition to multiclass classification unit 65, the multiclass classification unit 65 is used for:
By the corresponding type of business of maximum coupling probability, it is defined as the company information to be sorted under current enterprise rank
The type of business after, perform following operate:
Determine the type of business of the company information to be sorted under each default different enterprise level;
Based on default multistage screening rule, one is filtered out from the type of business under each described different enterprise level
The type of business, is used as the Target Enterprise type of the company information to be sorted.
In summary, in the embodiment of the present invention, acquisition company information to be sorted is first passed through, then, from the to be sorted of acquisition
Some words for meeting setting rule are extracted in company information, and each two word is defined as a word pair, then, are based on
Default coupling network model, determines each word to the complete correlation in each default type of business, wherein,
Complete correlation is used to characterize the semantic association degree between two words, and the enterprise level of each type of business is identical, finally,
Based on each word to the corresponding complete correlation in each above-mentioned type of business, determine each word to belonging to each
The coupling probability of the type of business is planted, and the corresponding type of business of maximum coupling probability is defined as company information to be sorted current
The type of business under enterprise level, so, for the company information to be sorted directly obtained, just can be based on company information to be sorted
Semantic association degree between each word of middle extraction, determines the corresponding type of business of company information to be sorted, improves classification
Accuracy, is additionally, since without any artificial operation, also improves treatment effeciency, and then improve customer experience.
Further, based on default multistage screening rule, from each corresponding type of business of each above-mentioned enterprise level
In, a type of business is filtered out, so, just can be from enterprise to be sorted as the Target Enterprise type of company information to be sorted
In the type of business of the corresponding different enterprise levels of information, the enterprise for the actual demand for more conforming to company information to be sorted is filtered out
Industry type, further increases the accuracy of classification.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described
Property concept, then can make other change and modification to these embodiments.So, appended claims are intended to be construed to include excellent
Select embodiment and fall into having altered and changing for the scope of the invention.
Obviously, those skilled in the art can carry out various changes and modification without departing from this hair to the embodiment of the present invention
The spirit and scope of bright embodiment.So, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention
And its within the scope of equivalent technologies, then the present invention is also intended to comprising including these changes and modification.
Claims (20)
1. a kind of company information sorting technique, it is characterised in that including:
Company information to be sorted is obtained, and extraction meets some words of setting rule from the company information to be sorted,
And each two word is defined as a word pair;
Based on default coupling network model, determine each word to corresponding in each default type of business respectively
Complete correlation, wherein, complete correlation is used to characterize the semantic association degree between two words, each described type of business
Enterprise level it is identical;
Each described word is based respectively on to the corresponding complete correlation in each described type of business, each word is determined
Language, by the corresponding type of business of maximum coupling probability, is defined as described treat to belonging to the coupling probability of each type of business
The type of business of the classification company information under current enterprise rank.
2. the method as described in claim 1, it is characterised in that obtain before company information to be sorted, further comprise:
Some company informations are obtained, and filters out from some company informations and to meet some of setting screening rule
Company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined respective correspondence
The type of business;
Each each self-corresponding type of business of bar company information is concentrated according to the training sample, each of the same type of business will be belonged to
Bar company information is defined as a training sample subset, wherein, an a kind of type of business of training sample subset correspondence, each instruction
The enterprise level for practicing each self-corresponding type of business of sample set is identical;
The each company information for each training sample subset performs following operate respectively:
The keyword for meeting setting number or setting number range is extracted, keyword set is constituted;
Each two keyword in the keyword set is defined as a keyword pair, and calculates each keyword pair respectively
In complete correlation between two keywords.
3. method as claimed in claim 2, it is characterised in that obtain some company informations, and from some enterprises
Some company informations for meeting setting screening rule are filtered out in information, training sample set is constituted, wherein, the training sample
The each company information concentrated is determined each self-corresponding type of business, including:
Some company informations are crawled using default web crawlers device, and respectively from crawling each company information
In, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and is directed to each information respectively
It is right, perform following operate:
Split using clause, some simple sentences included in the company profile information for extracting information pair;
Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and based on described
Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical clause of trade classification rule;
Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and respectively for described
Each information pair that training sample is concentrated, performs following operate:Based on preset rules, from least one corresponding canonical clause
In filter out target canonical clause, and based on the target canonical clause, determine the corresponding type of business.
4. method as claimed in claim 3, it is characterised in that based on preset rules, from least one corresponding canonical clause
In filter out target canonical clause, and based on the target canonical clause, determine the corresponding type of business, including:
According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as target
Canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business;Or,
From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and based on described
Target canonical clause, by corresponding information to recalling to the corresponding type of business.
5. the method as described in claim 2,3 or 4, it is characterised in that each two keyword in the keyword set is true
It is set to a keyword pair, and calculates the complete correlation between two keywords of each keyword centering respectively, including:
Based on variance distribution, each keyword in the keyword set is calculated respectively shared in corresponding company profile information
Weighted value, and each two keyword in the keyword set is defined as a keyword pair, and be based respectively on every
Each self-corresponding weighted value of two keywords of one keyword centering, it is determined that described two keywords of each keyword centering
Between co-occurrence correlation, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously;
The co-occurrence correlation between two keywords of each described keyword centering is based respectively on, it is determined that each described key
Co-occurrence dependent probability between two keywords of word centering, wherein, co-occurrence dependent probability characterizes the co-occurrence between two keywords
Correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set;
Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key word so that keyword
When co-occurrence dependent probability of two keywords of centering each between at least one described interim key word is all higher than zero, it is based on
The co-occurrence dependent probability of described two keywords each between at least one described interim key word, determines described two keys
Coupling correlation between word;
The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering, it is determined that
Complete correlation between described two keywords of each keyword centering.
6. method as claimed in claim 5, it is characterised in that based on described two keywords each with described at least one
Between co-occurrence dependent probability between keyword, determine the coupling correlation between described two keywords, including:
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that described
Conditional dependencies between two keywords and at least one described interim key word, wherein, two keywords and a centre
Existence condition correlation between keyword, is represented using said one interim key word as condition, is had between above-mentioned two keyword
Relevant property;
Based on the conditional dependencies between described two keywords and at least one described interim key word, described two passes are determined
Coupling correlation between keyword.
7. method as claimed in claim 6, it is characterised in that based on described two keywords each with described at least one
Between co-occurrence dependent probability between keyword, determine the bar between described two keywords and at least one described interim key word
Part correlation, including:
For each interim key word, following operate is performed:
The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken, as
Conditional dependencies between described two keywords and the interim key word.
8. method as claimed in claim 6, it is characterised in that based on described two keywords and at least one described middle pass
Conditional dependencies between keyword, determine the coupling correlation between described two keywords, including:
To each interim key word at least one described interim key word, the condition phase between described two keywords respectively
Closing property, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.
9. the method as described in claim 1, it is characterised in that be based respectively on each described word in each described enterprise
Each self-corresponding complete correlation in type, it is determined that each described word is to belonging to the coupling probability of each type of business, bag
Include:
Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that described
Each word is to the class conditional probability in each described type of business;
Each described word of determination is based respectively on to the class conditional probability in each described type of business, and it is described every
A kind of prior probability of the type of business, it is determined that each described word is to belonging to the coupling probability of each type of business.
10. the method as described in claim 1, it is characterised in that by the corresponding type of business of maximum coupling probability, be defined as institute
State after the type of business of the company information to be sorted under current enterprise rank, further comprise:
Determine the type of business of the company information to be sorted under each default different enterprise level;
Based on default multistage screening rule, an enterprise is filtered out from the type of business under each described different enterprise level
Type, is used as the Target Enterprise type of the company information to be sorted.
11. a kind of company information sorter, it is characterised in that including:
Data capture unit, for obtaining company information to be sorted, and extracts to meet and sets from the company information to be sorted
Some words of set pattern then, and each two word is defined as a word pair;
Processing unit, for based on default coupling network model, determining each word in each default enterprise respectively
Corresponding complete correlation in industry type, wherein, complete correlation is used to characterize the semantic association degree between two words, described
The enterprise level of each type of business is identical;
Taxon, for being based respectively on each described word to the corresponding complete correlation in each described type of business
Property, each word is determined to belonging to the coupling probability of each type of business, and maximum is coupled into the corresponding type of business of probability,
It is defined as the type of business of the company information to be sorted under current enterprise rank.
12. device as claimed in claim 11, it is characterised in that also including training unit, the training unit is used for:
Obtain before company information to be sorted, perform following operate:
Some company informations are obtained, and filters out from some company informations and to meet some of setting screening rule
Company information, constitutes training sample set, wherein, each company information that the training sample is concentrated is determined respective correspondence
The type of business;
Each each self-corresponding type of business of bar company information is concentrated according to the training sample, each of the same type of business will be belonged to
Bar company information is defined as a training sample subset, wherein, an a kind of type of business of training sample subset correspondence, each instruction
The enterprise level for practicing each self-corresponding type of business of sample set is identical;
The each company information for each training sample subset performs following operate respectively:
The keyword for meeting setting number or setting number range is extracted, keyword set is constituted;
Each two keyword in the keyword set is defined as a keyword pair, and calculates each keyword pair respectively
In complete correlation between two keywords.
13. device as claimed in claim 12, it is characterised in that obtain some company informations, and looked forward to from described some
Some company informations for meeting setting screening rule are filtered out in industry information, training sample set is constituted, wherein, the training sample
When each company information of this concentration is determined each self-corresponding type of business, the training unit is used for:
Some company informations are crawled using default web crawlers device, and respectively from crawling each company information
In, each self-contained enterprise name and company profile information are extracted, respective information pair is constituted, and is directed to each information respectively
It is right, perform following operate:
Split using clause, some simple sentences included in the company profile information for extracting information pair;
Semantic excavation is performed to each simple sentence respectively, each self-contained SVO composition of each simple sentence is extracted, and based on described
Each self-contained SVO composition of each simple sentence, each described simple sentence of construction each meets the canonical clause of trade classification rule;
Each information pair determined in the presence of at least one canonical clause is filtered out, training sample set is constituted, and respectively for described
Each information pair that training sample is concentrated, performs following operate:Based on preset rules, from least one corresponding canonical clause
In filter out target canonical clause, and based on the target canonical clause, determine the corresponding type of business.
14. device as claimed in claim 13, it is characterised in that based on preset rules, from least one corresponding canonical sentence
Target canonical clause is filtered out in formula, and based on the target canonical clause, when determining the corresponding type of business, the training list
Member is used for:
According at least one the described sequence of canonical clause in company profile information, forward canonical clause is defined as target
Canonical clause, and based on the target canonical clause, by corresponding information to recalling to the corresponding type of business;Or,
From at least one described canonical clause, a canonical clause is randomly selected as target canonical clause, and based on described
Target canonical clause, by corresponding information to recalling to the corresponding type of business.
15. the device as described in claim 12,13 or 14, it is characterised in that each two in the keyword set is crucial
Word is defined as a keyword pair, and when calculating the complete correlation between two keywords of each keyword centering respectively,
The training unit is used for:
Based on variance distribution, each keyword in the keyword set is calculated respectively shared in corresponding company profile information
Weighted value, and each two keyword in the keyword set is defined as a keyword pair, and be based respectively on every
Each self-corresponding weighted value of two keywords of one keyword centering, it is determined that described two keywords of each keyword centering
Between co-occurrence correlation, wherein, co-occurrence correlation characterizes the relevance that two keywords occur simultaneously;
The co-occurrence correlation between two keywords of each described keyword centering is based respectively on, it is determined that each described key
Co-occurrence dependent probability between two keywords of word centering, wherein, co-occurrence dependent probability characterizes the co-occurrence between two keywords
Correlation, accounts for the ratio of the co-occurrence correlation of all keywords pair in affiliated keyword set;
Each keyword pair is directed to respectively, performs following operate:Judge there is at least one interim key word so that keyword
When co-occurrence dependent probability of two keywords of centering each between at least one described interim key word is all higher than zero, it is based on
The co-occurrence dependent probability of described two keywords each between at least one described interim key word, determines described two keys
Coupling correlation between word;
The co-occurrence dependent probability and coupling correlation being based respectively between two keywords of each described keyword centering, it is determined that
Complete correlation between described two keywords of each keyword centering.
16. device as claimed in claim 15, it is characterised in that based on described two keywords each with it is described at least one
Co-occurrence dependent probability between interim key word, when determining the coupling correlation between described two keywords, the training list
Member is used for:
Co-occurrence dependent probability based on described two keywords each between at least one described interim key word, it is determined that described
Conditional dependencies between two keywords and at least one described interim key word, wherein, two keywords and a centre
Existence condition correlation between keyword, is represented using said one interim key word as condition, is had between above-mentioned two keyword
Relevant property;
Based on the conditional dependencies between described two keywords and at least one described interim key word, described two passes are determined
Coupling correlation between keyword.
17. device as claimed in claim 16, it is characterised in that based on described two keywords each with it is described at least one
Co-occurrence dependent probability between interim key word, is determined between described two keywords and at least one described interim key word
During conditional dependencies, the training unit is used for:
For each interim key word, following operate is performed:
The small side of value in co-occurrence dependent probability of described two keywords each between the interim key word is taken, as
Conditional dependencies between described two keywords and the interim key word.
18. device as claimed in claim 16, it is characterised in that based on described two keywords and described in the middle of at least one
Conditional dependencies between keyword, when determining the coupling correlation between described two keywords, the training unit is used for:
To each interim key word at least one described interim key word, the condition phase between described two keywords respectively
Closing property, which is weighted, to be averaged, and the result after being averaged is defined as the coupling correlation between described two keywords.
19. device as claimed in claim 11, it is characterised in that be based respectively on each described word in each described enterprise
Each self-corresponding complete correlation in industry type, it is determined that each described word is to belonging to the coupling probability of each type of business
When, the taxon is used for:
Each described word is based respectively on to each self-corresponding complete correlation in each described type of business, it is determined that described
Each word is to the class conditional probability in each described type of business;
Each described word of determination is based respectively on to the class conditional probability in each described type of business, and it is described every
A kind of prior probability of the type of business, it is determined that each described word is to belonging to the coupling probability of each type of business.
20. device as claimed in claim 11, it is characterised in that also including multiclass classification unit, the multiclass classification unit
For:
By the corresponding type of business of maximum coupling probability, it is defined as enterprise of the company information to be sorted under current enterprise rank
After industry type, following operate is performed:
Determine the type of business of the company information to be sorted under each default different enterprise level;
Based on default multistage screening rule, an enterprise is filtered out from the type of business under each described different enterprise level
Type, is used as the Target Enterprise type of the company information to be sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710339393.2A CN107193915A (en) | 2017-05-15 | 2017-05-15 | A kind of company information sorting technique and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710339393.2A CN107193915A (en) | 2017-05-15 | 2017-05-15 | A kind of company information sorting technique and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107193915A true CN107193915A (en) | 2017-09-22 |
Family
ID=59872414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710339393.2A Pending CN107193915A (en) | 2017-05-15 | 2017-05-15 | A kind of company information sorting technique and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193915A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107944480A (en) * | 2017-11-16 | 2018-04-20 | 广州探迹科技有限公司 | A kind of enterprises ' industry sorting technique |
CN108009248A (en) * | 2017-11-30 | 2018-05-08 | 国信优易数据有限公司 | A kind of data classification method and system |
CN108960772A (en) * | 2018-06-27 | 2018-12-07 | 北京窝头网络科技有限公司 | Enterprise's evaluation householder method and system based on deep learning |
CN109299362A (en) * | 2018-09-21 | 2019-02-01 | 平安科技(深圳)有限公司 | Similar enterprise's recommended method, device, computer equipment and storage medium |
CN109558481A (en) * | 2018-12-03 | 2019-04-02 | 中国科学技术信息研究所 | Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing |
CN110619067A (en) * | 2019-08-27 | 2019-12-27 | 深圳证券交易所 | Industry classification-based retrieval method and retrieval device and readable storage medium |
CN111191091A (en) * | 2019-12-30 | 2020-05-22 | 成都数联铭品科技有限公司 | Data classification method and system |
CN112131378A (en) * | 2020-08-20 | 2020-12-25 | 彭涛 | Method and device for identifying categories of civil problems and electronic equipment |
CN112487263A (en) * | 2020-11-26 | 2021-03-12 | 杭州安恒信息技术股份有限公司 | Information processing method, system, equipment and computer readable storage medium |
CN114722819A (en) * | 2022-02-16 | 2022-07-08 | 平安科技(深圳)有限公司 | Entity type classification and identification method, device, equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224577A (en) * | 2014-07-01 | 2016-01-06 | 清华大学 | Multi-label text classification method and system |
US20160357845A1 (en) * | 2014-04-29 | 2016-12-08 | Tencent Technology (Shenzhen) Company Limited | Method and Apparatus for Classifying Object Based on Social Networking Service, and Storage Medium |
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
-
2017
- 2017-05-15 CN CN201710339393.2A patent/CN107193915A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160357845A1 (en) * | 2014-04-29 | 2016-12-08 | Tencent Technology (Shenzhen) Company Limited | Method and Apparatus for Classifying Object Based on Social Networking Service, and Storage Medium |
CN105224577A (en) * | 2014-07-01 | 2016-01-06 | 清华大学 | Multi-label text classification method and system |
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
Non-Patent Citations (2)
Title |
---|
王洪佳: ""基于相对密度的多耦合文本聚类算法"", 《计算机应用研究》 * |
王雪飞: ""词间相关性对文本分类的影响"", 《万方》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107944480A (en) * | 2017-11-16 | 2018-04-20 | 广州探迹科技有限公司 | A kind of enterprises ' industry sorting technique |
CN107944480B (en) * | 2017-11-16 | 2020-11-24 | 广州探迹科技有限公司 | Enterprise industry classification method |
CN108009248A (en) * | 2017-11-30 | 2018-05-08 | 国信优易数据有限公司 | A kind of data classification method and system |
CN108960772A (en) * | 2018-06-27 | 2018-12-07 | 北京窝头网络科技有限公司 | Enterprise's evaluation householder method and system based on deep learning |
CN109299362A (en) * | 2018-09-21 | 2019-02-01 | 平安科技(深圳)有限公司 | Similar enterprise's recommended method, device, computer equipment and storage medium |
CN109299362B (en) * | 2018-09-21 | 2023-04-14 | 平安科技(深圳)有限公司 | Similar enterprise recommendation method and device, computer equipment and storage medium |
CN109558481B (en) * | 2018-12-03 | 2022-05-24 | 中国科学技术信息研究所 | Method, device and equipment for measuring correlation between patent and enterprise and readable storage medium |
CN109558481A (en) * | 2018-12-03 | 2019-04-02 | 中国科学技术信息研究所 | Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing |
CN110619067A (en) * | 2019-08-27 | 2019-12-27 | 深圳证券交易所 | Industry classification-based retrieval method and retrieval device and readable storage medium |
CN111191091A (en) * | 2019-12-30 | 2020-05-22 | 成都数联铭品科技有限公司 | Data classification method and system |
CN112131378A (en) * | 2020-08-20 | 2020-12-25 | 彭涛 | Method and device for identifying categories of civil problems and electronic equipment |
CN112487263A (en) * | 2020-11-26 | 2021-03-12 | 杭州安恒信息技术股份有限公司 | Information processing method, system, equipment and computer readable storage medium |
CN114722819A (en) * | 2022-02-16 | 2022-07-08 | 平安科技(深圳)有限公司 | Entity type classification and identification method, device, equipment and medium |
CN114722819B (en) * | 2022-02-16 | 2024-01-19 | 平安科技(深圳)有限公司 | Entity type classification and identification method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107193915A (en) | A kind of company information sorting technique and device | |
CN111177569B (en) | Recommendation processing method, device and equipment based on artificial intelligence | |
CN108446540B (en) | Program code plagiarism type detection method and system based on source code multi-label graph neural network | |
Pan et al. | Propensity score analysis: Fundamentals and developments | |
CN106202518B (en) | Short text classification method based on CHI and sub-category association rule algorithm | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
US7028250B2 (en) | System and method for automatically classifying text | |
CN113761218B (en) | Method, device, equipment and storage medium for entity linking | |
CN105045875B (en) | Personalized search and device | |
CN107301171A (en) | A kind of text emotion analysis method and system learnt based on sentiment dictionary | |
CN108563703A (en) | A kind of determination method of charge, device and computer equipment, storage medium | |
WO2022048363A1 (en) | Website classification method and apparatus, computer device, and storage medium | |
CN108717433A (en) | A kind of construction of knowledge base method and device of programming-oriented field question answering system | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
CN108875809A (en) | The biomedical entity relationship classification method of joint attention mechanism and neural network | |
CN110008309A (en) | A kind of short phrase picking method and device | |
CN110532352A (en) | Text duplicate checking method and device, computer readable storage medium, electronic equipment | |
CN110334343B (en) | Method and system for extracting personal privacy information in contract | |
CN109947934A (en) | For the data digging method and system of short text | |
Movshovitz-Attias et al. | Kb-lda: Jointly learning a knowledge base of hierarchy, relations, and facts | |
CN107368526A (en) | A kind of data processing method and device | |
CN106951420A (en) | Literature search method and apparatus, author's searching method and equipment | |
CN109857952A (en) | A kind of search engine and method for quickly retrieving with classification display | |
CN116362243A (en) | Text key phrase extraction method, storage medium and device integrating incidence relation among sentences | |
CN106649262B (en) | Method for protecting sensitive information of enterprise hardware facilities in social media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170922 |
|
RJ01 | Rejection of invention patent application after publication |