Specific embodiment
Referring to Fig. 1, in a first aspect, one embodiment of the application provides a kind of enterprise's incidence relation information mining method, it is described
Method includes the following steps:
Step 101: obtaining text to be detected.
Text to be detected can be obtained from networks such as news websites or be sent from technical staff's operating terminal to server to be checked
Text is surveyed, the data of industrial and commercial bureau's acquisition can also be visited by staff, the embodiment of the present invention is without limitation.
Step 102: deconsolidation process being carried out to the text to be detected, obtains at least one subordinate sentence.
Text to be detected tear open mode can from the starting position of text to be detected, search including default punctuate
Character between two default punctuation marks is determined as a subordinate sentence, obtains at least one subordinate sentence by symbol, wherein pre- bidding
Point symbol can be separator, fullstop, comma, exclamation mark and branch between sentence and sentence etc..
Step 103: segment simultaneously part-of-speech tagging to each subordinate sentence.
In the present embodiment, NLP (Natural Language Processing, natural language processing) system can be used
Word segmentation processing is carried out to each subordinate sentence, while marking out the part of speech of each participle, it then can be by the word segmentation processing of each subordinate sentence
Obtained word arranges from front to back according to the original word order of subordinate sentence.
Step 104: the incidence relation word in each subordinate sentence of identification.
Staff can establish data model according to actual excavation demand, which includes the type of incidence relation word, closes
Join the incidence relation word and the corresponding multiple extension expression formulas of incidence relation word of relatival type subordinate, wherein extension expression
Formula can be regular expression.By incidence relation word in data model and corresponding multiple extension expression formulas successively to each
Participle is matched, to identify the incidence relation word in subordinate sentence.
Step 105: judging whether the incidence relation word is weave connection relative, if the incidence relation word is group
When knitting incidence relation word, 106 are thened follow the steps.
Step 106: being determined according to the participle part of speech where the incidence relation word in subordinate sentence using cartesian product algorithm
First enterprise's incidence relation information.
The type of incidence relation word includes a variety of, such as membership credentials type, investment relation type, membership credentials type packet
The words such as president, general manager are included, investment relation type includes the words such as investment, financing.
From the above technical scheme, this application provides a kind of enterprise's incidence relation method for digging, text to be detected is obtained
This;Deconsolidation process is carried out to the text to be detected, obtains at least one subordinate sentence;Segment simultaneously part of speech to each subordinate sentence
Mark;Identify the incidence relation word in each subordinate sentence;Judge whether the incidence relation word is weave connection relative, such as
When incidence relation word described in fruit is weave connection relative, then according to the participle part of speech in subordinate sentence where the incidence relation word,
Using cartesian product algorithm, first enterprise's incidence relation information is determined.Therefore, the application is without staff in text to be detected
Middle lookup enterprise incidence relation information improves the efficiency of enterprise's incidence relation information excavating, also, sentences without staff's subjectivity
It is disconnected, improve the accuracy of excavation.
Referring to fig. 2, another embodiment of the application provides a kind of enterprise's incidence relation information mining method, the method includes
Following steps:
Step 201: obtaining text to be detected.
Text to be detected can be obtained from networks such as news websites or be sent from technical staff's operating terminal to server to be checked
Text is surveyed, the data of industrial and commercial bureau's acquisition can also be visited by staff, the embodiment of the present invention is without limitation.
After obtaining text to be detected, text to be detected is pre-processed using ETL, that is, is removed in text to be detected
Messy code, advertisement and forbidden character, and by letter, bracket etc. carry out sameization processing, to facilitate subsequent information processing, and
Improve the accuracy excavated.
ETL is the process of data pick-up, conversion, load, and ETL is by the data of operation system by extracting, cleaning conversion
Be loaded into the process of data warehouse afterwards, it is therefore an objective to by enterprise dispersion, messy, the skimble-scamble Data Integration of standard is to together.
Step 202: deconsolidation process being carried out to the text to be detected, obtains at least one subordinate sentence.
Text to be detected tear open mode can from the starting position of text to be detected, search including default punctuate
Character between two default punctuation marks is determined as a subordinate sentence, obtains at least one subordinate sentence by symbol, wherein pre- bidding
Point symbol can be separator, fullstop, comma, exclamation mark and branch between sentence and sentence etc..
Step 203: segment simultaneously part-of-speech tagging to each subordinate sentence.
In the present embodiment, NLP (Natural Language Processing, natural language processing) technology can be used
Word segmentation processing is carried out to each subordinate sentence, while marking out the part of speech of each participle, it then can be by the word segmentation processing of each subordinate sentence
Obtained word arranges from front to back according to the original word order of subordinate sentence.
Step 204: the incidence relation word in each subordinate sentence of identification.
Staff can establish data model according to actual excavation demand, which includes the type of incidence relation word, closes
Join the incidence relation word and the corresponding multiple extension expression formulas of incidence relation word of relatival type subordinate, wherein incidence relation
The type of word includes a variety of, such as membership credentials type, investment relation type, membership credentials type include president, general manager
Equal words, investment relation type include the words such as investment, financing;In addition, extension expression formula can be regular expression.Regular expressions
Formula is made of some general characters and metacharacter, and general character includes the letter and number of capital and small letter, and it is special that metacharacter has
Meaning, metacharacter include following 11 alphabetic characters: [] ︿ ﹩ ∣? * ().Metacharacter is used for specific use, for example, " " is used for
Match any character other than line feed character " n " and " r ";"? " indicate just that character before it of matching 0 or 1,
When character immediately any one other delimiter (* ,+,?, { n }, { n, }, { n, m }) back when, match pattern is non-greediness
, the character string that as the few as possible matching of non-greediness mode is searched for, and the greedy mode defaulted then matching institute as much as possible
The character string of search;" ∣ " indicates two matching conditions carrying out logical "or" operation.
Successively each participle is carried out by incidence relation word in data model and corresponding multiple extension expression formulas
Matching, to identify the incidence relation word in subordinate sentence.The present embodiment does not limit for specific matching way.For example, in subordinate sentence " north
The president of capital Divine Land Tai Yue software limited liability company is Wang Ning ", using NLP technological system to being obtained after its word segmentation processing,
" Beijing Divine Land Tai Yue limited liability company ", " president " and " Wang Ning " these three participles, while carrying out part of speech and marking, wherein
" Beijing Divine Land Tai Yue limited liability company " is physical mechanism title, and " Wang Ning " is eponym, and " president " is noun.So
Afterwards, the data model pre-established using staff, the data model include that the type of incidence relation word is membership credentials word
Language, incidence relation word are president, president and general manager etc., the regular expression of incidence relation word " president " be " (be | load
Appoint) { 0,20 } president ".Using in data model incidence relation word and corresponding regular expression respectively with participle phase
Match, may thereby determine that participle " president " is the incidence relation word in subordinate sentence.
Step 205: judging whether the incidence relation word is weave connection relative, if the incidence relation word is group
When knitting incidence relation word, 206 are thened follow the steps.
The type of incidence relation word includes a variety of, such as membership credentials type, investment relation type, membership credentials type packet
The words such as president, general manager are included, investment relation type includes the words such as investment, financing.
Step 206: noun is the participle and name name of physical mechanism title in subordinate sentence where extracting the incidence relation word
The participle of title.
Specifically, noun can be subdivided into physical mechanism title, eponym, geographic name etc., in the embodiment of the present application,
Only needing the noun in subordinate sentence where extracting is the participle of physical mechanism title and eponym.
For continuing the above example, if the type of incidence relation word is membership credentials word, such as senior executive's information, then extract
The noun in subordinate sentence where incidence relation word is the participle " Beijing Divine Land Tai Yue limited liability company " of physical mechanism title, with
And noun is the participle " Wang Ning " of eponym.
Step 207: if the noun is the quantity of the quantity of the participle of physical mechanism title and the participle of eponym
It is one, then generates enterprise's incidence relation between the participle that the noun is physical mechanism title and the participle of eponym
Information.
As above it exemplifies, noun is that the quantity of the participle of physical mechanism title and the participle of eponym is respectively one, then
Directly generate enterprise's incidence relation information " Beijing Divine Land Tai Yue limited liability company-president-Wang Ning ".
Step 208: if the noun is the number of the quantity of the participle of physical mechanism title and/or the participle of eponym
Amount at least two, then generate first set and second set, the first set and second set are that all nouns are
The set of the participle composition of the participle and eponym of physical mechanism title.
In one subordinate sentence, noun is the quantity of the quantity of the participle of physical mechanism title and/or the participle of eponym
When at least two, for example, subordinate sentence is " director of Beijing ### Co., Ltd a length of king xx and Lee xx ", through participle and part-of-speech tagging
Afterwards, part of speech is that the participle of eponym has " king xx " and " Lee xx ", and noun is that the participle of physical mechanism title is that " Beijing ### has
Limit company " then needs to form above-mentioned participle group into two set, i.e. first set { Beijing ### Co., Ltd, king xx, Lee xx }, the
Two set { Beijing ### Co., Ltd, king xx, Lee xx }.
Step 209: the first set and second set being done into cartesian product, obtain multiple subclass.
Cartesian product refers to that in mathematics, two are gathered the result being multiplied.For the above example, by first set { Beijing ###
Co., Ltd, king xx, Lee xx } and second set { Beijing ### Co., Ltd, king xx, Lee xx } do cartesian product, obtain multiple sons
Gathering is respectively<Beijing ### Co., Ltd, and Beijing ### Co., Ltd>,<Beijing ### Co., Ltd, king xx>,<Beijing ### has
Limit company, Lee xx>,<king xx, Beijing ### Co., Ltd>,<king xx, king xx>,<king xx, Lee xx>,<Lee xx, Beijing ### is limited
Company>,<Lee xx, king xx>,<Lee xx, Lee xx>.
Step 210: judging whether the participle in each subclass is identical, if the participle phase in the subclass
Together, 211 are thened follow the steps.
Step 211: abandoning the subclass.
For example, in upper example, subclass<Beijing ### Co., Ltd, Beijing ### Co., Ltd>,<king xx, king xx>and<Lee
Xx, Lee xx > in participle it is identical, these three subclass need to be abandoned.
Step 212: in the son that all participles by participle and eponym that the noun is physical mechanism title form
In set, identical subclass is judged whether there is, if there is identical subclass, thens follow the steps 212.
Step 213: abandoning participle that the noun is physical mechanism title after part of speech is the participle of eponym
Subclass.
It is identical in the subclass that all participles by participle and eponym that noun is physical mechanism title form
Subset is combined into more than two subclass containing identical participle, for example, continuing by taking above-mentioned example as an example, < Beijing ### is limited
Company, king xx>and<king xx, Beijing ### Co., Ltd>it is identical subclass, similarly,<Beijing ### Co., Ltd, Lee xx>
<Lee xx, Beijing ### Co., Ltd>it is identical subclass.It is physical mechanism title by noun in above-mentioned subclass
It segments the subclass after the participle of eponym to abandon, i.e. general<king xx, Beijing ### Co., Ltd>and<Lee xx, north
Capital ### Co., Ltd > discarding.
Step 214: being only made of the participle of participle or eponym that the noun is physical mechanism title remaining
Subclass in, according to the participle of participle or eponym that the noun is physical mechanism title in the position of the subordinate sentence,
The subclass for abandoning sorting by reversals, obtains target collection.
Sorting by reversals is the sortord opposite with reading order, for example, the subordinate sentence " president of Beijing ### Co., Ltd
" king xx " is segmented before " Lee xx ", then subclass<Lee xx according to reading order for king xx and Lee xx ", king xx>arrange to be reversed
Sequence, therefore the subclass is abandoned.Remaining subset is combined into<Beijing ### Co., Ltd, and king xx>,<Beijing ### Co., Ltd,
Lee xx>,<king xx, Lee xx>, i.e., target collection be<Beijing ### Co., Ltd, king xx>,<Beijing ### Co., Ltd, Lee xx>,<
King xx, Lee xx >.
Step 215: according to destination subset conjunction and incidence relation word, determining first enterprise's incidence relation information.
Destination subset is closed and incidence relation word generates first enterprise's incidence relation information, for example, target collection is < north
Capital ### Co., Ltd, king xx>,<Beijing ### Co., Ltd, Lee xx>,<king xx, Lee xx>, then the association of the first enterprise can be obtained and close
Be information be " Beijing ### Co., Ltd-president-king xx ", " Beijing ### Co., Ltd-president-Lee xx " and " president-
King xx, Lee xx ".
From the above technical scheme, this application provides a kind of enterprise's incidence relation information mining methods, obtain to be checked
Survey text;Deconsolidation process is carried out to the text to be detected, obtains at least one subordinate sentence;Each subordinate sentence is segmented simultaneously
Part-of-speech tagging;Identify the incidence relation word in each subordinate sentence;Judge whether the incidence relation word is weave connection relationship
Word, if the incidence relation word is weave connection relative, according to the participle where the incidence relation word in subordinate sentence
Part of speech determines first enterprise's incidence relation information using cartesian product algorithm.Therefore, the application is without staff to be checked
It surveys and searches enterprise's incidence relation information in text, improve the efficiency of enterprise's incidence relation information excavating, also, be not necessarily to staff
Subjective judgement improves the accuracy of excavation.
Referring to Fig. 3, in the another embodiment of the application, in above-described embodiment after step 215 further include:
Step 301: judging in the text to be detected with the presence or absence of the ambiguity incidence relation that content is identical and part of speech is different
Word, if ambiguity incidence relation word identical there are content and different part of speech in the text to be detected, thens follow the steps 302.
Step 302: part of speech label will be added before or after the ambiguity incidence relation word position.
For the characteristics of part of speech is using this as the foundation of Part of Speech Division, the word of Modern Chinese can be divided into two class of notional word and function word,
Notional word generally includes noun, quantifier, adjective and verb etc., and function word includes adverbial word, preposition and conjunction etc..Due to the meaning of Chinese
Abundant, a word may be because the difference of context and have different parts of speech, and for relative connective, constant volume is easily produced really in this way
Raw mistake is needed in the present embodiment to text disambiguation to be detected, to obtain more to eliminate the mistake generated due to ambiguity
Accurate enterprise incidence relation information.
For example, text to be detected is " Shenyang ## company, Beijing ## corporate investment, capital fund are 1,000,000 yuan ", in benefit
When with data model simple match, the two participles " investment " can be confirmed as incidence relation word, but can be seen by semanteme
Out, " investment " in " capital fund " is not required incidence relation word, so in order to avoid such case appearance, the present embodiment
It will be distinguished according to the part of speech of " investment ", first " investment " is verb vt, and second " investment " is become famous with subsequent " fund " group
Word, i.e., should " investment " be defined as gerund vn, then in " Shenyang ## company, Beijing ## corporate investment, capital fund one
Million yuan " in " investment " before or after add part of speech label, i.e., " Beijing ## company [vt] has invested Shenyang ## company,
[vn] capital fund is 1,000,000 yuan ".
Step 303: according to the part of speech label, identifying target association relative.
After text to be detected adds part of speech label, incidence relation word will be corresponded in data model and also adds corresponding word
Property label, for example, part of speech label in incidence relation word " investment " addition under investment types is obtained " [vt] investment ".Then
It using " [vt] is invested " in data model, is matched with text to be detected, obtains accurate target in text to be detected and close
Join relative " [vt] investment ".
Step 304: extracting the subordinate sentence where the target association relative, and remove part of speech label.
For continuing the above example, the subordinate sentence extracted is " Beijing ## company [vt] has invested Shenyang ## company ", then again
Part of speech label [vt] is removed, " Beijing ## corporate investment Shenyang ## company " is obtained, to carry out subsequent incidence relation digging
Pick.
Step 305: relatival according to the target association for comprising the relatival each subordinate sentence of the target association
The position of part of speech and the target association relative in subordinate sentence determines second enterprise's incidence relation information.
The relatival part of speech of target association has verb, noun etc., such as verb has investment, spends more money on, purchases, and noun has holding
People, subsidiary, parent company, controlling shareholder etc..Enterprise's incidence relation information include be based on target association relative, building it is multiple
Incidence relation between enterprise, if target association relative is " purchase ", enterprise's incidence relation information is objective for implementation-receipts
Purchase-applied object.
In each subordinate sentence relatival comprising target association, opened at the position in subordinate sentence from target association relative
Begin, identifies forward, if recognizing the first enterprise name, the first enterprise name is determined as to the objective for implementation of incidence relation word
Title identified and since target association relative is at the position in subordinate sentence backward, by the second enterprise recognized name
Claim, it is relatival by the title for applying object to be determined as target association;Based on target association relative, the title of objective for implementation is generated
With by enterprise's incidence relation information between the title for applying object.
Wherein, incidence relation word is unidirectional incidence relation word here, and part of speech is verb, and such as " investment " " spending more money on ", " is received
Purchase " etc..First enterprise name and the second enterprise name are any enterprise name.
In an implementation, after server determines the subordinate sentence comprising preset incidence relation word, for some subordinate sentence, server
It can determine the relatival position of target association in the subordinate sentence, the word of noun is noted as before combining target incidence relation word
The contextual information of language is noted as before identifying target association relative forward from target association relationship keyword position
First enterprise name is determined as the implementation pair of incidence relation word if the first enterprise name can be recognized by the word of noun
As, and the contextual information of the word of noun is noted as after combining target incidence relation word, it is relatival from target association
Start at position, identifies the second enterprise that target association relative is noted as the word of noun later, and identification is obtained backward
It is relatival by the title for applying object to be determined as target association for title.Then target association relative is used, obtained enterprise closes
Connection relation information is the first enterprise name-incidence relation word-second enterprise name, in this manner it is possible to determine this subordinate sentence
In include enterprise's incidence relation information, and so on can determine comprising including in the relatival each subordinate sentence of target association
Enterprise's incidence relation information.
For example, be " * * Co., Ltd has invested ## Co., Ltd " comprising the relatival subordinate sentence of target association,
After carrying out word segmentation processing, obtained word is " * * Co., Ltd " from front to back, " investment ", " ", " ## Limited Liability is public
Department ", " * * Co., Ltd " are noun, and " investment " is verb, and " " is auxiliary word, and " ## Co., Ltd " is noun, clothes
Business device can identify forward from " investment ", recognize " * * Co., Ltd ", " * * Co., Ltd " is determined as target
Then the title of the objective for implementation of incidence relation word can identify backward from " investment ", recognize " ## Co., Ltd ", this
Enterprise's incidence relation information that sample is determined is " * * Co., Ltd-investment-" ## Co., Ltd ".
It should be noted that if identified backward, it is unidentified to arrive any enterprise name, then carry out the identification of next subordinate sentence.
When target association relative is unidirectional target association relative, and part of speech is noun, corresponding processing can be as
Under:
In each subordinate sentence relatival comprising target association, opened at the position in subordinate sentence from target association relative
Begin, identifies backward, if recognizing third enterprise name, third enterprise name is determined as the relatival implementation of target association
The title of object, and since target association relative is at the position in subordinate sentence, it identifies forward, the 4th enterprise that will be recognized
It is relatival by the title for applying object to be determined as target association for title;Based on target association relative, the name of objective for implementation is generated
Claim and by enterprise's incidence relation information between the title for applying object.
Wherein, target association relative is unidirectional target association relative here, and part of speech is noun, such as " holding stock
East ", " holding people ", " parent company ", " subsidiary " etc..Third enterprise name and the 4th enterprise name are any enterprise name.
In an implementation, server is determined comprising for some subordinate sentence, taking after the relatival subordinate sentence of preset target association
Business device can determine the relatival position of target association in the subordinate sentence, and noun is noted as after combining target incidence relation word
Word contextual information, be marked after identifying target association relative backward from target association relative position
It is relatival that third enterprise name is determined as target association if third enterprise name can be recognized for the word of noun
Objective for implementation, and the contextual information of the word of noun is noted as before combining target incidence relation word, it is closed from target association
Start at the position of copula, the word of noun is noted as before identifying target association relative forward, that identification is obtained
It is relatival by the title for applying object to be determined as target association for four enterprise names.Then target association relative is used, is obtained
Enterprise's incidence relation information is third enterprise name-enterprise name of target association relative-the 4th, in this manner it is possible to determine
The enterprise's incidence relation information for including in this subordinate sentence out, and so on can determine each subordinate sentence comprising incidence relation word
In include enterprise's incidence relation information.
For example, being that " controlling shareholder of * * Co., Ltd is ## Limited Liability comprising the relatival subordinate sentence of target association
Company ", after carrying out subordinate sentence processing, the target association relative that server recognizes is " controlling shareholder ", can be from " holding stock
East " identifies backward, recognizes " ## Co., Ltd ", " ## Co., Ltd " is determined as the relatival reality of target association
The title of object is applied, then can be identified forward from " controlling shareholder ", " * * Co., Ltd " is recognized, determines in this way
Enterprise's incidence relation information is "-controlling shareholder-* * Co., Ltd, ## Co., Ltd ".
When target association relative is bi-directional objects incidence relation word, corresponding processing be can be such that
In each subordinate sentence relatival comprising target association, opened at the position in subordinate sentence from target association relative
Begin, identifies forward, the multiple enterprise names recognized are determined as the relatival objective for implementation arranged side by side of target association;Based on target
Incidence relation word generates enterprise's incidence relation information between the multiple enterprise name.
Wherein, target association relative is bi-directional objects incidence relation word here, and part of speech can be noun or verb, example
Such as, when part of speech is noun, bi-directional objects incidence relation word has " strategic partnership relationship ", " affiliate ", " competitive relation " etc., word
Property when being verb, bi-directional objects incidence relation word has " starting jointly ", " joint undertake ", " joint investment ".
It should be noted that the enterprise name identified mentioned above can be based on preset enterprise if it is referred to as
Full name and abbreviation corresponding relationship, find the corresponding full name of the abbreviation, full name stored into enterprise's incidence relation information.
In the present embodiment, server is determined comprising after the relatival subordinate sentence of preset target association, for some point
Sentence, server can determine the relatival position of target association in the subordinate sentence, be marked before combining target incidence relation word
For the contextual information of the word of noun, before identifying target association relative forward from target association relative position
It is noted as the word of noun, after identification obtains first enterprise name, continuation identifies forward to be identified until in this subordinate sentence
Less than enterprise name, then using the incidence relation word for including in this subordinate sentence, obtained enterprise's incidence relation information is multiple
The title of enterprise-target association relative, in this manner it is possible to determine the enterprise's incidence relation information for including in this subordinate sentence,
The rest may be inferred can determine enterprise's incidence relation information comprising including in the relatival each subordinate sentence of target association.
For example, being that " * * Co., Ltd and ## Co., Ltd be strategic comprising the relatival subordinate sentence of target association
Cooperative relationship ", after carrying out subordinate sentence processing, obtained word is " * * Co., Ltd " from front to back, " ## Limited Liability is public
Department ", " for ", " strategic partnership relationship ", the target association relative that server recognizes are " strategic partnership relationship ", Ke Yicong
" strategic partnership relationship " identifies forward, recognizes " * * Co., Ltd " and " ## Co., Ltd ", determines in this way
Enterprise's incidence relation information is " * * You Xianzerengongsi &## Co., Ltd-" strategic partnership relationship ".
Optionally, target association relative is bi-directional objects incidence relation word, and when part of speech is verb, in certain subordinate sentences also
It will include the title of objective for implementation and by the title for applying object, corresponding processing be can be such that
In each subordinate sentence relatival comprising target association, opened at the position in subordinate sentence from target association relative
Begin, identify forward, the enterprise name recognized is determined as to the title of the relatival objective for implementation of target association, and close from target
Join relative to start at the position in subordinate sentence, identify backward, the enterprise name that will be recognized is determined as target association relative
By the title for applying object;Based on target association relative, the title of objective for implementation is generated and by between the title for applying object
Enterprise's incidence relation information.
In an implementation, server is determined comprising for some subordinate sentence, taking after the relatival subordinate sentence of preset target association
Business device can determine the relatival position of target association in the subordinate sentence, and noun is noted as before combining target incidence relation word
Word contextual information, be marked before identifying target association relative forward from target association relative position
For the word of noun, after identification obtains first enterprise name, continuation is identified forward until identifying in this subordinate sentence less than enterprise
The enterprise name recognized is determined as the title of objective for implementation by industry title, is marked later then in conjunction with target association relative
Note is that the contextual information of the word of noun identifies backward since target association relative is at the position in subordinate sentence, will be known
The enterprise name being clipped to, it is relatival by the title for applying object to be determined as target association, then uses target association relative, really
The enterprise's incidence relation information for making this subordinate sentence is that the enterprise name recognized forward-target association relative-is known backward
The enterprise name being clipped to.
For example, being that " * * Co., Ltd and ## Co., Ltd throw jointly comprising the relatival subordinate sentence of target association
Provide * # Co., Ltd ", after carrying out word segmentation processing, obtained word is " * * Co., Ltd " from front to back, " ## has
Limit responsible company ", " joint investment ", " ", " * # Co., Ltd ", the target association relative that server recognizes are
" joint investment " can identify forward from " joint investment ", recognize " * * Co., Ltd " and " ## Co., Ltd ",
" * * Co., Ltd " and " ## Co., Ltd " is all determined as the title of the relatival objective for implementation of target association, so
Identification, recognizes " * # Co., Ltd " after backward, by " * # Co., Ltd " be determined as target association it is relatival by
The title for applying object, the enterprise's incidence relation information determined in this way are " * * You Xianzerengongsi &## Co., Ltd "-
" joint investment "-" * # Co., Ltd ".
It should be noted that the enterprise name identified mentioned above can be based on preset enterprise if it is referred to as
Full name and abbreviation corresponding relationship, find the corresponding full name of the abbreviation, full name stored into enterprise's incidence relation information.
Step 306: judging whether the second enterprise incidence relation information and first enterprise's incidence relation information are identical, such as
Fruit is identical, thens follow the steps 307.
Step 307: abandoning the second enterprise incidence relation information identical with the first enterprise incidence relation information.
Second obtained enterprise's related information is matched with first enterprise's incidence relation information, it, will if identical
Second enterprise's incidence relation information abandons, to prevent from repeating storing.
Referring to fig. 4, in another embodiment provided by the present application, after above-described embodiment step 307 further include:
Step 401: judging have at least two contents identical there are no in addition to incidence relation word in the text to be detected
And the identical participle of part of speech, if it is present executing step 402-406.
Step 402: at least two contents are identical and the location index of the identical participle of part of speech for record.
For example, containing " on May 28th, 2018, * * Co., Ltd makes decision, * * Limited Liability in text to be detected
The president that company elects king xx new as * * Co., Ltd, king xx indicate to be responsible for for the enterprise ".Wherein, " * * is limited
Responsible company " is three, and " king xx " is two.Server records the position of above-mentioned participle respectively.
Step 403: the location index of the identical participle of and part of speech identical according at least two contents is closed according to distance association
The shortest path first principle of copula determines that target participle and the target segment corresponding location index.
Shortest path first principle is to be identified forward since the position of incidence relation word to the first default punctuation mark to be
Only, then by the position of incidence relation word it identifies backward to the second default punctuation mark position, in the first default punctuation mark
With the text between the second default punctuation mark, the participle nearest apart from incidence relation word is therefrom selected to be determined as target participle,
First default punctuation mark and the second default punctuation mark include fullstop, branch or comma.For continuing the above example, wherein on
" * * Co., Ltd " and two " king xx " there are three stating in example, according to former apart from incidence relation word shortest path first
Then, then identified forward until first comma since the position of incidence relation word, then by incidence relation word position backward
It identifies until second comma to get the " director that * Co., Ltd elects king xx new as * * Co., Ltd is arrived
It is long ", it can thus be seen that in this text, " king xx " nearest apart from president and second " * * Co., Ltd " can
Target segments the most.
Step 404: according to the position of the location index of target participle and incidence relation word, determining effective short sentence model
It encloses.
For example, in upper example, after third " * * Co., Ltd " and first " king xx " are determined as target participle,
According to the position of their location index and incidence relation word, the part between two target participles is determined as effective short sentence model
It encloses, i.e. " president king xx new as * * Co., Ltd ".
Step 405: within the scope of effective short sentence, if the relationship clause of effectively short sentence is positive relationship clause placed in the middle,
It is identified forward since the incidence relation word is at the position within the scope of effective short sentence, by recognize first enterprise
Title or eponym are determined as first instance, and open at the position within the scope of effective short sentence from the incidence relation word
Beginning identifies backward, and recognize first enterprise name or eponym are determined as second instance.
The determination of relationship clause type can be determined in data model after incidence relation word by staff, be designed not
With incidence relation word sentence pattern template, if incidence relation word is " investment ", sentence pattern template can be " ... invest ... ",
" ... invested ... ", " to ... investment ", it is matched using sentence pattern template with effective short sentence, by matched sentence
The corresponding relationship clause type of formula template is determined as the relationship clause type of effective short sentence, and relationship clause type includes positive relationship
Clause, positive relationship postposition clause, inverse relationship clause placed in the middle and inverse relationship postposition clause placed in the middle.
Positive relationship clause placed in the middle is the normal clause of Subject, Predicate and Object, for example, " Beijing * * * company, Beijing ### corporate buyout ",
It since incidence relation word " purchase ", identifies forward, recognizes " Beijing ### company ", " Beijing ### company " is determined as first
Entity, and identified backward since incidence relation word " purchase ", " Beijing * * * company " that recognizes is determined as second instance.
Step 406: within the scope of effective short sentence, if the relationship clause of effectively short sentence is positive relationship postposition clause,
It is identified forward since incidence relation word is at the position within the scope of effective short sentence, by recognize first enterprise name
Or eponym is determined as second instance, and continues to identify forward, recognize second enterprise name or eponym is true
It is set to first instance.
Positive relationship postposition clause is the normal clause of Subject, Predicate and Object, but incidence relation word is usually noun, and in all enterprises
After industry title or eponym, such as " president that king xx is Beijing ### company ", since incidence relation word " president "
It identifies forward, " Beijing ### company " is determined as second instance, " king xx " is determined as first instance.
Step 407: within the scope of effective short sentence, if the relationship clause of effectively short sentence is inverse relationship clause placed in the middle,
It is identified backward since the incidence relation word is at the position within the scope of effective short sentence, by recognize first enterprise
Title or eponym are determined as first instance, and open at the position within the scope of effective short sentence from the incidence relation word
Beginning identifies forward, and recognize first enterprise name or eponym are determined as second instance.
Inverse relationship clause placed in the middle is the passive clause of obvious formula, such as " Beijing ### company by investment Beijing * * * company ",
It is identified backward since incidence relation word " investment ", " Beijing * * * company " is determined as first instance, and at " investment " forward
Identification, is determined as second instance for " Beijing ### company ".
Step 408: within the scope of effective short sentence, if the relationship clause of effectively short sentence is inverse relationship postposition clause,
It is identified forward since the incidence relation word is at the position within the scope of effective short sentence, by recognize first enterprise
Title or eponym are determined as first instance, and continue to identify forward, by recognize second enterprise name or name name
Title is determined as second instance.
Inverse relationship postposition clause is concealed passive clause, and incidence relation word is usually verb, and in all enterprise's names
Claim or eponym after, such as " Beijing ### company to Beijing * * * corporate investment 1,000,000 " opens from incidence relation word " investment "
Beginning identifies forward, " Beijing * * * company " is determined as first instance, and continue to identify forward, " Beijing ### company " is determined as
Second instance.
Step 409: according to the first instance, second instance and incidence relation word, determining that third enterprise incidence relation is believed
Breath.
Based on incidence relation word, enterprise's incidence relation information between first instance and second instance is generated, for example, " * *
Co., Ltd-president-king xx ", " * * * company-investment-Beijing, Beijing ### company ".
Step 410: judge whether the third enterprise incidence relation information and first enterprise's incidence relation information are identical, or
Whether third enterprise incidence relation information described in person and second enterprise's incidence relation information are identical, if the third enterprise is associated with
Identical or described third enterprise incidence relation information is associated with relation information with the second enterprise with first enterprise's incidence relation information
Relation information is identical, thens follow the steps 411.
Step 411: the third enterprise incidence relation information identical with the first enterprise incidence relation information is abandoned,
And third enterprise incidence relation information identical with the second enterprise incidence relation information.
By obtained third enterprise related information respectively with first enterprise's incidence relation information and second enterprise's incidence relation
Information is matched, if, if the third enterprise incidence relation information is identical as first enterprise's incidence relation information, or
The third enterprise incidence relation information is identical as second enterprise's incidence relation information, then by the second enterprise incidence relation information
It abandons, to prevent from repeating storing.
Referring to Fig. 5, in another embodiment provided by the present application, after above-described embodiment step 411 further include:
Step 501: according in first enterprise's incidence relation, second enterprise's incidence relation and third enterprise incidence relation
Incidence relation word, by the enterprise in first enterprise incidence relation, second enterprise's incidence relation and third enterprise incidence relation
Industry title or eponym establish associated path, and store to corresponding database.
Since incidence relation word includes many types, such as membership credentials type and investment incidence relation type, work people
Member can establish multiple corresponding databases previously according to the type of incidence relation word, for example, the association of weave connection relationship
Relative then needs the database for establishing shareholder's information and Business Name database.By first enterprise incidence relation, the second enterprise
Industry incidence relation and enterprise name in third enterprise incidence relation or eponym establish after associated path, while will enterprise
Industry title or eponym store to corresponding database.Such as: * * * company-shareholder-king xx, then by * * * company and king xx
Associated path is established, and * * * company is stored to Business Name database, king xx is stored to the database of shareholder's information
It is interior.
Step 502: obtain user input solicited message, the solicited message include user's enterprise name to be checked or
Eponym.
Step 503: judging whether the solicited message matches with the storage information in database, depositing in the database
Storage information is first enterprise's incidence relation, second enterprise's incidence relation and enterprise name or people in third enterprise incidence relation
Name title, if matching, thens follow the steps 504.
Step 504: according to the associated path of the storage information, extracting incidence relation letter corresponding with the storage information
Breath forms membership credentials map.
For example, user's input is Beijing Divine Land Tai You software limited liability company, then it is the information and staff is pre-
If database in storage information match, if finding matched storage information, will with the storage information establish close
The related information in connection path extracts, and membership credentials map as shown in figure 11 is obtained, to facilitate user's direct convenience
Solve the organizational composition of the said firm.
From the above technical scheme, this application provides a kind of enterprise's incidence relation method for digging, text to be detected is obtained
This;Deconsolidation process is carried out to the text to be detected, obtains at least one subordinate sentence;Segment simultaneously part of speech to each subordinate sentence
Mark;Identify the incidence relation word in each subordinate sentence;Judge whether the incidence relation word is weave connection relative, such as
When incidence relation word described in fruit is weave connection relative, then according to the participle part of speech in subordinate sentence where the incidence relation word,
Using cartesian product algorithm, first enterprise's incidence relation information is determined.Therefore, the application is without staff in text to be detected
Middle lookup enterprise incidence relation information improves the efficiency of enterprise's incidence relation information excavating, also, sentences without staff's subjectivity
It is disconnected, improve the accuracy of excavation.
Second aspect, referring to Fig. 6, the embodiment of the present application provides a kind of enterprise's incidence relation information excavating device, described
Device includes:
Module 601 is obtained, for obtaining text to be detected;
It tears a module 602 open, for carrying out deconsolidation process to the text to be detected, obtains at least one subordinate sentence;
Part-of-speech tagging module 603, for segment simultaneously part-of-speech tagging to each subordinate sentence;
First identification module 604, for identification the incidence relation word in each subordinate sentence;
First determining module 605, for judging whether the incidence relation word is weave connection relative, if the pass
When connection relative is weave connection relative, then according to the participle part of speech where the incidence relation word in subordinate sentence, flute card is utilized
That integration method, determines first enterprise's incidence relation information.
From the above technical scheme, this application provides a kind of enterprise's incidence relation excavating gears, to be checked by obtaining
Survey text;Deconsolidation process is carried out to the text to be detected, obtains at least one subordinate sentence;Each subordinate sentence is segmented simultaneously
Part-of-speech tagging;Identify the incidence relation word in each subordinate sentence;Judge whether the incidence relation word is weave connection relationship
Word, if the incidence relation word is weave connection relative, according to the participle where the incidence relation word in subordinate sentence
Part of speech determines first enterprise's incidence relation information using cartesian product algorithm.Therefore, the application is without staff to be checked
It surveys and searches enterprise's incidence relation information in text, improve the efficiency of enterprise's incidence relation information excavating, also, be not necessarily to staff
Subjective judgement improves the accuracy of excavation.
Further, referring to Fig. 7, when the type of the incidence relation word is membership credentials word, described first is determined
Module 505 includes:
Extraction unit 701 is the participle of physical mechanism title for noun in subordinate sentence where extracting the incidence relation word
With the participle of eponym;
First judging unit 702, if quantity and eponym for the participle that the noun is physical mechanism title
The quantity of participle be one, then the enterprise generated between the participle of the physical mechanism title and the participle of eponym closes
Join relation information;
Second judgment unit 703, if quantity and/or name name for the participle that the noun is physical mechanism title
The quantity at least two of the participle of title, then generate first set and second set, and the first set and second set are institute
The set being made of the participle of the physical mechanism title and the participle of eponym;
Cartesian product unit 704 obtains multiple subsets for the first set and second set to be done cartesian product
It closes;
Screening unit 705 screens multiple subclass for being screened according to preset screening rule,
Obtain target collection;
First determination unit 706 determines first enterprise's incidence relation information for closing according to the destination subset.
Further, referring to Fig. 8, the screening unit 605 includes:
First judgment sub-unit 801, for judging whether the participle in each subclass is identical, if the subset
Participle in conjunction is identical, then abandons the subclass;
Second judgment sub-unit 802, in all points by participle and eponym that noun is physical mechanism title
In the subclass of word composition, identical subclass is judged whether there is, if there is identical subclass, then it is real for abandoning the noun
Subclass of the participle of body mechanism title after part of speech is the participle of eponym;
Target collection determines subelement 803, for it is remaining only by the noun be physical mechanism title participle or
In the subclass of the participle composition of eponym, according to the participle of participle or eponym that the noun is physical mechanism title
In the position of the subordinate sentence, the subclass of sorting by reversals is abandoned, target collection is obtained.
Further, referring to Fig. 9, described device further include:
First judges mould 901, for judging in the text to be detected with the presence or absence of the discrimination that content is identical and part of speech is different
Adopted incidence relation word will if ambiguity incidence relation word identical there are content and different part of speech in the text to be detected
Part of speech label is added before or after the ambiguity incidence relation word position;
Second identification module 902, for identifying target association relative according to the part of speech label;
Subordinate sentence module 903 is extracted, for extracting the subordinate sentence where the target association relative, and removes part of speech label;
Second determining module 904, for for comprising the relatival each subordinate sentence of the target association, according to the target
The position of the part of speech of incidence relation word and the target association relative in subordinate sentence determines that second enterprise's incidence relation is believed
Breath;
Second judgment module 905, for judging that the second enterprise incidence relation information and first enterprise's incidence relation are believed
Whether breath is identical, if identical, abandons the second enterprise incidence relation letter identical with the first enterprise incidence relation information
Breath.
Further, referring to Figure 10, described device further include:
Third judgment module 1001, for judging have at least in addition to incidence relation word there are no in the text to be detected
Two contents are identical and the identical participle of part of speech, if it is present at least two contents of record are identical and the identical participle of part of speech
Location index;
Determine location index module 1002, the position rope for the identical participle of identical and part of speech according at least two contents
Draw, according to the shortest path first principle apart from incidence relation word, determines that target participle and the target segment corresponding position
Index;
Effective short sentence range determination module 1003, location index and incidence relation word for being segmented according to the target
Position determines effective short sentence range;
Entity determining module 1004 is used within the scope of effective short sentence, if the relationship clause of effectively short sentence is positive closes
It is clause placed in the middle, then identifies forward, will identify since the incidence relation word is at the position within the scope of effective short sentence
To first enterprise name or eponym be determined as first instance, and from the incidence relation word in effective short sentence model
Start to identify backward at position in enclosing, recognize first enterprise name or eponym are determined as second instance;
If the relationship clause of effective short sentence is positive relationship postposition clause, from the incidence relation word described effective
Start to identify forward at position within the scope of short sentence, it is real that recognize first enterprise name or eponym are determined as second
Body, and continue to identify forward, recognize second enterprise name or eponym are determined as first instance;
If the relationship clause of effective short sentence is inverse relationship clause placed in the middle, from the incidence relation word described effective
Start to identify backward at position within the scope of short sentence, it is real that recognize first enterprise name or eponym are determined as first
Body, and identified forward since the incidence relation word is at the position within the scope of effective short sentence, first will recognized
A enterprise name or eponym are determined as second instance;
If the relationship clause of effective short sentence is inverse relationship postposition clause, from the incidence relation word described effective
Start to identify forward at position within the scope of short sentence, it is real that recognize first enterprise name or eponym are determined as first
Body, and continue to identify forward, recognize second enterprise name or eponym are determined as second instance;
Third determining module 1005, for determining that third enterprise closes according to first instance, second instance and incidence relation word
Join relation information;
Third judgment module 1006, for judging that the third enterprise incidence relation information and first enterprise's incidence relation are believed
Whether breath identical or the third enterprise incidence relation information and second enterprise's incidence relation information it is whether identical, if institute
State third enterprise incidence relation information with first enterprise's incidence relation information identical or described third enterprise incidence relation information
It is identical as second enterprise's incidence relation information, then abandon the third enterprise identical with the first enterprise incidence relation information
Incidence relation information, and third enterprise incidence relation information identical with the second enterprise incidence relation information.
From the above technical scheme, it this application provides a kind of enterprise's incidence relation information mining method and device, obtains
Take text to be detected;Deconsolidation process is carried out to the text to be detected, obtains at least one subordinate sentence;Each subordinate sentence is carried out
Segment simultaneously part-of-speech tagging;Identify the incidence relation word in each subordinate sentence;Judge whether the incidence relation word is that tissue closes
Join relative, if the incidence relation word is weave connection relative, according in the subordinate sentence of the incidence relation word place
Participle part of speech determine first enterprise's incidence relation information using cartesian product algorithm.Therefore, the application is not necessarily to staff
Enterprise's incidence relation information is searched in text to be detected, improves the efficiency of enterprise's incidence relation information excavating, also, be not necessarily to work
Make personnel's subjective judgement, improves the accuracy of excavation.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present application can add by software
The mode of general hardware platform realize.Based on this understanding, the technical solution in the embodiment of the present application substantially or
Or the part that contributes to existing technology can be embodied in the form of software products, which can deposit
Storage is in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions computer equipment to as (can be with
It is personal computer, server or the network equipment etc.) execute certain part institutes of each embodiment of the application or embodiment
The method stated.
Various embodiments are described in a progressive manner for this specification, same and similar part between each embodiment
Can cross-reference, each embodiment focuses on the differences from other embodiments, especially for device reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.