CN109033132A - The method and device of text and the main body degree of correlation are calculated using knowledge mapping - Google Patents
The method and device of text and the main body degree of correlation are calculated using knowledge mapping Download PDFInfo
- Publication number
- CN109033132A CN109033132A CN201810567101.5A CN201810567101A CN109033132A CN 109033132 A CN109033132 A CN 109033132A CN 201810567101 A CN201810567101 A CN 201810567101A CN 109033132 A CN109033132 A CN 109033132A
- Authority
- CN
- China
- Prior art keywords
- enterprise
- text
- candidate
- keyword
- dominant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a kind of methods and device that text and the main body degree of correlation are calculated using knowledge mapping, which comprises obtains text;Word segmentation processing is carried out to text, extract the keyword set occurred in text, pass through the knowledge mapping pre-established, retrieve enterprise dominant associated with keyword, the enterprise dominant associated with keyword to be gathered as candidate enterprise, wherein, the knowledge mapping includes destination node information, associated nodal information, relationship and relevance weight between the destination node information and the associated nodal information, the destination node information includes the first enterprise dominant information, the associated nodal information includes the second main information associated with the first main body enterprise dominant information, product or natural person's information;The degree of association of text and the enterprise dominant of the candidate is calculated according to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs.
Description
Technical field
The present invention relates to a kind of methods and device that text and the main body degree of correlation are calculated using knowledge mapping.
Background technique
In the information age, the acquisition and processing analysis of mass data are a big difficulties.In some industries (such as financial row
Industry), people pay close attention to the information of each dimension of enterprise, to help the decisions such as management investment.On the one hand, participant in the market needs more
Extensively, on the other hand more full data also require these data processed in time.Enterprise's public feelings information is that market participates in
The dimension that person pays close attention to, as a kind of non-structured text information, there are public feelings information data to disperse, data volume is big,
The features such as data format is complicated, timeliness is strong.Therefore, using technological means, such as natural language processing, this kind of data are carried out high
Effect ground handles and extracts valuable information, is the demand of numerous financial practitioners.In face of numerous and complicated public feelings information, how will
The enterprise of itself and concern associates, and screens out value less or with the incoherent information of main body, is to carry out data analysis and excavation
Essential step.
Text information is associated with, common method with enterprise dominant, is to construct the keywords database of enterprise dominant, including enterprise
Industrial and commercial title, enterprise's abbreviation, listing of a company code etc., and take this as the standard, carry out Keywords matching retrieval in text information library,
Relevant information of the text that will match to as the enterprise dominant.On the one hand such method needs to construct more full enterprise in advance
Keywords database is as retrieval foundation;On the other hand, to matching retrieval obtain as a result, being associated degree sequence, effect is not yet
It is good, often occur occurring keyword in text, be not but the information of the enterprise, therefore still has more redundancy;
Meanwhile association is directly matched by keyword, it can also slip for the important information of the emphasis affiliated enterprise of enterprise, cause information
It loses.
Summary of the invention
In view of the above shortcomings of the prior art, the technical problems to be solved by the present invention are: providing a kind of using knowledge graph
Spectrum calculates the method and device of text and the main body degree of correlation, keyword can be applied alone to tradition when analyzing mass text
Matched mode is optimized.In conjunction with knowledge mapping method, target subject can be associated with and text information is associated journey
Degree is quantified, and the relevant dimension of text information and target subject is enriched, and provides basis for subsequent further analysis.
In order to solve the above technical problems, one technical scheme adopted by the invention is that: a kind of utilization knowledge mapping meter is provided
The method for calculating text and the enterprise dominant degree of correlation, comprising the following steps:
Obtain text;
Word segmentation processing is carried out to text, extracts the keyword set that occurs in text, by the knowledge mapping pre-established,
Enterprise dominant associated with keyword is retrieved, is collected the enterprise dominant associated with keyword as candidate enterprise
Close, wherein the knowledge mapping include destination node information, associated nodal information, the destination node information with it is described
Relationship and relevance weight between associated nodal information, the destination node information include the first enterprise dominant information,
The associated nodal information include the second main information associated with the first main body enterprise dominant information, product or
Natural person's information;
Text is calculated according to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs
The degree of association of this and the enterprise dominant of the candidate.
Further, word segmentation processing is being carried out to text, the keyword set occurred in text is being extracted, by pre-establishing
Knowledge mapping, retrieve associated with keyword enterprise dominant, will described in enterprise dominant conduct associated with keyword
In the step of candidate enterprise gathers, comprising:
Word segmentation processing is carried out to text, obtains all keywords to form keyword set, the keyword set note
For K, the keyword in the keyword set K is searched in the knowledge mapping, is obtained associated with the keyword set K
Enterprise dominant, gather the enterprise dominant associated with keyword as candidate enterprise, the enterprise of the candidate
Set is denoted as C.
Further, according to the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set
Word frequency calculated in the step of degree of association of text and the enterprise dominant of the candidate, comprising:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map
Node is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate whole keyword word frequency relevant to i-th of candidate enterprise dominant in text
The sum of;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
Further, in the step of calculating the degree of association of enterprise dominant of text and the candidate, further includes:
Word frequency, the relationship power occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set
The degree of association of re-computation text and the enterprise dominant of the candidate.
Further, the word occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set
Frequently, in the step of degree of association of relationship weight calculation text and the enterprise dominant of the candidate, comprising:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length;
Wherein, 0≤ryi≤1;
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC;
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
Further, before the step of carrying out word segmentation processing to the text, further includes:
Paragraph is carried out to the text and divides pretreatment, and assigns respective weights to paragraph position;
In the step of calculating the degree of association of enterprise dominant of the text and the candidate, further includes:
According to the word frequency of the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set, paragraph position
It sets, the degree of association of relationship weight, text length calculating text and the enterprise dominant of the candidate.
Further, paragraph is carried out to the text by following formula and divides pretreatment:
Wherein,Integer of the expression not less than x, paragragh of the P for text, P >=1, the H are split for text
The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L
=(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for
The maximum ratio of number of segment P,
Further, according to the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set
In the step of word frequency, paragraph position, relationship weight, text length calculate the degree of association of text and the enterprise dominant of the candidate,
Including following sub-step:
Enabling W is weight matrix of the keyword in paragraph position:
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enabling R is the correlation matrix of enterprise dominant set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
In order to solve the above technical problems, another technical solution used in the present invention is: providing a kind of using knowledge mapping
Calculate the device of text and the enterprise dominant degree of correlation, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, by building in advance for carrying out word segmentation processing to text
Vertical knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is made
Gather for candidate enterprise, wherein the knowledge mapping includes that several nodal informations, each nodal information are believed with corresponding node
Relationship and relevance weight between breath, in several nodal informations, nodal information therein is enterprise dominant information, remaining
Nodal information be the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for according to the candidate associated key of enterprise dominant in the enterprise of candidate set
The word frequency that word occurs calculates the degree of association of text and the enterprise dominant of the candidate.
Further, the calculation of relationship degree module is also used to according to the candidate enterprise in the enterprise of candidate set
The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that the associated keyword of owner's body occurs.
The present invention constructs the knowledge mapping of financial field, in this, as the network of personal connections of candidate matches keyword, covers
Enterprise is the relationships such as the industrial and commercial full name of target subject, abbreviation, product, senior executive, shareholder, investment;In invention, keyword is gone out
Paragraph position assign different weights, limit of consideration is incorporated to the importance of text difference paragraph;Utilize knowledge mapping technology
The complex relationship net of building calculates possible keyword all degree of being associated, and is finally weighted and is quantified, and improves
Text and the associated success rate of target subject and accuracy rate.
Detailed description of the invention
Fig. 1 is the process for the method first embodiment that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping
Figure.
Fig. 2 is the structural schematic diagram of knowledge mapping of the present invention.
Fig. 3 is the process for the method second embodiment that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping
Figure.
Fig. 4 is the schematic diagram of sample article in specific example.
Fig. 5 is the schematic diagram of knowledge mapping relevant to the sample article in specific example.
Fig. 6 is the box for one embodiment of device that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping
Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
Referring to Figure 1, the method that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping, including following step
It is rapid:
S101, text is obtained;
The text can be public sentiment text (i.e. public feelings information).
S102, word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge pre-established
Map retrieves enterprise dominant associated with keyword, using the enterprise dominant associated with keyword as candidate's
Enterprise's set, wherein the knowledge mapping includes destination node information, associated nodal information, the destination node information
Relationship and relevance weight between the associated nodal information, the destination node information include the first enterprise dominant
Information, the associated nodal information include the second main information associated with the first main body enterprise dominant information,
Product or natural person's information;
The knowledge mapping is established especially by following manner: being believed from destination node is extracted in database (such as corpus)
Breath, associated nodal information are assigned according to the relationship between the destination node information and the associated nodal information
Corresponding relevance weight, to constitute the knowledge mapping (reference can be made to Fig. 2).Wherein, the destination node information is first
Enterprise dominant information (such as enterprise name are as follows: XX limited liability company), node letter associated with the destination node information
Breath can be the second main information associated with the first enterprise dominant information, associated with the first main body company information
Natural person's information (such as senior executive, shareholder under the first main body enterprise etc.) or associated with the first main body company information
Product (such as product of the first main body Corporation R & D, listing).In the knowledge mapping, no matter the first main body company information
Or the second enterprise dominant information can become destination node information, and the second enterprise dominant A in figure 2 becomes target
When nodal information, then original first enterprise dominant is then the associated nodal information of the second enterprise dominant A in Fig. 2,
Only their relationship has corresponding change.In the knowledge mapping, it is associated there that each destination node information is also presented
Relationship and relevance weight between nodal information, the relationship between the first enterprise dominant and the second enterprise dominant include but not
Be limited to: investment relation, supply-demand relationship, guarantee relationship etc., the relationship between natural person and the first enterprise dominant include that tenure is closed
(such as shareholder, senior executive, employees etc.) such as systems.Such as second enterprise dominant A and first enterprise dominant relationship are as follows: second enterprise
Industry main body A is the supplier of the first enterprise dominant, and relevance weight is 0.65, and product A is the product under the first enterprise dominant, is closed
Connection property weight is 0.5, and natural person B is the shareholder of the first enterprise dominant, and relevance weight is 1.In above-mentioned knowledge mapping, according to not
Bigger with the attribute information imparting respective relevancy of relationship, such as investment relation ratio, correlation is bigger;Position of holding a post is heavier
It wants, correlation is bigger etc., the specific building mode present invention is not explained in detail.The knowledge mapping of building can pass through diagram data inventory
Information is stored up, and for retrieval and inquisition.
In S102 step, by word segmentation processing, all keywords are obtained to form keyword set, the keyword
Set is denoted as K, and the keyword in the keyword set K is searched in the knowledge mapping, obtains and the keyword set K
Associated enterprise dominant is gathered the enterprise dominant associated with keyword as candidate enterprise, the candidate
Enterprise set be denoted as C.
S103, the word frequency meter occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set
Calculate the degree of association of text and the enterprise dominant of the candidate.Wherein, as follows according to the mode of word frequency calculating correlation:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map
Node is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate whole keyword word frequency relevant to i-th of candidate enterprise dominant in text
The sum of;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on
This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th
The relevant different texts of a main body are screened, are sorted.
It is as one preferred or optional, it can also pass through word frequency, the related coefficient of keyword and candidate enterprise dominant
The degree of association of the text and the enterprise dominant of the candidate is calculated, as follows:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length.
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on
This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th
The relevant different texts of a main body are screened, are sorted.
It is intelligible, in other examples, the calculating of relationship weight be in order to preferably, more accurately calculate pass
The degree of association between keyword and the enterprise dominant of candidate, in some embodiments, the technical characteristic of relationship weight not necessarily.
Embodiment of the present invention is foundation according to the knowledge mapping pre-established, after extracting the keyword in text,
Each keyword is retrieved by the knowledge mapping to obtain enterprise dominant corresponding with the keyword, by the correspondence
Enterprise dominant text is then appeared according to keyword to form candidate enterprise dominant set as candidate enterprise dominant
Word frequency in this and the relationship weight between the enterprise dominant of candidate, and obtain the enterprise dominant of the text Yu the candidate
The degree of association, improve the associated success rate of text and enterprise dominant (claiming Target Enterprise main body) and accuracy rate, enrich text envelope
The relevant dimension of breath and Target Enterprise main body provides more accurate basis for subsequent further analysis.
Fig. 3 is referred to, Fig. 3 is that the method second that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping is real
Apply the flow chart of example.The method for calculating text and the enterprise dominant degree of correlation using knowledge mapping of the present embodiment includes following step
It is rapid:
S201, text is obtained;
S202, paragraph division pretreatment is carried out to the text;
In this step, paragraph is carried out to the text in the following manner and divides pretreatment:
Setting public sentiment text information includes two title, text major parts, and text has P >=1 paragragh.Setting will be literary
This text splits into the part H >=1, is denoted as part respectively1,…,partH, by part0It is denoted as title division, the paragraph number of every part
Amount is denoted as L=(l0,l1,…,lH).Consider that the different paragraphs of text have different importance in the text, is split in text
When, the length of text head and tail parts is limited, is enabledRespectively part 1 and the portion H
Divide and accounts for total number of segment P maximum ratio, in the present embodiment, Ke YiquFor splitting the paragraph for including of every part
Number calculation formula are as follows:
Wherein,Indicate the integer for being not less than x.Paragragh of the P for text, P >=1, the H are split for text
The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L
=(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for
The maximum ratio of number of segment P,
In this step, after the paragraph divides pre-treatment step, corresponding weight also is assigned for paragraph position.Generally
Ground, paragraphs to the title of text, front and tail portion paragraphs and assigns higher weights, and text middle position weight is relatively low.
For example, the weight w of the title division of text0It is 0.35, the weight w of preceding part1It is 0.25, the weight w of portionHIt is 0.25, in
Between part w2~wH-1It is 0.15.
S203, word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge pre-established
Map retrieves enterprise dominant associated with keyword, using the enterprise dominant associated with keyword as candidate's
Enterprise's set, wherein the knowledge mapping includes destination node information, associated nodal information, the destination node information
Relationship and relevance weight between the associated nodal information, the destination node information include the first enterprise dominant
Information, the associated nodal information include the second main information associated with the first main body enterprise dominant information,
Product or natural person's information;
In this step, word segmentation processing is carried out to the segmentation text that S202 step obtains, and obtain text in conjunction with knowledge mapping
In all candidate words that can be found in knowledge mapping, be marked as keyword, all keywords are formed
Keyword set is denoted as K, and the keyword in the keyword set K is searched in the knowledge mapping, obtains and the key
The associated enterprise dominant of set of words K is gathered the enterprise dominant associated with keyword as candidate enterprise, institute
It states candidate enterprise's set and is denoted as C.
S204, the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set,
Paragraph position, relationship weight, the degree of association of text length calculating text and the enterprise dominant of the candidate, the text length are logical
It crosses the quantity for the word got in participle step and determines.
This step calculates the degree of association of text and the enterprise dominant of the candidate in the following manner:
Enabling W is weight matrix of the keyword in paragraph position:
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enable the correlation matrix of set C and its keyword set K based on R:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β
To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring
Text length.
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on
This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th
The relevant different texts of a main body are screened, are sorted.
Embodiment of the present invention divides pretreatment by carrying out paragraph to text, and assigns corresponding power to text fragment
Weight, in this way, determining the weight matrix of keyword by the paragraph position where text, then tie related coefficient after word segmentation processing
Weighted Term Frequency matrix can obtain the degree of correlation factor, obtain the correlation matrix of text and candidate enterprise dominant set C, from
And more accurately obtain the degree of association of each of entire text and candidate enterprise dominant set C enterprise dominant.
It is explained in detail below by way of a specific example and illustrates how to calculate text and the enterprise dominant degree of correlation using knowledge mapping
Method:
Fig. 4 and Fig. 5 is referred to, Fig. 4 is the sample article of the example, and Fig. 5 is knowledge graph corresponding with the sample article
Spectrum, because position is limited, only shows the partial knowledge map centered on " LeTV information technology (Beijing) limited liability company ".
The first step pre-processes sample article, and in sample article, altogether there are four paragragh, P=4 takes textH=3,
The paragraph and weight obtained according to the formula is as follows:
Table 1W=(0.35,0.25,0.15,0.25)
Second step extracts the keyword in text and extracts candidate host complex
(1) keyword set in title and text:
K={ LeEco, Sun Hongbin, circle of friends, LeTV, new LeEco intelligence man, Tencent, Tencent's video, LeEco TV, happy wound
Entertainment }
(2) it is retrieved in knowledge mapping, there is the enterprise of direct correlation to gather with K:
C={ LeTV information technology (Beijing) limited liability company, Shenzhen Tencent Computer System Co., Ltd }
Third step calculates the degree of association of public sentiment text and candidate target main body
In conjunction with the related coefficient (number on line) in knowledge mapping, it can obtain host complex C's and its keyword set K
Correlation matrix R:
Table 2
Frequency matrix F is as follows:
It can obtainMatrix is as follows:
After cleaning text information always segments word quantity, obtaining participle number is 148, and scale=148 takes β=100
Obtain the correlation matrix R of text Yu host complex CKCIt is as follows:
So the degree of association of sample article and " LeTV information technology (Beijing) limited liability company " is 0.526, with " depth
The degree of association of computer system Co., Ltd of Tencent of ditch between fields city " is 0.122.(coefficient is that citing is assumed in the above specific example)
Fig. 6 is referred to, the invention also discloses a kind of dresses that text and the enterprise dominant degree of correlation are calculated using knowledge mapping
It sets, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, by building in advance for carrying out word segmentation processing to text
Vertical knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is made
Gather for candidate enterprise, wherein the knowledge mapping includes that several nodal informations, each nodal information are believed with corresponding node
Relationship and relevance weight between breath, in several nodal informations, nodal information therein is enterprise dominant information, remaining
Nodal information be the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for according to the candidate associated key of enterprise dominant in the enterprise of candidate set
The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that word occurs.
Further include that paragraph divides preprocessing module as optional, divide pretreatment for carrying out paragraph to the text,
It is also used to assign corresponding weight to text fragment;
The calculation of relationship degree module is also used to according to the candidate enterprise dominant association in the enterprise of candidate set
The word frequency that occurs of keyword, paragraph position, relationship weight, text length calculate text and the candidate enterprise dominant pass
Connection degree.
As optional, the paragraph divides preprocessing module and by following formula carries out paragraph and divide to pre-process:
Wherein,Integer of the expression not less than x, paragragh of the P for text, P >=1, the H are split for text
The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L
=(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, Indicate that the part H accounts for
The maximum ratio of total number of segment P,
As optional, the word segmentation module is also used to carry out at participle the segmentation text divided by paragraph
Reason, obtains all keywords to form keyword set, the keyword set is denoted as K, searches in the knowledge mapping
Keyword in the keyword set K obtains enterprise dominant associated with the keyword set K, will described and pass
The associated enterprise dominant of keyword is gathered as candidate enterprise, and enterprise's set of the candidate is denoted as C.
Embodiment of the present invention, each module of the device that text and the enterprise dominant degree of correlation are calculated using knowledge mapping
Function description can be found in the above method description, just no longer repeat one by one herein.
The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention
Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies
Field is included within the scope of the present invention.
Claims (10)
1. a kind of method for calculating text and the enterprise dominant degree of correlation using knowledge mapping, comprising the following steps:
Obtain text;
Word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge mapping pre-established, retrieval
Enterprise dominant associated with keyword is gathered the enterprise dominant associated with keyword as candidate enterprise,
Wherein, the knowledge mapping include destination node information, associated nodal information, the destination node information to it is described related
Relationship and relevance weight between the nodal information of connection, the destination node information includes the first enterprise dominant information, described
Associated nodal information includes the second main information associated with the first main body enterprise dominant information, product or nature
People's information;
According to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs calculate text with
The degree of association of the enterprise dominant of the candidate.
2. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as described in claim 1, which is characterized in that
Word segmentation processing is being carried out to text, is extracting the keyword set that occurs in text, by the knowledge mapping pre-established, retrieval with
The associated enterprise dominant of keyword, using the enterprise dominant associated with keyword as the step of candidate enterprise's set
In rapid, comprising:
Word segmentation processing is carried out to text, obtains all keywords to form keyword set, the keyword set is denoted as K,
The keyword in the keyword set K is searched in the knowledge mapping, obtains enterprise associated with the keyword set K
Owner's body is gathered the enterprise dominant associated with keyword as candidate enterprise, enterprise's set of the candidate
It is denoted as C.
3. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 2, which is characterized in that
Text and institute are calculated in the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set
In the step of stating the degree of association of candidate enterprise dominant, comprising:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map node
It is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate in text whole keyword word frequency relevant to i-th of candidate enterprise dominant it
With;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein,0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting
Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text
Length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
4. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 2, which is characterized in that
In the step of calculating the degree of association of enterprise dominant of text and the candidate, further includes:
According to the word frequency of the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set, relationship weight meter
Calculate the degree of association of text and the enterprise dominant of the candidate.
5. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 4, which is characterized in that
Word frequency, the relationship weight calculation occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set is literary
In the step of sheet and the degree of association of the enterprise dominant of the candidate, comprising:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein, 0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting
Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text
Length;
Wherein, 0≤ryi≤1;
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC;
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
6. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 4, which is characterized in that
Before the step of carrying out word segmentation processing to the text, further includes:
Paragraph is carried out to the text and divides pretreatment, and assigns respective weights to paragraph position;
In the step of calculating the degree of association of enterprise dominant of the text and the candidate, further includes:
The word frequency that is occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set, paragraph position,
Relationship weight, text length calculate the degree of association of text and the enterprise dominant of the candidate.
7. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 6, which is characterized in that
Paragraph is carried out to the text by following formula and divides pretreatment:
Wherein,Indicate that the integer for being not less than x, the P are the paragragh of text, P >=1, the H is what text was split
Part is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L=
(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for
The maximum ratio of number of segment P,
8. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 7, which is characterized in that
In the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set, paragraph position, close
In the step of being the degree of association of the enterprise dominant of weight, text length calculating text and the candidate, including following sub-step:
Enabling W is weight matrix of the keyword in paragraph position:
W=(w0,w1,…,wH),
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enabling R is the correlation matrix of enterprise dominant set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein, 0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting
Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text
Length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
9. a kind of device for calculating text and the enterprise dominant degree of correlation using knowledge mapping, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, passes through what is pre-established for carrying out word segmentation processing to text
Knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is used as and is waited
The enterprise of choosing gathers, wherein the knowledge mapping include several nodal informations, each nodal information and corresponding nodal information it
Between relationship and relevance weight, in several nodal informations, nodal information therein is enterprise dominant information, remaining section
Point information is the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for being gone out according to the candidate associated keyword of enterprise dominant in the enterprise of candidate set
Existing word frequency calculates the degree of association of text and the enterprise dominant of the candidate.
10. calculating the device of text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 9, feature exists
In the calculation of relationship degree module is also used to according to the candidate associated pass of enterprise dominant in the enterprise of candidate set
The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that keyword occurs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810567101.5A CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810567101.5A CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033132A true CN109033132A (en) | 2018-12-18 |
CN109033132B CN109033132B (en) | 2020-12-11 |
Family
ID=64611958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810567101.5A Active CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033132B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815499A (en) * | 2019-01-25 | 2019-05-28 | 杭州凡闻科技有限公司 | Information correlation method and system |
CN111881183A (en) * | 2020-07-28 | 2020-11-03 | 北京金堤科技有限公司 | Enterprise name matching method and device, storage medium and electronic equipment |
CN112732883A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Fuzzy matching method and device based on knowledge graph and computer equipment |
WO2021098648A1 (en) * | 2019-11-22 | 2021-05-27 | 深圳前海微众银行股份有限公司 | Text recommendation method, apparatus and device, and medium |
WO2021103594A1 (en) * | 2019-11-25 | 2021-06-03 | 深圳壹账通智能科技有限公司 | Tacitness degree detection method and device, server and readable storage medium |
CN113688628A (en) * | 2021-07-28 | 2021-11-23 | 上海携宁计算机科技股份有限公司 | Text recognition method, electronic device, and computer-readable storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
CN104346446A (en) * | 2014-10-27 | 2015-02-11 | 百度在线网络技术(北京)有限公司 | Paper associated information recommendation method and device based on mapping knowledge domain |
US20150310073A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Finding patterns in a knowledge base to compose table answers |
CN105117487A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Book semantic retrieval method based on content structures |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN106095858A (en) * | 2016-06-02 | 2016-11-09 | 海信集团有限公司 | A kind of audio video searching method, device and terminal |
CN107679186A (en) * | 2017-09-30 | 2018-02-09 | 北京奇虎科技有限公司 | The method and device of entity search is carried out based on entity storehouse |
CN108038204A (en) * | 2017-12-15 | 2018-05-15 | 福州大学 | For the viewpoint searching system and method for social media |
CN108090167A (en) * | 2017-12-14 | 2018-05-29 | 畅捷通信息技术股份有限公司 | Method, system, computing device and the storage medium of data retrieval |
-
2018
- 2018-06-05 CN CN201810567101.5A patent/CN109033132B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
US20150310073A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Finding patterns in a knowledge base to compose table answers |
CN104346446A (en) * | 2014-10-27 | 2015-02-11 | 百度在线网络技术(北京)有限公司 | Paper associated information recommendation method and device based on mapping knowledge domain |
CN105117487A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Book semantic retrieval method based on content structures |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN106095858A (en) * | 2016-06-02 | 2016-11-09 | 海信集团有限公司 | A kind of audio video searching method, device and terminal |
CN107679186A (en) * | 2017-09-30 | 2018-02-09 | 北京奇虎科技有限公司 | The method and device of entity search is carried out based on entity storehouse |
CN108090167A (en) * | 2017-12-14 | 2018-05-29 | 畅捷通信息技术股份有限公司 | Method, system, computing device and the storage medium of data retrieval |
CN108038204A (en) * | 2017-12-15 | 2018-05-15 | 福州大学 | For the viewpoint searching system and method for social media |
Non-Patent Citations (2)
Title |
---|
YOOKYUNG JO ET AL.: ""Detecting research topics via the correlation between graphs and texts"", 《 PROCEEDINGS OF THE 13TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 * |
张云秋 等: ""非相关文献知识发现的关键技术研究"", 《情报学报》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815499A (en) * | 2019-01-25 | 2019-05-28 | 杭州凡闻科技有限公司 | Information correlation method and system |
WO2021098648A1 (en) * | 2019-11-22 | 2021-05-27 | 深圳前海微众银行股份有限公司 | Text recommendation method, apparatus and device, and medium |
WO2021103594A1 (en) * | 2019-11-25 | 2021-06-03 | 深圳壹账通智能科技有限公司 | Tacitness degree detection method and device, server and readable storage medium |
CN111881183A (en) * | 2020-07-28 | 2020-11-03 | 北京金堤科技有限公司 | Enterprise name matching method and device, storage medium and electronic equipment |
CN112732883A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Fuzzy matching method and device based on knowledge graph and computer equipment |
CN113688628A (en) * | 2021-07-28 | 2021-11-23 | 上海携宁计算机科技股份有限公司 | Text recognition method, electronic device, and computer-readable storage medium |
CN113688628B (en) * | 2021-07-28 | 2023-09-22 | 上海携宁计算机科技股份有限公司 | Text recognition method, electronic device, and computer-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109033132B (en) | 2020-12-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033132A (en) | The method and device of text and the main body degree of correlation are calculated using knowledge mapping | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
CN105468605B (en) | Entity information map generation method and device | |
CN107315738B (en) | A kind of innovation degree appraisal procedure of text information | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN110909164A (en) | Text enhancement semantic classification method and system based on convolutional neural network | |
CN105045875B (en) | Personalized search and device | |
CN110489560A (en) | The little Wei enterprise portrait generation method and device of knowledge based graphical spectrum technology | |
CN105824959A (en) | Public opinion monitoring method and system | |
CN103309886A (en) | Trading-platform-based structural information searching method and device | |
CN110968782A (en) | Student-oriented user portrait construction and application method | |
CN106598950A (en) | Method for recognizing named entity based on mixing stacking model | |
CN110222172B (en) | Multi-source network public opinion theme mining method based on improved hierarchical clustering | |
CN105740448B (en) | More microblogging timing abstract methods towards topic | |
CN105378730A (en) | Social media content analysis and output | |
CN112966091B (en) | Knowledge map recommendation system fusing entity information and heat | |
CN110287329A (en) | A kind of electric business classification attribute excavation method based on commodity text classification | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
CN110750995A (en) | File management method based on user-defined map | |
CN105869058B (en) | A kind of method that multilayer latent variable model user portrait extracts | |
CN110110218B (en) | Identity association method and terminal | |
CN107679977A (en) | A kind of tax administration platform and implementation method based on semantic analysis | |
CN114971730A (en) | Method for extracting file material, device, equipment, medium and product thereof | |
CN112148886A (en) | Method and system for constructing content knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |