CN109033132A - The method and device of text and the main body degree of correlation are calculated using knowledge mapping - Google Patents

The method and device of text and the main body degree of correlation are calculated using knowledge mapping Download PDF

Info

Publication number
CN109033132A
CN109033132A CN201810567101.5A CN201810567101A CN109033132A CN 109033132 A CN109033132 A CN 109033132A CN 201810567101 A CN201810567101 A CN 201810567101A CN 109033132 A CN109033132 A CN 109033132A
Authority
CN
China
Prior art keywords
enterprise
text
candidate
keyword
dominant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810567101.5A
Other languages
Chinese (zh)
Other versions
CN109033132B (en
Inventor
孙雨轩
吴成龙
周劼人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Securities Credit Reporting (shenzhen) Co Ltd
Original Assignee
China Securities Credit Reporting (shenzhen) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Securities Credit Reporting (shenzhen) Co Ltd filed Critical China Securities Credit Reporting (shenzhen) Co Ltd
Priority to CN201810567101.5A priority Critical patent/CN109033132B/en
Publication of CN109033132A publication Critical patent/CN109033132A/en
Application granted granted Critical
Publication of CN109033132B publication Critical patent/CN109033132B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of methods and device that text and the main body degree of correlation are calculated using knowledge mapping, which comprises obtains text;Word segmentation processing is carried out to text, extract the keyword set occurred in text, pass through the knowledge mapping pre-established, retrieve enterprise dominant associated with keyword, the enterprise dominant associated with keyword to be gathered as candidate enterprise, wherein, the knowledge mapping includes destination node information, associated nodal information, relationship and relevance weight between the destination node information and the associated nodal information, the destination node information includes the first enterprise dominant information, the associated nodal information includes the second main information associated with the first main body enterprise dominant information, product or natural person's information;The degree of association of text and the enterprise dominant of the candidate is calculated according to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs.

Description

The method and device of text and the main body degree of correlation are calculated using knowledge mapping
Technical field
The present invention relates to a kind of methods and device that text and the main body degree of correlation are calculated using knowledge mapping.
Background technique
In the information age, the acquisition and processing analysis of mass data are a big difficulties.In some industries (such as financial row Industry), people pay close attention to the information of each dimension of enterprise, to help the decisions such as management investment.On the one hand, participant in the market needs more Extensively, on the other hand more full data also require these data processed in time.Enterprise's public feelings information is that market participates in The dimension that person pays close attention to, as a kind of non-structured text information, there are public feelings information data to disperse, data volume is big, The features such as data format is complicated, timeliness is strong.Therefore, using technological means, such as natural language processing, this kind of data are carried out high Effect ground handles and extracts valuable information, is the demand of numerous financial practitioners.In face of numerous and complicated public feelings information, how will The enterprise of itself and concern associates, and screens out value less or with the incoherent information of main body, is to carry out data analysis and excavation Essential step.
Text information is associated with, common method with enterprise dominant, is to construct the keywords database of enterprise dominant, including enterprise Industrial and commercial title, enterprise's abbreviation, listing of a company code etc., and take this as the standard, carry out Keywords matching retrieval in text information library, Relevant information of the text that will match to as the enterprise dominant.On the one hand such method needs to construct more full enterprise in advance Keywords database is as retrieval foundation;On the other hand, to matching retrieval obtain as a result, being associated degree sequence, effect is not yet It is good, often occur occurring keyword in text, be not but the information of the enterprise, therefore still has more redundancy; Meanwhile association is directly matched by keyword, it can also slip for the important information of the emphasis affiliated enterprise of enterprise, cause information It loses.
Summary of the invention
In view of the above shortcomings of the prior art, the technical problems to be solved by the present invention are: providing a kind of using knowledge graph Spectrum calculates the method and device of text and the main body degree of correlation, keyword can be applied alone to tradition when analyzing mass text Matched mode is optimized.In conjunction with knowledge mapping method, target subject can be associated with and text information is associated journey Degree is quantified, and the relevant dimension of text information and target subject is enriched, and provides basis for subsequent further analysis.
In order to solve the above technical problems, one technical scheme adopted by the invention is that: a kind of utilization knowledge mapping meter is provided The method for calculating text and the enterprise dominant degree of correlation, comprising the following steps:
Obtain text;
Word segmentation processing is carried out to text, extracts the keyword set that occurs in text, by the knowledge mapping pre-established, Enterprise dominant associated with keyword is retrieved, is collected the enterprise dominant associated with keyword as candidate enterprise Close, wherein the knowledge mapping include destination node information, associated nodal information, the destination node information with it is described Relationship and relevance weight between associated nodal information, the destination node information include the first enterprise dominant information, The associated nodal information include the second main information associated with the first main body enterprise dominant information, product or Natural person's information;
Text is calculated according to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs The degree of association of this and the enterprise dominant of the candidate.
Further, word segmentation processing is being carried out to text, the keyword set occurred in text is being extracted, by pre-establishing Knowledge mapping, retrieve associated with keyword enterprise dominant, will described in enterprise dominant conduct associated with keyword In the step of candidate enterprise gathers, comprising:
Word segmentation processing is carried out to text, obtains all keywords to form keyword set, the keyword set note For K, the keyword in the keyword set K is searched in the knowledge mapping, is obtained associated with the keyword set K Enterprise dominant, gather the enterprise dominant associated with keyword as candidate enterprise, the enterprise of the candidate Set is denoted as C.
Further, according to the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set Word frequency calculated in the step of degree of association of text and the enterprise dominant of the candidate, comprising:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map Node is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate whole keyword word frequency relevant to i-th of candidate enterprise dominant in text The sum of;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
Further, in the step of calculating the degree of association of enterprise dominant of text and the candidate, further includes:
Word frequency, the relationship power occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set The degree of association of re-computation text and the enterprise dominant of the candidate.
Further, the word occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set Frequently, in the step of degree of association of relationship weight calculation text and the enterprise dominant of the candidate, comprising:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length;
Wherein, 0≤ryi≤1;
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
Further, before the step of carrying out word segmentation processing to the text, further includes:
Paragraph is carried out to the text and divides pretreatment, and assigns respective weights to paragraph position;
In the step of calculating the degree of association of enterprise dominant of the text and the candidate, further includes:
According to the word frequency of the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set, paragraph position It sets, the degree of association of relationship weight, text length calculating text and the enterprise dominant of the candidate.
Further, paragraph is carried out to the text by following formula and divides pretreatment:
Wherein,Integer of the expression not less than x, paragragh of the P for text, P >=1, the H are split for text The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L =(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for The maximum ratio of number of segment P,
Further, according to the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set In the step of word frequency, paragraph position, relationship weight, text length calculate the degree of association of text and the enterprise dominant of the candidate, Including following sub-step:
Enabling W is weight matrix of the keyword in paragraph position:
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enabling R is the correlation matrix of enterprise dominant set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
In order to solve the above technical problems, another technical solution used in the present invention is: providing a kind of using knowledge mapping Calculate the device of text and the enterprise dominant degree of correlation, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, by building in advance for carrying out word segmentation processing to text Vertical knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is made Gather for candidate enterprise, wherein the knowledge mapping includes that several nodal informations, each nodal information are believed with corresponding node Relationship and relevance weight between breath, in several nodal informations, nodal information therein is enterprise dominant information, remaining Nodal information be the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for according to the candidate associated key of enterprise dominant in the enterprise of candidate set The word frequency that word occurs calculates the degree of association of text and the enterprise dominant of the candidate.
Further, the calculation of relationship degree module is also used to according to the candidate enterprise in the enterprise of candidate set The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that the associated keyword of owner's body occurs.
The present invention constructs the knowledge mapping of financial field, in this, as the network of personal connections of candidate matches keyword, covers Enterprise is the relationships such as the industrial and commercial full name of target subject, abbreviation, product, senior executive, shareholder, investment;In invention, keyword is gone out Paragraph position assign different weights, limit of consideration is incorporated to the importance of text difference paragraph;Utilize knowledge mapping technology The complex relationship net of building calculates possible keyword all degree of being associated, and is finally weighted and is quantified, and improves Text and the associated success rate of target subject and accuracy rate.
Detailed description of the invention
Fig. 1 is the process for the method first embodiment that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping Figure.
Fig. 2 is the structural schematic diagram of knowledge mapping of the present invention.
Fig. 3 is the process for the method second embodiment that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping Figure.
Fig. 4 is the schematic diagram of sample article in specific example.
Fig. 5 is the schematic diagram of knowledge mapping relevant to the sample article in specific example.
Fig. 6 is the box for one embodiment of device that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall within the protection scope of the present invention.
Referring to Figure 1, the method that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping, including following step It is rapid:
S101, text is obtained;
The text can be public sentiment text (i.e. public feelings information).
S102, word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge pre-established Map retrieves enterprise dominant associated with keyword, using the enterprise dominant associated with keyword as candidate's Enterprise's set, wherein the knowledge mapping includes destination node information, associated nodal information, the destination node information Relationship and relevance weight between the associated nodal information, the destination node information include the first enterprise dominant Information, the associated nodal information include the second main information associated with the first main body enterprise dominant information, Product or natural person's information;
The knowledge mapping is established especially by following manner: being believed from destination node is extracted in database (such as corpus) Breath, associated nodal information are assigned according to the relationship between the destination node information and the associated nodal information Corresponding relevance weight, to constitute the knowledge mapping (reference can be made to Fig. 2).Wherein, the destination node information is first Enterprise dominant information (such as enterprise name are as follows: XX limited liability company), node letter associated with the destination node information Breath can be the second main information associated with the first enterprise dominant information, associated with the first main body company information Natural person's information (such as senior executive, shareholder under the first main body enterprise etc.) or associated with the first main body company information Product (such as product of the first main body Corporation R & D, listing).In the knowledge mapping, no matter the first main body company information Or the second enterprise dominant information can become destination node information, and the second enterprise dominant A in figure 2 becomes target When nodal information, then original first enterprise dominant is then the associated nodal information of the second enterprise dominant A in Fig. 2, Only their relationship has corresponding change.In the knowledge mapping, it is associated there that each destination node information is also presented Relationship and relevance weight between nodal information, the relationship between the first enterprise dominant and the second enterprise dominant include but not Be limited to: investment relation, supply-demand relationship, guarantee relationship etc., the relationship between natural person and the first enterprise dominant include that tenure is closed (such as shareholder, senior executive, employees etc.) such as systems.Such as second enterprise dominant A and first enterprise dominant relationship are as follows: second enterprise Industry main body A is the supplier of the first enterprise dominant, and relevance weight is 0.65, and product A is the product under the first enterprise dominant, is closed Connection property weight is 0.5, and natural person B is the shareholder of the first enterprise dominant, and relevance weight is 1.In above-mentioned knowledge mapping, according to not Bigger with the attribute information imparting respective relevancy of relationship, such as investment relation ratio, correlation is bigger;Position of holding a post is heavier It wants, correlation is bigger etc., the specific building mode present invention is not explained in detail.The knowledge mapping of building can pass through diagram data inventory Information is stored up, and for retrieval and inquisition.
In S102 step, by word segmentation processing, all keywords are obtained to form keyword set, the keyword Set is denoted as K, and the keyword in the keyword set K is searched in the knowledge mapping, obtains and the keyword set K Associated enterprise dominant is gathered the enterprise dominant associated with keyword as candidate enterprise, the candidate Enterprise set be denoted as C.
S103, the word frequency meter occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set Calculate the degree of association of text and the enterprise dominant of the candidate.Wherein, as follows according to the mode of word frequency calculating correlation:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map Node is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate whole keyword word frequency relevant to i-th of candidate enterprise dominant in text The sum of;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th The relevant different texts of a main body are screened, are sorted.
It is as one preferred or optional, it can also pass through word frequency, the related coefficient of keyword and candidate enterprise dominant The degree of association of the text and the enterprise dominant of the candidate is calculated, as follows:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length.
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th The relevant different texts of a main body are screened, are sorted.
It is intelligible, in other examples, the calculating of relationship weight be in order to preferably, more accurately calculate pass The degree of association between keyword and the enterprise dominant of candidate, in some embodiments, the technical characteristic of relationship weight not necessarily.
Embodiment of the present invention is foundation according to the knowledge mapping pre-established, after extracting the keyword in text, Each keyword is retrieved by the knowledge mapping to obtain enterprise dominant corresponding with the keyword, by the correspondence Enterprise dominant text is then appeared according to keyword to form candidate enterprise dominant set as candidate enterprise dominant Word frequency in this and the relationship weight between the enterprise dominant of candidate, and obtain the enterprise dominant of the text Yu the candidate The degree of association, improve the associated success rate of text and enterprise dominant (claiming Target Enterprise main body) and accuracy rate, enrich text envelope The relevant dimension of breath and Target Enterprise main body provides more accurate basis for subsequent further analysis.
Fig. 3 is referred to, Fig. 3 is that the method second that the present invention calculates text and the enterprise dominant degree of correlation using knowledge mapping is real Apply the flow chart of example.The method for calculating text and the enterprise dominant degree of correlation using knowledge mapping of the present embodiment includes following step It is rapid:
S201, text is obtained;
S202, paragraph division pretreatment is carried out to the text;
In this step, paragraph is carried out to the text in the following manner and divides pretreatment:
Setting public sentiment text information includes two title, text major parts, and text has P >=1 paragragh.Setting will be literary This text splits into the part H >=1, is denoted as part respectively1,…,partH, by part0It is denoted as title division, the paragraph number of every part Amount is denoted as L=(l0,l1,…,lH).Consider that the different paragraphs of text have different importance in the text, is split in text When, the length of text head and tail parts is limited, is enabledRespectively part 1 and the portion H Divide and accounts for total number of segment P maximum ratio, in the present embodiment, Ke YiquFor splitting the paragraph for including of every part Number calculation formula are as follows:
Wherein,Indicate the integer for being not less than x.Paragragh of the P for text, P >=1, the H are split for text The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L =(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for The maximum ratio of number of segment P,
In this step, after the paragraph divides pre-treatment step, corresponding weight also is assigned for paragraph position.Generally Ground, paragraphs to the title of text, front and tail portion paragraphs and assigns higher weights, and text middle position weight is relatively low. For example, the weight w of the title division of text0It is 0.35, the weight w of preceding part1It is 0.25, the weight w of portionHIt is 0.25, in Between part w2~wH-1It is 0.15.
S203, word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge pre-established Map retrieves enterprise dominant associated with keyword, using the enterprise dominant associated with keyword as candidate's Enterprise's set, wherein the knowledge mapping includes destination node information, associated nodal information, the destination node information Relationship and relevance weight between the associated nodal information, the destination node information include the first enterprise dominant Information, the associated nodal information include the second main information associated with the first main body enterprise dominant information, Product or natural person's information;
In this step, word segmentation processing is carried out to the segmentation text that S202 step obtains, and obtain text in conjunction with knowledge mapping In all candidate words that can be found in knowledge mapping, be marked as keyword, all keywords are formed Keyword set is denoted as K, and the keyword in the keyword set K is searched in the knowledge mapping, obtains and the key The associated enterprise dominant of set of words K is gathered the enterprise dominant associated with keyword as candidate enterprise, institute It states candidate enterprise's set and is denoted as C.
S204, the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set, Paragraph position, relationship weight, the degree of association of text length calculating text and the enterprise dominant of the candidate, the text length are logical It crosses the quantity for the word got in participle step and determines.
This step calculates the degree of association of text and the enterprise dominant of the candidate in the following manner:
Enabling W is weight matrix of the keyword in paragraph position:
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enable the correlation matrix of set C and its keyword set K based on R:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein,
Wherein,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β To scale adjustment parameter, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring Text length.
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.It is based on This degree of association can screen the close enterprise dominant compared with the Ben Wenben degree of correlation with given threshold;It is also possible to i-th The relevant different texts of a main body are screened, are sorted.
Embodiment of the present invention divides pretreatment by carrying out paragraph to text, and assigns corresponding power to text fragment Weight, in this way, determining the weight matrix of keyword by the paragraph position where text, then tie related coefficient after word segmentation processing Weighted Term Frequency matrix can obtain the degree of correlation factor, obtain the correlation matrix of text and candidate enterprise dominant set C, from And more accurately obtain the degree of association of each of entire text and candidate enterprise dominant set C enterprise dominant.
It is explained in detail below by way of a specific example and illustrates how to calculate text and the enterprise dominant degree of correlation using knowledge mapping Method:
Fig. 4 and Fig. 5 is referred to, Fig. 4 is the sample article of the example, and Fig. 5 is knowledge graph corresponding with the sample article Spectrum, because position is limited, only shows the partial knowledge map centered on " LeTV information technology (Beijing) limited liability company ".
The first step pre-processes sample article, and in sample article, altogether there are four paragragh, P=4 takes textH=3,
The paragraph and weight obtained according to the formula is as follows:
Table 1W=(0.35,0.25,0.15,0.25)
Second step extracts the keyword in text and extracts candidate host complex
(1) keyword set in title and text:
K={ LeEco, Sun Hongbin, circle of friends, LeTV, new LeEco intelligence man, Tencent, Tencent's video, LeEco TV, happy wound Entertainment }
(2) it is retrieved in knowledge mapping, there is the enterprise of direct correlation to gather with K:
C={ LeTV information technology (Beijing) limited liability company, Shenzhen Tencent Computer System Co., Ltd }
Third step calculates the degree of association of public sentiment text and candidate target main body
In conjunction with the related coefficient (number on line) in knowledge mapping, it can obtain host complex C's and its keyword set K Correlation matrix R:
Table 2
Frequency matrix F is as follows:
It can obtainMatrix is as follows:
After cleaning text information always segments word quantity, obtaining participle number is 148, and scale=148 takes β=100
Obtain the correlation matrix R of text Yu host complex CKCIt is as follows:
So the degree of association of sample article and " LeTV information technology (Beijing) limited liability company " is 0.526, with " depth The degree of association of computer system Co., Ltd of Tencent of ditch between fields city " is 0.122.(coefficient is that citing is assumed in the above specific example)
Fig. 6 is referred to, the invention also discloses a kind of dresses that text and the enterprise dominant degree of correlation are calculated using knowledge mapping It sets, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, by building in advance for carrying out word segmentation processing to text Vertical knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is made Gather for candidate enterprise, wherein the knowledge mapping includes that several nodal informations, each nodal information are believed with corresponding node Relationship and relevance weight between breath, in several nodal informations, nodal information therein is enterprise dominant information, remaining Nodal information be the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for according to the candidate associated key of enterprise dominant in the enterprise of candidate set The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that word occurs.
Further include that paragraph divides preprocessing module as optional, divide pretreatment for carrying out paragraph to the text, It is also used to assign corresponding weight to text fragment;
The calculation of relationship degree module is also used to according to the candidate enterprise dominant association in the enterprise of candidate set The word frequency that occurs of keyword, paragraph position, relationship weight, text length calculate text and the candidate enterprise dominant pass Connection degree.
As optional, the paragraph divides preprocessing module and by following formula carries out paragraph and divide to pre-process:
Wherein,Integer of the expression not less than x, paragragh of the P for text, P >=1, the H are split for text The part divided, is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L =(l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, Indicate that the part H accounts for The maximum ratio of total number of segment P,
As optional, the word segmentation module is also used to carry out at participle the segmentation text divided by paragraph Reason, obtains all keywords to form keyword set, the keyword set is denoted as K, searches in the knowledge mapping Keyword in the keyword set K obtains enterprise dominant associated with the keyword set K, will described and pass The associated enterprise dominant of keyword is gathered as candidate enterprise, and enterprise's set of the candidate is denoted as C.
Embodiment of the present invention, each module of the device that text and the enterprise dominant degree of correlation are calculated using knowledge mapping Function description can be found in the above method description, just no longer repeat one by one herein.
The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies Field is included within the scope of the present invention.

Claims (10)

1. a kind of method for calculating text and the enterprise dominant degree of correlation using knowledge mapping, comprising the following steps:
Obtain text;
Word segmentation processing is carried out to text, extracts the keyword set occurred in text, passes through the knowledge mapping pre-established, retrieval Enterprise dominant associated with keyword is gathered the enterprise dominant associated with keyword as candidate enterprise, Wherein, the knowledge mapping include destination node information, associated nodal information, the destination node information to it is described related Relationship and relevance weight between the nodal information of connection, the destination node information includes the first enterprise dominant information, described Associated nodal information includes the second main information associated with the first main body enterprise dominant information, product or nature People's information;
According to the word frequency that the associated keyword of candidate enterprise dominant in the enterprise of candidate set occurs calculate text with The degree of association of the enterprise dominant of the candidate.
2. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as described in claim 1, which is characterized in that Word segmentation processing is being carried out to text, is extracting the keyword set that occurs in text, by the knowledge mapping pre-established, retrieval with The associated enterprise dominant of keyword, using the enterprise dominant associated with keyword as the step of candidate enterprise's set In rapid, comprising:
Word segmentation processing is carried out to text, obtains all keywords to form keyword set, the keyword set is denoted as K, The keyword in the keyword set K is searched in the knowledge mapping, obtains enterprise associated with the keyword set K Owner's body is gathered the enterprise dominant associated with keyword as candidate enterprise, enterprise's set of the candidate It is denoted as C.
3. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 2, which is characterized in that Text and institute are calculated in the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set In the step of stating the degree of association of candidate enterprise dominant, comprising:
Enabling F is the frequency matrix of keyword set K:
fiIndicate the word frequency of i-th of keyword;
The correlation matrix of set C and its keyword set K based on R are enabled, it is 1 that knowledge mapping node, which is connected, map node It is not attached to as 0:
Based on the aggregation word frequency vector of set C and relative keyword:
Wherein,Indicate in text whole keyword word frequency relevant to i-th of candidate enterprise dominant it With;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein,0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text Length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
4. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 2, which is characterized in that In the step of calculating the degree of association of enterprise dominant of text and the candidate, further includes:
According to the word frequency of the associated keyword appearance of candidate enterprise dominant in the enterprise of candidate set, relationship weight meter Calculate the degree of association of text and the enterprise dominant of the candidate.
5. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 4, which is characterized in that Word frequency, the relationship weight calculation occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set is literary In the step of sheet and the degree of association of the enterprise dominant of the candidate, comprising:
The word frequency vector F of statistics keyword K set first:
fiIndicate the word frequency of i-th of keyword;
Enabling R is the correlation matrix of candidate enterprise set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate the sum of keyword Weighted Term Frequency of enterprise dominant;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein, 0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text Length;
Wherein, 0≤ryi≤1;
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
6. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 4, which is characterized in that Before the step of carrying out word segmentation processing to the text, further includes:
Paragraph is carried out to the text and divides pretreatment, and assigns respective weights to paragraph position;
In the step of calculating the degree of association of enterprise dominant of the text and the candidate, further includes:
The word frequency that is occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set, paragraph position, Relationship weight, text length calculate the degree of association of text and the enterprise dominant of the candidate.
7. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 6, which is characterized in that Paragraph is carried out to the text by following formula and divides pretreatment:
Wherein,Indicate that the integer for being not less than x, the P are the paragragh of text, P >=1, the H is what text was split Part is denoted as part respectively1,…,partH, title is designated as part0, the paragraph quantity of H >=1, every part is denoted as L= (l0,l1,…,lH),Indicate that first part accounts for the maximum ratio of total number of segment P, It is total to indicate that the part H accounts for The maximum ratio of number of segment P,
8. the method for calculating text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 7, which is characterized in that In the word frequency occurred according to the associated keyword of candidate enterprise dominant in the enterprise of candidate set, paragraph position, close In the step of being the degree of association of the enterprise dominant of weight, text length calculating text and the candidate, including following sub-step:
Enabling W is weight matrix of the keyword in paragraph position:
W=(w0,w1,…,wH),
Wherein wiIndicate keyword in the resulting weight of i-th section, w0Refer to keyword in the resulting weight of title;
Enabling R is the correlation matrix of enterprise dominant set C and its keyword set K:
rijIndicate the related coefficient of i-th candidate enterprise dominant and j-th of keyword;
F is keyword K in the different resulting frequency matrixes in paragraph position:
fijIndicate i-th of keyword in partjPartial word frequency;
For correlation coefficient weighted frequency matrix:
WhereinIndicate i-th of candidate enterprise dominant in partjThe sum of partial Weighted Term Frequency;
Degree of correlation factor R X is defined, RX is used to measure the associated order between enterprise dominant candidate in this document;
Wherein, (1 ..., 1) u=,
Wherein, 0≤rxi≤ 1,
Degree of correlation factor R Y is defined, for measuring the associated order of enterprise dominant candidate between different texts, β > 0, β are contracting Adjustment parameter is put, scale > 0 is that text information always segments the once purged obtained participle word quantity of number, for measuring text Length;
Wherein, 0≤ryi≤1
Obtain the correlation matrix R of text and candidate enterprise dominant set CKC
Wherein, ⊙ is matrix point multiplication operation,Indicate Ben Wenben to the degree of association of i-th of candidate enterprise dominant.
9. a kind of device for calculating text and the enterprise dominant degree of correlation using knowledge mapping, comprising:
Text obtains module, for obtaining text;
Word segmentation module extracts the keyword set occurred in text, passes through what is pre-established for carrying out word segmentation processing to text Knowledge mapping retrieves enterprise dominant associated with keyword, and the enterprise dominant associated with keyword is used as and is waited The enterprise of choosing gathers, wherein the knowledge mapping include several nodal informations, each nodal information and corresponding nodal information it Between relationship and relevance weight, in several nodal informations, nodal information therein is enterprise dominant information, remaining section Point information is the corresponding product information of corresponding enterprise dominant or natural person's information;
Calculation of relationship degree module, for being gone out according to the candidate associated keyword of enterprise dominant in the enterprise of candidate set Existing word frequency calculates the degree of association of text and the enterprise dominant of the candidate.
10. calculating the device of text and the enterprise dominant degree of correlation using knowledge mapping as claimed in claim 9, feature exists In the calculation of relationship degree module is also used to according to the candidate associated pass of enterprise dominant in the enterprise of candidate set The degree of association of the enterprise dominant of word frequency, relationship weight calculation text and the candidate that keyword occurs.
CN201810567101.5A 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph Active CN109033132B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810567101.5A CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810567101.5A CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Publications (2)

Publication Number Publication Date
CN109033132A true CN109033132A (en) 2018-12-18
CN109033132B CN109033132B (en) 2020-12-11

Family

ID=64611958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810567101.5A Active CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Country Status (1)

Country Link
CN (1) CN109033132B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815499A (en) * 2019-01-25 2019-05-28 杭州凡闻科技有限公司 Information correlation method and system
CN111881183A (en) * 2020-07-28 2020-11-03 北京金堤科技有限公司 Enterprise name matching method and device, storage medium and electronic equipment
CN112732883A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Fuzzy matching method and device based on knowledge graph and computer equipment
WO2021098648A1 (en) * 2019-11-22 2021-05-27 深圳前海微众银行股份有限公司 Text recommendation method, apparatus and device, and medium
WO2021103594A1 (en) * 2019-11-25 2021-06-03 深圳壹账通智能科技有限公司 Tacitness degree detection method and device, server and readable storage medium
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104346446A (en) * 2014-10-27 2015-02-11 百度在线网络技术(北京)有限公司 Paper associated information recommendation method and device based on mapping knowledge domain
US20150310073A1 (en) * 2014-04-29 2015-10-29 Microsoft Corporation Finding patterns in a knowledge base to compose table answers
CN105117487A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Book semantic retrieval method based on content structures
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN106095858A (en) * 2016-06-02 2016-11-09 海信集团有限公司 A kind of audio video searching method, device and terminal
CN107679186A (en) * 2017-09-30 2018-02-09 北京奇虎科技有限公司 The method and device of entity search is carried out based on entity storehouse
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media
CN108090167A (en) * 2017-12-14 2018-05-29 畅捷通信息技术股份有限公司 Method, system, computing device and the storage medium of data retrieval

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
US20150310073A1 (en) * 2014-04-29 2015-10-29 Microsoft Corporation Finding patterns in a knowledge base to compose table answers
CN104346446A (en) * 2014-10-27 2015-02-11 百度在线网络技术(北京)有限公司 Paper associated information recommendation method and device based on mapping knowledge domain
CN105117487A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Book semantic retrieval method based on content structures
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN106095858A (en) * 2016-06-02 2016-11-09 海信集团有限公司 A kind of audio video searching method, device and terminal
CN107679186A (en) * 2017-09-30 2018-02-09 北京奇虎科技有限公司 The method and device of entity search is carried out based on entity storehouse
CN108090167A (en) * 2017-12-14 2018-05-29 畅捷通信息技术股份有限公司 Method, system, computing device and the storage medium of data retrieval
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YOOKYUNG JO ET AL.: ""Detecting research topics via the correlation between graphs and texts"", 《 PROCEEDINGS OF THE 13TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 *
张云秋 等: ""非相关文献知识发现的关键技术研究"", 《情报学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815499A (en) * 2019-01-25 2019-05-28 杭州凡闻科技有限公司 Information correlation method and system
WO2021098648A1 (en) * 2019-11-22 2021-05-27 深圳前海微众银行股份有限公司 Text recommendation method, apparatus and device, and medium
WO2021103594A1 (en) * 2019-11-25 2021-06-03 深圳壹账通智能科技有限公司 Tacitness degree detection method and device, server and readable storage medium
CN111881183A (en) * 2020-07-28 2020-11-03 北京金堤科技有限公司 Enterprise name matching method and device, storage medium and electronic equipment
CN112732883A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Fuzzy matching method and device based on knowledge graph and computer equipment
CN113688628A (en) * 2021-07-28 2021-11-23 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium
CN113688628B (en) * 2021-07-28 2023-09-22 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Also Published As

Publication number Publication date
CN109033132B (en) 2020-12-11

Similar Documents

Publication Publication Date Title
CN109033132A (en) The method and device of text and the main body degree of correlation are calculated using knowledge mapping
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN105468605B (en) Entity information map generation method and device
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN105045875B (en) Personalized search and device
CN110489560A (en) The little Wei enterprise portrait generation method and device of knowledge based graphical spectrum technology
CN105824959A (en) Public opinion monitoring method and system
CN103309886A (en) Trading-platform-based structural information searching method and device
CN110968782A (en) Student-oriented user portrait construction and application method
CN106598950A (en) Method for recognizing named entity based on mixing stacking model
CN110222172B (en) Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN105740448B (en) More microblogging timing abstract methods towards topic
CN105378730A (en) Social media content analysis and output
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
CN110287329A (en) A kind of electric business classification attribute excavation method based on commodity text classification
CN107341199A (en) A kind of recommendation method based on documentation & info general model
CN110750995A (en) File management method based on user-defined map
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
CN110110218B (en) Identity association method and terminal
CN107679977A (en) A kind of tax administration platform and implementation method based on semantic analysis
CN114971730A (en) Method for extracting file material, device, equipment, medium and product thereof
CN112148886A (en) Method and system for constructing content knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant