CN109033132B - Method and device for calculating text and subject correlation by using knowledge graph - Google Patents

Method and device for calculating text and subject correlation by using knowledge graph Download PDF

Info

Publication number
CN109033132B
CN109033132B CN201810567101.5A CN201810567101A CN109033132B CN 109033132 B CN109033132 B CN 109033132B CN 201810567101 A CN201810567101 A CN 201810567101A CN 109033132 B CN109033132 B CN 109033132B
Authority
CN
China
Prior art keywords
text
enterprise
subject
candidate
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810567101.5A
Other languages
Chinese (zh)
Other versions
CN109033132A (en
Inventor
孙雨轩
吴成龙
周劼人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongzheng Zhengxin Shenzhen Co ltd
Original Assignee
Zhongzheng Zhengxin Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongzheng Zhengxin Shenzhen Co ltd filed Critical Zhongzheng Zhengxin Shenzhen Co ltd
Priority to CN201810567101.5A priority Critical patent/CN109033132B/en
Publication of CN109033132A publication Critical patent/CN109033132A/en
Application granted granted Critical
Publication of CN109033132B publication Critical patent/CN109033132B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a device for calculating the relevancy between a text and a main body by using a knowledge graph, wherein the method comprises the following steps: acquiring a text; performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, a product or natural person information associated with the first subject enterprise subject information; and calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set.

Description

Method and device for calculating text and subject correlation by using knowledge graph
Technical Field
The invention relates to a method and a device for calculating the relevancy between a text and a main body by using a knowledge graph.
Background
In the information age, acquisition, processing and analysis of mass data are a great difficulty. In some industries (e.g., the financial industry), people focus on information about various dimensions of an enterprise to help make decisions such as investments. On the one hand, market participants require broader, more complete data, and on the other hand, require that these data be processed in a timely manner. The enterprise public opinion information is a dimension of key attention of market participants, and as unstructured text information, the public opinion information has the characteristics of data dispersion, large data volume, complex data format, strong timeliness and the like. Therefore, it is a demand of many financial practitioners to efficiently process such data and extract valuable information by using technical means such as natural language processing. In the face of complicated public opinion information, how to associate the public opinion information with concerned enterprises to screen out information with low value or irrelevant to a main body is an important step for data analysis and mining.
The common method is to construct a keyword library of the enterprise main body, including the business name, enterprise abbreviation, enterprise listing code, etc. of the enterprise, and on the basis of this, to perform keyword matching search in the text information library, and to use the matched text as the related information of the enterprise main body. On one hand, the method needs to construct a relatively complete enterprise keyword library in advance as a retrieval basis; on the other hand, the results obtained by matching retrieval are ranked according to the degree of association, so that the effect is poor, keywords are often found in the text but not the information of the enterprise, and more redundant information still exists; meanwhile, the key words are directly matched and associated, so that important information of key association enterprises of the enterprises can be omitted, and information loss is caused.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: the method and the device for calculating the relevance between the text and the main body by using the knowledge graph can optimize the traditional single-use keyword matching mode when analyzing massive texts. By combining a knowledge graph method, the degree of association between the target subject association and the text information can be quantified, the association dimensions of the text information and the target subject are enriched, and a basis is provided for subsequent further analysis.
In order to solve the technical problems, the invention adopts a technical scheme that: the method for calculating the relevancy between the text and the enterprise main body by using the knowledge graph comprises the following steps:
acquiring a text;
performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, a product or natural person information associated with the first subject enterprise subject information;
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the step of performing word segmentation processing on the text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set includes:
performing word segmentation processing on a text to obtain all keywords to form a keyword set, wherein the keyword set is marked as K, searching the keywords in the keyword set K in the knowledge graph, and acquiring an enterprise subject associated with the keyword set K to use the enterprise subject associated with the keywords as a candidate enterprise set, and the candidate enterprise set is marked as C.
Further, in the step of calculating the association degree between the text and the candidate enterprise main body according to the word frequency of the keyword occurrence associated with the candidate enterprise main body in the candidate enterprise set, the method includes:
let F be the word frequency matrix of the keyword set K:
Figure BDA0001684808990000021
fithe word frequency of the ith keyword is represented;
let R be the correlation matrix of the main body set C and the keyword set K thereof, the connected nodes of the knowledge graph are 1, and the disconnected nodes of the graph are 0:
Figure BDA0001684808990000022
Figure BDA0001684808990000023
the summed word-frequency vector for the subject set C and its associated keywords:
Figure BDA0001684808990000024
wherein the content of the first and second substances,
Figure BDA0001684808990000025
representing the sum of all keyword word frequencies related to the ith candidate enterprise subject in the text;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA0001684808990000026
Figure BDA0001684808990000031
wherein the content of the first and second substances,
Figure BDA0001684808990000032
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
Figure BDA0001684808990000033
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000034
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000035
indicating the degree of association of the text to the ith candidate business entity.
Further, in the step of calculating the association degree between the text and the candidate business entity, the method further includes:
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the step of calculating the association degree between the text and the candidate enterprise subject according to the word frequency and the relationship weight of the occurrence of the keyword associated with the candidate enterprise subject in the candidate enterprise set includes:
firstly, counting a word frequency vector F of a keyword K set:
Figure BDA0001684808990000036
fithe word frequency of the ith keyword is represented;
let R be the correlation coefficient matrix of the candidate enterprise set C and the keyword set K thereof:
Figure BDA0001684808990000037
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
Figure BDA0001684808990000038
weighting the word frequency matrix for the correlation coefficients:
Figure BDA0001684808990000039
wherein
Figure BDA0001684808990000041
The sum of the weighted word frequencies of the keywords of the enterprise main body representing the ith candidate;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA0001684808990000042
Figure BDA0001684808990000043
wherein the content of the first and second substances,
Figure BDA0001684808990000044
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
Figure BDA0001684808990000045
wherein, 0 is not less than ryi≤1;
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000046
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000047
indicating the degree of association of the text to the ith candidate business entity.
Further, before the step of performing word segmentation processing on the text, the method further includes:
performing paragraph division preprocessing on the text, and giving corresponding weight to the position of a paragraph;
in the step of calculating the association degree of the text with the candidate business entity, the method further comprises:
and calculating the association degree of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text spread of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
Further, the text is subjected to paragraph segmentation preprocessing by the following formula:
Figure BDA0001684808990000048
wherein the content of the first and second substances,
Figure BDA0001684808990000049
representing an integer not less than x, wherein P is a natural segment of the text, P is more than or equal to 1, and H is a split part of the text and is respectively marked as part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),
Figure BDA0001684808990000051
Representing the maximum proportion of the first portion to the total number of segments P,
Figure BDA0001684808990000052
Figure BDA0001684808990000053
represents the maximum proportion of the H-th part to the total number of segments P,
Figure BDA0001684808990000054
further, the step of calculating the association degree between the text and the candidate enterprise main body according to the word frequency, paragraph position, relationship weight and text space of the keyword occurrence associated with the candidate enterprise main body in the candidate enterprise set comprises the following substeps:
let W be the weight matrix of the keyword at the paragraph position:
Figure BDA0001684808990000055
wherein wiRepresenting the resulting weight of the keyword in section i, w0The weight of the keyword in the title;
let R be the correlation coefficient matrix of the enterprise subject set C and the keyword set K:
Figure BDA0001684808990000056
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
f is a word frequency matrix obtained by the key word K at different paragraph positions:
Figure BDA0001684808990000057
fijindicates that the ith keyword is in partjThe word frequency of the portion;
Figure BDA0001684808990000058
weighting the word frequency matrix for the correlation coefficients:
Figure BDA0001684808990000059
wherein
Figure BDA00016848089900000510
Business entity representing the ith candidate in partjA sum of partial weighted word frequencies;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA00016848089900000511
Figure BDA00016848089900000512
wherein the content of the first and second substances,
Figure BDA00016848089900000513
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
Figure BDA0001684808990000061
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000062
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000063
indicating the degree of association of the text to the ith candidate business entity.
In order to solve the technical problem, the invention adopts another technical scheme that: an apparatus for calculating the relevancy of a text and an enterprise main body by using a knowledge graph is provided, which comprises:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the relevancy calculation module is further configured to calculate relevancy of the text and the candidate enterprise subject according to word frequency and relationship weight of occurrence of keywords associated with the candidate enterprise subject in the candidate enterprise set.
The invention constructs the knowledge graph in the financial field, takes the knowledge graph as a relation network of candidate matching keywords, and covers the relation of industry and commerce full names, short names, products, high governments, stockholders, investment and the like with an enterprise as a target subject; in the invention, different weights are given to the positions of paragraphs given by keywords, and the importance of different paragraphs of the text is taken into consideration; and (3) calculating the association degree of all possible keywords by using a complex relation network constructed by a knowledge graph technology, finally weighting and quantizing, and improving the success rate and accuracy of the association of the text and the target subject.
Drawings
FIG. 1 is a flowchart of a first embodiment of a method for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
FIG. 2 is a schematic diagram of the structure of a knowledge-graph of the present invention.
FIG. 3 is a flowchart of a second embodiment of a method for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
FIG. 4 is a schematic illustration of a sample article in a specific example.
Fig. 5 is a schematic illustration of a knowledge-graph associated with the sample article in a specific example.
FIG. 6 is a block diagram of an embodiment of an apparatus for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the method for calculating the relevance between a text and an enterprise subject by using a knowledge graph of the present invention includes the following steps:
s101, acquiring a text;
the text may be public opinion text (i.e., public opinion information).
S102, performing word segmentation processing on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject related to the keyword through a pre-established knowledge graph to take the enterprise subject related to the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, related node information, a relation between the target node information and the related node information and an association weight, the target node information comprises first enterprise subject information, and the related node information comprises second subject information, product or natural person information related to the first subject enterprise subject information;
the knowledge graph is specifically established in the following way: target node information and associated node information are extracted from a database (such as a corpus), and corresponding relevance weights are given according to the relation between the target node information and the associated node information, so that the knowledge graph is formed (see fig. 2). The target node information is first enterprise subject information (e.g., the name of an enterprise: XX corporation), and the node information associated with the target node information may be second subject information associated with the first enterprise subject information, natural person information associated with the first subject enterprise information (e.g., a high manager, a shareholder, etc. of the first subject enterprise), or a product associated with the first subject enterprise information (e.g., a product developed and marketed by the first subject enterprise). In the knowledge-graph, both the first main body business information and the second business main body information can become target node information, and when the second business main body a in fig. 2 becomes the target node information, the original first business main body in fig. 2 is the node information associated with the second business main body a, but the relationship between the first business main body and the second business main body is changed correspondingly. The relation between each target node information and its associated node information and the relevance weight are also embodied in the knowledge-graph, and the relation between the first enterprise principal and the second enterprise principal includes but is not limited to: investment relations, supply-demand relations, guarantee relations, etc., and the relation between the natural person and the first business entity includes an occupational relation, etc. (e.g., stockholder, high manager, employee, etc.). For example, the relationship between the second enterprise principal a and the first enterprise principal is: the second business entity a is a supplier of the first business entity, the relevance weight is 0.65, the product a is a product under the first business entity, the relevance weight is 0.5, the natural person B is a stockholder of the first business entity, and the relevance weight is 1. In the knowledge graph, corresponding correlation is given according to attribute information of different relations, for example, the larger the investment relation proportion is, the larger the correlation is; the more important the job is, the more relevant the job is, etc., and the specific construction mode of the invention is not described in detail. The constructed knowledge graph can store information through a graph database and can be used for retrieval and query.
In the step S102, all keywords are obtained through word segmentation processing to form a keyword set, where the keyword set is denoted as K, keywords in the keyword set K are searched in the knowledge graph, and an enterprise subject associated with the keyword set K is obtained to serve as a candidate enterprise set, and the candidate enterprise set is denoted as C.
S103, calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set. The method for calculating the association degree according to the word frequency comprises the following steps:
let F be the word frequency matrix of the keyword set K:
Figure BDA0001684808990000081
fithe word frequency of the ith keyword is represented;
let R be the correlation matrix of the main body set C and the keyword set K thereof, the connected nodes of the knowledge graph are 1, and the disconnected nodes of the graph are 0:
Figure BDA0001684808990000082
Figure BDA0001684808990000083
the summed word-frequency vector for the subject set C and its associated keywords:
Figure BDA0001684808990000084
wherein the content of the first and second substances,
Figure BDA0001684808990000085
representing the sum of all keyword word frequencies related to the ith candidate enterprise subject in the text;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA0001684808990000086
Figure BDA0001684808990000087
wherein the content of the first and second substances,
Figure BDA0001684808990000088
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
Figure BDA0001684808990000091
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000092
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000093
indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
Preferably or optionally, the association degree of the text and the candidate business subject can be further calculated by the correlation coefficient of the word frequency, the keyword and the candidate business subject, as follows:
firstly, counting a word frequency vector F of a keyword K set:
Figure BDA0001684808990000094
fithe word frequency of the ith keyword is represented;
let R be the correlation coefficient matrix of the candidate enterprise set C and the keyword set K thereof:
Figure BDA0001684808990000095
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
Figure BDA0001684808990000096
weighting the word frequency matrix for the correlation coefficients:
Figure BDA0001684808990000097
wherein
Figure BDA0001684808990000098
The sum of the weighted word frequencies of the keywords of the enterprise main body representing the ith candidate;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA0001684808990000099
Figure BDA0001684808990000101
wherein the content of the first and second substances,
Figure BDA0001684808990000102
and defining a relevance factor RY for measuring the relevance sequence of the candidate enterprise main bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, so as to measure the text space.
Figure BDA0001684808990000103
Wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000104
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000105
indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
It is understood that in other embodiments, the relationship weight is calculated to better and more accurately calculate the association between the keyword and the candidate business entity, and in some embodiments, the relationship weight is not a necessary technical feature.
According to the embodiment of the invention, according to a pre-established knowledge graph, after keywords in a text are extracted, each keyword is searched through the knowledge graph to obtain an enterprise subject corresponding to the keyword, the corresponding enterprise subject is used as a candidate enterprise subject to form a candidate enterprise subject set, and then according to word frequency of the keyword appearing in the text and relation weight between the word frequency and the candidate enterprise subject, the association degree between the text and the candidate enterprise subject is obtained, the success rate and accuracy of association between the text and the enterprise subject (called a target enterprise subject) are improved, the association dimension between text information and the target enterprise subject is enriched, and a more accurate basis is provided for subsequent further analysis.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a method for calculating the relevance of a text to an enterprise principal by using a knowledge-graph according to the present invention. The method for calculating the relevance between the text and the enterprise main body by using the knowledge graph comprises the following steps:
s201, acquiring a text;
s202, performing paragraph division preprocessing on the text;
in this step, the text is subjected to paragraph segmentation preprocessing in the following manner:
the public opinion text information is set to comprise two main parts of a title and a text, and the text comprises more than or equal to 1 natural segment P. The text is divided into parts H being more than or equal to 1 and respectively marked as part1,…,partHWill part0The number of paragraphs per section is denoted as L ═ L0,l1,…,lH). Considering that different paragraphs of a text have different importance in the text, when the text is split, the length of the beginning and the end of the text is limited to make
Figure BDA0001684808990000111
The maximum ratio of the 1 st part and the H th part to the total number of the segments P is respectively adopted in the embodiment
Figure BDA0001684808990000112
The number of paragraphs contained for each partition is calculated as:
Figure BDA0001684808990000113
wherein the content of the first and second substances,
Figure BDA0001684808990000114
denotes an integer not less than x. P is a natural segment of the text, P is more than or equal to 1, and H is a text quiltThe split fractions are designated part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),
Figure BDA0001684808990000115
Representing the maximum proportion of the first portion to the total number of segments P,
Figure BDA0001684808990000116
Figure BDA0001684808990000117
represents the maximum proportion of the H-th part to the total number of segments P,
Figure BDA0001684808990000118
in this step, after the paragraph segmentation preprocessing step, corresponding weights are also given to the paragraph positions. Generally, the title, front and tail segmentations of the text are given higher weights, and the text middle position weights are relatively lower. For example, the weight w of the title portion of the text00.35, weight w of the front part10.25, weight w of the tail portionHIs 0.25, middle portion w2~wH-1Is 0.15.
S203, performing word segmentation processing on the text, extracting a keyword set appearing in the text, and retrieving an enterprise subject related to the keyword through a pre-established knowledge graph to take the enterprise subject related to the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, related node information, a relation between the target node information and the related node information and an association weight, the target node information comprises first enterprise subject information, and the related node information comprises second subject information, product or natural person information related to the first subject enterprise subject information;
in this step, the segmented text obtained in step S202 is subjected to word segmentation processing, all candidate words that can be found in the knowledge graph in the text are obtained by combining the knowledge graph, and are used as keywords to be labeled, a keyword set formed by all keywords is recorded as K, keywords in the keyword set K are searched in the knowledge graph, an enterprise subject associated with the keyword set K is obtained, the enterprise subject associated with the keywords is used as a candidate enterprise set, and the candidate enterprise set is recorded as C.
S204, calculating the association degree of the text and the candidate enterprise body according to the word frequency, the paragraph position, the relation weight and the text space of the keywords associated with the candidate enterprise body in the candidate enterprise set, wherein the text space is determined by the number of words segmented in the word segmentation step.
In the step, the association degree of the text and the candidate enterprise subject is calculated in the following mode:
let W be the weight matrix of the keyword at the paragraph position:
Figure BDA0001684808990000121
wherein wiRepresenting the resulting weight of the keyword in section i, w0The weight of the keyword in the title;
let R be the correlation coefficient matrix of the main body set C and the keyword set K:
Figure BDA0001684808990000122
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
f is a word frequency matrix obtained by the key word K at different paragraph positions:
Figure BDA0001684808990000123
fijindicates that the ith keyword is in partjThe word frequency of the portion;
Figure BDA0001684808990000124
weighting the word frequency matrix for the correlation coefficients:
Figure BDA0001684808990000125
wherein
Figure BDA0001684808990000126
Business entity representing the ith candidate in partjA sum of partial weighted word frequencies;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
wherein the content of the first and second substances,
Figure BDA0001684808990000127
Figure BDA0001684808990000128
wherein the content of the first and second substances,
Figure BDA0001684808990000129
and defining a relevance factor RY for measuring the relevance sequence of the candidate enterprise main bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, so as to measure the text space.
Figure BDA0001684808990000131
Wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Figure BDA0001684808990000132
Wherein, the lines are matrix dot product operations,
Figure BDA0001684808990000133
indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
According to the embodiment of the invention, the text is subjected to paragraph division preprocessing, and corresponding weights are given to the text paragraphs, so that after word segmentation processing, the weight matrix of the keywords is determined according to the positions of the paragraphs where the text is located, and then the word frequency matrix is weighted according to the correlation coefficient, the correlation factor can be obtained, and the correlation matrix of the text and the candidate enterprise subject set C is obtained, so that the correlation of the whole text and each enterprise subject in the candidate enterprise subject set C is more accurately obtained.
The following method for calculating the relevance of a text to an enterprise subject by using a knowledge graph is described in detail by a specific example:
referring to fig. 4 and 5, fig. 4 is a sample article of the example, and fig. 5 is a knowledge graph corresponding to the sample article, which shows only a partial knowledge graph centered on "le ye information technology (beijing) gmbh" due to limited locations.
The first step is to preprocess a sample article, wherein the text of the sample article has four natural sections in total, P is 4, and the sample article is taken
Figure BDA0001684808990000134
H=3,
Figure BDA0001684808990000135
Figure BDA0001684808990000136
Figure BDA0001684808990000141
The paragraphs and weights obtained according to this formula are given in the following table:
Figure BDA0001684808990000142
table 1W ═ (0.35,0.25,0.15,0.25)
Secondly, extracting key words in the text and extracting a candidate subject set
(1) Title and keyword set in body text:
k is { look, grandbin, circle of friends, video net, new look intellectuality family, Tengchun video, video TV, creation and entertainment }
(2) And (3) searching in the knowledge graph, wherein the enterprise set directly related to the K comprises the following steps:
c ═ Leye information technology (Beijing) stock Limited, Shenzhen City Tengchen computer systems Limited }
Thirdly, calculating the relevance of the public sentiment text and the candidate target subject
Combining the correlation coefficients (numbers on the connecting lines) in the knowledge graph, a correlation coefficient matrix R of the main body set C and the keyword set K thereof can be obtained:
Figure BDA0001684808990000143
Figure BDA0001684808990000151
TABLE 2
The word frequency matrix F is as follows:
Figure BDA0001684808990000152
Figure BDA0001684808990000153
can obtain the product
Figure BDA0001684808990000154
The matrix is as follows:
Figure BDA0001684808990000155
Figure BDA0001684808990000156
cleaning the total word number of the word-segmentation words of the text information to obtain 148 word-segmentation words, wherein the scale is 148, and the beta is 100
Figure BDA0001684808990000157
Obtaining a correlation matrix R of the text and the main body set CKCThe following were used:
Figure BDA0001684808990000158
therefore, the relevance of the sample article to the "Leye information technology (Beijing) stock Limited" is 0.526, and the relevance to the "Shenzhen Tengchen computer systems Limited" is 0.122. (the coefficients in the above specific examples are all example assumptions)
Referring to fig. 6, the present invention also discloses an apparatus for calculating the relevancy between a text and an enterprise subject by using a knowledge graph, which includes:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Optionally, the system further comprises a paragraph segmentation preprocessing module, configured to perform paragraph segmentation preprocessing on the text, and further configured to assign corresponding weights to text paragraphs;
the relevancy calculation module is further used for calculating the relevancy of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text space of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
Optionally, the paragraph segmentation preprocessing module performs paragraph segmentation preprocessing according to the following formula:
Figure BDA0001684808990000161
wherein the content of the first and second substances,
Figure BDA0001684808990000162
representing an integer not less than x, wherein P is a natural segment of the text, P is more than or equal to 1, and H is a split part of the text and is respectively marked as part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),
Figure BDA0001684808990000165
Representing the maximum proportion of the first portion to the total number of segments P,
Figure BDA0001684808990000163
Figure BDA0001684808990000164
represents the maximum proportion of the H-th part to the total number of segments P,
Figure BDA0001684808990000166
optionally, the word segmentation module is further configured to perform word segmentation on a segmented text obtained by segmenting a paragraph to obtain all keywords to form a keyword set, where the keyword set is denoted as K, search keywords in the keyword set K in the knowledge graph, and obtain an enterprise subject associated with the keyword set K, so that the enterprise subject associated with the keywords is used as a candidate enterprise set, and the candidate enterprise set is denoted as C.
In the embodiment of the present invention, the functional description of each module of the apparatus for calculating the relevancy between a text and an enterprise body by using a knowledge graph may refer to the description of the method above, and is not repeated here.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (5)

1. A method for calculating the relevancy of a text and an enterprise main body by using a knowledge graph comprises the following steps:
acquiring a text;
performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, products or natural person information associated with the first subject enterprise subject information;
calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set;
the method is characterized in that in the steps of performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise subject related to the keywords through a pre-established knowledge graph to take the enterprise subject related to the keywords as a candidate enterprise set, the method comprises the following steps:
performing word segmentation processing on a text to obtain all keywords to form a keyword set, wherein the keyword set is marked as K, searching the keywords in the keyword set K in the knowledge graph, and acquiring an enterprise subject associated with the keyword set K to use the enterprise subject associated with the keywords as a candidate enterprise set, and the candidate enterprise set is marked as C.
2. The method of calculating the relevance of text to an enterprise subject using a knowledge-graph of claim 1, wherein in the step of calculating the relevance of text to the candidate enterprise subject, further comprising:
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
3. The method for calculating relevance of text to an enterprise principal using a knowledge-graph of claim 2, wherein prior to the step of tokenizing the text, further comprising:
performing paragraph division preprocessing on the text, and giving corresponding weight to the position of a paragraph;
in the step of calculating the association degree of the text with the candidate business entity, the method further comprises:
and calculating the association degree of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text spread of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
4. An apparatus for calculating relevance of text to an enterprise subject using a knowledge graph, comprising:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
5. The apparatus for calculating relevancy of text to business subjects using knowledge-graph as claimed in claim 4, wherein said relevancy calculation module is further configured to calculate relevancy of text to said candidate business subject according to word frequency and relationship weight of occurrence of keywords associated with candidate business subjects in said candidate business set.
CN201810567101.5A 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph Active CN109033132B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810567101.5A CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810567101.5A CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Publications (2)

Publication Number Publication Date
CN109033132A CN109033132A (en) 2018-12-18
CN109033132B true CN109033132B (en) 2020-12-11

Family

ID=64611958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810567101.5A Active CN109033132B (en) 2018-06-05 2018-06-05 Method and device for calculating text and subject correlation by using knowledge graph

Country Status (1)

Country Link
CN (1) CN109033132B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815499B (en) * 2019-01-25 2023-05-23 杭州凡闻科技有限公司 Information association method and system
CN110888990B (en) * 2019-11-22 2024-04-12 深圳前海微众银行股份有限公司 Text recommendation method, device, equipment and medium
CN111125369A (en) * 2019-11-25 2020-05-08 深圳壹账通智能科技有限公司 Tacit degree detection method, equipment, server and readable storage medium
CN111881183A (en) * 2020-07-28 2020-11-03 北京金堤科技有限公司 Enterprise name matching method and device, storage medium and electronic equipment
CN112732883A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Fuzzy matching method and device based on knowledge graph and computer equipment
CN113688628B (en) * 2021-07-28 2023-09-22 上海携宁计算机科技股份有限公司 Text recognition method, electronic device, and computer-readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104346446A (en) * 2014-10-27 2015-02-11 百度在线网络技术(北京)有限公司 Paper associated information recommendation method and device based on mapping knowledge domain
CN105117487A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Book semantic retrieval method based on content structures
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN106095858A (en) * 2016-06-02 2016-11-09 海信集团有限公司 A kind of audio video searching method, device and terminal
CN107679186A (en) * 2017-09-30 2018-02-09 北京奇虎科技有限公司 The method and device of entity search is carried out based on entity storehouse
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media
CN108090167A (en) * 2017-12-14 2018-05-29 畅捷通信息技术股份有限公司 Method, system, computing device and the storage medium of data retrieval

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150310073A1 (en) * 2014-04-29 2015-10-29 Microsoft Corporation Finding patterns in a knowledge base to compose table answers

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886063A (en) * 2014-03-18 2014-06-25 国家电网公司 Text retrieval method and device
CN104346446A (en) * 2014-10-27 2015-02-11 百度在线网络技术(北京)有限公司 Paper associated information recommendation method and device based on mapping knowledge domain
CN105117487A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Book semantic retrieval method based on content structures
CN105354321A (en) * 2015-11-16 2016-02-24 中国建设银行股份有限公司 Query data processing method and device
CN106095858A (en) * 2016-06-02 2016-11-09 海信集团有限公司 A kind of audio video searching method, device and terminal
CN107679186A (en) * 2017-09-30 2018-02-09 北京奇虎科技有限公司 The method and device of entity search is carried out based on entity storehouse
CN108090167A (en) * 2017-12-14 2018-05-29 畅捷通信息技术股份有限公司 Method, system, computing device and the storage medium of data retrieval
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Detecting research topics via the correlation between graphs and texts";Yookyung Jo et al.;《 Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining》;20070831;全文 *
"非相关文献知识发现的关键技术研究";张云秋 等;《情报学报》;20080831;第27卷(第4期);全文 *

Also Published As

Publication number Publication date
CN109033132A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109033132B (en) Method and device for calculating text and subject correlation by using knowledge graph
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN108073568B (en) Keyword extraction method and device
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN109165294B (en) Short text classification method based on Bayesian classification
CN111104794A (en) Text similarity matching method based on subject words
CN104077407B (en) A kind of intelligent data search system and method
US10387805B2 (en) System and method for ranking news feeds
CN101404015A (en) Automatically generating a hierarchy of terms
CN114911917B (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN112015721A (en) E-commerce platform storage database optimization method based on big data
CN114254201A (en) Recommendation method for science and technology project review experts
CN111444304A (en) Search ranking method and device
CN112818661B (en) Patent technology keyword unsupervised extraction method
CN111190968A (en) Data preprocessing and content recommendation method based on knowledge graph
JP2011198111A (en) Feature word extraction device and program
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
CN107341199A (en) A kind of recommendation method based on documentation & info general model
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN112016294B (en) Text-based news importance evaluation method and device and electronic equipment
Chen et al. Data analysis and knowledge discovery in web recruitment—based on big data related jobs
Afrizal et al. New filtering scheme based on term weighting to improve object based opinion mining on tourism product reviews
CN107480126B (en) Intelligent identification method for engineering material category
CN111259223B (en) News recommendation and text classification method based on emotion analysis model
CN107133274B (en) Distributed information retrieval set selection method based on graph knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant