CN109033132B - Method and device for calculating text and subject correlation by using knowledge graph - Google Patents
Method and device for calculating text and subject correlation by using knowledge graph Download PDFInfo
- Publication number
- CN109033132B CN109033132B CN201810567101.5A CN201810567101A CN109033132B CN 109033132 B CN109033132 B CN 109033132B CN 201810567101 A CN201810567101 A CN 201810567101A CN 109033132 B CN109033132 B CN 109033132B
- Authority
- CN
- China
- Prior art keywords
- text
- enterprise
- subject
- candidate
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses a method and a device for calculating the relevancy between a text and a main body by using a knowledge graph, wherein the method comprises the following steps: acquiring a text; performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, a product or natural person information associated with the first subject enterprise subject information; and calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Description
Technical Field
The invention relates to a method and a device for calculating the relevancy between a text and a main body by using a knowledge graph.
Background
In the information age, acquisition, processing and analysis of mass data are a great difficulty. In some industries (e.g., the financial industry), people focus on information about various dimensions of an enterprise to help make decisions such as investments. On the one hand, market participants require broader, more complete data, and on the other hand, require that these data be processed in a timely manner. The enterprise public opinion information is a dimension of key attention of market participants, and as unstructured text information, the public opinion information has the characteristics of data dispersion, large data volume, complex data format, strong timeliness and the like. Therefore, it is a demand of many financial practitioners to efficiently process such data and extract valuable information by using technical means such as natural language processing. In the face of complicated public opinion information, how to associate the public opinion information with concerned enterprises to screen out information with low value or irrelevant to a main body is an important step for data analysis and mining.
The common method is to construct a keyword library of the enterprise main body, including the business name, enterprise abbreviation, enterprise listing code, etc. of the enterprise, and on the basis of this, to perform keyword matching search in the text information library, and to use the matched text as the related information of the enterprise main body. On one hand, the method needs to construct a relatively complete enterprise keyword library in advance as a retrieval basis; on the other hand, the results obtained by matching retrieval are ranked according to the degree of association, so that the effect is poor, keywords are often found in the text but not the information of the enterprise, and more redundant information still exists; meanwhile, the key words are directly matched and associated, so that important information of key association enterprises of the enterprises can be omitted, and information loss is caused.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: the method and the device for calculating the relevance between the text and the main body by using the knowledge graph can optimize the traditional single-use keyword matching mode when analyzing massive texts. By combining a knowledge graph method, the degree of association between the target subject association and the text information can be quantified, the association dimensions of the text information and the target subject are enriched, and a basis is provided for subsequent further analysis.
In order to solve the technical problems, the invention adopts a technical scheme that: the method for calculating the relevancy between the text and the enterprise main body by using the knowledge graph comprises the following steps:
acquiring a text;
performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, a product or natural person information associated with the first subject enterprise subject information;
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the step of performing word segmentation processing on the text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set includes:
performing word segmentation processing on a text to obtain all keywords to form a keyword set, wherein the keyword set is marked as K, searching the keywords in the keyword set K in the knowledge graph, and acquiring an enterprise subject associated with the keyword set K to use the enterprise subject associated with the keywords as a candidate enterprise set, and the candidate enterprise set is marked as C.
Further, in the step of calculating the association degree between the text and the candidate enterprise main body according to the word frequency of the keyword occurrence associated with the candidate enterprise main body in the candidate enterprise set, the method includes:
let F be the word frequency matrix of the keyword set K:
fithe word frequency of the ith keyword is represented;
let R be the correlation matrix of the main body set C and the keyword set K thereof, the connected nodes of the knowledge graph are 1, and the disconnected nodes of the graph are 0:
wherein the content of the first and second substances,representing the sum of all keyword word frequencies related to the ith candidate enterprise subject in the text;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity.
Further, in the step of calculating the association degree between the text and the candidate business entity, the method further includes:
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the step of calculating the association degree between the text and the candidate enterprise subject according to the word frequency and the relationship weight of the occurrence of the keyword associated with the candidate enterprise subject in the candidate enterprise set includes:
firstly, counting a word frequency vector F of a keyword K set:
fithe word frequency of the ith keyword is represented;
let R be the correlation coefficient matrix of the candidate enterprise set C and the keyword set K thereof:
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
whereinThe sum of the weighted word frequencies of the keywords of the enterprise main body representing the ith candidate;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
wherein, 0 is not less than ryi≤1;
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC;
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity.
Further, before the step of performing word segmentation processing on the text, the method further includes:
performing paragraph division preprocessing on the text, and giving corresponding weight to the position of a paragraph;
in the step of calculating the association degree of the text with the candidate business entity, the method further comprises:
and calculating the association degree of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text spread of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
Further, the text is subjected to paragraph segmentation preprocessing by the following formula:
wherein the content of the first and second substances,representing an integer not less than x, wherein P is a natural segment of the text, P is more than or equal to 1, and H is a split part of the text and is respectively marked as part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),Representing the maximum proportion of the first portion to the total number of segments P, represents the maximum proportion of the H-th part to the total number of segments P,
further, the step of calculating the association degree between the text and the candidate enterprise main body according to the word frequency, paragraph position, relationship weight and text space of the keyword occurrence associated with the candidate enterprise main body in the candidate enterprise set comprises the following substeps:
let W be the weight matrix of the keyword at the paragraph position:
wherein wiRepresenting the resulting weight of the keyword in section i, w0The weight of the keyword in the title;
let R be the correlation coefficient matrix of the enterprise subject set C and the keyword set K:
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
f is a word frequency matrix obtained by the key word K at different paragraph positions:
fijindicates that the ith keyword is in partjThe word frequency of the portion;
whereinBusiness entity representing the ith candidate in partjA sum of partial weighted word frequencies;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity.
In order to solve the technical problem, the invention adopts another technical scheme that: an apparatus for calculating the relevancy of a text and an enterprise main body by using a knowledge graph is provided, which comprises:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Further, the relevancy calculation module is further configured to calculate relevancy of the text and the candidate enterprise subject according to word frequency and relationship weight of occurrence of keywords associated with the candidate enterprise subject in the candidate enterprise set.
The invention constructs the knowledge graph in the financial field, takes the knowledge graph as a relation network of candidate matching keywords, and covers the relation of industry and commerce full names, short names, products, high governments, stockholders, investment and the like with an enterprise as a target subject; in the invention, different weights are given to the positions of paragraphs given by keywords, and the importance of different paragraphs of the text is taken into consideration; and (3) calculating the association degree of all possible keywords by using a complex relation network constructed by a knowledge graph technology, finally weighting and quantizing, and improving the success rate and accuracy of the association of the text and the target subject.
Drawings
FIG. 1 is a flowchart of a first embodiment of a method for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
FIG. 2 is a schematic diagram of the structure of a knowledge-graph of the present invention.
FIG. 3 is a flowchart of a second embodiment of a method for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
FIG. 4 is a schematic illustration of a sample article in a specific example.
Fig. 5 is a schematic illustration of a knowledge-graph associated with the sample article in a specific example.
FIG. 6 is a block diagram of an embodiment of an apparatus for calculating relevance of text to an enterprise principal using a knowledge-graph of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the method for calculating the relevance between a text and an enterprise subject by using a knowledge graph of the present invention includes the following steps:
s101, acquiring a text;
the text may be public opinion text (i.e., public opinion information).
S102, performing word segmentation processing on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject related to the keyword through a pre-established knowledge graph to take the enterprise subject related to the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, related node information, a relation between the target node information and the related node information and an association weight, the target node information comprises first enterprise subject information, and the related node information comprises second subject information, product or natural person information related to the first subject enterprise subject information;
the knowledge graph is specifically established in the following way: target node information and associated node information are extracted from a database (such as a corpus), and corresponding relevance weights are given according to the relation between the target node information and the associated node information, so that the knowledge graph is formed (see fig. 2). The target node information is first enterprise subject information (e.g., the name of an enterprise: XX corporation), and the node information associated with the target node information may be second subject information associated with the first enterprise subject information, natural person information associated with the first subject enterprise information (e.g., a high manager, a shareholder, etc. of the first subject enterprise), or a product associated with the first subject enterprise information (e.g., a product developed and marketed by the first subject enterprise). In the knowledge-graph, both the first main body business information and the second business main body information can become target node information, and when the second business main body a in fig. 2 becomes the target node information, the original first business main body in fig. 2 is the node information associated with the second business main body a, but the relationship between the first business main body and the second business main body is changed correspondingly. The relation between each target node information and its associated node information and the relevance weight are also embodied in the knowledge-graph, and the relation between the first enterprise principal and the second enterprise principal includes but is not limited to: investment relations, supply-demand relations, guarantee relations, etc., and the relation between the natural person and the first business entity includes an occupational relation, etc. (e.g., stockholder, high manager, employee, etc.). For example, the relationship between the second enterprise principal a and the first enterprise principal is: the second business entity a is a supplier of the first business entity, the relevance weight is 0.65, the product a is a product under the first business entity, the relevance weight is 0.5, the natural person B is a stockholder of the first business entity, and the relevance weight is 1. In the knowledge graph, corresponding correlation is given according to attribute information of different relations, for example, the larger the investment relation proportion is, the larger the correlation is; the more important the job is, the more relevant the job is, etc., and the specific construction mode of the invention is not described in detail. The constructed knowledge graph can store information through a graph database and can be used for retrieval and query.
In the step S102, all keywords are obtained through word segmentation processing to form a keyword set, where the keyword set is denoted as K, keywords in the keyword set K are searched in the knowledge graph, and an enterprise subject associated with the keyword set K is obtained to serve as a candidate enterprise set, and the candidate enterprise set is denoted as C.
S103, calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set. The method for calculating the association degree according to the word frequency comprises the following steps:
let F be the word frequency matrix of the keyword set K:
fithe word frequency of the ith keyword is represented;
let R be the correlation matrix of the main body set C and the keyword set K thereof, the connected nodes of the knowledge graph are 1, and the disconnected nodes of the graph are 0:
wherein the content of the first and second substances,representing the sum of all keyword word frequencies related to the ith candidate enterprise subject in the text;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
defining a relevance factor RY for measuring the relevance sequence of candidate enterprise bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, and the relevance factor RY is used for measuring the text space;
wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
Preferably or optionally, the association degree of the text and the candidate business subject can be further calculated by the correlation coefficient of the word frequency, the keyword and the candidate business subject, as follows:
firstly, counting a word frequency vector F of a keyword K set:
fithe word frequency of the ith keyword is represented;
let R be the correlation coefficient matrix of the candidate enterprise set C and the keyword set K thereof:
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
whereinThe sum of the weighted word frequencies of the keywords of the enterprise main body representing the ith candidate;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
and defining a relevance factor RY for measuring the relevance sequence of the candidate enterprise main bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, so as to measure the text space.
Wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
It is understood that in other embodiments, the relationship weight is calculated to better and more accurately calculate the association between the keyword and the candidate business entity, and in some embodiments, the relationship weight is not a necessary technical feature.
According to the embodiment of the invention, according to a pre-established knowledge graph, after keywords in a text are extracted, each keyword is searched through the knowledge graph to obtain an enterprise subject corresponding to the keyword, the corresponding enterprise subject is used as a candidate enterprise subject to form a candidate enterprise subject set, and then according to word frequency of the keyword appearing in the text and relation weight between the word frequency and the candidate enterprise subject, the association degree between the text and the candidate enterprise subject is obtained, the success rate and accuracy of association between the text and the enterprise subject (called a target enterprise subject) are improved, the association dimension between text information and the target enterprise subject is enriched, and a more accurate basis is provided for subsequent further analysis.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a method for calculating the relevance of a text to an enterprise principal by using a knowledge-graph according to the present invention. The method for calculating the relevance between the text and the enterprise main body by using the knowledge graph comprises the following steps:
s201, acquiring a text;
s202, performing paragraph division preprocessing on the text;
in this step, the text is subjected to paragraph segmentation preprocessing in the following manner:
the public opinion text information is set to comprise two main parts of a title and a text, and the text comprises more than or equal to 1 natural segment P. The text is divided into parts H being more than or equal to 1 and respectively marked as part1,…,partHWill part0The number of paragraphs per section is denoted as L ═ L0,l1,…,lH). Considering that different paragraphs of a text have different importance in the text, when the text is split, the length of the beginning and the end of the text is limited to makeThe maximum ratio of the 1 st part and the H th part to the total number of the segments P is respectively adopted in the embodimentThe number of paragraphs contained for each partition is calculated as:
wherein the content of the first and second substances,denotes an integer not less than x. P is a natural segment of the text, P is more than or equal to 1, and H is a text quiltThe split fractions are designated part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),Representing the maximum proportion of the first portion to the total number of segments P, represents the maximum proportion of the H-th part to the total number of segments P,
in this step, after the paragraph segmentation preprocessing step, corresponding weights are also given to the paragraph positions. Generally, the title, front and tail segmentations of the text are given higher weights, and the text middle position weights are relatively lower. For example, the weight w of the title portion of the text00.35, weight w of the front part10.25, weight w of the tail portionHIs 0.25, middle portion w2~wH-1Is 0.15.
S203, performing word segmentation processing on the text, extracting a keyword set appearing in the text, and retrieving an enterprise subject related to the keyword through a pre-established knowledge graph to take the enterprise subject related to the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, related node information, a relation between the target node information and the related node information and an association weight, the target node information comprises first enterprise subject information, and the related node information comprises second subject information, product or natural person information related to the first subject enterprise subject information;
in this step, the segmented text obtained in step S202 is subjected to word segmentation processing, all candidate words that can be found in the knowledge graph in the text are obtained by combining the knowledge graph, and are used as keywords to be labeled, a keyword set formed by all keywords is recorded as K, keywords in the keyword set K are searched in the knowledge graph, an enterprise subject associated with the keyword set K is obtained, the enterprise subject associated with the keywords is used as a candidate enterprise set, and the candidate enterprise set is recorded as C.
S204, calculating the association degree of the text and the candidate enterprise body according to the word frequency, the paragraph position, the relation weight and the text space of the keywords associated with the candidate enterprise body in the candidate enterprise set, wherein the text space is determined by the number of words segmented in the word segmentation step.
In the step, the association degree of the text and the candidate enterprise subject is calculated in the following mode:
let W be the weight matrix of the keyword at the paragraph position:
wherein wiRepresenting the resulting weight of the keyword in section i, w0The weight of the keyword in the title;
let R be the correlation coefficient matrix of the main body set C and the keyword set K:
rijrepresenting the correlation coefficient of the ith candidate enterprise subject and the jth keyword;
f is a word frequency matrix obtained by the key word K at different paragraph positions:
fijindicates that the ith keyword is in partjThe word frequency of the portion;
whereinBusiness entity representing the ith candidate in partjA sum of partial weighted word frequencies;
defining a correlation factor RX, wherein the RX is used for measuring the correlation sequence among the candidate enterprise subjects in the text;
and defining a relevance factor RY for measuring the relevance sequence of the candidate enterprise main bodies among different texts, wherein beta is more than 0, beta is a scaling adjustment parameter, and scale is more than 0, and is the number of participle words obtained after the total participle number of the text information is cleaned, so as to measure the text space.
Wherein, 0 is not less than ryi≤1
Obtaining a correlation matrix R of the text and the candidate enterprise main body set CKC
Wherein, the lines are matrix dot product operations,indicating the degree of association of the text to the ith candidate business entity. Based on the relevance, a threshold value can be set to screen the enterprise subject with the closer relevance to the text; meanwhile, different texts related to the ith subject can be screened and sorted.
According to the embodiment of the invention, the text is subjected to paragraph division preprocessing, and corresponding weights are given to the text paragraphs, so that after word segmentation processing, the weight matrix of the keywords is determined according to the positions of the paragraphs where the text is located, and then the word frequency matrix is weighted according to the correlation coefficient, the correlation factor can be obtained, and the correlation matrix of the text and the candidate enterprise subject set C is obtained, so that the correlation of the whole text and each enterprise subject in the candidate enterprise subject set C is more accurately obtained.
The following method for calculating the relevance of a text to an enterprise subject by using a knowledge graph is described in detail by a specific example:
referring to fig. 4 and 5, fig. 4 is a sample article of the example, and fig. 5 is a knowledge graph corresponding to the sample article, which shows only a partial knowledge graph centered on "le ye information technology (beijing) gmbh" due to limited locations.
The first step is to preprocess a sample article, wherein the text of the sample article has four natural sections in total, P is 4, and the sample article is takenH=3,
The paragraphs and weights obtained according to this formula are given in the following table:
table 1W ═ (0.35,0.25,0.15,0.25)
Secondly, extracting key words in the text and extracting a candidate subject set
(1) Title and keyword set in body text:
k is { look, grandbin, circle of friends, video net, new look intellectuality family, Tengchun video, video TV, creation and entertainment }
(2) And (3) searching in the knowledge graph, wherein the enterprise set directly related to the K comprises the following steps:
c ═ Leye information technology (Beijing) stock Limited, Shenzhen City Tengchen computer systems Limited }
Thirdly, calculating the relevance of the public sentiment text and the candidate target subject
Combining the correlation coefficients (numbers on the connecting lines) in the knowledge graph, a correlation coefficient matrix R of the main body set C and the keyword set K thereof can be obtained:
TABLE 2
The word frequency matrix F is as follows:
cleaning the total word number of the word-segmentation words of the text information to obtain 148 word-segmentation words, wherein the scale is 148, and the beta is 100
Obtaining a correlation matrix R of the text and the main body set CKCThe following were used:
therefore, the relevance of the sample article to the "Leye information technology (Beijing) stock Limited" is 0.526, and the relevance to the "Shenzhen Tengchen computer systems Limited" is 0.122. (the coefficients in the above specific examples are all example assumptions)
Referring to fig. 6, the present invention also discloses an apparatus for calculating the relevancy between a text and an enterprise subject by using a knowledge graph, which includes:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
Optionally, the system further comprises a paragraph segmentation preprocessing module, configured to perform paragraph segmentation preprocessing on the text, and further configured to assign corresponding weights to text paragraphs;
the relevancy calculation module is further used for calculating the relevancy of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text space of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
Optionally, the paragraph segmentation preprocessing module performs paragraph segmentation preprocessing according to the following formula:
wherein the content of the first and second substances,representing an integer not less than x, wherein P is a natural segment of the text, P is more than or equal to 1, and H is a split part of the text and is respectively marked as part1,…,partHTitle is denoted part0H is more than or equal to 1, and the number of paragraphs in each part is recorded as L ═ L0,l1,…,lH),Representing the maximum proportion of the first portion to the total number of segments P, represents the maximum proportion of the H-th part to the total number of segments P,
optionally, the word segmentation module is further configured to perform word segmentation on a segmented text obtained by segmenting a paragraph to obtain all keywords to form a keyword set, where the keyword set is denoted as K, search keywords in the keyword set K in the knowledge graph, and obtain an enterprise subject associated with the keyword set K, so that the enterprise subject associated with the keywords is used as a candidate enterprise set, and the candidate enterprise set is denoted as C.
In the embodiment of the present invention, the functional description of each module of the apparatus for calculating the relevancy between a text and an enterprise body by using a knowledge graph may refer to the description of the method above, and is not repeated here.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (5)
1. A method for calculating the relevancy of a text and an enterprise main body by using a knowledge graph comprises the following steps:
acquiring a text;
performing word segmentation on a text, extracting a keyword set appearing in the text, and retrieving an enterprise subject associated with the keyword through a pre-established knowledge graph so as to take the enterprise subject associated with the keyword as a candidate enterprise set, wherein the knowledge graph comprises target node information, associated node information, a relation between the target node information and the associated node information and an association weight, the target node information comprises first enterprise subject information, and the associated node information comprises second subject information, products or natural person information associated with the first subject enterprise subject information;
calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the keywords associated with the candidate enterprise subject in the candidate enterprise set;
the method is characterized in that in the steps of performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise subject related to the keywords through a pre-established knowledge graph to take the enterprise subject related to the keywords as a candidate enterprise set, the method comprises the following steps:
performing word segmentation processing on a text to obtain all keywords to form a keyword set, wherein the keyword set is marked as K, searching the keywords in the keyword set K in the knowledge graph, and acquiring an enterprise subject associated with the keyword set K to use the enterprise subject associated with the keywords as a candidate enterprise set, and the candidate enterprise set is marked as C.
2. The method of calculating the relevance of text to an enterprise subject using a knowledge-graph of claim 1, wherein in the step of calculating the relevance of text to the candidate enterprise subject, further comprising:
and calculating the association degree of the text and the candidate enterprise subject according to the word frequency and the relation weight of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
3. The method for calculating relevance of text to an enterprise principal using a knowledge-graph of claim 2, wherein prior to the step of tokenizing the text, further comprising:
performing paragraph division preprocessing on the text, and giving corresponding weight to the position of a paragraph;
in the step of calculating the association degree of the text with the candidate business entity, the method further comprises:
and calculating the association degree of the text and the candidate enterprise main body according to the word frequency, the paragraph position, the relation weight and the text spread of the keywords associated with the candidate enterprise main body in the candidate enterprise set.
4. An apparatus for calculating relevance of text to an enterprise subject using a knowledge graph, comprising:
the text acquisition module is used for acquiring a text;
the system comprises a word segmentation module, a word segmentation module and a word segmentation module, wherein the word segmentation module is used for performing word segmentation processing on a text, extracting a keyword set appearing in the text, and searching an enterprise main body associated with the keyword through a pre-established knowledge graph so as to take the enterprise main body associated with the keyword as a candidate enterprise set, the knowledge graph comprises a plurality of node information, and a relation and an association weight between each node information and the corresponding node information, and in the plurality of node information, the node information is enterprise main body information, and the rest node information is product information or natural person information corresponding to the corresponding enterprise main body;
and the association degree calculation module is used for calculating the association degree of the text and the candidate enterprise subject according to the word frequency of the occurrence of the keywords associated with the candidate enterprise subject in the candidate enterprise set.
5. The apparatus for calculating relevancy of text to business subjects using knowledge-graph as claimed in claim 4, wherein said relevancy calculation module is further configured to calculate relevancy of text to said candidate business subject according to word frequency and relationship weight of occurrence of keywords associated with candidate business subjects in said candidate business set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810567101.5A CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810567101.5A CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033132A CN109033132A (en) | 2018-12-18 |
CN109033132B true CN109033132B (en) | 2020-12-11 |
Family
ID=64611958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810567101.5A Active CN109033132B (en) | 2018-06-05 | 2018-06-05 | Method and device for calculating text and subject correlation by using knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033132B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815499B (en) * | 2019-01-25 | 2023-05-23 | 杭州凡闻科技有限公司 | Information association method and system |
CN110888990B (en) * | 2019-11-22 | 2024-04-12 | 深圳前海微众银行股份有限公司 | Text recommendation method, device, equipment and medium |
CN111125369A (en) * | 2019-11-25 | 2020-05-08 | 深圳壹账通智能科技有限公司 | Tacit degree detection method, equipment, server and readable storage medium |
CN111881183A (en) * | 2020-07-28 | 2020-11-03 | 北京金堤科技有限公司 | Enterprise name matching method and device, storage medium and electronic equipment |
CN112732883A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Fuzzy matching method and device based on knowledge graph and computer equipment |
CN113688628B (en) * | 2021-07-28 | 2023-09-22 | 上海携宁计算机科技股份有限公司 | Text recognition method, electronic device, and computer-readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
CN104346446A (en) * | 2014-10-27 | 2015-02-11 | 百度在线网络技术(北京)有限公司 | Paper associated information recommendation method and device based on mapping knowledge domain |
CN105117487A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Book semantic retrieval method based on content structures |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN106095858A (en) * | 2016-06-02 | 2016-11-09 | 海信集团有限公司 | A kind of audio video searching method, device and terminal |
CN107679186A (en) * | 2017-09-30 | 2018-02-09 | 北京奇虎科技有限公司 | The method and device of entity search is carried out based on entity storehouse |
CN108038204A (en) * | 2017-12-15 | 2018-05-15 | 福州大学 | For the viewpoint searching system and method for social media |
CN108090167A (en) * | 2017-12-14 | 2018-05-29 | 畅捷通信息技术股份有限公司 | Method, system, computing device and the storage medium of data retrieval |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150310073A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Finding patterns in a knowledge base to compose table answers |
-
2018
- 2018-06-05 CN CN201810567101.5A patent/CN109033132B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886063A (en) * | 2014-03-18 | 2014-06-25 | 国家电网公司 | Text retrieval method and device |
CN104346446A (en) * | 2014-10-27 | 2015-02-11 | 百度在线网络技术(北京)有限公司 | Paper associated information recommendation method and device based on mapping knowledge domain |
CN105117487A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Book semantic retrieval method based on content structures |
CN105354321A (en) * | 2015-11-16 | 2016-02-24 | 中国建设银行股份有限公司 | Query data processing method and device |
CN106095858A (en) * | 2016-06-02 | 2016-11-09 | 海信集团有限公司 | A kind of audio video searching method, device and terminal |
CN107679186A (en) * | 2017-09-30 | 2018-02-09 | 北京奇虎科技有限公司 | The method and device of entity search is carried out based on entity storehouse |
CN108090167A (en) * | 2017-12-14 | 2018-05-29 | 畅捷通信息技术股份有限公司 | Method, system, computing device and the storage medium of data retrieval |
CN108038204A (en) * | 2017-12-15 | 2018-05-15 | 福州大学 | For the viewpoint searching system and method for social media |
Non-Patent Citations (2)
Title |
---|
"Detecting research topics via the correlation between graphs and texts";Yookyung Jo et al.;《 Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining》;20070831;全文 * |
"非相关文献知识发现的关键技术研究";张云秋 等;《情报学报》;20080831;第27卷(第4期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109033132A (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033132B (en) | Method and device for calculating text and subject correlation by using knowledge graph | |
CN109101477B (en) | Enterprise field classification and enterprise keyword screening method | |
CN108073568B (en) | Keyword extraction method and device | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
CN111104794A (en) | Text similarity matching method based on subject words | |
CN104077407B (en) | A kind of intelligent data search system and method | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN101404015A (en) | Automatically generating a hierarchy of terms | |
CN114911917B (en) | Asset meta-information searching method and device, computer equipment and readable storage medium | |
CN112015721A (en) | E-commerce platform storage database optimization method based on big data | |
CN114254201A (en) | Recommendation method for science and technology project review experts | |
CN111444304A (en) | Search ranking method and device | |
CN112818661B (en) | Patent technology keyword unsupervised extraction method | |
CN111190968A (en) | Data preprocessing and content recommendation method based on knowledge graph | |
JP2011198111A (en) | Feature word extraction device and program | |
CN112182145A (en) | Text similarity determination method, device, equipment and storage medium | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
CN113360647B (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN112016294B (en) | Text-based news importance evaluation method and device and electronic equipment | |
Chen et al. | Data analysis and knowledge discovery in web recruitment—based on big data related jobs | |
Afrizal et al. | New filtering scheme based on term weighting to improve object based opinion mining on tourism product reviews | |
CN107480126B (en) | Intelligent identification method for engineering material category | |
CN111259223B (en) | News recommendation and text classification method based on emotion analysis model | |
CN107133274B (en) | Distributed information retrieval set selection method based on graph knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |