CN110232185A

CN110232185A - Towards financial industry software test knowledge based map semantic similarity calculation method

Info

Publication number: CN110232185A
Application number: CN201910010902.6A
Authority: CN
Inventors: 杜广龙; 陈震星; 李方; 周文沛; 孙慧; 姚庚成
Original assignee: Shanghai Huateng Soft Software Co Ltd; South China University of Technology SCUT
Current assignee: Shanghai Huateng Soft Software Co Ltd; South China University of Technology SCUT
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-09-13
Anticipated expiration: 2039-01-07
Also published as: CN110232185B

Abstract

The present invention provides one kind towards financial industry software test knowledge based map semantic similarity calculation method, comprising steps of S1 carries out participle operation to financial text；S2 chooses to be combined with the maximally related participle of text subject；S3 combines computing semantic similarity to participle using knowledge mapping and using concept IC weighting minimum path length.The natural Semantic detection algorithm of knowledge based map carries out participle cutting to financial text first with a variety of segmentation methods and obtains participle combination, then the concept distance of word and text keyword is calculated to measure the similarity between participle combination and text subject, is finally chosen the smallest participle combination of concept sum of the distance and is carried out semantic similarity detection.Carry out the shortest path length between weighted concept using the information IC of concept in knowledge mapping, shows better performance relative to other methods in accuracy.

Description

Knowledge graph semantic similarity calculation method for financial industry software testing

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a knowledge graph semantic similarity calculation method for financial industry software testing.

Background

Semantic similarity detection between natural languages has wide application in many fields such as information retrieval, machine translation, and automatic question answering. The sentences themselves are composed of individual words, and include subjects, predicates, various stop words, and the like. Even the same words may have completely different meanings in different combinations and different contexts. Many statistical-based calculation methods have been proposed in recent years, but such methods ignore semantic information and sentence structure information of text, and the resulting calculation results sometimes do not fit with human understanding of natural language, and currently corpus-based ICs may have ambiguous conceptual meanings because they calculate ICs by calculating occurrences of words in a corpus, where the words are mapped to different concepts at the same time. Today, a large number of public knowledge-graphs (KGs) are used, encompassing a large number of concepts, entities and relationships between them. The knowledge graph is used for detecting the similarity of the natural language, so that semantic information behind a detected text can be accurately known, and more accurate and structured information can be returned. However, most of the current knowledge graphs are mainly used for semantic similarity detection of English texts, and the application of the knowledge graphs in Chinese is less. Financial text is different from English text, words are not separated by spaces, and Chinese is relatively more complex in grammatical structure. In view of the above characteristics, it is necessary to perform word segmentation on Chinese sentences before performing semantic similarity detection by using a knowledge graph.

Disclosure of Invention

Aiming at the previous semantic similarity research, the semantic similarity of a text is measured mainly by calculating the depth and the path length between concepts, or the Information Content (IC) between the concepts, the invention provides a knowledge graph-based semantic similarity calculation method for financial industry software test, the minimum distance of the path is weighted by the Information Content (IC) of the concepts, and the IC based on a graph is calculated by the concept distribution of an example, and the method comprises the following specific steps:

s1, performing word segmentation operation on the financial text;

s2, selecting the word segmentation combination most relevant to the text theme;

and S3, calculating the semantic similarity of the financial text by using a Knowledge Graph (KG) and the segmentation combination with the concept IC weighted minimum path length.

Further, step S1 includes:

segmenting Chinese sentences through various word segmentation algorithms to obtain different word segmentation combinations;

algorithm one, Jieba (words in the form of Jieba)

1) Generating all possible word formation conditions in the sentence according to a trie tree generated by the direct.txt dictionary through a direct.txt dictionary carried by the Jieba participle, and forming a Directed Acyclic Graph (DAG).

2) The maximum probability path is found out by using a dynamic programming method, because the gravity center of the Chinese sentence is often positioned in the second half section, the sentence is reversely calculated from the right to the left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the word segmentation combination with the maximum probability is obtained.

For the vocabulary which does not appear in the dictionary, HMM (Hidden Markov model) model based on Chinese character word forming capability is adopted, and Viterbi algorithm is used for realizing word segmentation operation on the words which do not have records in the dictionary.

Algorithm two, Ansj Chinese word segmentation device

The second algorithm is java implementation of Chinese word segmentation based on n-Gram language model, CRF (condition Random field) and HMM (Hidden Markov model).

Algorithm III SmartChineseAnalyzer (Chinese word segmenter)

SmartChineseAnalyzer is a tool provided by the Lucene text retrieval system, and performs word statistics calculation through a large amount of linguistic data in a Hidden Markov Model (HMM), and then the statistical result is used for calculating the optimal word segmentation combination of the text.

Further, the step S2 includes:

for the given various Chinese character word segmentation combinations, if the three word segmentation results are the same, no processing is carried out; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on a DBpedia (knowledge graph) is adopted to obtain a word segmentation combination most relevant to the text theme. The relevance between words and text topics is measured by finding the conceptual distance between the word and the text keyword. The concept is a collection of entities with the same characteristics, the entities refer to different instances, and the instances comprise specific descriptions in the concept, such as 'extensive hair counter' under the concept 'counter'.

Further, the step of selecting the word combination most relevant to the text theme comprises the following steps:

1) keywords are first extracted from each combination of participles using the TextRank algorithm (text ranking algorithm) adapted from the PageRank algorithm of Google. In the word segmentation combination, setting the size of the window as m, we can get [ w₁,w₂,…,w_m],[w₂,w₃,…,w_m+1],[w₃,w₄,…,w_m+2]And the like. Will be in the textEach word of (a) is regarded as a node, and an edge exists between any two word nodes in the same window. Then, the edges are used as mutual voting among the words by using a voting principle, and after continuous iteration, the number of votes obtained by each node tends to be stable. The importance degree of the words is judged by comparing the ticket number of the nodes, and m words with the maximum ticket number are taken as the keywords of the text, so that the keywords of the text are obtained. The importance formula of a node is as follows:

wherein d is a damping coefficient representing the probability that a certain node points to any node, and is generally 0.85.ln (V)_i) Is pointing to node V_iSet of nodes of, and Out (V)_j) Representing a slave node V_iThe set of nodes to which the departure points.

2) Selecting the word segmentation result that minimizes the concept distance:

by calculating the word omega and m keywords k in the text N₁,k₂,…,k_mAnd measures the relevance of the word segmentation combination and the text topic N by the average conceptual distance, wherein the word segmentation combination with the minimum conceptual distance sum Dis (omega, N) can be regarded as the combination with the highest relevance to the text topic, and the conceptual distance DBpediaDis (omega, k)_i) And calculating in a DBpedia database, and performing semantic similarity detection after obtaining the word segmentation combination of the financial text.

Further, in the Knowledge Graph (KG), the language information includes semantic distances between different concepts, that is, distances between nodes, and the closer the semantic distance of a concept is, the smaller the shortest path between concepts is, the higher the similarity between the two is.

Further, the concept IC is used for measuring the concept information amount, and the IC value is low, the concept information amount is low, and the IC value is low and is calculated based on the knowledge graph. The higher the IC value of a specific concept, the more information the two concepts share, the more similar the two concepts are, and based on the similarity calculation method of the concept IC, if LCS (most recently shared node) of the two concepts are the same, the similarity of the two concepts is the same.

Further, step S3 includes:

(1) KG is defined as the orientation marker map. G ═ V, E, τ, V represents all nodes in the knowledge-graph; e represents a series of edges connecting the nodes; τ is the function V → E, defining all triples in G. In the D-path algorithm, semantic similarity between concepts can be defined as:

wherein ,sim_wpath(c_i,c_j) For measuring semantic similarity between two concepts, c_i,c_jRespectively representing two different concepts, c_i,c_j∈V，k∈(0,1](ii) a The parameter k represents the contribution of common information of the two concepts to the similarity; IC (c)_Lcs) IC value, c, referring to the nearest shared node_LcsRefers to the nearest shared node of the two concepts, LCS is concept c_i,c_jThe most specific concept in the common ancestor, namely the nearest shared node; weighting shortest Path Length (c) with IC values of two notional nearest shared nodes_i,c_j) Let path Paths (c)_i,c_j)＝{P₁；P₂；…P_nIs to the concept c_i,c_jA connected set of paths, with multiple paths between two concepts, with a common P₁To P_nStrip path, length (c)_i,c_j) Representing the shortest path among all paths, let P_i∈Paths(c_i,c_j) Then let

length(c_i,c_j)＝min(|P_i|)

(4)

Length (c) is obtained_i,c_j) Representing the shortest path length between the two concepts.

(2) The semantic similarity between concepts is calculated based on the graph-based IC, namely based on a specific KG, as shown in formula (5), the size of the semantic similarity can be known through an IC value, the IC of the concepts in the KG is calculated through example distribution conditions in a concept classification method, and the semantic similarity between the concepts calculated by the graph-based IC does not depend on an extrinsic corpus.

The graph-based IC is defined in KG as:

IC_graph(c_i)＝-logProb(c_i) (5)

wherein ,

n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graph_iIs defined as:

frequenvy(c_i)＝count(c_i) (7)

where count is a simple function that computes a base of an entity.

The IC in the equation can be either a corpus-based IC as is conventional or a graph-based IC. Graph-based ICs are as effective as corpus-based ICs, and can be a good complement to conventional corpus-based ICs in cases where the corpus is insufficient or online computation of the IC is required. Since the IC in a knowledge graph typically expresses the type of instance with disambiguating concepts, graph-based IC contains a specific meaning of the concept. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because the IC is calculated by calculating the occurrence of words in the corpus, where the words may be mapped to different concepts simultaneously.

In the Knowledge Graph (KG), more intuitive language information comprises semantic distances among different concepts, and the closer the semantic distances of the concepts, the smaller the shortest path among the concepts, the higher the similarity between the two concepts. But this method does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth has a disadvantage in that if any two concepts have the same path length and depth, their semantic similarity is the same. As shown in fig. 1, sim (concept 7, concept 8) and sim (concept 10, concept 11) are the same because the shortest path length and depth of the above concept pair are the same. IC is a statistical method used to measure the amount of conceptual information. The general concepts have a lower information content and therefore a lower IC value, while the more specific concepts have a higher IC value. The more information two concepts share, the more similar they are. But using only information of concepts ignores valuable distance information between concepts in the concept taxonomy. And the semantic distance between the concepts is an effective measurement method for describing the similarity between the concepts. A disadvantage of similarity algorithms based on concept IC is that if the LCS (nearest shared node) of two concepts are the same, their semantic similarity is the same. Like sim (concept 4, concept 5) and sim (concept 7, concept 9) are the same, since the LCS of the above concept pair is concept 2. Therefore, by combining the two methods, the IC of the two concepts LCS is used for weighting the shortest path length between the concepts, and the method can better describe the semantic similarity between the concepts.

The proposed D-path algorithm gives different weights to the shortest path length through the shared information of two concepts, when the two concepts are the same, the path is 0, and the similarity score is 1 at most. Although the algorithm does not contain depth information, the IC of the LCS is actually similar to the depth of the concept, meaning that the concept at a deeper level in the concept classification map has more specific information and is therefore more similar between the two concepts.

In summary, the D-path algorithm solves the problem that similarity scores between concept pairs are the same due to the same depth and path length (as shown in step three, the D-path algorithm weights the minimum path length by introducing shared information of concepts, and under the condition that the minimum paths are the same, shared information may also be different, so scores are also different), and weights the minimum path length between concept pairs by using the shared information of concepts, so that valuable distance information in the concept classification method is retained, and statistical information is obtained to represent the consistency of the generic classification structure between concepts.

Compared with the prior art, the invention has the following advantages and effects:

the natural semantic detection algorithm based on the knowledge graph firstly utilizes a plurality of word segmentation algorithms to segment the financial text to obtain word segmentation combinations, then calculates the conceptual distances between words and text keywords to measure the similarity between the word segmentation combinations and the text topics, and finally selects the word segmentation combinations with the minimum sum of the conceptual distances to carry out semantic similarity detection. The knowledge graph uses the information IC of the concepts to weight the shortest path length between the concepts, the accuracy of the knowledge graph is better than that of other methods, the detection of the natural language similarity is carried out by using the knowledge graph, the semantic information behind the detected text can be accurately known, and more accurate and structured information is returned.

The IC value is calculated based on the graph, namely the knowledge graph, without using a corpus, so that the condition that one word is mapped to a plurality of concepts at the same time does not exist basically.

Drawings

FIG. 1 is a conceptual classification diagram of an embodiment.

FIG. 2 is a diagram of an exemplary financial text classification.

Figure 3 is an algorithmic flow chart of an embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example (b):

as shown in FIG. 3, the method for calculating the semantic similarity based on knowledge graph for the financial industry software test comprises the following steps:

s1, performing word segmentation operation on the financial text;

s2, selecting different word segmentation combinations most relevant to the text theme;

and S3, calculating semantic similarity for the word combination by using the knowledge graph and the concept IC weighted minimum path length.

Further, step S1 includes:

yi, Jieba

1) Generating all possible word formation conditions in the sentence according to a trie tree generated by the dit.txt dictionary through a dite.txt dictionary carried by the Jieba word segmentation to form a DAG directed acyclic graph;

2) finding out the maximum probability path by using a dynamic programming method, wherein the gravity center of the Chinese sentence is often positioned in the second half segment, so that the sentence is reversely calculated from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the segmentation combination of the maximum probability is obtained;

for the vocabulary which does not appear in the dictionary, the HMM model based on the Chinese character word forming capability is adopted, and the Viterbi algorithm is used for realizing the word segmentation operation of the vocabulary which does not appear in the dictionary.

Two, Ansj

This is a java implementation of Chinese participles based on CRF, n-Gram and HMM.

Third, SmartChineseAnalyzer

Further, the step S2 includes:

for the given various Chinese character word segmentation combinations, if the three word segmentation results are the same, no processing is carried out; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on DBpedia is adopted to obtain the word segmentation combination most relevant to the text theme. The relevance between words and text topics is measured by finding the conceptual distance between the word and the text keyword.

1) Keywords are first extracted from each combination of participles using the TextRank algorithm, which is adapted from the PageRank algorithm of Google. In the word segmentation combination, setting the size of the window as m, we can get [ w₁,w₂,…,w_m],[w₂,w₃,…,w_m+1],[w₃,w₄,…,w_m+2]And the like. Each word in the text is regarded as a node, and an edge exists between any two word nodes in the same window. Then, the principle of voting is utilized to take the edge as a wordThe mutual voting between the two votes, and after continuous iteration, the obtained votes tend to be stable. And judging the importance degree of the words by comparing the number of the votes, thereby obtaining the keywords of the text.

Wherein d is a damping coefficient, represents the probability that a certain node points to any node, and generally takes the value of 0.85, ln (V)_i) Is pointing to node V_iSet of nodes of, and Out (V)_j) Representing a slave node V_iThe set of nodes to which the departure points.

2) Selecting the word segmentation result that minimizes the concept distance:

by calculating the word omega and m keywords k in the text N₁,k₂,…,k_mAnd measures the relevance of the word segmentation combination and the text topic N, and the word segmentation combination with the smallest conceptual distance sum Dis (omega, N) can be regarded as the combination which is most relevant to the text topic. Conceptual distance DBpediaDis (omega, k)_i) And calculating in a DBpedia database to obtain a segmentation combination of the financial text, and then performing semantic similarity detection as follows.

1) KG is defined as the orientation marker map. G ═ V, E, τ, and V represents all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is the function V → E, defining all triples in G. In the present algorithm, semantic similarity between concepts can be defined as:

wherein ,sim_Dpath(c_i,c_j) For measuring semantic similarity between two concepts, c_i,c_j∈V，k∈(0,1](ii) a The parameter k represents the contribution of common information of the two concepts to the similarity, in this example k is 0.9; LCS is two concepts c_i,c_jThe most specific concept in the common ancestors between them, i.e. the nearest shared node, such as the LCS of the concept "upload field" and the concept "counter" in fig. 2 is the concept "interface"; path Paths (c)_i,c_j)＝{P₁；P₂；…P_nIs to the concept c_i,c_jA connected path set, a plurality of paths are arranged between the two concepts, and a common P is arranged₁To P_nA strip path, then length (c)_i,c_j) Representing the shortest path among all paths, let P_i∈Paths(c_i,c_j) Then let

length(c_i,c_j)＝min(|P_i|)

(4)

2) The semantic similarity between concepts is calculated by using a graph-based IC (integrated circuit), namely based on a specific KG, and the IC of the concepts in the KG is calculated by using an example in a concept classification method, such as the distribution condition of an entity 'counter of a broad-release bank' below the concept 'counter' in FIG. 2, and the method does not depend on an extrinsic corpus.

The graph-based IC is defined in KG as:

IC_graph(c_i)＝-logProb(c_i)

(5)

wherein ,

frequenvy(c_i)＝count(c_i)

(7)

where count is a simple function that computes a base of an entity. Implementing count function with SPARQL query language

SELECT count(ie)as ie WHERE

{

ie rdf:type owl:Thing.

}。

In the Knowledge Graph (KG), more intuitive language information comprises semantic distances among different concepts, and the closer the semantic distances of the concepts, the smaller the shortest path among the concepts, the higher the similarity between the two concepts. But this method does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth has a disadvantage in that if any two concepts have the same path length and depth, their semantic similarity is the same. sim (c)_i,c_j) Is used forMeasure semantic similarity between two concepts, such as the concept classification graph shown in FIG. 2, where sim (send-up field, return field) and sim (counter, VTM) are the same (because the shortest path length and depth of the above concept pairs are the same) concept c_iDepth of e V is defined as c_iShortest path length to root concept (top-most concept). Since the shortest path length and depth of the above concept pair are the same. IC is a statistical method used to measure the amount of conceptual information. The general concepts have a lower information content and therefore a lower IC value, while the more specific concepts have a higher IC value. The more information two concepts share, the more similar they are. But using only information of concepts ignores valuable distance information between concepts in the concept taxonomy. And the semantic distance between the concepts is an effective measurement method for describing the similarity between the concepts. A disadvantage of similarity algorithms based on concept IC is that if the LCS (nearest shared node) of two concepts are the same, their semantic similarity is the same. As shown in fig. 2, sim (send-up field, counter) and sim (return field, VTM) are the same, and are considered the same as long as the nearest shared nodes are the same, because the LCS of the above concept pair is the same.

Therefore, by combining the two methods, the IC of the two concepts LCS is used for weighting the shortest path length between the concepts, and the method can better describe the semantic similarity between the concepts.

The text similarity test in the financial industry is now exemplified:

existing sentence 1: interface add success and sentence 2: interface return failure

Firstly, three word segmentation algorithms are used for word segmentation to obtain three word segmentation combinations, and the three word segmentation combinations in the example are the same, so that the result is used as the word segmentation combination of the text. The word segmentation results for sentence 1 are as follows: interface | add | is successful; the word segmentation result of sentence 2 is as follows: interface | Return | fails.

Secondly, the shortest path between the concept interface and the concept interface is 0; the shortest path between the concept "add" and the concept "return" is 4, and the IC of the LCS of both concepts is 8.6219; the shortest path between concept "success" and concept "failure" is 7, and the IC for the LCS of both concepts is 20.4752.

Finally, the similarity score between the concept "interface" and the concept "interface" can be obtained by using the formula (3) as follows: 1; the similarity score between the concept "add" and the concept "return" is: 0.3828, respectively; the similarity score between concept "success" and concept "failure" is: 0.5526.

the closer the score is to 1, the higher the similarity of the two concepts.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for calculating the semantic similarity based on the knowledge graph in the software test of the financial industry is characterized by comprising the following steps of:

s1, performing word segmentation operation on the financial text;

and S3, calculating the semantic similarity of the financial text by using the knowledge graph and the word combination weighted by the concept IC to the minimum path length.

2. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S1 includes:

segmenting Chinese sentences through three word segmentation algorithms to obtain different word segmentation combinations;

algorithm I, Jieba

2) carrying out reverse calculation from right to left on the sentence by using a dynamic programming method, taking the frequency of the word in the dictionary as the frequency of the word, finding out a maximum probability path, and finally obtaining a word segmentation combination with the maximum probability;

for the vocabulary which does not appear in the dictionary, the method adopts a hidden Markov model based on the Chinese character word forming capability and uses a Viterbi algorithm to realize word segmentation operation on the words which are not recorded in the dictionary;

algorithm two, Ansj Chinese word segmentation device

The second algorithm is based on the iava realization of the Chinese word segmentation of the n-Gram language model, the random condition field and the hidden Markov model;

algorithm III SmartChineseAnalyzer Chinese word segmentation device

The computation of word statistics is performed over a large number of corpora in a hidden markov model, and then the statistical results are used to calculate the best segmentation combination for the text.

3. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S2 includes: for the obtained three word segmentation combinations, if the three word segmentation results are the same, no processing is performed; if the word segmentation results are different, obtaining a word segmentation combination most relevant to the text theme by adopting an improved algorithm based on a knowledge graph, wherein the relevance between the words and the text theme is measured by solving a concept distance between the words and the text keywords, the concept is a set formed by entities with the same characteristics, the entities refer to different examples, and the examples comprise specific descriptions of the entities in the concepts.

4. The knowledge-graph-based semantic similarity calculation method according to claim 3, wherein the step of selecting the combination of the participles most relevant to the text theme comprises the following steps:

1) firstly, a keyword is proposed from each participle combination by using a TextRank algorithm, the size of a window is set to be m in the participle combination, and one sentence sequentially consists of the following words:

w₁，w₂，...，w_mto obtain [ w₁，w₂，...，w_m]，[w₂，w₃，...，w_m+1]，[w₃，w4，...，w_m+2]The window, regard every word in the text as a node, in the same window, there is an edge between any two nodes, then utilize the principle of voting, regard the edge as the mutual vote among the word, after iterating continuously, the number of votes obtained of every node will tend to be stable, through comparing the number of votes of the node, judge the importance of the word, fetch m words that the number of votes is the most as the key word of the text, the importance formula of the node is as follows:

wherein d is a damping coefficient representing the probability that a certain node points to any node, ln (V)_i) Is pointing to node V_iSet of nodes of, and Out (V)_j) Representing a slave node V_iSet of nodes to which a departure points, S (V)_i)，S(V_j) Representing the importance of each node, wherein i and j represent different nodes respectively, and the total number of the nodes is m;

2) selecting the word segmentation result that minimizes the concept distance:

by calculating the word omega and m keywords k in the text N₁，k₂，...，k_mAnd measures the relevance of the word segmentation combination and the text topic N by the average conceptual distance, wherein the word segmentation combination with the minimum conceptual distance sum Dis (omega, N) is regarded as the combination with the highest relevance to the text topic, and the conceptual distance DBpediaDis (omega, k)_i) Calculated in the DBpedia database.

5. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the language information in the knowledge graph comprises semantic distances of different concepts, i.e. distances between nodes; the closer the semantic distance of the concepts, the smaller the shortest path between the concepts, indicating the higher the similarity between the two.

6. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the concept ic (information content) is used to measure the amount of concept information; the low IC value, which is calculated based on the knowledge-graph, is low, the amount of conceptual information is low.

7. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S3 includes:

1) KG is defined as an oriented tagram, G ═ (V, E, τ), V represents all nodes in the knowledge-graph, G represents the defined KG, i.e. the oriented tagram; e represents a series of edges connecting the nodes; t is a function V multiplied by V → E, all triples in G are defined, and in the D-path algorithm, the semantic similarity of the concept is defined as:

wherein ,sim_wpath(c_i，c_j) For measuring semantic similarity between two concepts, c_i，c_jRespectively representing two different concepts, c_i，c_j∈V，k∈(0，1]The parameter k represents the contribution of common information of the two concepts to the similarity; IC (c)_Lcs) IC value, C, referring to the nearest shared node_LcsRefers to the nearest shared node of the two concepts, LCS is concept c_i，c_jThe most specific concepts in the common ancestor; weighting shortest Path Length (c) with IC values of two notional nearest shared nodes_i，c_j) Let path Paths (c)_i，c_j)＝{P₁；P₂；...P_nIs to the concept c_i，c_jA set of connected paths, two concepts having P₁To P_nStrip path, length (c)_i，c_j) Representing the shortest path among all paths, let P_i∈Paths(c_i，c_j) Then let

length(c_i，c_j)＝min(|P_i|) (4)

Length (c) is obtained_i，c_j) Represents the shortest path length between two concepts;

2) calculating semantic similarity between concepts by using a graph-based IC (integrated circuit), namely based on KG, calculating a concept IC in KG according to example distribution conditions in a concept classification method, wherein the graph-based IC is defined as follows in KG:

IC_graph(c_i)＝-logProb(c_i) (5)

wherein ,

frequenvy(c_i)＝count(c_i) (7)

where count is a simple function of calculating an entity base.