CN110232185B

CN110232185B - Knowledge graph semantic similarity-based computing method for financial industry software testing

Info

Publication number: CN110232185B
Application number: CN201910010902.6A
Authority: CN
Inventors: 杜广龙; 陈震星; 李方; 周文沛; 孙慧; 姚庚成
Original assignee: Shanghai Chinasoft Huateng Software System Co ltd; South China University of Technology SCUT
Current assignee: Shanghai Chinasoft Huateng Software System Co ltd; South China University of Technology SCUT
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2023-09-19
Anticipated expiration: 2039-01-07
Also published as: CN110232185A

Abstract

The invention provides a knowledge graph semantic similarity-based computing method for financial industry software testing, which comprises the following steps: s1, word segmentation operation is carried out on a financial text; s2, selecting word segmentation combinations most relevant to the thematic property of the text; s3, calculating semantic similarity of the component groups by utilizing the knowledge graph and using the concept IC to weight the minimum path length. The natural semantic detection algorithm based on the knowledge graph firstly performs word segmentation on the financial text by utilizing a plurality of word segmentation algorithms to obtain word segmentation combinations, then calculates the concept distances of words and text keywords to measure the similarity between the word segmentation combinations and text topics, and finally selects the word segmentation combination with the minimum sum of the concept distances to perform semantic similarity detection. The information IC of the concepts is used in the knowledge graph to weight the shortest path length between the concepts, and the knowledge graph shows better performance in accuracy compared with other methods.

Description

Knowledge graph semantic similarity-based computing method for financial industry software testing

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a knowledge graph semantic similarity-based computing method for financial industry software testing.

Background

Semantic similarity detection between natural languages is widely used in many fields such as information retrieval, machine translation, and automatic question-answering. The sentence itself is composed of individual words, including subjects, predicates, various stop words, and the like. Even the same words, in different combinations and in different contexts, have entirely different meaning. In recent years, many statistical-based calculation methods have been proposed, but this method ignores semantic information and sentence structure information of text, and the obtained calculation result sometimes does not conform to understanding of natural language, and the current corpus-based IC may contain ambiguous concept meanings because the IC is calculated by calculating the occurrence of words in the corpus, where words are mapped to different concepts at the same time. Nowadays, many public Knowledge Graphs (KG) are applied, including a large number of concepts, entities and relationships between them. The knowledge graph is utilized to detect the similarity of natural language, so that semantic information behind the detected text can be accurately known, and more accurate and structured information is returned. However, most application knowledge graphs for semantic similarity detection are mainly English text, and the application of the knowledge graphs in Chinese is less. Financial text is different from English text, words are not separated by spaces, and Chinese is relatively more complex in grammar structure. In view of the above features, word segmentation of the Chinese sentence is necessary before semantic similarity detection by using the knowledge graph.

Disclosure of Invention

Aiming at the prior semantic similarity research, the semantic similarity of texts is mainly measured by calculating the depth and the path length between concepts or by calculating the Information Content (IC) between concepts, the invention provides a semantic similarity calculation method based on knowledge graphs for financial industry software test, weights the minimum distance of the path by using the Information Content (IC) of the concepts, and provides a graph-based IC to calculate through the concept distribution of examples, and the specific steps are as follows:

s1, word segmentation operation is carried out on a financial text;

s2, selecting word segmentation combinations most relevant to the thematic property of the text;

s3, calculating the semantic similarity of the financial text by utilizing a Knowledge Graph (KG) and using a concept IC weighted minimum path length to calculate the semantic similarity of the financial text.

Further, step S1 includes:

dividing Chinese sentences through a plurality of word segmentation algorithms to obtain different word segmentation combinations;

algorithm I, jieba (resultant word)

1) And generating all possible word forming conditions in sentences through a self-contained dictionary of the Jieba word segmentation according to a trie tree generated by the dictionary of the Jieba word segmentation, and forming a directed acyclic graph (Directed Acyclic Graph, DAG).

2) The maximum probability path is found out by using a dynamic programming method, and because the gravity center of the Chinese sentence is often located at the second half section, reverse calculation is performed on the sentence from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the word segmentation combination with the maximum probability is obtained.

For the vocabulary which does not appear in the dictionary, an HMM (Hidden Markov Models, hidden Markov model) model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used for realizing word segmentation operation on the words which are not recorded in the dictionary.

Algorithm II and Ansj Chinese word segmentation device

Algorithm two is a java implementation of chinese segmentation based on an n-Gram language model, CRF (condition Random field, random condition field) and HMM (Hidden Markov Models, hidden markov model).

Algorithm three, smartchinese analyzer

Smartchinese analyzer is a tool provided by the text retrieval system Lucene, which performs word statistics calculation by using a large number of corpus in Hidden Markov Model (HMM), and then, the statistics result is used to calculate the best word segmentation combination of the text.

Further, the step S2 includes:

for the given Chinese character word segmentation combinations, if the three word segmentation results are the same, the processing is not performed; if the word segmentation results are different, an improved algorithm based on DBpedia (knowledge graph) is adopted to obtain word segmentation combinations most relevant to text thematic property in order to improve the accuracy of the word segmentation results. The relevance between a word and a text topic is measured by finding the conceptual distance between the word and the text keyword. The concept is a collection of entities of the same nature, the entities referring to different instances, examples including specific descriptions in the concept, such as "broad counter" under the concept "counter".

Further, the step of selecting the word segmentation combination most relevant to the text thematic property comprises the following steps:

1) Keywords are first presented from each word segmentation combination using the TextRank algorithm (text ranking algorithm), which was adapted from the Google's web page ranking algorithm PageRank. In the word segmentation combination, the size of the window is set to be m, and [ w ] can be obtained ₁ ,w ₂ ,…,w _m ],[w ₂ ,w ₃ ,…,w _m+1 ],[w ₃ ,w ₄ ,…,w _m+2 ]And (5) waiting for a window. Each word in the text is considered as a node, and in the same window, an edge exists between any two word nodes. Then, the sides are used as mutual votes among words by using the voting principle, and the number of votes obtained by each node tends to be stable after continuous iteration. Judging the importance degree of the words by comparing the ticket number of the nodes, taking m words with the largest ticket number as keywords of the text, and thus obtaining the keywords of the text. The importance formula of the node is as follows:

wherein d is a damping coefficient, representing the probability that a certain node points to any node, and is generally 0.85.Ln (V _i ) Is directed to node V _i And Out (V) _j ) Representing slave node V _i A set of nodes pointed to by the departure.

2) Selecting word segmentation results that minimize the concept distance:

by calculating the word omega and m keywords { k ] in the text N ₁ ,k ₂ ,…,k _m The average conceptual distance between the word segmentation combinations and the text N topic correlation is measured, the word segmentation combination with the smallest sum of the conceptual distances Dis (omega, N) can be regarded as the combination most relevant to the text topic, and the conceptual distance DBpediaDis (omega, k) _i ) And calculating in the DBpedia database to obtain word segmentation combinations of the financial texts, and then detecting semantic similarity.

Further, in the Knowledge Graph (KG), the language information includes semantic distances between different concepts, that is, distances between nodes, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the similarity between the two concepts is higher.

Further, the concept IC is used for measuring the concept information amount, and if the IC value is low, the concept information amount is low, and the IC value is calculated based on the knowledge graph. The higher the IC value of a particular concept, the more information that two concepts share, the more similar the two concepts are, and based on the similarity algorithm of the concept IC, the similarity of the two concepts is the same if the LCS (most recently shared node) of the two concepts is the same.

Further, step S3 includes:

(1) KG is defined as the orientation mark map. G= (V, E, τ), V representing all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is a function V V→E, defining all triples in G. In the D-path algorithm, the semantic similarity between concepts can be defined as:

wherein ,sim_wpath (c _i ,c _j ) For measuring semantic similarity between two concepts c _i ,c _j Respectively two different concepts, c _i ,c _j ∈V，k∈(0,1]The method comprises the steps of carrying out a first treatment on the surface of the The parameter k represents two probabilitiesContribution of common information of the ideas to similarity; IC (c) _Lcs ) Refers to the IC value of the most recently shared node, c _Lcs Refers to the nearest shared node of two concepts, LCS is concept c _i ,c _j The most specific concept in the common ancestor is the most recently shared node; shortest path length (c) weighted with IC values of the nearest shared nodes of both concepts _i ,c _j ) Set a path (c) _i ,c _j )＝{P ₁ ；P ₂ ；…P _n Is to divide concept c _i ,c _j The connected paths are set, there are multiple paths between two concepts, and a common P is set ₁ To P _n Path length (c) _i ,c _j ) Representing the shortest path among all paths, let P _i ∈Paths(c _i ,c _j )，

Let then

length(c _i ,c _j )＝min(|P _i |) (4)

Obtaining length (c) _i ,c _j ) Representing the shortest path length between the two concepts.

(2) The semantic similarity between concepts is calculated based on a specific KG by using a graph-based IC, namely, the semantic similarity is calculated based on a specific KG, the IC value can be used for knowing the semantic similarity, the IC of the concepts in the KG is calculated by the example distribution condition in the concept classification method, and the semantic similarity between concepts is calculated based on the graph-based IC and does not depend on an external corpus.

The graphics-based IC is defined in KG as:

IC _graph (c _i )＝-logProb(c _i ) (5)

wherein ,

n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graph _i Is defined as:

frequenvy(c _i )＝count(c _i ) (7)

where count is a simple function that computes the cardinality of an entity.

The IC in the equation may be either a conventional corpus-based IC or a graph-based IC. The graph-based IC is as effective as the corpus-based IC, and in the case of insufficient corpus or on-line calculation of the IC is required, the graph-based IC may be a good complement to the conventional corpus-based approach. Since ICs in knowledge maps typically express the type of instance in terms of disambiguation concepts, graph-based ICs contain the specific meaning of the concepts. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because ICs are computed by computing the occurrence of words in the corpus, where words may be mapped to different concepts simultaneously.

In the Knowledge Graph (KG), the more visual language information comprises semantic distances between different concepts, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the higher the similarity between the two concepts is. But this approach does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth is deficient in that if any two concepts have the same path length and depth, their semantic similarity is the same. As shown in fig. 1, sim (concept 7, concept 8) and sim (concept 10, concept 11) are the same, because the shortest path length and depth of the above concept pairs are the same. IC is a statistical method for measuring the amount of conceptual information. The general concept has a lower information content and thus a lower IC value, while the more specific concept has a higher IC value. The more information that two concepts share, the more similar they are. But merely using information about concepts ignores valuable distance information between concepts in a concept taxonomy. And semantic distance between concepts is an effective measure of describing similarity between concepts. A disadvantage of the concept IC based similarity algorithm is that if LCS (most recently shared node) of the two concepts are identical, their semantic similarity is identical. Such as sim (concept 4, concept 5) and sim (concept 7, concept 9) are the same, since LCS of the above concept pair is concept 2. Thus, combining the two methods, using the IC of the two concepts LCS to weight the shortest path length between the concepts, this method can better describe the semantic similarity between the concepts.

The proposed D-path algorithm gives different weights to the shortest path length of the two concepts through the shared information of the two concepts, and when the two concepts are identical, the path is 0, and the similarity score is 1 at the highest. The similarity score of this algorithm is progressively smaller as the path length increases, the similarity score range of this algorithm is (0, 1.) when the path lengths of the two concepts are the same, the more shared information between the two concepts represents more similarity between the two concepts, although this algorithm does not contain depth information, the ICs of LCS are actually similar to the depth of the concepts, meaning that the concepts deeper in the concept classification map have more specific information, and thus are more similar between the two.

In summary, the D-path algorithm solves the problem that the similarity score between the concept pairs is the same due to the same depth and path length (as shown in the third step, the D-path algorithm weights the minimum path length by introducing the shared information of the concepts, and the shared information may be different and thus the scores are different when the minimum paths are the same), weights the minimum path length between the concept pairs by using the shared information of the concepts, thereby preserving valuable distance information in the concept taxonomies and obtaining statistical information to represent consistency of the generic classification structure between the concepts.

Compared with the prior art, the invention has the following advantages and effects:

the natural semantic detection algorithm based on the knowledge graph firstly performs word segmentation on the financial text by utilizing a plurality of word segmentation algorithms to obtain word segmentation combinations, then calculates the concept distances of words and text keywords to measure the similarity between the word segmentation combinations and text topics, and finally selects the word segmentation combination with the minimum sum of the concept distances to perform semantic similarity detection. The information IC of the concepts is used in the knowledge graph to weight the shortest path length between the concepts, the accuracy is better than that of other methods, the knowledge graph is used for detecting the similarity of natural language, the semantic information behind the detected text can be accurately known, and more accurate and structured information is returned.

The IC value is calculated based on graphics, namely based on a knowledge graph, and a corpus is not needed, so that the situation that one word is mapped to a plurality of concepts at the same time basically does not exist.

Drawings

Fig. 1 is a conceptual classification diagram of an embodiment.

Fig. 2 is a schematic diagram of a financial text classification according to an embodiment.

The algorithm flow chart of the embodiment of fig. 3.

Detailed Description

The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.

Examples:

as shown in fig. 3, the financial industry oriented software testing method based on knowledge graph semantic similarity calculation comprises the following steps:

s1, word segmentation operation is carried out on a financial text;

s2, selecting the word segmentation combination most relevant to the text thematic property to obtain different word segmentation combinations;

and S3, calculating the semantic similarity of the component groups by utilizing the knowledge graph and the concept IC weighted minimum path length.

Further, step S1 includes:

1. jieba

1) Generating all possible word forming conditions in sentences through a self-contained text-to-txt dictionary of the Jieba word segmentation according to a trie tree generated by the text-to-txt dictionary, and forming a DAG directed acyclic graph;

2) The maximum probability path is found out by utilizing a dynamic programming method, and the gravity center of the Chinese sentence is often positioned at the second half section, so that reverse calculation is performed on the sentence from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the segmentation combination of the maximum probability is obtained;

and for the vocabulary which does not appear in the dictionary, adopting an HMM model based on the word forming capability of Chinese characters, and using a Viterbi algorithm to realize word segmentation operation on the vocabulary which does not appear in the dictionary.

2. Ansj

This is a java implementation of chinese segmentation based on CRF, n-Gram and HMM.

3. SmartChineseAnalyzer

Further, the step S2 includes:

for the given Chinese character word segmentation combinations, if the three word segmentation results are the same, the processing is not performed; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on DBpedia is adopted to obtain word segmentation combinations most relevant to the text thematic property. The relevance between a word and a text topic is measured by finding the conceptual distance between the word and the text keyword.

1) Firstly, a TextRank algorithm is utilized to propose keywords from each word segmentation combination, and the TextRank algorithm is adapted by a webpage ordering algorithm Pagerank of Google. In the word segmentation combination, the size of the window is set to be m, and [ w ] can be obtained ₁ ,w ₂ ,…,w _m ],[w ₂ ,w ₃ ,…,w _m+1 ],[w ₃ ,w ₄ ,…,w _m+2 ]And (5) waiting for a window. Each word in the text is considered as a node, and in the same window, an edge exists between any two word nodes. Then, the sides are used as mutual votes among words by using the voting principle, and the obtained votes tend to be stable after continuous iteration. And judging the importance degree of the words by comparing the number of the tickets, thereby obtaining the keywords of the text.

Wherein d is a damping coefficient representing that a certain node points to any directionThe probability of the intended node is typically 0.85, ln (V _i ) Is directed to node V _i And Out (V) _j ) Representing slave node V _i A set of nodes pointed to by the departure.

2) Selecting word segmentation results that minimize the concept distance:

by calculating the word omega and m keywords { k ] in the text N ₁ ,k ₂ ,…,k _m The average conceptual distance between the word segmentation combinations and the text N topic correlation is measured, and the word segmentation combination with the smallest sum of the conceptual distances Dis (ω, N) can be regarded as the combination most relevant to the text topic. Concept distance DBpediaDis (omega, k) _i ) And (3) calculating in the DBpedia database to obtain word segmentation combinations of the financial texts, and then carrying out the following semantic similarity detection.

1) KG is defined as the orientation mark map. G= (V, E, τ), V representing all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is a function V V→E, defining all triples in G. In the present algorithm, the semantic similarity between concepts can be defined as:

wherein ,sim_Dpath (c _i ,c _j ) For measuring semantic similarity between two concepts c _i ,c _j ∈V，k∈(0,1]The method comprises the steps of carrying out a first treatment on the surface of the The parameter k represents the contribution of the common information of the two concepts to the similarity, in this example k=0.9; LCS is two concepts c _i ,c _j LCS with the most specific concept among the common ancestors, i.e., the most recently shared node, such as the concept "upload field" and the concept "counter" in fig. 2 are the concept "interfaces"; set path (c) _i ,c _j )＝{P ₁ ；P ₂ ；…P _n Is to divide concept c _i ,c _j Connected path set, two generalThere are several paths between the ideas, a total of P is set ₁ To P _n Path and then length (c _i ,c _j ) Representing the shortest path among all paths, let P _i ∈Paths(c _i ,c _j )，

Let then

length(c _i ,c _j )＝min(|P _i |) (4)

2) Using graphics-based ICs, i.e. computing semantic similarity between concepts based on a specific KG, by way of example in a concept taxonomy,

the IC of the concept in KG is calculated as the distribution of the concept "counter" below the "counter" of fig. 2, with the entity "counter" of the issuing bank, which is independent of the external corpus.

The graphics-based IC is defined in KG as:

ICgraph(c _i )＝-logProb(c _i ) (5)

wherein ,

n represents the total number of all entities in the knowledge-graph,

concept c in knowledge graph _i Is defined as:

frequenvy(c _i )＝count(c _i ) (7)

where count is a simple function that computes the cardinality of an entity. Implementing count functions with SPARQL query language

SELECT count(ie)as ie WHERE

{

ie rdf:type owl:Thing.

}。

In the Knowledge Graph (KG), the more visual language information comprises semantic distances between different concepts, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the higher the similarity between the two concepts is. But this approach does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth is deficient in that if any two concepts have the same path length and depth, their semantic similarity is the same. sim (c) _i ,c _j ) Is used to measure semantic similarity between two concepts, as in the concept classification diagram shown in FIG. 2, sim (upload field, return field) and sim (counter, VTM) are the same (since the shortest path length and depth of the above concept pairs are the same) concept c _i Depth of e V is defined as from c _i Shortest path length to root concept (topmost concept). Because the shortest path length and depth of the above conceptual pair are the same. IC is a statistical method for measuring the amount of conceptual information. The general concept has a lower information content and thus a lower IC value, while the more specific concept has a higher IC value. The more information that two concepts share, the more similar they are. But merely using information about concepts ignores valuable distance information between concepts in a concept taxonomy. And semantic distance between concepts is an effective measure of describing similarity between concepts. A disadvantage of the concept IC based similarity algorithm is that if LCS (most recently shared node) of the two concepts are identical, their semantic similarity is identical. As shown in fig. 2, sim (send-up field, counter) and sim (return field, VTM) are the same, as long as the most recently shared nodes are the sameThe same is considered because LCS for the above conceptual pair is the same.

Thus, combining the two methods, using the IC of the two concepts LCS to weight the shortest path length between the concepts, this method can better describe the semantic similarity between the concepts.

The text similarity test in the financial industry will now be illustrated:

existing sentence 1: interface add success and sentence 2: interface return failure

Firstly, three word segmentation algorithms are used for word segmentation, three word segmentation combinations are obtained, in this example, the three word segmentation combinations are the same, and therefore the result is taken as the word segmentation combination of the text. The word segmentation result of sentence 1 is as follows: interface |increase|success; the word segmentation result of sentence 2 is as follows: interface |returns |failure.

Secondly, the shortest path between the concept 'interface' and the concept 'interface' is 0; the shortest path of the concept "add" and the concept "return" is 4, the IC of LCS for both concepts is 8.6219; the shortest path between the concept "success" and the concept "failure" is 7, and the IC of LCS for both concepts is 20.4752.

Finally, the similarity scores of the concept 'interface' and the concept 'interface' can be obtained by using the formula (3) as follows: 1, a step of; the similarity score for the concept "increase" and the concept "return" is: 0.3828; the similarity score for the concept "success" and the concept "failure" is: 0.5526.

the closer the score is to 1, the higher the similarity of the two concepts.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The method for calculating semantic similarity of software tests based on knowledge patterns in the financial industry is characterized by comprising the following steps of:

s1, word segmentation operation is carried out on a financial text;

s2, selecting word segmentation combinations most relevant to text thematic property, and specifically comprising the following steps: if the three word segmentation results are the same, the three word segmentation combinations are not processed; if word segmentation results are different, a word segmentation combination most relevant to text thematic is obtained by adopting an improved algorithm based on a knowledge graph, the relevance between a word and a text theme is measured by solving a conceptual distance between the word and a text keyword, the concept is a set formed by entities with the same characteristics, the entities refer to different examples, and the examples comprise specific descriptions of the entities in the concept; the word segmentation combination step most relevant to the text thematic property is selected as follows:

1) Firstly, extracting keywords from each word segmentation combination by using a TextRank algorithm, wherein in the word segmentation combination, the size of a window is set to be m, and one sentence sequentially consists of the following words:

w ₁ ,w ₂ ,…,w _m to obtain [ w ] ₁ ,w ₂ ,…,w _m ],[w ₂ ,w ₃ ,…,w _m+1 ],[w ₃ ,w ₄ ,…,w _m+2 ]And in the window, each word in the text is regarded as a node, an edge exists between any two nodes in the same window, then the edge is regarded as mutual voting among words by utilizing the voting principle, the number of votes obtained by each node tends to be stable after continuous iteration, the importance of the word is judged by comparing the number of votes of the nodes, m words with the maximum number of votes are taken as keywords of the text, and the importance formula of the node is as follows:

wherein d is a damping coefficient representing the probability that a certain node points to an arbitrary node, ln (V _i ) Is directed to node V _i And Out (V) _j ) Representing slave node V _i Set of departure-directed nodes, S (V _i )，S(V _j ) Representing the importance of each node, i and j respectively represent different nodes, and the total number of the nodes is m;

2) Selecting word segmentation results that minimize the concept distance:

by calculating the word omega and m keywords { k ] in the text N ₁ ,k ₂ ,…,k _m Average conceptual distance between word segmentation combinations and text N topic correlation is measured, the word segmentation combination with minimum sum of conceptual distances Dis (omega, N) is regarded as the combination most relevant to text topic, and the conceptual distance DBpediaDis (omega, k) _i ) Calculating in a DBpedia database;

s3, calculating the semantic similarity of the financial text by utilizing the knowledge graph and using the concept IC to weight the minimum path length to the partial graph; the concept IC is used for measuring concept information quantity; and if the IC value is low, the concept information amount is low, and the IC value is low and is calculated based on the knowledge graph.

2. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S1 includes:

dividing Chinese sentences through three word segmentation algorithms to obtain different word segmentation combinations;

algorithm one, jieba

2) Performing reverse calculation on sentences from right to left by using a dynamic programming method, taking the frequency of the word in a dictionary as the frequency of the word, finding out a maximum probability path, and finally obtaining a word segmentation combination with maximum probability;

for the vocabulary which does not appear in the dictionary, a hidden Markov model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used to realize word segmentation operation on the words which are not recorded in the dictionary;

algorithm II and Ansj Chinese word segmentation device

The second algorithm is based on the java implementation of Chinese word segmentation of an n-Gram language model, a random condition field and a hidden Markov model;

algorithm three, smartChineseseAnalyzer Chinese word segmentation device

Word statistics are calculated by a large number of corpora in the hidden Markov model, and then the statistics result is used for calculating the optimal word segmentation combination of the text.

3. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the language information in the knowledge graph includes semantic distances of different concepts, namely, distances between nodes; the closer the semantic distance of the concepts, the smaller the shortest path between the concepts, indicating a higher similarity between the two.

4. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S3 includes: 1) KG is defined as a directional marker graph, g= (V, E, τ), V represents all nodes in the knowledge graph, G represents the defined KG, i.e. the directional marker graph; e represents a series of edges connecting the nodes; τ is a function V X V→E, defining all triples in G, in the D-path algorithm, the semantic similarity of the concept is defined as:

wherein ,sim_wpath (c _i ,c _j ) For measuring semantic similarity between two concepts c _i ,c _j Respectively two different concepts, c _i ,c _j ∈V，k∈(0,1]The parameter k represents the contribution of the common information of the two concepts to the similarity; IC (c) _Lcs ) Refers to the IC value of the most recently shared node, c _Lcs Refers to the nearest shared node of two concepts, LCS is concept c _i ,c _j The most specific concept among common ancestors; shortest path length (c) weighted with IC values of the nearest shared nodes of both concepts _i ,c _j ) Set a path (c) _i ,c _j )＝{P ₁ ；P ₂ ；…P _n Is to divide concept c _i ,c _j The connected paths are assembled, and two concepts are P ₁ To P _n Path length (c) _i ,c _j ) Representing the shortest path among all paths, let P _i ∈Paths(c _i ,c _j ) Let then

length(c _i ,c _j )＝min(|P _i |) (4)

Obtaining length (c) _i ,c _j ) Representing the shortest path length between the two concepts;

2) Using graph-based ICs, i.e., KG-based, to calculate semantic similarity between concepts, the concept ICs in KG are calculated from example distribution conditions in the concept taxonomies, and the graph-based ICs are defined in KG as:

IC _graph (c _i )＝-logProb(c _i ) (5)

wherein ,

frequency(c _i )＝count(c _i ) (7)

where count is a simple function that computes the cardinality of an entity.