CN110232185A - Towards financial industry software test knowledge based map semantic similarity calculation method - Google Patents
Towards financial industry software test knowledge based map semantic similarity calculation method Download PDFInfo
- Publication number
- CN110232185A CN110232185A CN201910010902.6A CN201910010902A CN110232185A CN 110232185 A CN110232185 A CN 110232185A CN 201910010902 A CN201910010902 A CN 201910010902A CN 110232185 A CN110232185 A CN 110232185A
- Authority
- CN
- China
- Prior art keywords
- concept
- concepts
- text
- word
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 17
- 230000011218 segmentation Effects 0.000 claims abstract description 69
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 40
- 238000000034 method Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000013016 damping Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 10
- 238000013507 mapping Methods 0.000 abstract 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013522 software testing Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides one kind towards financial industry software test knowledge based map semantic similarity calculation method, comprising steps of S1 carries out participle operation to financial text;S2 chooses to be combined with the maximally related participle of text subject;S3 combines computing semantic similarity to participle using knowledge mapping and using concept IC weighting minimum path length.The natural Semantic detection algorithm of knowledge based map carries out participle cutting to financial text first with a variety of segmentation methods and obtains participle combination, then the concept distance of word and text keyword is calculated to measure the similarity between participle combination and text subject, is finally chosen the smallest participle combination of concept sum of the distance and is carried out semantic similarity detection.Carry out the shortest path length between weighted concept using the information IC of concept in knowledge mapping, shows better performance relative to other methods in accuracy.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a knowledge graph semantic similarity calculation method for financial industry software testing.
Background
Semantic similarity detection between natural languages has wide application in many fields such as information retrieval, machine translation, and automatic question answering. The sentences themselves are composed of individual words, and include subjects, predicates, various stop words, and the like. Even the same words may have completely different meanings in different combinations and different contexts. Many statistical-based calculation methods have been proposed in recent years, but such methods ignore semantic information and sentence structure information of text, and the resulting calculation results sometimes do not fit with human understanding of natural language, and currently corpus-based ICs may have ambiguous conceptual meanings because they calculate ICs by calculating occurrences of words in a corpus, where the words are mapped to different concepts at the same time. Today, a large number of public knowledge-graphs (KGs) are used, encompassing a large number of concepts, entities and relationships between them. The knowledge graph is used for detecting the similarity of the natural language, so that semantic information behind a detected text can be accurately known, and more accurate and structured information can be returned. However, most of the current knowledge graphs are mainly used for semantic similarity detection of English texts, and the application of the knowledge graphs in Chinese is less. Financial text is different from English text, words are not separated by spaces, and Chinese is relatively more complex in grammatical structure. In view of the above characteristics, it is necessary to perform word segmentation on Chinese sentences before performing semantic similarity detection by using a knowledge graph.
Disclosure of Invention
Aiming at the previous semantic similarity research, the semantic similarity of a text is measured mainly by calculating the depth and the path length between concepts, or the Information Content (IC) between the concepts, the invention provides a knowledge graph-based semantic similarity calculation method for financial industry software test, the minimum distance of the path is weighted by the Information Content (IC) of the concepts, and the IC based on a graph is calculated by the concept distribution of an example, and the method comprises the following specific steps:
s1, performing word segmentation operation on the financial text;
s2, selecting the word segmentation combination most relevant to the text theme;
and S3, calculating the semantic similarity of the financial text by using a Knowledge Graph (KG) and the segmentation combination with the concept IC weighted minimum path length.
Further, step S1 includes:
segmenting Chinese sentences through various word segmentation algorithms to obtain different word segmentation combinations;
algorithm one, Jieba (words in the form of Jieba)
1) Generating all possible word formation conditions in the sentence according to a trie tree generated by the direct.txt dictionary through a direct.txt dictionary carried by the Jieba participle, and forming a Directed Acyclic Graph (DAG).
2) The maximum probability path is found out by using a dynamic programming method, because the gravity center of the Chinese sentence is often positioned in the second half section, the sentence is reversely calculated from the right to the left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the word segmentation combination with the maximum probability is obtained.
For the vocabulary which does not appear in the dictionary, HMM (Hidden Markov model) model based on Chinese character word forming capability is adopted, and Viterbi algorithm is used for realizing word segmentation operation on the words which do not have records in the dictionary.
Algorithm two, Ansj Chinese word segmentation device
The second algorithm is java implementation of Chinese word segmentation based on n-Gram language model, CRF (condition Random field) and HMM (Hidden Markov model).
Algorithm III SmartChineseAnalyzer (Chinese word segmenter)
SmartChineseAnalyzer is a tool provided by the Lucene text retrieval system, and performs word statistics calculation through a large amount of linguistic data in a Hidden Markov Model (HMM), and then the statistical result is used for calculating the optimal word segmentation combination of the text.
Further, the step S2 includes:
for the given various Chinese character word segmentation combinations, if the three word segmentation results are the same, no processing is carried out; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on a DBpedia (knowledge graph) is adopted to obtain a word segmentation combination most relevant to the text theme. The relevance between words and text topics is measured by finding the conceptual distance between the word and the text keyword. The concept is a collection of entities with the same characteristics, the entities refer to different instances, and the instances comprise specific descriptions in the concept, such as 'extensive hair counter' under the concept 'counter'.
Further, the step of selecting the word combination most relevant to the text theme comprises the following steps:
1) keywords are first extracted from each combination of participles using the TextRank algorithm (text ranking algorithm) adapted from the PageRank algorithm of Google. In the word segmentation combination, setting the size of the window as m, we can get [ w1,w2,…,wm],[w2,w3,…,wm+1],[w3,w4,…,wm+2]And the like. Will be in the textEach word of (a) is regarded as a node, and an edge exists between any two word nodes in the same window. Then, the edges are used as mutual voting among the words by using a voting principle, and after continuous iteration, the number of votes obtained by each node tends to be stable. The importance degree of the words is judged by comparing the ticket number of the nodes, and m words with the maximum ticket number are taken as the keywords of the text, so that the keywords of the text are obtained. The importance formula of a node is as follows:
wherein d is a damping coefficient representing the probability that a certain node points to any node, and is generally 0.85.ln (V)i) Is pointing to node ViSet of nodes of, and Out (V)j) Representing a slave node ViThe set of nodes to which the departure points.
2) Selecting the word segmentation result that minimizes the concept distance:
by calculating the word omega and m keywords k in the text N1,k2,…,kmAnd measures the relevance of the word segmentation combination and the text topic N by the average conceptual distance, wherein the word segmentation combination with the minimum conceptual distance sum Dis (omega, N) can be regarded as the combination with the highest relevance to the text topic, and the conceptual distance DBpediaDis (omega, k)i) And calculating in a DBpedia database, and performing semantic similarity detection after obtaining the word segmentation combination of the financial text.
Further, in the Knowledge Graph (KG), the language information includes semantic distances between different concepts, that is, distances between nodes, and the closer the semantic distance of a concept is, the smaller the shortest path between concepts is, the higher the similarity between the two is.
Further, the concept IC is used for measuring the concept information amount, and the IC value is low, the concept information amount is low, and the IC value is low and is calculated based on the knowledge graph. The higher the IC value of a specific concept, the more information the two concepts share, the more similar the two concepts are, and based on the similarity calculation method of the concept IC, if LCS (most recently shared node) of the two concepts are the same, the similarity of the two concepts is the same.
Further, step S3 includes:
(1) KG is defined as the orientation marker map. G ═ V, E, τ, V represents all nodes in the knowledge-graph; e represents a series of edges connecting the nodes; τ is the function V → E, defining all triples in G. In the D-path algorithm, semantic similarity between concepts can be defined as:
wherein ,simwpath(ci,cj) For measuring semantic similarity between two concepts, ci,cjRespectively representing two different concepts, ci,cj∈V,k∈(0,1](ii) a The parameter k represents the contribution of common information of the two concepts to the similarity; IC (c)Lcs) IC value, c, referring to the nearest shared nodeLcsRefers to the nearest shared node of the two concepts, LCS is concept ci,cjThe most specific concept in the common ancestor, namely the nearest shared node; weighting shortest Path Length (c) with IC values of two notional nearest shared nodesi,cj) Let path Paths (c)i,cj)={P1;P2;…PnIs to the concept ci,cjA connected set of paths, with multiple paths between two concepts, with a common P1To PnStrip path, length (c)i,cj) Representing the shortest path among all paths, let Pi∈Paths(ci,cj) Then let
length(ci,cj)=min(|Pi|)
(4)
Length (c) is obtainedi,cj) Representing the shortest path length between the two concepts.
(2) The semantic similarity between concepts is calculated based on the graph-based IC, namely based on a specific KG, as shown in formula (5), the size of the semantic similarity can be known through an IC value, the IC of the concepts in the KG is calculated through example distribution conditions in a concept classification method, and the semantic similarity between the concepts calculated by the graph-based IC does not depend on an extrinsic corpus.
The graph-based IC is defined in KG as:
ICgraph(ci)=-logProb(ci) (5)
wherein ,
n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graphiIs defined as:
frequenvy(ci)=count(ci) (7)
where count is a simple function that computes a base of an entity.
The IC in the equation can be either a corpus-based IC as is conventional or a graph-based IC. Graph-based ICs are as effective as corpus-based ICs, and can be a good complement to conventional corpus-based ICs in cases where the corpus is insufficient or online computation of the IC is required. Since the IC in a knowledge graph typically expresses the type of instance with disambiguating concepts, graph-based IC contains a specific meaning of the concept. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because the IC is calculated by calculating the occurrence of words in the corpus, where the words may be mapped to different concepts simultaneously.
In the Knowledge Graph (KG), more intuitive language information comprises semantic distances among different concepts, and the closer the semantic distances of the concepts, the smaller the shortest path among the concepts, the higher the similarity between the two concepts. But this method does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth has a disadvantage in that if any two concepts have the same path length and depth, their semantic similarity is the same. As shown in fig. 1, sim (concept 7, concept 8) and sim (concept 10, concept 11) are the same because the shortest path length and depth of the above concept pair are the same. IC is a statistical method used to measure the amount of conceptual information. The general concepts have a lower information content and therefore a lower IC value, while the more specific concepts have a higher IC value. The more information two concepts share, the more similar they are. But using only information of concepts ignores valuable distance information between concepts in the concept taxonomy. And the semantic distance between the concepts is an effective measurement method for describing the similarity between the concepts. A disadvantage of similarity algorithms based on concept IC is that if the LCS (nearest shared node) of two concepts are the same, their semantic similarity is the same. Like sim (concept 4, concept 5) and sim (concept 7, concept 9) are the same, since the LCS of the above concept pair is concept 2. Therefore, by combining the two methods, the IC of the two concepts LCS is used for weighting the shortest path length between the concepts, and the method can better describe the semantic similarity between the concepts.
The proposed D-path algorithm gives different weights to the shortest path length through the shared information of two concepts, when the two concepts are the same, the path is 0, and the similarity score is 1 at most. Although the algorithm does not contain depth information, the IC of the LCS is actually similar to the depth of the concept, meaning that the concept at a deeper level in the concept classification map has more specific information and is therefore more similar between the two concepts.
In summary, the D-path algorithm solves the problem that similarity scores between concept pairs are the same due to the same depth and path length (as shown in step three, the D-path algorithm weights the minimum path length by introducing shared information of concepts, and under the condition that the minimum paths are the same, shared information may also be different, so scores are also different), and weights the minimum path length between concept pairs by using the shared information of concepts, so that valuable distance information in the concept classification method is retained, and statistical information is obtained to represent the consistency of the generic classification structure between concepts.
Compared with the prior art, the invention has the following advantages and effects:
the natural semantic detection algorithm based on the knowledge graph firstly utilizes a plurality of word segmentation algorithms to segment the financial text to obtain word segmentation combinations, then calculates the conceptual distances between words and text keywords to measure the similarity between the word segmentation combinations and the text topics, and finally selects the word segmentation combinations with the minimum sum of the conceptual distances to carry out semantic similarity detection. The knowledge graph uses the information IC of the concepts to weight the shortest path length between the concepts, the accuracy of the knowledge graph is better than that of other methods, the detection of the natural language similarity is carried out by using the knowledge graph, the semantic information behind the detected text can be accurately known, and more accurate and structured information is returned.
The IC value is calculated based on the graph, namely the knowledge graph, without using a corpus, so that the condition that one word is mapped to a plurality of concepts at the same time does not exist basically.
Drawings
FIG. 1 is a conceptual classification diagram of an embodiment.
FIG. 2 is a diagram of an exemplary financial text classification.
Figure 3 is an algorithmic flow chart of an embodiment.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example (b):
as shown in FIG. 3, the method for calculating the semantic similarity based on knowledge graph for the financial industry software test comprises the following steps:
s1, performing word segmentation operation on the financial text;
s2, selecting different word segmentation combinations most relevant to the text theme;
and S3, calculating semantic similarity for the word combination by using the knowledge graph and the concept IC weighted minimum path length.
Further, step S1 includes:
segmenting Chinese sentences through various word segmentation algorithms to obtain different word segmentation combinations;
yi, Jieba
1) Generating all possible word formation conditions in the sentence according to a trie tree generated by the dit.txt dictionary through a dite.txt dictionary carried by the Jieba word segmentation to form a DAG directed acyclic graph;
2) finding out the maximum probability path by using a dynamic programming method, wherein the gravity center of the Chinese sentence is often positioned in the second half segment, so that the sentence is reversely calculated from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the segmentation combination of the maximum probability is obtained;
for the vocabulary which does not appear in the dictionary, the HMM model based on the Chinese character word forming capability is adopted, and the Viterbi algorithm is used for realizing the word segmentation operation of the vocabulary which does not appear in the dictionary.
Two, Ansj
This is a java implementation of Chinese participles based on CRF, n-Gram and HMM.
Third, SmartChineseAnalyzer
SmartChineseAnalyzer is a tool provided by the Lucene text retrieval system, and performs word statistics calculation through a large amount of linguistic data in a Hidden Markov Model (HMM), and then the statistical result is used for calculating the optimal word segmentation combination of the text.
Further, the step S2 includes:
for the given various Chinese character word segmentation combinations, if the three word segmentation results are the same, no processing is carried out; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on DBpedia is adopted to obtain the word segmentation combination most relevant to the text theme. The relevance between words and text topics is measured by finding the conceptual distance between the word and the text keyword.
1) Keywords are first extracted from each combination of participles using the TextRank algorithm, which is adapted from the PageRank algorithm of Google. In the word segmentation combination, setting the size of the window as m, we can get [ w1,w2,…,wm],[w2,w3,…,wm+1],[w3,w4,…,wm+2]And the like. Each word in the text is regarded as a node, and an edge exists between any two word nodes in the same window. Then, the principle of voting is utilized to take the edge as a wordThe mutual voting between the two votes, and after continuous iteration, the obtained votes tend to be stable. And judging the importance degree of the words by comparing the number of the votes, thereby obtaining the keywords of the text.
Wherein d is a damping coefficient, represents the probability that a certain node points to any node, and generally takes the value of 0.85, ln (V)i) Is pointing to node ViSet of nodes of, and Out (V)j) Representing a slave node ViThe set of nodes to which the departure points.
2) Selecting the word segmentation result that minimizes the concept distance:
by calculating the word omega and m keywords k in the text N1,k2,…,kmAnd measures the relevance of the word segmentation combination and the text topic N, and the word segmentation combination with the smallest conceptual distance sum Dis (omega, N) can be regarded as the combination which is most relevant to the text topic. Conceptual distance DBpediaDis (omega, k)i) And calculating in a DBpedia database to obtain a segmentation combination of the financial text, and then performing semantic similarity detection as follows.
1) KG is defined as the orientation marker map. G ═ V, E, τ, and V represents all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is the function V → E, defining all triples in G. In the present algorithm, semantic similarity between concepts can be defined as:
wherein ,simDpath(ci,cj) For measuring semantic similarity between two concepts, ci,cj∈V,k∈(0,1](ii) a The parameter k represents the contribution of common information of the two concepts to the similarity, in this example k is 0.9; LCS is two concepts ci,cjThe most specific concept in the common ancestors between them, i.e. the nearest shared node, such as the LCS of the concept "upload field" and the concept "counter" in fig. 2 is the concept "interface"; path Paths (c)i,cj)={P1;P2;…PnIs to the concept ci,cjA connected path set, a plurality of paths are arranged between the two concepts, and a common P is arranged1To PnA strip path, then length (c)i,cj) Representing the shortest path among all paths, let Pi∈Paths(ci,cj) Then let
length(ci,cj)=min(|Pi|)
(4)
Length (c) is obtainedi,cj) Representing the shortest path length between the two concepts.
2) The semantic similarity between concepts is calculated by using a graph-based IC (integrated circuit), namely based on a specific KG, and the IC of the concepts in the KG is calculated by using an example in a concept classification method, such as the distribution condition of an entity 'counter of a broad-release bank' below the concept 'counter' in FIG. 2, and the method does not depend on an extrinsic corpus.
The graph-based IC is defined in KG as:
ICgraph(ci)=-logProb(ci)
(5)
wherein ,
n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graphiIs defined as:
frequenvy(ci)=count(ci)
(7)
where count is a simple function that computes a base of an entity. Implementing count function with SPARQL query language
SELECT count(ie)as ie WHERE
{
ie rdf:type owl:Thing.
}。
The IC in the equation can be either a corpus-based IC as is conventional or a graph-based IC. Graph-based ICs are as effective as corpus-based ICs, and can be a good complement to conventional corpus-based ICs in cases where the corpus is insufficient or online computation of the IC is required. Since the IC in a knowledge graph typically expresses the type of instance with disambiguating concepts, graph-based IC contains a specific meaning of the concept. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because the IC is calculated by calculating the occurrence of words in the corpus, where the words may be mapped to different concepts simultaneously.
In the Knowledge Graph (KG), more intuitive language information comprises semantic distances among different concepts, and the closer the semantic distances of the concepts, the smaller the shortest path among the concepts, the higher the similarity between the two concepts. But this method does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth has a disadvantage in that if any two concepts have the same path length and depth, their semantic similarity is the same. sim (c)i,cj) Is used forMeasure semantic similarity between two concepts, such as the concept classification graph shown in FIG. 2, where sim (send-up field, return field) and sim (counter, VTM) are the same (because the shortest path length and depth of the above concept pairs are the same) concept ciDepth of e V is defined as ciShortest path length to root concept (top-most concept). Since the shortest path length and depth of the above concept pair are the same. IC is a statistical method used to measure the amount of conceptual information. The general concepts have a lower information content and therefore a lower IC value, while the more specific concepts have a higher IC value. The more information two concepts share, the more similar they are. But using only information of concepts ignores valuable distance information between concepts in the concept taxonomy. And the semantic distance between the concepts is an effective measurement method for describing the similarity between the concepts. A disadvantage of similarity algorithms based on concept IC is that if the LCS (nearest shared node) of two concepts are the same, their semantic similarity is the same. As shown in fig. 2, sim (send-up field, counter) and sim (return field, VTM) are the same, and are considered the same as long as the nearest shared nodes are the same, because the LCS of the above concept pair is the same.
Therefore, by combining the two methods, the IC of the two concepts LCS is used for weighting the shortest path length between the concepts, and the method can better describe the semantic similarity between the concepts.
The text similarity test in the financial industry is now exemplified:
existing sentence 1: interface add success and sentence 2: interface return failure
Firstly, three word segmentation algorithms are used for word segmentation to obtain three word segmentation combinations, and the three word segmentation combinations in the example are the same, so that the result is used as the word segmentation combination of the text. The word segmentation results for sentence 1 are as follows: interface | add | is successful; the word segmentation result of sentence 2 is as follows: interface | Return | fails.
Secondly, the shortest path between the concept interface and the concept interface is 0; the shortest path between the concept "add" and the concept "return" is 4, and the IC of the LCS of both concepts is 8.6219; the shortest path between concept "success" and concept "failure" is 7, and the IC for the LCS of both concepts is 20.4752.
Finally, the similarity score between the concept "interface" and the concept "interface" can be obtained by using the formula (3) as follows: 1; the similarity score between the concept "add" and the concept "return" is: 0.3828, respectively; the similarity score between concept "success" and concept "failure" is: 0.5526.
the closer the score is to 1, the higher the similarity of the two concepts.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (7)
1. The method for calculating the semantic similarity based on the knowledge graph in the software test of the financial industry is characterized by comprising the following steps of:
s1, performing word segmentation operation on the financial text;
s2, selecting the word segmentation combination most relevant to the text theme;
and S3, calculating the semantic similarity of the financial text by using the knowledge graph and the word combination weighted by the concept IC to the minimum path length.
2. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S1 includes:
segmenting Chinese sentences through three word segmentation algorithms to obtain different word segmentation combinations;
algorithm I, Jieba
1) Generating all possible word formation conditions in the sentence according to a trie tree generated by the dit.txt dictionary through a dite.txt dictionary carried by the Jieba word segmentation to form a DAG directed acyclic graph;
2) carrying out reverse calculation from right to left on the sentence by using a dynamic programming method, taking the frequency of the word in the dictionary as the frequency of the word, finding out a maximum probability path, and finally obtaining a word segmentation combination with the maximum probability;
for the vocabulary which does not appear in the dictionary, the method adopts a hidden Markov model based on the Chinese character word forming capability and uses a Viterbi algorithm to realize word segmentation operation on the words which are not recorded in the dictionary;
algorithm two, Ansj Chinese word segmentation device
The second algorithm is based on the iava realization of the Chinese word segmentation of the n-Gram language model, the random condition field and the hidden Markov model;
algorithm III SmartChineseAnalyzer Chinese word segmentation device
The computation of word statistics is performed over a large number of corpora in a hidden markov model, and then the statistical results are used to calculate the best segmentation combination for the text.
3. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S2 includes: for the obtained three word segmentation combinations, if the three word segmentation results are the same, no processing is performed; if the word segmentation results are different, obtaining a word segmentation combination most relevant to the text theme by adopting an improved algorithm based on a knowledge graph, wherein the relevance between the words and the text theme is measured by solving a concept distance between the words and the text keywords, the concept is a set formed by entities with the same characteristics, the entities refer to different examples, and the examples comprise specific descriptions of the entities in the concepts.
4. The knowledge-graph-based semantic similarity calculation method according to claim 3, wherein the step of selecting the combination of the participles most relevant to the text theme comprises the following steps:
1) firstly, a keyword is proposed from each participle combination by using a TextRank algorithm, the size of a window is set to be m in the participle combination, and one sentence sequentially consists of the following words:
w1,w2,...,wmto obtain [ w1,w2,...,wm],[w2,w3,...,wm+1],[w3,w4,...,wm+2]The window, regard every word in the text as a node, in the same window, there is an edge between any two nodes, then utilize the principle of voting, regard the edge as the mutual vote among the word, after iterating continuously, the number of votes obtained of every node will tend to be stable, through comparing the number of votes of the node, judge the importance of the word, fetch m words that the number of votes is the most as the key word of the text, the importance formula of the node is as follows:
wherein d is a damping coefficient representing the probability that a certain node points to any node, ln (V)i) Is pointing to node ViSet of nodes of, and Out (V)j) Representing a slave node ViSet of nodes to which a departure points, S (V)i),S(Vj) Representing the importance of each node, wherein i and j represent different nodes respectively, and the total number of the nodes is m;
2) selecting the word segmentation result that minimizes the concept distance:
by calculating the word omega and m keywords k in the text N1,k2,...,kmAnd measures the relevance of the word segmentation combination and the text topic N by the average conceptual distance, wherein the word segmentation combination with the minimum conceptual distance sum Dis (omega, N) is regarded as the combination with the highest relevance to the text topic, and the conceptual distance DBpediaDis (omega, k)i) Calculated in the DBpedia database.
5. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the language information in the knowledge graph comprises semantic distances of different concepts, i.e. distances between nodes; the closer the semantic distance of the concepts, the smaller the shortest path between the concepts, indicating the higher the similarity between the two.
6. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the concept ic (information content) is used to measure the amount of concept information; the low IC value, which is calculated based on the knowledge-graph, is low, the amount of conceptual information is low.
7. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S3 includes:
1) KG is defined as an oriented tagram, G ═ (V, E, τ), V represents all nodes in the knowledge-graph, G represents the defined KG, i.e. the oriented tagram; e represents a series of edges connecting the nodes; t is a function V multiplied by V → E, all triples in G are defined, and in the D-path algorithm, the semantic similarity of the concept is defined as:
wherein ,simwpath(ci,cj) For measuring semantic similarity between two concepts, ci,cjRespectively representing two different concepts, ci,cj∈V,k∈(0,1]The parameter k represents the contribution of common information of the two concepts to the similarity; IC (c)Lcs) IC value, C, referring to the nearest shared nodeLcsRefers to the nearest shared node of the two concepts, LCS is concept ci,cjThe most specific concepts in the common ancestor; weighting shortest Path Length (c) with IC values of two notional nearest shared nodesi,cj) Let path Paths (c)i,cj)={P1;P2;...PnIs to the concept ci,cjA set of connected paths, two concepts having P1To PnStrip path, length (c)i,cj) Representing the shortest path among all paths, let Pi∈Paths(ci,cj) Then let
length(ci,cj)=min(|Pi|) (4)
Length (c) is obtainedi,cj) Represents the shortest path length between two concepts;
2) calculating semantic similarity between concepts by using a graph-based IC (integrated circuit), namely based on KG, calculating a concept IC in KG according to example distribution conditions in a concept classification method, wherein the graph-based IC is defined as follows in KG:
ICgraph(ci)=-logProb(ci) (5)
wherein ,
n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graphiIs defined as:
frequenvy(ci)=count(ci) (7)
where count is a simple function of calculating an entity base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910010902.6A CN110232185B (en) | 2019-01-07 | 2019-01-07 | Knowledge graph semantic similarity-based computing method for financial industry software testing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910010902.6A CN110232185B (en) | 2019-01-07 | 2019-01-07 | Knowledge graph semantic similarity-based computing method for financial industry software testing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110232185A true CN110232185A (en) | 2019-09-13 |
CN110232185B CN110232185B (en) | 2023-09-19 |
Family
ID=67860089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910010902.6A Active CN110232185B (en) | 2019-01-07 | 2019-01-07 | Knowledge graph semantic similarity-based computing method for financial industry software testing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110232185B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110618987A (en) * | 2019-09-18 | 2019-12-27 | 宁夏大学 | Treatment pathway key node information processing method based on lung cancer medical big data |
CN110941612A (en) * | 2019-11-19 | 2020-03-31 | 上海交通大学 | Autonomous data lake construction system and method based on associated data |
CN111125339A (en) * | 2019-11-26 | 2020-05-08 | 华南师范大学 | Test question recommendation method based on formal concept analysis and knowledge graph |
CN112328810A (en) * | 2020-11-11 | 2021-02-05 | 河海大学 | Knowledge graph fusion method based on self-adaptive mixed ontology mapping |
CN114168751A (en) * | 2021-12-06 | 2022-03-11 | 厦门大学 | Medical knowledge concept graph-based medical text label identification method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100223051A1 (en) * | 2003-10-27 | 2010-09-02 | Educational Testing Service | Method and System for Determining Text Coherence |
US20150227626A1 (en) * | 2014-02-12 | 2015-08-13 | Regents Of The University Of Minnesota | Measuring semantic incongruity within text data |
KR101623860B1 (en) * | 2015-04-08 | 2016-05-24 | 서울시립대학교 산학협력단 | Method for calculating similarity between document elements |
CN106610951A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Improved text similarity solving algorithm based on semantic analysis |
CN106844350A (en) * | 2017-02-15 | 2017-06-13 | 广州索答信息科技有限公司 | A kind of computational methods of short text semantic similarity |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN108090077A (en) * | 2016-11-23 | 2018-05-29 | 中国科学院沈阳计算技术研究所有限公司 | A kind of comprehensive similarity computational methods based on natural language searching |
-
2019
- 2019-01-07 CN CN201910010902.6A patent/CN110232185B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100223051A1 (en) * | 2003-10-27 | 2010-09-02 | Educational Testing Service | Method and System for Determining Text Coherence |
US20150227626A1 (en) * | 2014-02-12 | 2015-08-13 | Regents Of The University Of Minnesota | Measuring semantic incongruity within text data |
KR101623860B1 (en) * | 2015-04-08 | 2016-05-24 | 서울시립대학교 산학협력단 | Method for calculating similarity between document elements |
CN106610951A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Improved text similarity solving algorithm based on semantic analysis |
CN108090077A (en) * | 2016-11-23 | 2018-05-29 | 中国科学院沈阳计算技术研究所有限公司 | A kind of comprehensive similarity computational methods based on natural language searching |
CN106844350A (en) * | 2017-02-15 | 2017-06-13 | 广州索答信息科技有限公司 | A kind of computational methods of short text semantic similarity |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110618987A (en) * | 2019-09-18 | 2019-12-27 | 宁夏大学 | Treatment pathway key node information processing method based on lung cancer medical big data |
CN110941612A (en) * | 2019-11-19 | 2020-03-31 | 上海交通大学 | Autonomous data lake construction system and method based on associated data |
CN110941612B (en) * | 2019-11-19 | 2020-08-11 | 上海交通大学 | Autonomous data lake construction system and method based on associated data |
CN111125339A (en) * | 2019-11-26 | 2020-05-08 | 华南师范大学 | Test question recommendation method based on formal concept analysis and knowledge graph |
CN111125339B (en) * | 2019-11-26 | 2023-05-09 | 华南师范大学 | Test question recommendation method based on formal concept analysis and knowledge graph |
CN112328810A (en) * | 2020-11-11 | 2021-02-05 | 河海大学 | Knowledge graph fusion method based on self-adaptive mixed ontology mapping |
CN112328810B (en) * | 2020-11-11 | 2022-10-14 | 河海大学 | Knowledge graph fusion method based on self-adaptive mixed ontology mapping |
CN114168751A (en) * | 2021-12-06 | 2022-03-11 | 厦门大学 | Medical knowledge concept graph-based medical text label identification method and system |
Also Published As
Publication number | Publication date |
---|---|
CN110232185B (en) | 2023-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pontes et al. | Predicting the semantic textual similarity with siamese CNN and LSTM | |
CN110232185B (en) | Knowledge graph semantic similarity-based computing method for financial industry software testing | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
JP5391633B2 (en) | Term recommendation to define the ontology space | |
CN106372061B (en) | Short text similarity calculation method based on semantics | |
JP3882048B2 (en) | Question answering system and question answering processing method | |
JP5391634B2 (en) | Selecting tags for a document through paragraph analysis | |
Varma et al. | IIIT Hyderabad at TAC 2009. | |
CN109783806B (en) | Text matching method utilizing semantic parsing structure | |
JP5710581B2 (en) | Question answering apparatus, method, and program | |
Zhou et al. | New model of semantic similarity measuring in wordnet | |
JP2009093651A (en) | Modeling topics using statistical distribution | |
JP2009093654A (en) | Determinion of document specificity | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN109408802A (en) | A kind of method, system and storage medium promoting sentence vector semanteme | |
CN104765779A (en) | Patent document inquiry extension method based on YAGO2s | |
Hussein | Visualizing document similarity using n-grams and latent semantic analysis | |
Sebti et al. | A new word sense similarity measure in WordNet | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
Danilova | Cross-language plagiarism detection methods | |
Vij et al. | Fuzzy logic for inculcating significance of semantic relations in word sense disambiguation using a WordNet graph | |
Bella et al. | Domain-based sense disambiguation in multilingual structured data | |
Godoy et al. | Leveraging semantic similarity for folksonomy-based recommendation | |
Zhang et al. | An approach for named entity disambiguation with knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |