CN110232185B - Knowledge graph semantic similarity-based computing method for financial industry software testing - Google Patents

Knowledge graph semantic similarity-based computing method for financial industry software testing Download PDF

Info

Publication number
CN110232185B
CN110232185B CN201910010902.6A CN201910010902A CN110232185B CN 110232185 B CN110232185 B CN 110232185B CN 201910010902 A CN201910010902 A CN 201910010902A CN 110232185 B CN110232185 B CN 110232185B
Authority
CN
China
Prior art keywords
word segmentation
concept
text
concepts
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910010902.6A
Other languages
Chinese (zh)
Other versions
CN110232185A (en
Inventor
杜广龙
陈震星
李方
周文沛
孙慧
姚庚成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Chinasoft Huateng Software System Co ltd
South China University of Technology SCUT
Original Assignee
Shanghai Chinasoft Huateng Software System Co ltd
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Chinasoft Huateng Software System Co ltd, South China University of Technology SCUT filed Critical Shanghai Chinasoft Huateng Software System Co ltd
Priority to CN201910010902.6A priority Critical patent/CN110232185B/en
Publication of CN110232185A publication Critical patent/CN110232185A/en
Application granted granted Critical
Publication of CN110232185B publication Critical patent/CN110232185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a knowledge graph semantic similarity-based computing method for financial industry software testing, which comprises the following steps: s1, word segmentation operation is carried out on a financial text; s2, selecting word segmentation combinations most relevant to the thematic property of the text; s3, calculating semantic similarity of the component groups by utilizing the knowledge graph and using the concept IC to weight the minimum path length. The natural semantic detection algorithm based on the knowledge graph firstly performs word segmentation on the financial text by utilizing a plurality of word segmentation algorithms to obtain word segmentation combinations, then calculates the concept distances of words and text keywords to measure the similarity between the word segmentation combinations and text topics, and finally selects the word segmentation combination with the minimum sum of the concept distances to perform semantic similarity detection. The information IC of the concepts is used in the knowledge graph to weight the shortest path length between the concepts, and the knowledge graph shows better performance in accuracy compared with other methods.

Description

Knowledge graph semantic similarity-based computing method for financial industry software testing
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a knowledge graph semantic similarity-based computing method for financial industry software testing.
Background
Semantic similarity detection between natural languages is widely used in many fields such as information retrieval, machine translation, and automatic question-answering. The sentence itself is composed of individual words, including subjects, predicates, various stop words, and the like. Even the same words, in different combinations and in different contexts, have entirely different meaning. In recent years, many statistical-based calculation methods have been proposed, but this method ignores semantic information and sentence structure information of text, and the obtained calculation result sometimes does not conform to understanding of natural language, and the current corpus-based IC may contain ambiguous concept meanings because the IC is calculated by calculating the occurrence of words in the corpus, where words are mapped to different concepts at the same time. Nowadays, many public Knowledge Graphs (KG) are applied, including a large number of concepts, entities and relationships between them. The knowledge graph is utilized to detect the similarity of natural language, so that semantic information behind the detected text can be accurately known, and more accurate and structured information is returned. However, most application knowledge graphs for semantic similarity detection are mainly English text, and the application of the knowledge graphs in Chinese is less. Financial text is different from English text, words are not separated by spaces, and Chinese is relatively more complex in grammar structure. In view of the above features, word segmentation of the Chinese sentence is necessary before semantic similarity detection by using the knowledge graph.
Disclosure of Invention
Aiming at the prior semantic similarity research, the semantic similarity of texts is mainly measured by calculating the depth and the path length between concepts or by calculating the Information Content (IC) between concepts, the invention provides a semantic similarity calculation method based on knowledge graphs for financial industry software test, weights the minimum distance of the path by using the Information Content (IC) of the concepts, and provides a graph-based IC to calculate through the concept distribution of examples, and the specific steps are as follows:
s1, word segmentation operation is carried out on a financial text;
s2, selecting word segmentation combinations most relevant to the thematic property of the text;
s3, calculating the semantic similarity of the financial text by utilizing a Knowledge Graph (KG) and using a concept IC weighted minimum path length to calculate the semantic similarity of the financial text.
Further, step S1 includes:
dividing Chinese sentences through a plurality of word segmentation algorithms to obtain different word segmentation combinations;
algorithm I, jieba (resultant word)
1) And generating all possible word forming conditions in sentences through a self-contained dictionary of the Jieba word segmentation according to a trie tree generated by the dictionary of the Jieba word segmentation, and forming a directed acyclic graph (Directed Acyclic Graph, DAG).
2) The maximum probability path is found out by using a dynamic programming method, and because the gravity center of the Chinese sentence is often located at the second half section, reverse calculation is performed on the sentence from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the word segmentation combination with the maximum probability is obtained.
For the vocabulary which does not appear in the dictionary, an HMM (Hidden Markov Models, hidden Markov model) model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used for realizing word segmentation operation on the words which are not recorded in the dictionary.
Algorithm II and Ansj Chinese word segmentation device
Algorithm two is a java implementation of chinese segmentation based on an n-Gram language model, CRF (condition Random field, random condition field) and HMM (Hidden Markov Models, hidden markov model).
Algorithm three, smartchinese analyzer
Smartchinese analyzer is a tool provided by the text retrieval system Lucene, which performs word statistics calculation by using a large number of corpus in Hidden Markov Model (HMM), and then, the statistics result is used to calculate the best word segmentation combination of the text.
Further, the step S2 includes:
for the given Chinese character word segmentation combinations, if the three word segmentation results are the same, the processing is not performed; if the word segmentation results are different, an improved algorithm based on DBpedia (knowledge graph) is adopted to obtain word segmentation combinations most relevant to text thematic property in order to improve the accuracy of the word segmentation results. The relevance between a word and a text topic is measured by finding the conceptual distance between the word and the text keyword. The concept is a collection of entities of the same nature, the entities referring to different instances, examples including specific descriptions in the concept, such as "broad counter" under the concept "counter".
Further, the step of selecting the word segmentation combination most relevant to the text thematic property comprises the following steps:
1) Keywords are first presented from each word segmentation combination using the TextRank algorithm (text ranking algorithm), which was adapted from the Google's web page ranking algorithm PageRank. In the word segmentation combination, the size of the window is set to be m, and [ w ] can be obtained 1 ,w 2 ,…,w m ],[w 2 ,w 3 ,…,w m+1 ],[w 3 ,w 4 ,…,w m+2 ]And (5) waiting for a window. Each word in the text is considered as a node, and in the same window, an edge exists between any two word nodes. Then, the sides are used as mutual votes among words by using the voting principle, and the number of votes obtained by each node tends to be stable after continuous iteration. Judging the importance degree of the words by comparing the ticket number of the nodes, taking m words with the largest ticket number as keywords of the text, and thus obtaining the keywords of the text. The importance formula of the node is as follows:
wherein d is a damping coefficient, representing the probability that a certain node points to any node, and is generally 0.85.Ln (V i ) Is directed to node V i And Out (V) j ) Representing slave node V i A set of nodes pointed to by the departure.
2) Selecting word segmentation results that minimize the concept distance:
by calculating the word omega and m keywords { k ] in the text N 1 ,k 2 ,…,k m The average conceptual distance between the word segmentation combinations and the text N topic correlation is measured, the word segmentation combination with the smallest sum of the conceptual distances Dis (omega, N) can be regarded as the combination most relevant to the text topic, and the conceptual distance DBpediaDis (omega, k) i ) And calculating in the DBpedia database to obtain word segmentation combinations of the financial texts, and then detecting semantic similarity.
Further, in the Knowledge Graph (KG), the language information includes semantic distances between different concepts, that is, distances between nodes, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the similarity between the two concepts is higher.
Further, the concept IC is used for measuring the concept information amount, and if the IC value is low, the concept information amount is low, and the IC value is calculated based on the knowledge graph. The higher the IC value of a particular concept, the more information that two concepts share, the more similar the two concepts are, and based on the similarity algorithm of the concept IC, the similarity of the two concepts is the same if the LCS (most recently shared node) of the two concepts is the same.
Further, step S3 includes:
(1) KG is defined as the orientation mark map. G= (V, E, τ), V representing all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is a function V V→E, defining all triples in G. In the D-path algorithm, the semantic similarity between concepts can be defined as:
wherein ,simwpath (c i ,c j ) For measuring semantic similarity between two concepts c i ,c j Respectively two different concepts, c i ,c j ∈V,k∈(0,1]The method comprises the steps of carrying out a first treatment on the surface of the The parameter k represents two probabilitiesContribution of common information of the ideas to similarity; IC (c) Lcs ) Refers to the IC value of the most recently shared node, c Lcs Refers to the nearest shared node of two concepts, LCS is concept c i ,c j The most specific concept in the common ancestor is the most recently shared node; shortest path length (c) weighted with IC values of the nearest shared nodes of both concepts i ,c j ) Set a path (c) i ,c j )={P 1 ;P 2 ;…P n Is to divide concept c i ,c j The connected paths are set, there are multiple paths between two concepts, and a common P is set 1 To P n Path length (c) i ,c j ) Representing the shortest path among all paths, let P i ∈Paths(c i ,c j ),
Let then
length(c i ,c j )=min(|P i |) (4)
Obtaining length (c) i ,c j ) Representing the shortest path length between the two concepts.
(2) The semantic similarity between concepts is calculated based on a specific KG by using a graph-based IC, namely, the semantic similarity is calculated based on a specific KG, the IC value can be used for knowing the semantic similarity, the IC of the concepts in the KG is calculated by the example distribution condition in the concept classification method, and the semantic similarity between concepts is calculated based on the graph-based IC and does not depend on an external corpus.
The graphics-based IC is defined in KG as:
IC graph (c i )=-logProb(c i ) (5)
wherein ,
n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graph i Is defined as:
frequenvy(c i )=count(c i ) (7)
where count is a simple function that computes the cardinality of an entity.
The IC in the equation may be either a conventional corpus-based IC or a graph-based IC. The graph-based IC is as effective as the corpus-based IC, and in the case of insufficient corpus or on-line calculation of the IC is required, the graph-based IC may be a good complement to the conventional corpus-based approach. Since ICs in knowledge maps typically express the type of instance in terms of disambiguation concepts, graph-based ICs contain the specific meaning of the concepts. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because ICs are computed by computing the occurrence of words in the corpus, where words may be mapped to different concepts simultaneously.
In the Knowledge Graph (KG), the more visual language information comprises semantic distances between different concepts, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the higher the similarity between the two concepts is. But this approach does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth is deficient in that if any two concepts have the same path length and depth, their semantic similarity is the same. As shown in fig. 1, sim (concept 7, concept 8) and sim (concept 10, concept 11) are the same, because the shortest path length and depth of the above concept pairs are the same. IC is a statistical method for measuring the amount of conceptual information. The general concept has a lower information content and thus a lower IC value, while the more specific concept has a higher IC value. The more information that two concepts share, the more similar they are. But merely using information about concepts ignores valuable distance information between concepts in a concept taxonomy. And semantic distance between concepts is an effective measure of describing similarity between concepts. A disadvantage of the concept IC based similarity algorithm is that if LCS (most recently shared node) of the two concepts are identical, their semantic similarity is identical. Such as sim (concept 4, concept 5) and sim (concept 7, concept 9) are the same, since LCS of the above concept pair is concept 2. Thus, combining the two methods, using the IC of the two concepts LCS to weight the shortest path length between the concepts, this method can better describe the semantic similarity between the concepts.
The proposed D-path algorithm gives different weights to the shortest path length of the two concepts through the shared information of the two concepts, and when the two concepts are identical, the path is 0, and the similarity score is 1 at the highest. The similarity score of this algorithm is progressively smaller as the path length increases, the similarity score range of this algorithm is (0, 1.) when the path lengths of the two concepts are the same, the more shared information between the two concepts represents more similarity between the two concepts, although this algorithm does not contain depth information, the ICs of LCS are actually similar to the depth of the concepts, meaning that the concepts deeper in the concept classification map have more specific information, and thus are more similar between the two.
In summary, the D-path algorithm solves the problem that the similarity score between the concept pairs is the same due to the same depth and path length (as shown in the third step, the D-path algorithm weights the minimum path length by introducing the shared information of the concepts, and the shared information may be different and thus the scores are different when the minimum paths are the same), weights the minimum path length between the concept pairs by using the shared information of the concepts, thereby preserving valuable distance information in the concept taxonomies and obtaining statistical information to represent consistency of the generic classification structure between the concepts.
Compared with the prior art, the invention has the following advantages and effects:
the natural semantic detection algorithm based on the knowledge graph firstly performs word segmentation on the financial text by utilizing a plurality of word segmentation algorithms to obtain word segmentation combinations, then calculates the concept distances of words and text keywords to measure the similarity between the word segmentation combinations and text topics, and finally selects the word segmentation combination with the minimum sum of the concept distances to perform semantic similarity detection. The information IC of the concepts is used in the knowledge graph to weight the shortest path length between the concepts, the accuracy is better than that of other methods, the knowledge graph is used for detecting the similarity of natural language, the semantic information behind the detected text can be accurately known, and more accurate and structured information is returned.
The IC value is calculated based on graphics, namely based on a knowledge graph, and a corpus is not needed, so that the situation that one word is mapped to a plurality of concepts at the same time basically does not exist.
Drawings
Fig. 1 is a conceptual classification diagram of an embodiment.
Fig. 2 is a schematic diagram of a financial text classification according to an embodiment.
The algorithm flow chart of the embodiment of fig. 3.
Detailed Description
The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.
Examples:
as shown in fig. 3, the financial industry oriented software testing method based on knowledge graph semantic similarity calculation comprises the following steps:
s1, word segmentation operation is carried out on a financial text;
s2, selecting the word segmentation combination most relevant to the text thematic property to obtain different word segmentation combinations;
and S3, calculating the semantic similarity of the component groups by utilizing the knowledge graph and the concept IC weighted minimum path length.
Further, step S1 includes:
dividing Chinese sentences through a plurality of word segmentation algorithms to obtain different word segmentation combinations;
1. jieba
1) Generating all possible word forming conditions in sentences through a self-contained text-to-txt dictionary of the Jieba word segmentation according to a trie tree generated by the text-to-txt dictionary, and forming a DAG directed acyclic graph;
2) The maximum probability path is found out by utilizing a dynamic programming method, and the gravity center of the Chinese sentence is often positioned at the second half section, so that reverse calculation is performed on the sentence from right to left, the frequency of the word in the dictionary is used as the frequency of the word, the maximum probability path is found out, and finally the segmentation combination of the maximum probability is obtained;
and for the vocabulary which does not appear in the dictionary, adopting an HMM model based on the word forming capability of Chinese characters, and using a Viterbi algorithm to realize word segmentation operation on the vocabulary which does not appear in the dictionary.
2. Ansj
This is a java implementation of chinese segmentation based on CRF, n-Gram and HMM.
3. SmartChineseAnalyzer
Smartchinese analyzer is a tool provided by the text retrieval system Lucene, which performs word statistics calculation by using a large number of corpus in Hidden Markov Model (HMM), and then, the statistics result is used to calculate the best word segmentation combination of the text.
Further, the step S2 includes:
for the given Chinese character word segmentation combinations, if the three word segmentation results are the same, the processing is not performed; if the word segmentation results are different, in order to improve the accuracy of the word segmentation results, an improved algorithm based on DBpedia is adopted to obtain word segmentation combinations most relevant to the text thematic property. The relevance between a word and a text topic is measured by finding the conceptual distance between the word and the text keyword.
1) Firstly, a TextRank algorithm is utilized to propose keywords from each word segmentation combination, and the TextRank algorithm is adapted by a webpage ordering algorithm Pagerank of Google. In the word segmentation combination, the size of the window is set to be m, and [ w ] can be obtained 1 ,w 2 ,…,w m ],[w 2 ,w 3 ,…,w m+1 ],[w 3 ,w 4 ,…,w m+2 ]And (5) waiting for a window. Each word in the text is considered as a node, and in the same window, an edge exists between any two word nodes. Then, the sides are used as mutual votes among words by using the voting principle, and the obtained votes tend to be stable after continuous iteration. And judging the importance degree of the words by comparing the number of the tickets, thereby obtaining the keywords of the text.
Wherein d is a damping coefficient representing that a certain node points to any directionThe probability of the intended node is typically 0.85, ln (V i ) Is directed to node V i And Out (V) j ) Representing slave node V i A set of nodes pointed to by the departure.
2) Selecting word segmentation results that minimize the concept distance:
by calculating the word omega and m keywords { k ] in the text N 1 ,k 2 ,…,k m The average conceptual distance between the word segmentation combinations and the text N topic correlation is measured, and the word segmentation combination with the smallest sum of the conceptual distances Dis (ω, N) can be regarded as the combination most relevant to the text topic. Concept distance DBpediaDis (omega, k) i ) And (3) calculating in the DBpedia database to obtain word segmentation combinations of the financial texts, and then carrying out the following semantic similarity detection.
1) KG is defined as the orientation mark map. G= (V, E, τ), V representing all nodes in the knowledge graph; e represents a series of edges connecting the nodes; τ is a function V V→E, defining all triples in G. In the present algorithm, the semantic similarity between concepts can be defined as:
wherein ,simDpath (c i ,c j ) For measuring semantic similarity between two concepts c i ,c j ∈V,k∈(0,1]The method comprises the steps of carrying out a first treatment on the surface of the The parameter k represents the contribution of the common information of the two concepts to the similarity, in this example k=0.9; LCS is two concepts c i ,c j LCS with the most specific concept among the common ancestors, i.e., the most recently shared node, such as the concept "upload field" and the concept "counter" in fig. 2 are the concept "interfaces"; set path (c) i ,c j )={P 1 ;P 2 ;…P n Is to divide concept c i ,c j Connected path set, two generalThere are several paths between the ideas, a total of P is set 1 To P n Path and then length (c i ,c j ) Representing the shortest path among all paths, let P i ∈Paths(c i ,c j ),
Let then
length(c i ,c j )=min(|P i |) (4)
Obtaining length (c) i ,c j ) Representing the shortest path length between the two concepts.
2) Using graphics-based ICs, i.e. computing semantic similarity between concepts based on a specific KG, by way of example in a concept taxonomy,
the IC of the concept in KG is calculated as the distribution of the concept "counter" below the "counter" of fig. 2, with the entity "counter" of the issuing bank, which is independent of the external corpus.
The graphics-based IC is defined in KG as:
ICgraph(c i )=-logProb(c i ) (5)
wherein ,
n represents the total number of all entities in the knowledge-graph,
concept c in knowledge graph i Is defined as:
frequenvy(c i )=count(c i ) (7)
where count is a simple function that computes the cardinality of an entity. Implementing count functions with SPARQL query language
SELECT count(ie)as ie WHERE
{
ie rdf:type owl:Thing.
}。
The IC in the equation may be either a conventional corpus-based IC or a graph-based IC. The graph-based IC is as effective as the corpus-based IC, and in the case of insufficient corpus or on-line calculation of the IC is required, the graph-based IC may be a good complement to the conventional corpus-based approach. Since ICs in knowledge maps typically express the type of instance in terms of disambiguation concepts, graph-based ICs contain the specific meaning of the concepts. Whereas corpus-based IC algorithms may contain ambiguous concept meanings because ICs are computed by computing the occurrence of words in the corpus, where words may be mapped to different concepts simultaneously.
In the Knowledge Graph (KG), the more visual language information comprises semantic distances between different concepts, and the closer the semantic distances between the concepts are, the smaller the shortest path between the concepts is, which indicates that the higher the similarity between the two concepts is. But this approach does not solve the similarity problem between concepts having the same depth and length. The similarity algorithm based on path length and depth is deficient in that if any two concepts have the same path length and depth, their semantic similarity is the same. sim (c) i ,c j ) Is used to measure semantic similarity between two concepts, as in the concept classification diagram shown in FIG. 2, sim (upload field, return field) and sim (counter, VTM) are the same (since the shortest path length and depth of the above concept pairs are the same) concept c i Depth of e V is defined as from c i Shortest path length to root concept (topmost concept). Because the shortest path length and depth of the above conceptual pair are the same. IC is a statistical method for measuring the amount of conceptual information. The general concept has a lower information content and thus a lower IC value, while the more specific concept has a higher IC value. The more information that two concepts share, the more similar they are. But merely using information about concepts ignores valuable distance information between concepts in a concept taxonomy. And semantic distance between concepts is an effective measure of describing similarity between concepts. A disadvantage of the concept IC based similarity algorithm is that if LCS (most recently shared node) of the two concepts are identical, their semantic similarity is identical. As shown in fig. 2, sim (send-up field, counter) and sim (return field, VTM) are the same, as long as the most recently shared nodes are the sameThe same is considered because LCS for the above conceptual pair is the same.
Thus, combining the two methods, using the IC of the two concepts LCS to weight the shortest path length between the concepts, this method can better describe the semantic similarity between the concepts.
The text similarity test in the financial industry will now be illustrated:
existing sentence 1: interface add success and sentence 2: interface return failure
Firstly, three word segmentation algorithms are used for word segmentation, three word segmentation combinations are obtained, in this example, the three word segmentation combinations are the same, and therefore the result is taken as the word segmentation combination of the text. The word segmentation result of sentence 1 is as follows: interface |increase|success; the word segmentation result of sentence 2 is as follows: interface |returns |failure.
Secondly, the shortest path between the concept 'interface' and the concept 'interface' is 0; the shortest path of the concept "add" and the concept "return" is 4, the IC of LCS for both concepts is 8.6219; the shortest path between the concept "success" and the concept "failure" is 7, and the IC of LCS for both concepts is 20.4752.
Finally, the similarity scores of the concept 'interface' and the concept 'interface' can be obtained by using the formula (3) as follows: 1, a step of; the similarity score for the concept "increase" and the concept "return" is: 0.3828; the similarity score for the concept "success" and the concept "failure" is: 0.5526.
the closer the score is to 1, the higher the similarity of the two concepts.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (4)

1. The method for calculating semantic similarity of software tests based on knowledge patterns in the financial industry is characterized by comprising the following steps of:
s1, word segmentation operation is carried out on a financial text;
s2, selecting word segmentation combinations most relevant to text thematic property, and specifically comprising the following steps: if the three word segmentation results are the same, the three word segmentation combinations are not processed; if word segmentation results are different, a word segmentation combination most relevant to text thematic is obtained by adopting an improved algorithm based on a knowledge graph, the relevance between a word and a text theme is measured by solving a conceptual distance between the word and a text keyword, the concept is a set formed by entities with the same characteristics, the entities refer to different examples, and the examples comprise specific descriptions of the entities in the concept; the word segmentation combination step most relevant to the text thematic property is selected as follows:
1) Firstly, extracting keywords from each word segmentation combination by using a TextRank algorithm, wherein in the word segmentation combination, the size of a window is set to be m, and one sentence sequentially consists of the following words:
w 1 ,w 2 ,…,w m to obtain [ w ] 1 ,w 2 ,…,w m ],[w 2 ,w 3 ,…,w m+1 ],[w 3 ,w 4 ,…,w m+2 ]And in the window, each word in the text is regarded as a node, an edge exists between any two nodes in the same window, then the edge is regarded as mutual voting among words by utilizing the voting principle, the number of votes obtained by each node tends to be stable after continuous iteration, the importance of the word is judged by comparing the number of votes of the nodes, m words with the maximum number of votes are taken as keywords of the text, and the importance formula of the node is as follows:
wherein d is a damping coefficient representing the probability that a certain node points to an arbitrary node, ln (V i ) Is directed to node V i And Out (V) j ) Representing slave node V i Set of departure-directed nodes, S (V i ),S(V j ) Representing the importance of each node, i and j respectively represent different nodes, and the total number of the nodes is m;
2) Selecting word segmentation results that minimize the concept distance:
by calculating the word omega and m keywords { k ] in the text N 1 ,k 2 ,…,k m Average conceptual distance between word segmentation combinations and text N topic correlation is measured, the word segmentation combination with minimum sum of conceptual distances Dis (omega, N) is regarded as the combination most relevant to text topic, and the conceptual distance DBpediaDis (omega, k) i ) Calculating in a DBpedia database;
s3, calculating the semantic similarity of the financial text by utilizing the knowledge graph and using the concept IC to weight the minimum path length to the partial graph; the concept IC is used for measuring concept information quantity; and if the IC value is low, the concept information amount is low, and the IC value is low and is calculated based on the knowledge graph.
2. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S1 includes:
dividing Chinese sentences through three word segmentation algorithms to obtain different word segmentation combinations;
algorithm one, jieba
1) Generating all possible word forming conditions in sentences through a self-contained text-to-txt dictionary of the Jieba word segmentation according to a trie tree generated by the text-to-txt dictionary, and forming a DAG directed acyclic graph;
2) Performing reverse calculation on sentences from right to left by using a dynamic programming method, taking the frequency of the word in a dictionary as the frequency of the word, finding out a maximum probability path, and finally obtaining a word segmentation combination with maximum probability;
for the vocabulary which does not appear in the dictionary, a hidden Markov model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used to realize word segmentation operation on the words which are not recorded in the dictionary;
algorithm II and Ansj Chinese word segmentation device
The second algorithm is based on the java implementation of Chinese word segmentation of an n-Gram language model, a random condition field and a hidden Markov model;
algorithm three, smartChineseseAnalyzer Chinese word segmentation device
Word statistics are calculated by a large number of corpora in the hidden Markov model, and then the statistics result is used for calculating the optimal word segmentation combination of the text.
3. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the language information in the knowledge graph includes semantic distances of different concepts, namely, distances between nodes; the closer the semantic distance of the concepts, the smaller the shortest path between the concepts, indicating a higher similarity between the two.
4. The knowledge-graph-based semantic similarity calculation method according to claim 1, wherein the step S3 includes: 1) KG is defined as a directional marker graph, g= (V, E, τ), V represents all nodes in the knowledge graph, G represents the defined KG, i.e. the directional marker graph; e represents a series of edges connecting the nodes; τ is a function V X V→E, defining all triples in G, in the D-path algorithm, the semantic similarity of the concept is defined as:
wherein ,simwpath (c i ,c j ) For measuring semantic similarity between two concepts c i ,c j Respectively two different concepts, c i ,c j ∈V,k∈(0,1]The parameter k represents the contribution of the common information of the two concepts to the similarity; IC (c) Lcs ) Refers to the IC value of the most recently shared node, c Lcs Refers to the nearest shared node of two concepts, LCS is concept c i ,c j The most specific concept among common ancestors; shortest path length (c) weighted with IC values of the nearest shared nodes of both concepts i ,c j ) Set a path (c) i ,c j )={P 1 ;P 2 ;…P n Is to divide concept c i ,c j The connected paths are assembled, and two concepts are P 1 To P n Path length (c) i ,c j ) Representing the shortest path among all paths, let P i ∈Paths(c i ,c j ) Let then
length(c i ,c j )=min(|P i |) (4)
Obtaining length (c) i ,c j ) Representing the shortest path length between the two concepts;
2) Using graph-based ICs, i.e., KG-based, to calculate semantic similarity between concepts, the concept ICs in KG are calculated from example distribution conditions in the concept taxonomies, and the graph-based ICs are defined in KG as:
IC graph (c i )=-logProb(c i ) (5)
wherein ,
n represents the total number of all entities in the knowledge graph, and the concept c in the knowledge graph i Is defined as:
frequency(c i )=count(c i ) (7)
where count is a simple function that computes the cardinality of an entity.
CN201910010902.6A 2019-01-07 2019-01-07 Knowledge graph semantic similarity-based computing method for financial industry software testing Active CN110232185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910010902.6A CN110232185B (en) 2019-01-07 2019-01-07 Knowledge graph semantic similarity-based computing method for financial industry software testing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910010902.6A CN110232185B (en) 2019-01-07 2019-01-07 Knowledge graph semantic similarity-based computing method for financial industry software testing

Publications (2)

Publication Number Publication Date
CN110232185A CN110232185A (en) 2019-09-13
CN110232185B true CN110232185B (en) 2023-09-19

Family

ID=67860089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910010902.6A Active CN110232185B (en) 2019-01-07 2019-01-07 Knowledge graph semantic similarity-based computing method for financial industry software testing

Country Status (1)

Country Link
CN (1) CN110232185B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110618987A (en) * 2019-09-18 2019-12-27 宁夏大学 Treatment pathway key node information processing method based on lung cancer medical big data
CN110941612B (en) * 2019-11-19 2020-08-11 上海交通大学 Autonomous data lake construction system and method based on associated data
CN111125339B (en) * 2019-11-26 2023-05-09 华南师范大学 Test question recommendation method based on formal concept analysis and knowledge graph
CN112328810B (en) * 2020-11-11 2022-10-14 河海大学 Knowledge graph fusion method based on self-adaptive mixed ontology mapping
CN114168751A (en) * 2021-12-06 2022-03-11 厦门大学 Medical knowledge concept graph-based medical text label identification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101623860B1 (en) * 2015-04-08 2016-05-24 서울시립대학교 산학협력단 Method for calculating similarity between document elements
CN106610951A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Improved text similarity solving algorithm based on semantic analysis
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108090077A (en) * 2016-11-23 2018-05-29 中国科学院沈阳计算技术研究所有限公司 A kind of comprehensive similarity computational methods based on natural language searching

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720675B2 (en) * 2003-10-27 2010-05-18 Educational Testing Service Method and system for determining text coherence
US10176260B2 (en) * 2014-02-12 2019-01-08 Regents Of The University Of Minnesota Measuring semantic incongruity within text data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101623860B1 (en) * 2015-04-08 2016-05-24 서울시립대학교 산학협력단 Method for calculating similarity between document elements
CN106610951A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Improved text similarity solving algorithm based on semantic analysis
CN108090077A (en) * 2016-11-23 2018-05-29 中国科学院沈阳计算技术研究所有限公司 A kind of comprehensive similarity computational methods based on natural language searching
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model

Also Published As

Publication number Publication date
CN110232185A (en) 2019-09-13

Similar Documents

Publication Publication Date Title
CN110232185B (en) Knowledge graph semantic similarity-based computing method for financial industry software testing
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
JP3882048B2 (en) Question answering system and question answering processing method
Jabbar et al. Empirical evaluation and study of text stemming algorithms
JP4778474B2 (en) Question answering apparatus, question answering method, question answering program, and recording medium recording the program
Lossio-Ventura et al. Yet another ranking function for automatic multiword term extraction
CN109783806B (en) Text matching method utilizing semantic parsing structure
CN106372061A (en) Short text similarity calculation method based on semantics
CN103399901A (en) Keyword extraction method
US8812504B2 (en) Keyword presentation apparatus and method
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN109408802A (en) A kind of method, system and storage medium promoting sentence vector semanteme
Hong Deep web data extraction
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
Hussein Visualizing document similarity using n-grams and latent semantic analysis
Alian et al. Arabic semantic similarity approaches-review
Hussein Arabic document similarity analysis using n-grams and singular value decomposition
Sebti et al. A new word sense similarity measure in WordNet
CN111428031B (en) Graph model filtering method integrating shallow semantic information
Shajalal et al. Semantic textual similarity in bengali text
Saghayan et al. Exploring the impact of machine translation on fake news detection: A case study on persian tweets about covid-19

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant