WO2022142027A1 - 基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质 - Google Patents

基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022142027A1
WO2022142027A1 PCT/CN2021/091060 CN2021091060W WO2022142027A1 WO 2022142027 A1 WO2022142027 A1 WO 2022142027A1 CN 2021091060 W CN2021091060 W CN 2021091060W WO 2022142027 A1 WO2022142027 A1 WO 2022142027A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
node
knowledge graph
text data
Prior art date
Application number
PCT/CN2021/091060
Other languages
English (en)
French (fr)
Inventor
王昊
张乐情
罗水权
刘剑
李燕婷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142027A1 publication Critical patent/WO2022142027A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present application relates to the field of computer technology, and in particular, to a knowledge graph-based fuzzy matching method, apparatus, computer equipment and storage medium.
  • Fuzzy matching technology refers to returning a description related to the query keyword according to the query keyword.
  • a common fuzzy matching method is that a search engine returns relevant web links according to query keywords.
  • the methods often used are the inverted index based on statistics and the calculation based on neural network.
  • the query method of the inverted index based on statistics is: after the text is subjected to word segmentation, the inverted index is made by the keyword, and the text database is bucketed or hashed according to the inverted index.
  • the query method based on neural network calculation is: prepare the training corpus by manual or semi-manual labeling, supervise the training of a similarity model, through which a text can be input and a hidden vector can be output, or two texts can be input and output directly similarity score.
  • a knowledge graph-based fuzzy matching method According to various embodiments disclosed in the present application, a knowledge graph-based fuzzy matching method, apparatus, computer device and storage medium are provided.
  • a fuzzy matching method based on knowledge graph includes:
  • the constructed knowledge graph is queried, and the knowledge graph node text containing the query keywords is obtained.
  • the constructed knowledge graph takes the text data as the node and the text similarity corresponding to the text data as the node connection relationship;
  • the keyword score corresponding to the query keyword is obtained, and according to the keyword score and the node connection relationship, the first retrieval score of the node text of the knowledge graph and the second retrieval of the similar node text in the similar text set are obtained. score;
  • a fuzzy matching device based on knowledge graph includes:
  • a receiving module configured to receive a retrieval request carrying a retrieval sentence, perform word segmentation on the retrieval sentence, and obtain a query word bag including query keywords;
  • the first query module is used to query the constructed knowledge graph according to the query word bag, and obtain the knowledge graph node text containing the query keywords.
  • the constructed knowledge graph uses text data as nodes and is similar to the text corresponding to the text data. Degree is the node connection relationship;
  • the second query module is used to query the constructed knowledge graph according to the node text of the knowledge graph, and obtain a set of similar texts corresponding to the node text of the knowledge graph according to the node connection relationship;
  • the processing module is used to obtain the keyword score corresponding to the query keyword according to the preset feature word score table, and obtain the first retrieval score of the node text of the knowledge graph and the similar nodes in the similar text set according to the keyword score and the node connection relationship the second retrieval score for the text;
  • the sorting module is used for sorting the knowledge graph node texts and similar node texts according to the first retrieval score and the second retrieval score to obtain retrieval results corresponding to the retrieval sentences.
  • a computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:
  • the constructed knowledge graph is queried, and the knowledge graph node text containing the query keywords is obtained.
  • the constructed knowledge graph takes the text data as the node and the text similarity corresponding to the text data as the node connection relationship;
  • the keyword score corresponding to the query keyword is obtained, and according to the keyword score and the node connection relationship, the first retrieval score of the node text of the knowledge graph and the second retrieval of the similar node text in the similar text set are obtained. score;
  • One or more computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the constructed knowledge graph is queried, and the knowledge graph node text containing the query keywords is obtained.
  • the constructed knowledge graph takes the text data as the node and the text similarity corresponding to the text data as the node connection relationship;
  • the keyword score corresponding to the query keyword is obtained, and according to the keyword score and the node connection relationship, the first retrieval score of the node text of the knowledge graph and the second retrieval of the similar node text in the similar text set are obtained. score;
  • a query word bag including query keywords is obtained by segmenting a search sentence, and the constructed knowledge graph is queried according to the query word bag, and a query word bag including query keywords can be obtained.
  • the knowledge graph node text of the knowledge graph can be further queried according to the knowledge graph node text, and the constructed knowledge graph can be further queried according to the node connection relationship.
  • the node connection relationship between the two is calculated to obtain the first retrieval score of the knowledge graph node text and the second retrieval score of the similar node text in the similar text set, and the knowledge graph node text and similar nodes according to the first retrieval score and the second retrieval score.
  • 1 is an application scenario diagram of a fuzzy matching method based on a knowledge graph according to one or more embodiments
  • FIG. 2 is a schematic flowchart of a fuzzy matching method based on a knowledge graph according to one or more embodiments
  • FIG. 3 is a schematic flowchart of a fuzzy matching method based on knowledge graph in another embodiment
  • FIG. 4 is a block diagram of an apparatus for fuzzy matching based on knowledge graph in accordance with one or more embodiments
  • FIG. 5 is a block diagram of a computer device in accordance with one or more embodiments.
  • the fuzzy matching method based on the knowledge graph provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through a network.
  • the terminal 102 sends a retrieval request carrying a retrieval sentence to the server 104, and the server 104 receives the retrieval request carrying the retrieval sentence, and performs word segmentation on the retrieval sentence to obtain a query including query keywords.
  • Word bag according to the query word bag, query the constructed knowledge graph, and obtain the knowledge graph node text containing the query keywords.
  • the constructed knowledge graph takes text data as nodes and connects with the text similarity corresponding to the text data as nodes
  • query the constructed knowledge graph obtain the similar text set corresponding to the knowledge graph node text according to the node connection relationship, obtain the keyword score corresponding to the query keyword according to the preset feature word score table,
  • the first retrieval score of the knowledge graph node text and the second retrieval score of the similar node text in the similar text set are obtained.
  • the knowledge graph node text and similar are sorted to obtain the search results corresponding to the search sentences.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a fuzzy matching method based on knowledge graph is provided, and the method is applied to the server in FIG. 1 as an example to illustrate, including the following steps:
  • Step 202 Receive a retrieval request carrying a retrieval sentence, perform word segmentation on the retrieval sentence, and obtain a query word bag including query keywords.
  • the terminal when the user needs to perform a fuzzy matching query, the terminal will send a retrieval request carrying the retrieval sentence to the server.
  • the server After receiving the retrieval request carrying the retrieval sentence, the server will use the preset word segmentation algorithm to segment the retrieval sentence.
  • the stop words after word segmentation are removed, and a query word bag containing query keywords is obtained.
  • the preset word segmentation algorithm may specifically be jieba word segmentation, etc., which is not specifically limited in this embodiment.
  • Step 204 query the constructed knowledge graph according to the query word bag, and obtain the knowledge graph node text containing the query keywords.
  • the constructed knowledge graph takes the text data as the node and the text similarity corresponding to the text data as the node connection. relation.
  • knowledge graph is a concept in the field of library and information science, which is used to draw, analyze and display the interconnection between disciplines or academic research subjects, and is a visualization tool to reveal the development process and structural relationship of scientific knowledge.
  • the knowledge graph adopts a graph structure for visual representation, using nodes to represent authors, academic institutions, scientific literature or keywords, and using lines to represent relationships between nodes.
  • the constructed knowledge graph is a text knowledge graph, that is, nodes represent text data, and text similarity is used as a connection line to represent a node connection relationship, so as to realize the connection between similar text data.
  • the server will query the constructed knowledge graph through the inverted index of text feature words according to the query keywords in the query word bag, and obtain the knowledge graph node text containing the query keywords.
  • Step 206 query the constructed knowledge graph according to the node text of the knowledge graph, and obtain a set of similar texts corresponding to the node text of the knowledge graph according to the node connection relationship.
  • the server will further query the constructed knowledge graph according to the knowledge graph node text, determine the neighbor nodes of the knowledge graph node text in the knowledge graph according to the node connection relationship, and obtain the knowledge graph node text with the knowledge graph node text.
  • the corresponding set of similar texts will be provided.
  • Step 208 according to the preset feature word score table, obtain the keyword score corresponding to the query keyword, according to the keyword score and the node connection relationship, obtain the first retrieval score of the knowledge graph node text and the similar node text in the similar text set. Second retrieval score.
  • the keyword scores corresponding to each query keyword are stored in the preset feature word score table.
  • the server will obtain the keyword score corresponding to the query keyword according to the preset feature word score table, calculate the first retrieval score of the node text of the knowledge graph according to the keyword score, and according to the keyword score and the node connection relationship, Calculate the second retrieval score of similar node texts in the similar text set.
  • Step 210 Sort the knowledge graph node texts and similar node texts according to the first retrieval score and the second retrieval score to obtain retrieval results corresponding to the retrieval sentences.
  • the server sorts the knowledge graph node texts and similar node texts according to the first retrieval score and the second retrieval score, so as to obtain the fuzzy matching text data most relevant to the retrieval sentence in the knowledge graph, and according to the preset retrieval text threshold value pair
  • the sorting results are intercepted, and the retrieval results corresponding to the retrieval sentences can be obtained.
  • a text is both a knowledge graph node text and a similar node text, it will have a first retrieval score and a second retrieval score at the same time, and at this time, the second retrieval score will be used as the final score of the text.
  • the above-mentioned fuzzy matching method based on the knowledge graph obtains a query word bag including query keywords by segmenting the search sentence, and queries the constructed knowledge graph according to the query word bag, and can obtain the knowledge graph node text including the query keyword, and then can Further query the constructed knowledge graph according to the node text of the knowledge graph, obtain a set of similar texts corresponding to the node text of the knowledge graph according to the node connection relationship, and finally use the keyword score and the node connection relationship between the nodes in the knowledge graph to calculate Obtain the first retrieval score of the knowledge graph node text and the second retrieval score of the similar node text in the similar text set, and sort the knowledge graph node text and the similar node text according to the first retrieval score and the second retrieval score.
  • the retrieval result corresponding to the sentence can realize accurate fuzzy matching and improve the accuracy of fuzzy matching.
  • obtaining the first retrieval score of the node text of the knowledge graph and the second retrieval score of the similar node text in the similar text set according to the keyword score and the node connection relationship includes:
  • the second retrieval score of the similar node text is calculated.
  • the server will calculate the first retrieval score of the knowledge graph node text according to the keyword score and the occurrence of each query keyword in the knowledge graph node text, and determine the similarity with the similar text set according to the knowledge graph node text.
  • the target node text corresponding to the node text that is, the knowledge graph node text that is a neighbor node with the similar node text.
  • the server may weight the first retrieval score of the text of the target node and the node connection relationship (ie, text similarity) between the text of the target node and the text of the similar node in the constructed knowledge graph, and calculate the text of the similar node the second retrieval score. Further, if a certain text is a knowledge graph node text and a similar node text at the same time, when calculating the second retrieval score, it can be considered that the node connection relationship is 1.
  • the keyword score may specifically refer to an IDF value
  • the first retrieval score of the node text of the knowledge graph is the sum of the IDF values of the contained query keywords. For example, if you find a total of 10 documents with "A” or "B", the scores of three documents with only "A” are 5.1, 5.1, 5.1, and the scores of three documents with only "B” are 1.2, 1.2, 1.2 , the scores of the four articles with the words "A" and "B” at the same time are 6.3, 6.3, 6.3, and 6.3.
  • the formula for calculating the second retrieval score of similar node text is:
  • score i represents the retrieval score of the target node text corresponding to the similar node text, that is, the first retrieval score of the knowledge graph node text with the similar node text as the neighbor node
  • w ij represents the node connection relationship between the similar node and the neighbor node, That is, the text similarity when constructing the knowledge graph, when the similar node text itself is the knowledge graph text node, it can be obtained that w ij is 1.
  • the first retrieval score of the knowledge graph node text is calculated according to the keyword score, and the target node text corresponding to the similar node text in the similar text set is determined according to the knowledge graph node text.
  • the connection relationship is used to calculate the second retrieval score of similar node texts, which can realize the calculation of the first retrieval score and the second retrieval score.
  • the method before querying the constructed knowledge graph according to the query word bag to obtain the knowledge graph node text containing the query keyword, the method further includes:
  • the knowledge graph is constructed with the text data as the node and the text similarity corresponding to the target similar text as the node connection relationship.
  • the text data set refers to a set composed of all text data that can be used for fuzzy matching, and the text data that can be used for fuzzy matching can specifically be articles, sentences, entities, and the like.
  • Word segmentation refers to the word splitting of text data, and the text data is split into multiple words.
  • the trained word vector model is used to obtain the word vector corresponding to the word, that is, the model of inputting the word to obtain the word vector.
  • the word vector model may specifically be a word2vec model or other neural networks such as bert.
  • the text vector refers to a vector with the same dimension as the vector dimension of each word vector in the word vector set, and is used to characterize the features of the text data.
  • the preset word frequency statistical algorithm refers to an algorithm used to count the occurrence frequency of words.
  • the word frequency statistical algorithm may specifically be an IF-IDF algorithm, a bm25 algorithm, or the like.
  • the text similarity is used to represent the degree of similarity between two text data. The greater the text similarity is, the more similar the two text data are. The similarity here may specifically mean that the content described by the text data is close or related.
  • the target similar text refers to the filtered text that is similar to the text data.
  • the server obtains a text data set from a preset text database, performs word segmentation on the text data in the text data set, splits the text data into multiple words, and obtains a word set corresponding to the text data.
  • the manner of performing word segmentation may specifically be jieba word segmentation, etc., which is not specifically limited in this embodiment.
  • the server will input each word in the word set into the trained word vector model to obtain the word vector corresponding to each word, and obtain the word vector set corresponding to the word set according to the word vector corresponding to each word. The average of the same dimension of each word vector in the word vector set, and the text vector corresponding to the text data is obtained.
  • the server will obtain sample text data from the preset text database, and use the sample text data to carry out the initial word vector model. Train to get the trained word vector model.
  • the word vector model here may specifically be a word2vec model or other neural networks such as bert, and this embodiment does not specifically limit the word vector model here.
  • the server determines the target similar text corresponding to the text data by using the text vector, the word set and the preset word frequency statistical algorithm, it adopts a double similarity comparison method, that is, firstly through the word set and the preset word frequency
  • the statistical algorithm selects a part of similar texts related to the text data from the text data set, and then uses the text vector to further filter out the target similar texts from these similar texts related to the text data.
  • the server will use the text data as a node, the target similar text corresponding to the text data as its neighbor node, and the text similarity corresponding to the target similar text as the node connection relationship between the node and the neighbor node to construct knowledge Atlas. Further, the server will set a text number for each text data, generate a text number table, and build a knowledge map with the text number as a node. When querying the knowledge map, first determine the corresponding text number, and then query the generated data according to the text number. Text number table to feed back the corresponding text data.
  • the text vector corresponding to the text data is obtained, and according to the text vector, the word set and the preset word frequency statistical algorithm, the text data set is calculated.
  • the text similarity between text data can be used to determine the target similar text corresponding to the text data, and then according to the target similar text, the text data can be used as a node and the text similarity corresponding to the target similar text can be used.
  • a knowledge graph is constructed for the node connection relationship, so that accurate fuzzy matching can be realized by using the constructed knowledge graph, and the accuracy of fuzzy matching can be improved.
  • obtaining the text vector corresponding to the text data includes:
  • the average value of the same dimension is collected to obtain the text vector corresponding to the text data.
  • the word vector is a multi-dimensional vector, and the average value of the same dimension refers to the average value of each same dimension in the word vector.
  • the word vector may specifically be an M-dimensional vector, and the average value of the same dimension refers to the average value of the dimension value of the first dimension, the average value of the dimension value of the second dimension, the average value of the dimension value of the Mth dimension, etc. in the word vector.
  • the text vector refers to a vector with the same dimension as the vector dimension of each word vector in the word vector set, and is used to characterize the features of the text data.
  • the word vector is an M-dimensional vector
  • the text vector is also an M-dimensional vector
  • the dimension value of each dimension in the text vector is obtained according to the average value of the same dimension of the word vector set corresponding to the text data, that is, each dimension
  • the dimension value of is the average of the same dimension corresponding to the same dimension.
  • the dimension value of the first dimension in the text vector is a dimension average value of the first dimension of the word vector corresponding to the text data.
  • the server will calculate the average value of the same dimension of each word vector in the word vector set according to the word vector set, collect the average value of the same dimension, and use the average value of the same dimension as the text vector corresponding to the text data corresponding to the word vector set The dimension values of each of the same dimensions.
  • the text vector obtained by averaging all word vectors in the text data by dimension is
  • V m is the text vector
  • T is the number of all words in the text
  • x k is the mth dimension value of the kth word vector in the text
  • the text vector corresponding to the text data can be obtained, which can realize the comparison of the text vector of the text data. Construct.
  • the text similarity between the text data in the text data set is calculated according to the text vector, the word set and the preset word frequency statistical algorithm, and the target similar text corresponding to the text data is determined to include:
  • a preset number of similar texts related to the text data in the text data set are obtained;
  • the text vector calculate the text similarity between the text data and the similar text in the preset number of similar texts.
  • the target similar text corresponding to the text data is selected.
  • the server will first determine the text feature words in the word set according to the word set and the preset word frequency statistical algorithm, and then construct a word frequency matrix of each text data according to the text feature words, and use the word frequency matrix of the text data to determine the text data set with the word frequency matrix.
  • the preset number of similar texts related to the text data and then according to the text vector, the text similarity between the text data and the similar texts in the preset number of similar texts is calculated, and the preset target number corresponding to the text data is selected according to the text similarity.
  • Target similarity text The preset number and the preset target number can be set as required, but the preset number must be greater than the preset target number. Preferably, the preset number can be set to be much larger than the preset target number.
  • a similarity calculation method such as cosine similarity can be used for calculation, which is not specifically limited in this embodiment.
  • the corresponding calculation formula can be: where ri ,j represents the similarity between text data i and text data j, and V i,m represents the value of the mth dimension of the text vector of text data i.
  • obtaining a preset number of similar texts related to the text data in the text data set includes:
  • a preset number of similar texts related to the text data in the text data set are obtained.
  • the word frequency matrix is used to represent the word frequency of each text feature word appearing in the text data.
  • the word frequency here may specifically refer to the IF-IDF value of each text feature word.
  • the text feature words can be calculated according to the number of occurrences and the IDF values of text feature words A, B, and C.
  • A, B, C and the IF-IDF values corresponding to the text data 1 and the word frequency matrix is obtained according to the IF-IDF values.
  • the word frequency similarity is used to represent the similarity of word frequencies between text data, and the similarity of word frequencies here may specifically refer to the repetition of words included in the text data.
  • the server will perform word frequency statistics according to the word set and the preset word frequency statistics algorithm, select the preset number of text feature words with the highest word frequency from the word set, traverse the word set according to the text feature words, and count the text feature words in The situation that occurs in the text data, the word frequency matrix corresponding to the text data is obtained. After obtaining the word frequency matrix, the server will calculate the word frequency similarity between two pieces of text data according to the word frequency matrix, sort the correlation degree of the text data in the text data set according to the word frequency similarity, and obtain the preset related to each text data. Amount of similar text. The number of preset feature words can be set according to needs.
  • the word frequency similarity can be used to realize the text Similarity comparison, to determine a preset number of similar texts related to text data.
  • the preset word frequency statistics algorithm is an IF-IDF algorithm, and the word frequency statistics are performed according to the word set and the preset word frequency statistics algorithm, and the obtained text feature words include:
  • the words are sorted according to the IF-IDF value, and the preset number of text feature words with the highest IF-IDF value is screened out.
  • traversing the word set according to the text feature words, and obtaining the word frequency matrix corresponding to the text data includes:
  • the word frequency matrix corresponding to the text data is obtained, and the element value in the same position in the word frequency matrix represents the word frequency of the same text feature word in each text data.
  • the server first uses the IF-IDF algorithm to calculate the IF-IDF value of each word in the word set, uses the IF-IDF value of each word to sort the words, and selects the preset feature with the highest IF-IDF value.
  • the number of words is the number of text feature words, and then traverse the word set of each text data according to the text feature words, and count the occurrence of text feature words in the text data, that is, the IF value of each text feature word for each text data, and finally
  • the IF-IDF value of each text feature word for each text data is obtained, and the word frequency matrix of the text data is obtained according to the IF-IDF value.
  • the same word frequency matrix The element value of the position represents the word frequency of the same text feature word in each text data.
  • the method further includes:
  • the feature word score of the text feature word is obtained.
  • the method further includes:
  • the inverted index of text feature words is constructed according to the nodes of the knowledge graph, and the feature word scores of the text feature words are recorded to obtain the feature word score table.
  • the feature word score of the text feature word is used to represent the word frequency of the text feature word in the text dataset.
  • the feature word score may specifically be the IDF value of the text feature word for the text data set.
  • the server will calculate the feature word score of the text feature word according to the word frequency matrix, and after constructing the knowledge graph, construct an inverted index based on the text feature word according to the nodes of the knowledge graph, and record the feature word of the text feature word. Score, get the feature word score table and store it.
  • the feature word score table can be obtained.
  • the node constructs an inverted index of text feature words, which can improve the retrieval efficiency.
  • the fuzzy matching method based on the knowledge graph of the present application is illustrated by a schematic flowchart, and the fuzzy matching method based on the knowledge graph includes the following steps:
  • Step S302 obtaining a text data set, performing word segmentation on the text data in the text data set, and obtaining a word set corresponding to the text data;
  • Step S304 input the word set into the trained word vector model, obtain the word vector set corresponding to the word set, and obtain the text vector corresponding to the text data according to the word vector set;
  • Step S306 according to the text vector, the word set and the preset word frequency statistical algorithm, calculate the text similarity between the text data in the text data set, and determine the target similar text corresponding to the text data;
  • Step S308 according to the target similar text, take the text data as a node and the text similarity corresponding to the target similar text as a node connection relationship to construct a knowledge graph;
  • Step S310 receiving a retrieval request carrying a retrieval sentence, performing word segmentation on the retrieval sentence, and obtaining a query word bag including query keywords;
  • Step S312 query the constructed knowledge graph according to the query word bag, and obtain the knowledge graph node text containing the query keyword;
  • Step S314 query the constructed knowledge graph according to the knowledge graph node text, and obtain a similar text set corresponding to the knowledge graph node text according to the node connection relationship;
  • Step S316 obtain the keyword score corresponding to the query keyword according to the preset feature word score table, and obtain the first retrieval score of the node text of the knowledge graph and the score of the similar node text in the similar text set according to the keyword score and the node connection relationship.
  • Step S318 Sort the knowledge graph node texts and similar node texts according to the first retrieval score and the second retrieval score to obtain retrieval results corresponding to the retrieval sentences.
  • a fuzzy matching apparatus based on knowledge graph including: a receiving module 402 , a first query module 404 , a second query module 406 , a processing module 408 and a sorting module 410 ,in:
  • the receiving module 402 is configured to receive a retrieval request carrying a retrieval sentence, perform word segmentation on the retrieval sentence, and obtain a query word bag including query keywords;
  • the first query module 404 is configured to query the constructed knowledge graph according to the query word bag, and obtain the knowledge graph node text containing the query keywords.
  • the constructed knowledge graph takes text data as nodes and uses text corresponding to the text data.
  • the similarity is the node connection relationship;
  • the second query module 406 is configured to query the constructed knowledge graph according to the knowledge graph node text, and obtain a similar text set corresponding to the knowledge graph node text according to the node connection relationship;
  • the processing module 408 is used to obtain the keyword score corresponding to the query keyword according to the preset feature word score table, and obtain the first retrieval score of the node text of the knowledge graph and the similarity in the similar text set according to the keyword score and the node connection relationship. the second retrieval score for the node text;
  • the sorting module 410 is configured to sort the knowledge graph node texts and similar node texts according to the first retrieval score and the second retrieval score to obtain retrieval results corresponding to the retrieval sentences.
  • the above-mentioned fuzzy matching device based on knowledge graph can obtain a query word bag including query keywords by segmenting the search sentence, and query the constructed knowledge graph according to the query word bag, and can obtain the knowledge graph node text including the query keyword, and then can Further, according to the knowledge graph node text, query the constructed knowledge graph to obtain a similar text set corresponding to the knowledge graph node text, and finally obtain the knowledge graph node by using the keyword score and the node connection relationship between the nodes in the knowledge graph.
  • the first retrieval score of the text and the second retrieval score of the similar node text in the similar text set, the knowledge graph node text and the similar node text are sorted according to the first retrieval score and the second retrieval score, and the retrieval corresponding to the retrieval sentence can be obtained.
  • accurate fuzzy matching is achieved, and the accuracy of fuzzy matching is improved.
  • the processing module is further configured to calculate the first retrieval score of the knowledge graph node text according to the keyword score, and determine the target node text corresponding to the similar node text in the similar text set according to the knowledge graph node text, According to the target node text and the node connection relationship, the second retrieval score of the similar node text is calculated.
  • the fuzzy matching device based on the knowledge graph further includes a knowledge graph building module, and the knowledge graph building module is used to obtain a text data set, perform word segmentation on the text data in the text data set, and obtain a word set corresponding to the text data, Input the word set into the trained word vector model to obtain the word vector set corresponding to the word set, obtain the text vector corresponding to the text data according to the word vector set, and calculate the text according to the text vector, the word set and the preset word frequency statistical algorithm
  • the text similarity between the text data in the data set determines the target similar text corresponding to the text data. According to the target similar text, the text data is used as a node and the text similarity corresponding to the target similar text is used as the node connection relationship to build a knowledge graph. .
  • the knowledge graph building module is further configured to calculate the average value of the same dimension of each word vector in the word vector set according to the word vector set, and collect the average value of the same dimension to obtain a text vector corresponding to the text data.
  • the knowledge graph building module is further configured to obtain a preset number of similar texts related to the text data in the text data set according to the word set and the preset word frequency statistical algorithm, and calculate the text data and the preset number according to the text vector.
  • the text similarity of similar texts in similar texts, and according to the text similarity, the target similar texts corresponding to the text data are selected.
  • the knowledge graph building module is further configured to perform word frequency statistics according to the word set and a preset word frequency statistical algorithm to obtain text feature words, traverse the word set according to the text feature words, and obtain a word frequency matrix corresponding to the text data, according to The word frequency matrix is used to calculate the word frequency similarity between pairs of text data, and according to the word frequency similarity, a preset number of similar texts related to the text data in the text data set are obtained.
  • the knowledge graph building module is further configured to use the IF-IDF algorithm to calculate the IF-IDF value corresponding to each word in the word set, sort the words according to the IF-IDF value, and filter out the IF-IDF from the IF-IDF value.
  • the number of preset feature words with the highest value is the number of text feature words.
  • the knowledge graph building module is further configured to traverse the word set according to the text feature words, obtain the IF value corresponding to each text feature word and the text data, obtain the IDF value of the text feature word, and obtain the IDF value of the text feature word according to the IF value and the text feature.
  • the IDF value of the word, the IF-IDF value corresponding to each text feature word and the text data is obtained, according to the IF-IDF value corresponding to each text feature word and the text data, the word frequency matrix corresponding to the text data is obtained, in the word frequency matrix
  • the element value in the same position represents the word frequency of the same text feature word in each text data.
  • the knowledge graph construction module is further configured to obtain the feature word score of the text feature word according to the word frequency matrix, construct an inverted index of the text feature word according to the nodes of the knowledge graph, and record the feature word score of the text feature word , get the feature word score table.
  • Each module in the above-mentioned fuzzy matching apparatus based on knowledge graph can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 5 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes non-volatile or volatile storage media, internal memory.
  • the non-volatile or volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store text data sets and the like.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by the processor, implement a knowledge graph-based fuzzy matching method.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored in the memory, the computer-readable instructions, when executed by the processor, cause the one or more processors Perform the following steps:
  • the constructed knowledge graph is queried, and the knowledge graph node text containing the query keywords is obtained.
  • the constructed knowledge graph takes the text data as the node and the text similarity corresponding to the text data as the node connection relationship;
  • the keyword score corresponding to the query keyword is obtained, and according to the keyword score and the node connection relationship, the first retrieval score of the node text of the knowledge graph and the second retrieval of the similar node text in the similar text set are obtained. score;
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the second retrieval score of the similar node text is calculated.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the knowledge graph is constructed with the text data as the node and the text similarity corresponding to the target similar text as the node connection relationship.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the average value of the same dimension is collected to obtain the text vector corresponding to the text data.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • a preset number of similar texts related to the text data in the text data set are obtained;
  • the text vector calculate the text similarity between the text data and the similar text in the preset number of similar texts.
  • the target similar text corresponding to the text data is selected.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • a preset number of similar texts related to the text data in the text data set are obtained.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the words are sorted according to the IF-IDF value, and the preset number of text feature words with the highest IF-IDF value is screened out.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the word frequency matrix corresponding to the text data is obtained, and the element value in the same position in the word frequency matrix represents the word frequency of the same text feature word in each text data.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the feature word score of the text feature word is obtained.
  • the processor further implements the following steps when executing the computer-readable instructions:
  • the inverted index of text feature words is constructed according to the nodes of the knowledge graph, and the feature word scores of the text feature words are recorded to obtain the feature word score table.
  • one or more computer-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more processors to implement the above Steps in Method Examples.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请为一种基于知识图谱的模糊匹配方法,涉及人工智能中的知识图谱技术领域,包括:接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本;根据知识图谱节点文本,查询已构建的知识图谱,得到与知识图谱节点文本对应的相似文本集合;根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;根据第一检索得分以及第二检索得分,得到检索结果。

Description

基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质
相关申请的交叉引用
本申请要求于2020年12月31日提交中国专利局,申请号为2020116336520,申请名称为“基于知识图谱的模糊匹配方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质。
背景技术
随着计算机技术的发展,出现了模糊匹配技术,模糊匹配技术是指根据查询关键字,返回与查询关键字相关的描述。比如,常见的模糊匹配方法有搜索引擎根据查询关键字返回相关的网页链接。
传统技术中,在进行模糊匹配时,常采用的方法为基于统计的倒排索引以及基于神经网络的计算。其中,基于统计的倒排索引的查询方式为:将文本做分词处理后,通过关键字做倒排索引,根据倒排索引将文本数据库分桶或者做哈希运算。基于神经网络的计算的查询方式为:通过人工或者半人工标注的方式准备训练语料,监督训练一个相似度模型,通过该模型可以输入一个文本,输出一个隐藏向量,或者输入两个文本,直接输出相似度得分。
然而,发明人意识到,传统方法都存在模糊匹配不准确的问题。
发明内容
根据本申请公开的各种实施例,提供一种基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质。
一种基于知识图谱的模糊匹配方法包括:
接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本 的第二检索得分;及
根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
一种基于知识图谱的模糊匹配装置包括:
接收模块,用于接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
第一查询模块,用于根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
第二查询模块,用于根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
处理模块,用于根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
排序模块,用于根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱 节点文本对应的相似文本集合;
根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
上述基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质,通过对检索句进行分词得到包括查询关键字的查询词袋,根据查询词袋查询已构建的知识图谱,能够得到包含查询关键字的知识图谱节点文本,进而可以进一步根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合,最后通过利用关键字得分以及知识图谱中各节点之间的节点连接关系,计算得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,能够得到与检索句对应的检索结果,实现准确模糊匹配,提高了模糊匹配准确度。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中基于知识图谱的模糊匹配方法的应用场景图;
图2为根据一个或多个实施例中基于知识图谱的模糊匹配方法的流程示意图;
图3为另一个实施例中基于知识图谱的模糊匹配方法的流程示意图;
图4为根据一个或多个实施例中基于知识图谱的模糊匹配装置的框图;
图5为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的基于知识图谱的模糊匹配方法,可以应用于如图1所示的应用环境中。终端102与服务器104通过网络进行通信。当使用终端102的用户需要进行模糊匹配查询时,通过终端102发送携带检索句的检索请求至服务器104,服务器104接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋,根据查询词袋,查询 已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系,根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合,根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种基于知识图谱的模糊匹配方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤202,接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋。
具体的,当用户需要进行模糊匹配查询时,会通过终端发送携带检索句的检索请求至服务器,服务器在接收到携带检索句的检索请求后,会利用预设分词算法对检索句进行分词,并去掉分词后的停用词,得到包含查询关键字的查询词袋。其中,预设分词算法具体可以为jieba分词等,本实施例在此处不做具体限定。
步骤204,根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系。
其中,知识图谱是图书情报学领域的概念,用于绘制、分析和显示学科或学术研究主体之间的相互联系,是揭示显示科学知识发展进程与结构关系的可视化工具。在多数情况下,知识图谱采用图结构进行可视化表示,使用节点代表作者、学术机构、科学文献或关键词,使用连线代表节点间关系。在本实施例中,所构建的知识图谱是文本知识图谱,即使用节点代表文本数据,以文本相似度为连线代表节点连接关系,实现相似的文本数据之间的连接。
具体的,服务器会根据查询词袋中的查询关键字,通过文本特征词倒排索引查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本。
步骤206,根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合。
具体的,在得到知识图谱节点文本后,服务器会进一步根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系确定在知识图谱中知识图谱节点文本的邻居节点,得到与知识图谱节点文本对应的相似文本集合。
步骤208,根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相 似节点文本的第二检索得分。
其中,预设特征词得分表中存储有各查询关键字所对应的关键字得分。
具体的,服务器会根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分,计算知识图谱节点文本的第一检索得分,并根据关键字得分以及节点连接关系,计算相似文本集合中相似节点文本的第二检索得分。
步骤210,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
具体的,服务器通过根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,可以得到知识图谱中与检索句最相关的模糊匹配文本数据,根据预设检索文本阈值对排序结果进行截取,即可得到与检索句对应的检索结果。进一步的,当一个文本同时为知识图谱节点文本和相似节点文本时,其会同时存在第一检索得分和第二检索得分,此时会将第二检索得分作为该文本的最终得分。
上述基于知识图谱的模糊匹配方法,通过对检索句进行分词得到包括查询关键字的查询词袋,根据查询词袋查询已构建的知识图谱,能够得到包含查询关键字的知识图谱节点文本,进而可以进一步根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合,最后通过利用关键字得分以及知识图谱中各节点之间的节点连接关系,计算得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,能够得到与检索句对应的检索结果,实现准确模糊匹配,提高了模糊匹配准确度。
在其中一个实施例中,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分包括:
根据关键字得分,计算知识图谱节点文本的第一检索得分,并根据知识图谱节点文本,确定与相似文本集合中相似节点文本对应的目标节点文本;及
根据目标节点文本以及节点连接关系,计算相似节点文本的第二检索得分。
具体的,服务器会根据关键字得分以及各查询关键字在知识图谱节点文本中出现的情况,加权计算知识图谱节点文本的第一检索得分,并根据知识图谱节点文本,确定与相似文本集合中相似节点文本对应的目标节点文本,即与相似节点文本为邻居节点的知识图谱节点文本。在确定目标节点文本后,服务器可将目标节点文本的第一检索得分以及已构建的知识图谱中该目标节点文本与相似节点文本的节点连接关系(即文本相似度)进行加权,计算相似节点文本的第二检索得分。进一步的,若某文本同时为知识图谱节点文本和相似节点文本,则在计算第二检索得分时,可认为节点连接关系为1。
举例说明,关键字得分具体可以是指IDF值,则知识图谱节点文本的第一检索得分为所包含的查询关键字的IDF值的总和。如查找到了带“A”或“B”的文档共10篇,其中三篇只带“A”的文档score为5.1、5.1、5.1,三篇只带“B”的得分为1.2、1.2、1.2, 四篇同时带“A”和“B”两个词的得分为6.3、6.3、6.3、6.3。相似节点文本的第二检索分数的计算公式为:
Figure PCTCN2021091060-appb-000001
其中,score i表示与相似节点文本对应的目标节点文本的检索得分,即与相似节点文本为邻居节点的知识图谱节点文本的第一检索得分,w ij表示相似节点与邻居节点的节点连接关系,即在构造知识图谱时的文本相似度,当相似节点文本本身就是知识图谱文本节点时,可得到w ij为1。
本实施例中,通过根据关键字得分,计算知识图谱节点文本的第一检索得分,并根据知识图谱节点文本,确定与相似文本集合中相似节点文本对应的目标节点文本,根据目标节点文本以及节点连接关系,计算相似节点文本的第二检索得分,能够实现对第一检索得分和第二检索得分的计算。
在其中一个实施例中,在根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本之前,方法还包括:
获取文本数据集,对文本数据集中文本数据进行分词,得到与文本数据对应的词语集合;
将词语集合输入已训练的词向量模型,得到与词语集合对应的词向量集合,根据词向量集合,得到与文本数据对应的文本向量;
根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,确定与文本数据对应的目标相似文本;及
根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
其中,文本数据集是指由所有可用于模糊匹配的文本数据构成的集合,可用于模糊匹配的文本数据具体可以是文章、句子以及实体等。分词是指对文本数据进行词语拆分,将文本数据拆分为多个词语。已训练的词向量模型用于得到与词语对应的词向量,即输入词语,得到词向量的模型。举例说明,词向量模型具体可以是word2vec模型或bert等其他神经网络。文本向量是指与词向量集合中各词向量的向量维度相同的向量,用于表征文本数据的特征。
其中,预设词频统计算法是指用于对词语出现频率进行统计的算法,举例说明,词频统计算法具体可以是IF-IDF算法、bm25算法等。文本相似度用于表征两个文本数据的相似程度,文本相似度越大,两个文本数据越相似,这里的相似具体可以是指文本数据所描述的内容接近或者有关联。目标相似文本是指筛选出的与文本数据相似的文本。
具体的,服务器会从预设文本数据库中获取文本数据集,对文本数据集中文本数据进行分词,将文本数据拆分为多个词语,得到与文本数据对应的词语集合。其中,进行分词的方式具体可以为jieba分词等,本实施例在此处不做具体限定。在得到词语集合后,服 务器会将词语集合中各词语输入已训练的词向量模型,得到各词语对应的词向量,根据各词语对应的词向量,得到与词语集合对应的词向量集合,通过计算词向量集合中各词向量的相同维度平均值,得到与文本数据对应的文本向量。
进一步的,在将词语集合中各词语输入已训练的词向量模型,得到各词语对应的词向量之前,服务器会从预设文本数据库中获取样本文本数据,利用样本文本数据对初始词向量模型进行训练,得到已训练的词向量模型。如前面举例说明,这里的词向量模型具体可以是word2vec模型或bert等其他神经网络,本实施例在此处不对词向量模型做具体限定。
具体的,服务器在利用根据文本向量、词语集合以及预设词频统计算法,确定与文本数据对应的目标相似文本时,采用的是二重相似比对的方式,即先通过词语集合以及预设词频统计算法,从文本数据集中筛选出一部分与文本数据相关的相似文本,再利用文本向量进一步从这些筛选出的与文本数据相关的相似文本中筛选出目标相似文本。
具体的,服务器会以文本数据为节点,以与文本数据对应的目标相似文本作为其邻居节点,并以与目标相似文本对应的文本相似度为节点与邻居节点之间的节点连接关系,构建知识图谱。进一步的,服务器会给每个文本数据设置文本编号,生成文本编号表,以文本编号为节点构建知识图谱,当在查询知识图谱时,先确定对应的文本编号,再根据文本编号查询所生成的文本编号表,以反馈对应的文本数据。
本实施例中,通过对文本数据进行分词,利用分词后的词语集合所对应的词向量,得到与文本数据对应的文本向量,根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,可以利用文本相似度实现对与文本数据对应的目标相似文本的确定,进而可以根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱,从而可以利用所构建的知识图谱实现准确模糊匹配,能够提高模糊匹配准确度。
在其中一个实施例中,根据词向量集合,得到与文本数据对应的文本向量包括:
根据词向量集合,计算词向量集合中各词向量的相同维度平均值;及
归集相同维度平均值,得到与文本数据对应的文本向量。
其中,词向量是多维度向量,相同维度平均值是指词向量中各相同维度的平均值。举例说明,词向量具体可以为M维向量,则相同维度平均值是指词向量中第一维度的维度值平均值、第二维度的维度值平均值…第M维度的维度值平均值等。文本向量是指与词向量集合中各词向量的向量维度相同的向量,用于表征文本数据的特征。举例说明,当词向量为M维向量时,文本向量也为M维向量,文本向量中的每一维度的维度值根据与文本数据对应的词向量集合的相同维度平均值得到,即每一维度的维度值为同维度所对应的相同维度平均值。举例说明,文本向量中的第一维度的维度值为与文本数据对应的词向量的第一维度的维度平均值。
具体的,服务器会根据词向量集合,计算词向量集合中各词向量的相同维度平均值,归集相同维度平均值,将相同维度平均值,作为与词向量集合对应的文本数据对应的文本 向量的各相同维度的维度值。
举例说明,将文本数据中所有词向量按维度求平均后得到的文本向量为
Figure PCTCN2021091060-appb-000002
Figure PCTCN2021091060-appb-000003
其中V m为文本向量,T为该文本中所有词语数,x k,m为该文本中第k个词向量的第m维数值),作为该文本数据的文本向量由词向量得到,因此维度也是M。
本实施例中,通过根据词向量集合,计算词向量集合中各词向量的相同维度平均值,归集相同维度平均值,得到与文本数据对应的文本向量,能够实现对文本数据的文本向量的构建。
在其中一个实施例中,根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,确定与文本数据对应的目标相似文本包括:
根据词语集合以及预设词频统计算法,得到文本数据集中与文本数据相关的预设数量相似文本;
根据文本向量,计算文本数据与预设数量相似文本中相似文本的文本相似度;及
根据文本相似度,选取出与文本数据对应的目标相似文本。
具体的,服务器会先根据词语集合以及预设词频统计算法,确定词语集合中的文本特征词,再根据文本特征词构建每个文本数据的词频矩阵,利用文本数据的词频矩阵确定文本数据集中与文本数据相关的预设数量相似文本,再根据文本向量,计算文本数据与预设数量相似文本中相似文本之间的文本相似度,根据文本相似度,选取出与文本数据对应的预设目标数量目标相似文本。其中,预设数量和预设目标数量可按照需要自行设置,但是需满足预设数量大于预设目标数量,优选的,可设置预设数量远远大于预设目标数量。
进一步的,在利用文本向量,计算文本数据与预设数量相似文本中相似文本的文本相似度时,可以采用余弦相似度等相似度计算方式进行计算,本实施例在此处不做具体限定。举例说明,若采用余弦相似度计算,则对应的计算公式可以为:
Figure PCTCN2021091060-appb-000004
Figure PCTCN2021091060-appb-000005
其中r i,j代表文本数据i和文本数据j的相似度,V i,m代表文本数据i的文本向量的第m维的数值。
本实施例中,通过利用词语集合、预设词频统计算法以及文本向量进行两次文本相似度筛选,能够选取出准确的与文本数据对应的目标相似文本。
在其中一个实施例中,根据词语集合以及预设词频统计算法,得到文本数据集中与文本数据相关的预设数量相似文本包括:
根据词语集合以及预设词频统计算法进行词频统计,得到文本特征词;
根据文本特征词遍历词语集合,得到与文本数据对应的词频矩阵;
根据词频矩阵,计算两两文本数据之间的词频相似度;及
根据词频相似度,得到文本数据集中与文本数据相关的预设数量相似文本。
其中,词频矩阵用于表示各文本特征词在文本数据中出现的词频。比如,这里的词频 具体可以是指各文本特征词的IF-IDF值。比如,当文本特征词A、B、C在文本数据1中出现的次数分别为0、3、4时,可根据该出现次数以及文本特征词A、B、C的IDF值,计算文本特征词A、B、C与文本数据1所对应的IF-IDF值,根据该IF-IDF值得到词频矩阵。词频相似度用于表征文本数据之间的词频的相似程度,这里的词频的相似程度具体可以是指文本数据所包含的词语的重复度。
具体的,服务器会根据词语集合以及预设词频统计算法进行词频统计,从词语集合中筛选出词频最高的预设特征词数量个文本特征词,根据文本特征词遍历词语集合,统计文本特征词在文本数据中出现的情况,得到与文本数据对应的词频矩阵。在得到词频矩阵后,服务器会根据词频矩阵,计算两两文本数据之间的词频相似度,根据词频相似度对文本数据集中文本数据的相关程度进行排序,得到与每个文本数据相关的预设数量相似文本。其中,预设特征词数量可按照需要自行设置。
本实施例中,通过先得到文本特征词,再根据文本特征词得到与文本数据对应的词频矩阵,最后根据词频矩阵,计算两两文本数据之间的词频相似度,能够利用词频相似度实现文本相似度比对,确定与文本数据相关的预设数量相似文本。
在其中一个实施例中,预设词频统计算法为IF-IDF算法,根据词语集合以及预设词频统计算法进行词频统计,得到文本特征词包括:
利用IF-IDF算法计算出词语集合中每个词语对应的IF-IDF值;及
根据IF-IDF值对词语进行排序,从中筛选出IF-IDF值最高的预设特征词数量个文本特征词。
在其中一个实施例中,根据文本特征词遍历词语集合,得到与文本数据对应的词频矩阵包括:
根据文本特征词遍历词语集合,得到每个文本特征词与文本数据对应的IF值;
获取文本特征词的IDF值,根据IF值和文本特征词的IDF值,得到每个文本特征词与文本数据对应的IF-IDF值;及
根据每个文本特征词与文本数据对应的IF-IDF值,得到与文本数据对应的词频矩阵,在词频矩阵中同一位置的元素值表示同一个文本特征词在各文本数据中出现的词频。
具体的,服务器先利用IF-IDF算法计算出词语集合中每个词语的IF-IDF值,利用每个词语的IF-IDF值对词语进行排序,从中筛选出IF-IDF值最高的预设特征词数量个文本特征词,再根据文本特征词遍历各文本数据的词语集合,统计文本特征词在文本数据中出现的情况,即针对每个文本数据而言每个文本特征词的IF值,最后根据该IF值,和文本特征词的IDF值,得到针对每个文本数据而言每个文本特征词的IF-IDF值,根据该IF-IDF值得到文本数据的词频矩阵,在词频矩阵中同一位置的元素值表示同一个文本特征词在各文本数据中出现的词频。
在其中一个实施例中,在根据文本特征词遍历词语集合,得到与文本数据对应的词频矩阵之后,方法还包括:
根据词频矩阵,得到文本特征词的特征词得分;及
在根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱之后,方法还包括:
根据知识图谱的节点构造文本特征词倒排索引,并记录文本特征词的特征词得分,得到特征词得分表。
其中,文本特征词的特征词得分用于表征文本特征词在文本数据集中的词频。比如,特征词得分具体可以是文本特征词针对文本数据集而言的IDF值。
具体的,服务器会根据词频矩阵,计算文本特征词的特征词得分,并在构建知识图谱之后,根据知识图谱的节点构造以文本特征词为依据的倒排索引,并记录文本特征词的特征词得分,得到特征词得分表并存储。
本实施例中,通过根据词频矩阵,得到文本特征词的特征词得分,记录文本特征词的特征词得分,得到特征词得分表,能够实现对特征词得分表的获取,同时通过根据知识图谱的节点构造文本特征词倒排索引,能够提高检索效率。
在其中一个实施例中,如图3所示,通过一个流程示意图来说明本申请的基于知识图谱的模糊匹配方法,该基于知识图谱的模糊匹配方法包括以下步骤:
步骤S302,获取文本数据集,对文本数据集中文本数据进行分词,得到与文本数据对应的词语集合;
步骤S304,将词语集合输入已训练的词向量模型,得到与词语集合对应的词向量集合,根据词向量集合,得到与文本数据对应的文本向量;
步骤S306,根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,确定与文本数据对应的目标相似文本;
步骤S308,根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱;
步骤S310,接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
步骤S312,根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本;
步骤S314,根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
步骤S316,根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
步骤S318,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
应该理解的是,虽然图2-3的流程图中的各个步骤按照箭头的指示依次显示,但是这 些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图4所示,提供了一种基于知识图谱的模糊匹配装置,包括:接收模块402、第一查询模块404、第二查询模块406、处理模块408和排序模块410,其中:
接收模块402,用于接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
第一查询模块404,用于根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
第二查询模块406,用于根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
处理模块408,用于根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
排序模块410,用于根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
上述基于知识图谱的模糊匹配装置,通过对检索句进行分词得到包括查询关键字的查询词袋,根据查询词袋查询已构建的知识图谱,能够得到包含查询关键字的知识图谱节点文本,进而可以进一步根据知识图谱节点文本,查询已构建的知识图谱,得到与知识图谱节点文本对应的相似文本集合,最后通过利用关键字得分以及知识图谱中各节点之间的节点连接关系,计算得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分,根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,能够得到与检索句对应的检索结果,实现准确模糊匹配,提高了模糊匹配准确度。
在其中一个实施例中,处理模块还用于根据关键字得分,计算知识图谱节点文本的第一检索得分,并根据知识图谱节点文本,确定与相似文本集合中相似节点文本对应的目标节点文本,根据目标节点文本以及节点连接关系,计算相似节点文本的第二检索得分。
在其中一个实施例中,基于知识图谱的模糊匹配装置还包括知识图谱构建模块,知识图谱构建模块用于获取文本数据集,对文本数据集中文本数据进行分词,得到与文本数据对应的词语集合,将词语集合输入已训练的词向量模型,得到与词语集合对应的词向量集 合,根据词向量集合,得到与文本数据对应的文本向量,根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,确定与文本数据对应的目标相似文本,根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
在其中一个实施例中,知识图谱构建模块还用于根据词向量集合,计算词向量集合中各词向量的相同维度平均值,归集相同维度平均值,得到与文本数据对应的文本向量。
在其中一个实施例中,知识图谱构建模块还用于根据词语集合以及预设词频统计算法,得到文本数据集中与文本数据相关的预设数量相似文本,根据文本向量,计算文本数据与预设数量相似文本中相似文本的文本相似度,根据文本相似度,选取出与文本数据对应的目标相似文本。
在其中一个实施例中,知识图谱构建模块还用于根据词语集合以及预设词频统计算法进行词频统计,得到文本特征词,根据文本特征词遍历词语集合,得到与文本数据对应的词频矩阵,根据词频矩阵,计算两两文本数据之间的词频相似度,根据词频相似度,得到文本数据集中与文本数据相关的预设数量相似文本。
在其中一个实施例中,知识图谱构建模块还用于利用IF-IDF算法计算出词语集合中每个词语对应的IF-IDF值,根据IF-IDF值对词语进行排序,从中筛选出IF-IDF值最高的预设特征词数量个文本特征词。
在其中一个实施例中,知识图谱构建模块还用于根据文本特征词遍历词语集合,得到每个文本特征词与文本数据对应的IF值,获取文本特征词的IDF值,根据IF值和文本特征词的IDF值,得到每个文本特征词与文本数据对应的IF-IDF值,根据每个文本特征词与文本数据对应的IF-IDF值,得到与文本数据对应的词频矩阵,在词频矩阵中同一位置的元素值表示同一个文本特征词在各文本数据中出现的词频。
在其中一个实施例中,知识图谱构建模块还用于根据词频矩阵,得到文本特征词的特征词得分,以及根据知识图谱的节点构造文本特征词倒排索引,并记录文本特征词的特征词得分,得到特征词得分表。
关于基于知识图谱的模糊匹配装置的具体限定可以参见上文中对于基于知识图谱的模糊匹配方法的限定,在此不再赘述。上述基于知识图谱的模糊匹配装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性或易失性存储介质、内存储器。该非易失性或易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算 机可读指令的运行提供环境。该计算机设备的数据库用于存储文本数据集等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于知识图谱的模糊匹配方法。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤:
接收携带检索句的检索请求,对检索句进行分词,得到包括查询关键字的查询词袋;
根据查询词袋,查询已构建的知识图谱,得到包含查询关键字的知识图谱节点文本,已构建的知识图谱以文本数据为节点、并以与文本数据对应的文本相似度为节点连接关系;
根据知识图谱节点文本,查询已构建的知识图谱,根据节点连接关系得到与知识图谱节点文本对应的相似文本集合;
根据预设特征词得分表,获取与查询关键字对应的关键字得分,根据关键字得分以及节点连接关系,得到知识图谱节点文本的第一检索得分以及相似文本集合中相似节点文本的第二检索得分;及
根据第一检索得分以及第二检索得分对知识图谱节点文本以及相似节点文本进行排序,得到与检索句对应的检索结果。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据关键字得分,计算知识图谱节点文本的第一检索得分,并根据知识图谱节点文本,确定与相似文本集合中相似节点文本对应的目标节点文本;及
根据目标节点文本以及节点连接关系,计算相似节点文本的第二检索得分。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
获取文本数据集,对文本数据集中文本数据进行分词,得到与文本数据对应的词语集合;
将词语集合输入已训练的词向量模型,得到与词语集合对应的词向量集合,根据词向量集合,得到与文本数据对应的文本向量;
根据文本向量、词语集合以及预设词频统计算法,计算文本数据集中文本数据之间的文本相似度,确定与文本数据对应的目标相似文本;及
根据目标相似文本,以文本数据为节点并以与目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据词向量集合,计算词向量集合中各词向量的相同维度平均值;及
归集相同维度平均值,得到与文本数据对应的文本向量。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据词语集合以及预设词频统计算法,得到文本数据集中与文本数据相关的预设数量相似文本;
根据文本向量,计算文本数据与预设数量相似文本中相似文本的文本相似度;及
根据文本相似度,选取出与文本数据对应的目标相似文本。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据词语集合以及预设词频统计算法进行词频统计,得到文本特征词;
根据文本特征词遍历词语集合,得到与文本数据对应的词频矩阵;
根据词频矩阵,计算两两文本数据之间的词频相似度;及
根据词频相似度,得到文本数据集中与文本数据相关的预设数量相似文本。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
利用IF-IDF算法计算出词语集合中每个词语对应的IF-IDF值;及
根据IF-IDF值对词语进行排序,从中筛选出IF-IDF值最高的预设特征词数量个文本特征词。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据文本特征词遍历词语集合,得到每个文本特征词与文本数据对应的IF值;
获取文本特征词的IDF值,根据IF值和文本特征词的IDF值,得到每个文本特征词与文本数据对应的IF-IDF值;及
根据每个文本特征词与文本数据对应的IF-IDF值,得到与文本数据对应的词频矩阵,在词频矩阵中同一位置的元素值表示同一个文本特征词在各文本数据中出现的词频。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据词频矩阵,得到文本特征词的特征词得分。
在其中一个实施例中,处理器执行计算机可读指令时还实现以下步骤:
根据知识图谱的节点构造文本特征词倒排索引,并记录文本特征词的特征词得分,得到特征词得分表。
在其中一个实施例中,提供了一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而 非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种基于知识图谱的模糊匹配方法,包括:
    接收携带检索句的检索请求,对所述检索句进行分词,得到包括查询关键字的查询词袋;
    根据所述查询词袋,查询已构建的知识图谱,得到包含所述查询关键字的知识图谱节点文本,所述已构建的知识图谱以文本数据为节点、并以与所述文本数据对应的文本相似度为节点连接关系;
    根据所述知识图谱节点文本,查询所述已构建的知识图谱,根据所述节点连接关系得到与所述知识图谱节点文本对应的相似文本集合;
    根据预设特征词得分表,获取与所述查询关键字对应的关键字得分,根据所述关键字得分以及所述节点连接关系,得到所述知识图谱节点文本的第一检索得分以及所述相似文本集合中相似节点文本的第二检索得分;及
    根据所述第一检索得分以及所述第二检索得分对所述知识图谱节点文本以及所述相似节点文本进行排序,得到与所述检索句对应的检索结果。
  2. 根据权利要求1所述的方法,其中,所述根据所述关键字得分以及所述节点连接关系,得到所述知识图谱节点文本的第一检索得分以及所述相似文本集合中相似节点文本的第二检索得分,包括:
    根据所述关键字得分,计算所述知识图谱节点文本的第一检索得分,并根据所述知识图谱节点文本,确定与所述相似文本集合中相似节点文本对应的目标节点文本;及
    根据所述目标节点文本以及所述节点连接关系,计算所述相似节点文本的第二检索得分。
  3. 根据权利要求1所述的方法,其中,在所述根据所述查询词袋,查询已构建的知识图谱,得到包含所述查询关键字的知识图谱节点文本之前,所述方法还包括:
    获取文本数据集,对所述文本数据集中文本数据进行分词,得到与所述文本数据对应的词语集合;
    将所述词语集合输入已训练的词向量模型,得到与所述词语集合对应的词向量集合,根据所述词向量集合,得到与所述文本数据对应的文本向量;
    根据所述文本向量、所述词语集合以及预设词频统计算法,计算所述文本数据集中文本数据之间的文本相似度,确定与所述文本数据对应的目标相似文本;及
    根据所述目标相似文本,以所述文本数据为节点、并以与所述目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
  4. 根据权利要求3所述的方法,其中,所述根据所述词向量集合,得到与所述文本数据对应的文本向量,包括:
    根据所述词向量集合,计算所述词向量集合中各词向量的相同维度平均值;及
    归集所述相同维度平均值,得到与所述文本数据对应的文本向量。
  5. 根据权利要求3所述的方法,其中,所述根据所述文本向量、所述词语集合以及预设词频统计算法,计算所述文本数据集中文本数据之间的文本相似度,确定与所述文本数据对应的目标相似文本,包括:
    根据所述词语集合以及预设词频统计算法,得到所述文本数据集中与所述文本数据相关的预设数量相似文本;
    根据所述文本向量,计算所述文本数据与所述预设数量相似文本中相似文本的文本相似度;及
    根据所述文本相似度,选取出与所述文本数据对应的目标相似文本。
  6. 根据权利要求5所述的方法,其中,所述根据所述词语集合以及预设词频统计算法,得到所述文本数据集中与所述文本数据相关的预设数量相似文本,包括:
    根据所述词语集合以及预设词频统计算法进行词频统计,得到文本特征词;
    根据文本特征词遍历所述词语集合,得到与所述文本数据对应的词频矩阵;
    根据所述词频矩阵,计算两两文本数据之间的词频相似度;及
    根据所述词频相似度,得到所述文本数据集中与所述文本数据相关的预设数量相似文本。
  7. 根据权利要求6所述的方法,其中,所述预设词频统计算法为IF-IDF算法,所述根据所述词语集合以及预设词频统计算法进行词频统计,得到文本特征词,包括:
    利用IF-IDF算法计算出词语集合中每个词语对应的IF-IDF值;及
    根据所述IF-IDF值对词语进行排序,从中筛选出IF-IDF值最高的预设特征词数量个文本特征词。
  8. 根据权利要求6所述的方法,其中,所述根据文本特征词遍历所述词语集合,得到与所述文本数据对应的词频矩阵,包括:
    根据文本特征词遍历所述词语集合,得到每个文本特征词与文本数据对应的IF值;
    获取文本特征词的IDF值,根据所述IF值和所述文本特征词的IDF值,得到每个文本特征词与文本数据对应的IF-IDF值;及
    根据每个文本特征词与文本数据对应的IF-IDF值,得到与所述文本数据对应的词频矩阵,在所述词频矩阵中同一位置的元素值表示同一个文本特征词在各文本数据中出现的词频。
  9. 根据权利要求6所述的方法,其中,在所述根据文本特征词遍历所述词语集合,得到与所述文本数据对应的词频矩阵之后,所述方法还包括:
    根据所述词频矩阵,得到文本特征词的特征词得分;及
    所述根据所述目标相似文本,以所述文本数据为节点并以与所述目标相似文本对应的文本相似度为节点连接关系,构建知识图谱之后,还包括:
    根据所述知识图谱的节点构造文本特征词倒排索引,并记录所述文本特征词的特征词得分,得到特征词得分表。
  10. 一种基于知识图谱的模糊匹配装置,包括:
    接收模块,用于接收携带检索句的检索请求,对所述检索句进行分词,得到包括查询关键字的查询词袋;
    第一查询模块,用于根据所述查询词袋,查询已构建的知识图谱,得到包含所述查询关键字的知识图谱节点文本,所述已构建的知识图谱以文本数据为节点、并以与所述文本数据对应的文本相似度为节点连接关系;
    第二查询模块,用于根据所述知识图谱节点文本,查询所述已构建的知识图谱,根据所述节点连接关系得到与所述知识图谱节点文本对应的相似文本集合;
    处理模块,用于根据预设特征词得分表,获取与所述查询关键字对应的关键字得分,根据所述关键字得分以及所述节点连接关系,得到所述知识图谱节点文本的第一检索得分以及所述相似文本集合中相似节点文本的第二检索得分;及
    排序模块,用于根据所述第一检索得分以及所述第二检索得分对所述知识图谱节点文本以及所述相似节点文本进行排序,得到与所述检索句对应的检索结果。
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收携带检索句的检索请求,对所述检索句进行分词,得到包括查询关键字的查询词袋;
    根据所述查询词袋,查询已构建的知识图谱,得到包含所述查询关键字的知识图谱节点文本,所述已构建的知识图谱以文本数据为节点、并以与所述文本数据对应的文本相似度为节点连接关系;
    根据所述知识图谱节点文本,查询所述已构建的知识图谱,根据所述节点连接关系得到与所述知识图谱节点文本对应的相似文本集合;
    根据预设特征词得分表,获取与所述查询关键字对应的关键字得分,根据所述关键字得分以及所述节点连接关系,得到所述知识图谱节点文本的第一检索得分以及所述相似文本集合中相似节点文本的第二检索得分;及
    根据所述第一检索得分以及所述第二检索得分对所述知识图谱节点文本以及所述相似节点文本进行排序,得到与所述检索句对应的检索结果。
  12. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还执行以下步骤:
    根据所述关键字得分,计算所述知识图谱节点文本的第一检索得分,并根据所述知识图谱节点文本,确定与所述相似文本集合中相似节点文本对应的目标节点文本;及
    根据所述目标节点文本以及所述节点连接关系,计算所述相似节点文本的第二检索得分。
  13. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述计算机可读指 令时还执行以下步骤:
    获取文本数据集,对所述文本数据集中文本数据进行分词,得到与所述文本数据对应的词语集合;
    将所述词语集合输入已训练的词向量模型,得到与所述词语集合对应的词向量集合,根据所述词向量集合,得到与所述文本数据对应的文本向量;
    根据所述文本向量、所述词语集合以及预设词频统计算法,计算所述文本数据集中文本数据之间的文本相似度,确定与所述文本数据对应的目标相似文本;及
    根据所述目标相似文本,以所述文本数据为节点、并以与所述目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
  14. 根据权利要求13所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还执行以下步骤:
    根据所述词向量集合,计算所述词向量集合中各词向量的相同维度平均值;及
    归集所述相同维度平均值,得到与所述文本数据对应的文本向量。
  15. 根据权利要求13所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还执行以下步骤:
    根据所述词语集合以及预设词频统计算法,得到所述文本数据集中与所述文本数据相关的预设数量相似文本;
    根据所述文本向量,计算所述文本数据与所述预设数量相似文本中相似文本的文本相似度;及
    根据所述文本相似度,选取出与所述文本数据对应的目标相似文本。
  16. 一个或多个存储有计算机可读指令的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收携带检索句的检索请求,对所述检索句进行分词,得到包括查询关键字的查询词袋;
    根据所述查询词袋,查询已构建的知识图谱,得到包含所述查询关键字的知识图谱节点文本,所述已构建的知识图谱以文本数据为节点、并以与所述文本数据对应的文本相似度为节点连接关系;
    根据所述知识图谱节点文本,查询所述已构建的知识图谱,根据所述节点连接关系得到与所述知识图谱节点文本对应的相似文本集合;
    根据预设特征词得分表,获取与所述查询关键字对应的关键字得分,根据所述关键字得分以及所述节点连接关系,得到所述知识图谱节点文本的第一检索得分以及所述相似文本集合中相似节点文本的第二检索得分;及
    根据所述第一检索得分以及所述第二检索得分对所述知识图谱节点文本以及所述相似节点文本进行排序,得到与所述检索句对应的检索结果。
  17. 根据权利要求16所述的存储介质,其中,所述计算机可读指令被所述处理器执 行时还执行以下步骤:
    根据所述关键字得分,计算所述知识图谱节点文本的第一检索得分,并根据所述知识图谱节点文本,确定与所述相似文本集合中相似节点文本对应的目标节点文本;及
    根据所述目标节点文本以及所述节点连接关系,计算所述相似节点文本的第二检索得分。
  18. 根据权利要求16所述的存储介质,其中,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    获取文本数据集,对所述文本数据集中文本数据进行分词,得到与所述文本数据对应的词语集合;
    将所述词语集合输入已训练的词向量模型,得到与所述词语集合对应的词向量集合,根据所述词向量集合,得到与所述文本数据对应的文本向量;
    根据所述文本向量、所述词语集合以及预设词频统计算法,计算所述文本数据集中文本数据之间的文本相似度,确定与所述文本数据对应的目标相似文本;及
    根据所述目标相似文本,以所述文本数据为节点、并以与所述目标相似文本对应的文本相似度为节点连接关系,构建知识图谱。
  19. 根据权利要求18所述的存储介质,其中,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    根据所述词向量集合,计算所述词向量集合中各词向量的相同维度平均值;及
    归集所述相同维度平均值,得到与所述文本数据对应的文本向量。
  20. 根据权利要求18所述的存储介质,其中,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    根据所述词语集合以及预设词频统计算法,得到所述文本数据集中与所述文本数据相关的预设数量相似文本;
    根据所述文本向量,计算所述文本数据与所述预设数量相似文本中相似文本的文本相似度;及
    根据所述文本相似度,选取出与所述文本数据对应的目标相似文本。
PCT/CN2021/091060 2020-12-31 2021-04-29 基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质 WO2022142027A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011633652.0A CN112732883A (zh) 2020-12-31 2020-12-31 基于知识图谱的模糊匹配方法、装置和计算机设备
CN202011633652.0 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022142027A1 true WO2022142027A1 (zh) 2022-07-07

Family

ID=75608543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091060 WO2022142027A1 (zh) 2020-12-31 2021-04-29 基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN112732883A (zh)
WO (1) WO2022142027A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226348A (zh) * 2023-03-01 2023-06-06 读书郎教育科技有限公司 一种基于知识图谱的学习方法
CN116450776A (zh) * 2023-04-23 2023-07-18 北京石油化工学院 基于知识图谱的油气管网法律法规及技术标准检索系统
CN116595197A (zh) * 2023-07-10 2023-08-15 清华大学深圳国际研究生院 一种专利分类号关联知识图谱的链路预测方法及系统
CN116701573A (zh) * 2023-06-06 2023-09-05 哈尔滨理工大学 一种基于时序知识图谱的查询方法和系统
CN116932767A (zh) * 2023-09-18 2023-10-24 江西农业大学 基于知识图谱的文本分类方法、系统、存储介质及计算机
CN117172322A (zh) * 2023-11-03 2023-12-05 中国标准化研究院 一种建立数字乡村知识图谱的方法
CN117271712A (zh) * 2023-11-21 2023-12-22 上海爱可生信息技术股份有限公司 基于向量数据库的检索方法、系统及电子设备
CN117688251A (zh) * 2024-02-04 2024-03-12 北京奥维云网大数据科技股份有限公司 一种基于知识图谱的商品检索方法及系统
CN117807191A (zh) * 2024-02-29 2024-04-02 船舶信息研究中心(中国船舶集团有限公司第七一四研究所) 一种基于知识图谱的非结构化数据检索方法及系统
CN117807191B (zh) * 2024-02-29 2024-05-24 船舶信息研究中心(中国船舶集团有限公司第七一四研究所) 一种基于知识图谱的非结构化数据检索方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641833B (zh) * 2021-08-17 2024-04-09 同济大学 服务需求匹配方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516047A (zh) * 2019-09-02 2019-11-29 湖南工业大学 基于包装领域的知识图谱的检索方法及检索系统
CN110928984A (zh) * 2019-09-30 2020-03-27 珠海格力电器股份有限公司 一种知识图谱的构建方法、装置、终端及存储介质
CN111859147A (zh) * 2020-07-31 2020-10-30 中国工商银行股份有限公司 对象推荐方法、对象推荐装置和电子设备
US20200364233A1 (en) * 2019-05-15 2020-11-19 WeR.AI, Inc. Systems and methods for a context sensitive search engine using search criteria and implicit user feedback

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890711B (zh) * 2012-09-13 2015-08-12 中国人民解放军国防科学技术大学 一种检索排序方法及系统
JP6843588B2 (ja) * 2016-11-04 2021-03-17 株式会社日立製作所 文書検索方法及び装置
CN109033132B (zh) * 2018-06-05 2020-12-11 中证征信(深圳)有限公司 利用知识图谱计算文本和主体相关度的方法以及装置
US10891321B2 (en) * 2018-08-28 2021-01-12 American Chemical Society Systems and methods for performing a computer-implemented prior art search
CN109582849A (zh) * 2018-12-03 2019-04-05 浪潮天元通信信息系统有限公司 一种基于知识图谱的网络资源智能检索方法
CN110188166B (zh) * 2019-05-15 2021-10-15 北京字节跳动网络技术有限公司 文档搜索方法、装置及电子设备
CN111723179B (zh) * 2020-05-26 2023-07-07 湖北师范大学 基于概念图谱的反馈模型信息检索方法、系统及介质
CN111400607B (zh) * 2020-06-04 2020-11-10 浙江口碑网络技术有限公司 搜索内容输出方法、装置、计算机设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200364233A1 (en) * 2019-05-15 2020-11-19 WeR.AI, Inc. Systems and methods for a context sensitive search engine using search criteria and implicit user feedback
CN110516047A (zh) * 2019-09-02 2019-11-29 湖南工业大学 基于包装领域的知识图谱的检索方法及检索系统
CN110928984A (zh) * 2019-09-30 2020-03-27 珠海格力电器股份有限公司 一种知识图谱的构建方法、装置、终端及存储介质
CN111859147A (zh) * 2020-07-31 2020-10-30 中国工商银行股份有限公司 对象推荐方法、对象推荐装置和电子设备

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226348B (zh) * 2023-03-01 2023-09-26 读书郎教育科技有限公司 一种基于知识图谱的学习方法
CN116226348A (zh) * 2023-03-01 2023-06-06 读书郎教育科技有限公司 一种基于知识图谱的学习方法
CN116450776A (zh) * 2023-04-23 2023-07-18 北京石油化工学院 基于知识图谱的油气管网法律法规及技术标准检索系统
CN116701573A (zh) * 2023-06-06 2023-09-05 哈尔滨理工大学 一种基于时序知识图谱的查询方法和系统
CN116595197B (zh) * 2023-07-10 2023-11-07 清华大学深圳国际研究生院 一种专利分类号关联知识图谱的链路预测方法及系统
CN116595197A (zh) * 2023-07-10 2023-08-15 清华大学深圳国际研究生院 一种专利分类号关联知识图谱的链路预测方法及系统
CN116932767A (zh) * 2023-09-18 2023-10-24 江西农业大学 基于知识图谱的文本分类方法、系统、存储介质及计算机
CN116932767B (zh) * 2023-09-18 2023-12-12 江西农业大学 基于知识图谱的文本分类方法、系统、存储介质及计算机
CN117172322A (zh) * 2023-11-03 2023-12-05 中国标准化研究院 一种建立数字乡村知识图谱的方法
CN117172322B (zh) * 2023-11-03 2024-03-12 中国标准化研究院 一种建立数字乡村知识图谱的方法
CN117271712A (zh) * 2023-11-21 2023-12-22 上海爱可生信息技术股份有限公司 基于向量数据库的检索方法、系统及电子设备
CN117688251A (zh) * 2024-02-04 2024-03-12 北京奥维云网大数据科技股份有限公司 一种基于知识图谱的商品检索方法及系统
CN117688251B (zh) * 2024-02-04 2024-04-26 北京奥维云网大数据科技股份有限公司 一种基于知识图谱的商品检索方法及系统
CN117807191A (zh) * 2024-02-29 2024-04-02 船舶信息研究中心(中国船舶集团有限公司第七一四研究所) 一种基于知识图谱的非结构化数据检索方法及系统
CN117807191B (zh) * 2024-02-29 2024-05-24 船舶信息研究中心(中国船舶集团有限公司第七一四研究所) 一种基于知识图谱的非结构化数据检索方法及系统

Also Published As

Publication number Publication date
CN112732883A (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022142027A1 (zh) 基于知识图谱的模糊匹配方法、装置、计算机设备和存储介质
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
WO2019091026A1 (zh) 知识库文档快速检索方法、应用服务器及计算机可读存储介质
CN109885773B (zh) 一种文章个性化推荐方法、系统、介质及设备
US10019442B2 (en) Method and system for peer detection
WO2017097231A1 (zh) 话题处理方法及装置
CN111797214A (zh) 基于faq数据库的问题筛选方法、装置、计算机设备及介质
US20160034514A1 (en) Providing search results based on an identified user interest and relevance matching
CN104573130B (zh) 基于群体计算的实体解析方法及装置
US20060224584A1 (en) Automatic linear text segmentation
US20080040342A1 (en) Data processing apparatus and methods
JP2020500371A (ja) 意味的検索のための装置および方法
Gao et al. Multimedia social event detection in microblog
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
US20180373754A1 (en) System and method for conducting a textual data search
WO2018090468A1 (zh) 视频节目的搜索方法和装置
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN117435685A (zh) 文档检索方法、装置、计算机设备、存储介质和产品
Tao et al. Doc2cube: Automated document allocation to text cube via dimension-aware joint embedding
Tandjung et al. Topic modeling with latent-dirichlet allocation for the discovery of state-of-the-art in research: A literature review
US20240037375A1 (en) Systems and Methods for Knowledge Distillation Using Artificial Intelligence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912789

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912789

Country of ref document: EP

Kind code of ref document: A1