CN111400493A

CN111400493A - Text matching method, device and equipment based on slot position similarity and storage medium

Info

Publication number: CN111400493A
Application number: CN202010149508.3A
Authority: CN
Inventors: 何斐斐
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-10

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a text matching method, a text matching device, text matching equipment and a text matching storage medium based on slot position similarity, which are used for improving the accuracy of problem matching. The text matching method based on the slot similarity comprises the following steps: performing a series of processing operations such as coarse granularity division, slot position establishment, fineness granularity division and the like on the problems in the target field so as to establish a dictionary in advance and determine a subdivision rule; performing slot filling processing on a target question sentence currently input by a user according to an established dictionary and a subdivision rule, and classifying questions in an existing question and answer library according to the same question types as the input target question sentence to obtain a candidate question set; and calculating the similarity between the target question sentences input by the user and the slot position words corresponding to each question in the candidate question set, and determining the similar question which is most matched with the input target question sentences according to the slot position similarity calculation result and the subdivision rule.

Description

Text matching method, device and equipment based on slot position similarity and storage medium

Technical Field

The invention relates to the technical field of similarity matching of artificial intelligence, in particular to a text matching method, a text matching device, text matching equipment and a text matching storage medium based on slot similarity.

Background

Text semantic matching has wide application scenes in the field of natural language processing (N L P), for example, semantic similarity of sentences A and B in a machine translation scene, and text semantic matching directly influences the quality of results.

The existing common matching algorithm based on deep learning or machine learning has no targeted text semantic matching solution for a target field, such as an insurance field, and has no special statistical treatment for high-frequency problems. Because the text semantic matching accuracy processed by a common matching algorithm is difficult to meet the requirement, a user cannot obtain an accurate answer, and thus a manual seat is needed to provide service, the operation cost of an enterprise is high, and the satisfaction degree of a customer is also influenced.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the text semantic matching accuracy is insufficient due to the problem of processing of the existing text matching algorithm.

In order to achieve the above object, a first aspect of the present invention provides a text matching method based on slot similarity, including:

identifying and extracting entity words in a target question sentence input by a user;

finding out corresponding keywords in a pre-established trigger word dictionary according to the entity words, and dividing the coarse-grained problem types of the target problem sentences according to the keywords;

finding and combining problems with the same problem type as the target problem statement in a pre-established problem classification set to obtain a candidate problem set;

performing word segmentation processing on the target problem statement, and performing slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset subdivision rule;

calculating slot position similarity between the target question statement and each question in the candidate question set by using a similarity calculation algorithm;

and determining the similar problem which is most matched with the target problem statement according to the slot position similarity calculation result and the subdivision rule.

Optionally, in another implementation manner of the first aspect of the present invention, before the identifying and extracting the entity words in the target question sentence input by the user, the method further includes:

identifying and extracting entity words related to the target field corpus by using a named entity identification algorithm, and establishing an entity dictionary corresponding to the target field language according to the entity words;

performing statistical word segmentation processing on the questions in the question-answer library of the target field, and extracting keywords according to word segmentation results;

expanding the keywords to form a trigger word set, and establishing a trigger word dictionary according to the trigger word set;

dividing the questions in the question-answering database of the target field into coarse-grained question types according to the keywords in the trigger word set, and establishing corresponding slot position information for each coarse-grained question type according to the entity dictionary;

performing statistical word segmentation processing on the corpus of the coarse-grained problem types, extracting high-frequency words in the slot position information corresponding to the coarse-grained problem types according to word segmentation results, and establishing a high-frequency word dictionary according to the high-frequency words;

further dividing the questions in the question-answer library in the target field into fine-grained question types according to the high-frequency words in the high-frequency word dictionary;

setting a corresponding subdivision rule for each fine-grained problem type; and the subdivision rule comprises the step of establishing corresponding slot position information for each fine-grained problem type according to the entity dictionary.

and classifying the questions in the question-answering database in the target field according to the coarse-grained question types divided by the keywords in the trigger word dictionary, and establishing a question classification set classified according to the coarse-grained question types.

Optionally, in another implementation manner of the first aspect of the present invention, the performing statistical word segmentation on the questions in the question-and-answer library in the target field, and extracting keywords according to a word segmentation result includes:

performing word segmentation processing on the questions in the question-answer library of the target field by adopting a statistical word segmentation algorithm, and extracting high-frequency words in word segmentation results according to word frequency information;

and extracting the high-frequency words to obtain corresponding keywords, wherein the keywords are nouns and verbs in the high-frequency words.

Optionally, in another implementation manner of the first aspect of the present invention, the expanding the keyword to form a trigger word set, and establishing a trigger word dictionary according to the trigger word set includes:

calling a synonym forest and a Chinese word library of the known network to expand the word quantity of the keywords;

combining the expanded words to form a trigger word set;

and establishing a trigger word dictionary according to the trigger word set.

Optionally, in another implementation manner of the first aspect of the present invention, the performing a word segmentation process on the target problem statement, and performing slot filling on the target problem statement according to a word segmentation process result and slot information established in a preset subdivision rule includes:

creating a Hash mapping from a pre-established high-frequency word dictionary, wherein the key value of the Hash mapping is a high-frequency word stored in the pre-established high-frequency word dictionary;

performing word segmentation processing on the target question sentence, and judging whether the word obtained by word segmentation processing exists in the hash mapping;

and when the word obtained by word segmentation processing is judged to exist in the Hash mapping, slot filling is carried out on the target question sentence according to slot information established in a preset subdivision rule.

Optionally, in another implementation manner of the first aspect of the present invention, the calculating, by using a similarity calculation algorithm, a slot similarity between the target question statement and each question in the candidate question set includes:

training the target question sentence and each question in the candidate question set respectively by utilizing a Bert model to obtain a word vector corresponding to each question;

respectively calculating the average value of the word vectors corresponding to the target problem statement and each problem in the candidate problem set, and taking the obtained average value of the word vectors corresponding to each problem as the word vector of each corresponding slot;

respectively calculating cosine distances between word vectors of the slot positions corresponding to the target question sentences and word vectors of the slot positions corresponding to the questions in the candidate question set; the cosine distance is used for judging the slot position similarity between the target question statement and each question in the candidate question set.

The second aspect of the present invention provides a text matching apparatus based on slot similarity, including:

the entity recognition and extraction module is used for recognizing and extracting entity words in target question sentences input by users;

the coarse-grained problem division module is used for finding out corresponding keywords in a pre-established trigger word dictionary according to the entity words and dividing the coarse-grained problem types of the target problem sentences according to the keywords;

a candidate problem set acquisition module, configured to find and combine problems in a pre-established problem classification set that have the same problem type as the target problem statement to obtain a candidate problem set;

the word segmentation and slot position filling module is used for performing word segmentation on the target problem statement and filling slot positions on the target problem statement according to a word segmentation processing result and slot position information established in a preset subdivision rule;

the similarity calculation module is used for calculating the slot position similarity between the target question statement and each question in the candidate question set by using a similarity calculation algorithm;

and the best matching similarity question determining module is used for determining a similarity question which is most matched with the target question statement according to the slot position similarity calculation result and the subdivision rule.

Optionally, in another implementation manner of the second aspect of the present invention, the text matching apparatus based on slot similarity further includes:

the entity dictionary establishing module is used for identifying and extracting entity words related to the target field linguistic data by using a named entity recognition algorithm and establishing an entity dictionary corresponding to the target field language according to the entity words;

the word segmentation and keyword extraction module is used for carrying out statistical word segmentation on the questions in the question-answer library of the target field and extracting keywords according to word segmentation results;

the trigger word dictionary establishing module is used for expanding the keywords to form a trigger word set and establishing a trigger word dictionary according to the trigger word set;

the coarse-grained division and corresponding slot position establishment module is used for dividing the questions in the question-answering database in the target field into coarse-grained question types according to the keywords in the trigger word set and establishing corresponding slot position information for each coarse-grained question type according to the entity dictionary;

the high-frequency word dictionary establishing module is used for carrying out statistical word segmentation on the corpus of the coarse-grained problem types, extracting high-frequency words in the slot position information corresponding to the coarse-grained problem types according to word segmentation results, and establishing a high-frequency word dictionary according to the high-frequency words;

the fine-grained division module is used for further dividing the questions in the question-answer library of the target field into fine-grained question types according to the high-frequency words in the high-frequency word dictionary;

the subdivision rule setting module is used for setting a corresponding subdivision rule for each fine-grained problem type; and the subdivision rule comprises the step of establishing corresponding slot position information for each fine-grained problem type according to the entity dictionary.

and the problem classification set establishing module is used for classifying the problems in the target field question-answering library in a coarse-grained manner according to the coarse-grained problem types divided by the keywords in the trigger word dictionary and establishing a problem classification set classified according to the coarse-grained problem types.

Optionally, in another implementation manner of the second aspect of the present invention, the word segmentation and keyword extraction module is specifically configured to:

Optionally, in another implementation manner of the second aspect of the present invention, the trigger dictionary creating module is specifically configured to:

combining the expanded words to form a trigger word set;

and establishing a trigger word dictionary according to the trigger word set.

Optionally, in another implementation manner of the second aspect of the present invention, the word segmentation and slot filling module is specifically configured to:

Optionally, in another implementation manner of the second aspect of the present invention, the similarity calculation module is specifically configured to:

The third aspect of the present invention provides a text matching apparatus based on slot similarity, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the slot similarity-based text matching device to perform the method of the first aspect.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.

In the technical scheme provided by the invention, entity words in target question sentences input by a user are identified and extracted; finding out corresponding keywords in a pre-established trigger word dictionary according to the entity words, and dividing the coarse-grained problem types of the target problem sentences according to the keywords; finding and combining problems with the same problem type as the target problem statement in a pre-established problem classification set to obtain a candidate problem set; performing word segmentation processing on the target problem statement, and performing slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset subdivision rule; calculating slot position similarity between the target question statement and each question in the candidate question set by using a similarity calculation algorithm; and determining the similar problem which is most matched with the target problem statement according to the slot position similarity calculation result and the subdivision rule. In the embodiment of the invention, a dictionary is pre-established and a subdivision rule is determined by carrying out a series of processing operations such as coarse granularity division, slot position establishment, fineness granularity division and the like on the problems in the target field; performing slot filling processing on a target question sentence currently input by a user according to an established dictionary and a subdivision rule, and classifying questions in an existing question and answer library according to the same question types as the input target question sentence to obtain a candidate question set; and calculating the similarity between the target question sentences input by the user and the slot position words corresponding to each question in the candidate question set, and determining the most matched similar question with the input target question sentences according to the slot position similarity calculation result and the segmentation rule, so that the semantic matching accuracy of the question text is improved.

Drawings

FIG. 1 is a diagram of an embodiment of a text matching method based on slot similarity according to the embodiment of the present invention;

FIG. 2 is a diagram of another embodiment of a text matching method based on slot similarity according to the embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of a text matching apparatus based on slot similarity according to the embodiment of the present invention;

FIG. 4 is a diagram of another embodiment of a text matching apparatus based on slot similarity according to the embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a text matching device based on slot similarity in the embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text matching method, a text matching device, text matching equipment and a storage medium based on slot position similarity, which are used for improving the semantic matching accuracy of question texts, so that a user can obtain more accurate answers.

In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the embodiment of the present invention, the slot similarity-based text matching method is executed by a server, and includes but is not limited to application in a scenario that requires text semantic matching, such as a terminal device and a computer.

For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the text matching method based on slot similarity in the embodiment of the present invention includes:

101. and identifying and extracting entity words in the target question sentences input by the user.

Specifically, when the server receives a target problem statement input by a user, entity words in the target problem statement are identified and extracted through an entity identification algorithm. The entity identification algorithm used in the present invention is not limited.

In specific implementation, optionally, the entity recognition algorithm for the target question sentence adopts a rule model constructed by linguistic experts, and features including statistical information, punctuations, keywords, indicator words, direction words, position words, central words and the like are selected depending on the establishment of an existing knowledge base and a dictionary, and the mode and the character string are matched as main means. In addition, when the extracted rules reflect linguistic phenomena more accurately, such as simple entities like "age", a rule-based algorithm is used to identify the entities, which has better performance than a statistical-based algorithm.

102. And finding out corresponding keywords in a pre-established trigger word dictionary according to the entity words, and dividing the coarse-grained problem types of the target problem sentences according to the keywords.

Specifically, the server finds out related keywords in a pre-established trigger word dictionary according to the identified and extracted entity words, and divides the coarse-grained problem types of the target problem sentences according to the obtained keywords. For example, if a pre-established trigger dictionary contains related keywords of "recommend" and "apply insurance", the target question sentence input by the user has the word "recommend", and the coarse-grained question type of the question sentence can be determined as a recommendation class; if the target question sentence input by the user has words such as 'application rule', 'purchase impossible' and 'can be applied', the coarse-grained question type of the question sentence can be determined as an application class, and the rough intention of the question asked by the user, namely the type of the question which the user wants to consult, can be determined by dividing the coarse-grained question types of the target question sentence.

103. And finding out and combining the problems with the same problem type as the target problem statement in the pre-established problem classification set to obtain a candidate problem set.

Further, the server finds out and merges problems with the same problem type as the target problem statement in a pre-established problem classification set to obtain a candidate problem set. Questions in the question classification set have been classified according to coarse-grained question types, such as "recommendation type", "application type", "disease type", etc. all questions in the existing question-and-answer library have been classified accordingly. Therefore, all the problems with the same problem type as the target problem statement are found in the pre-established problem classification set, and the found problems are combined to form a set, so that all the problem sets with the same problem type as the target problem statement are obtained and used as candidate problem sets, so as to further perform similarity calculation processing on the problems in the candidate problem sets and the target problem statement.

104. And performing word segmentation processing on the target problem statement, and performing slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset subdivision rule.

Specifically, the server further performs word segmentation on the target problem statement, and performs slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset subdivision rule. The slot position information is a corresponding entity slot position set according to the problem type, and clear attribute definition can be made for the entity in the problem type of the target domain language by establishing the slot position information, so that basic preparation is further made for slot position identification, slot position entity extraction and slot position filling in a target problem statement input by a user. For example, a disease entity slot and an insurance entity slot are established for the problem type of the insurance problem in the preset subdivision rule, that is, if the target problem statement input by the user is "leukemia can be insured XX risk? "then the target question statement may be populated with the" disease "entity slot and the" insurance "entity slot. The slot filling is carried out on the target question sentence, so that the slot information of the target question sentence is further compared with the slot information of other questions to be compared in similarity, and a similarity question is obtained.

105. And calculating the slot similarity between the target question sentence and each question in the candidate question set by using a similarity calculation algorithm.

Further, the server calculates the slot similarity between the target question statement and each question in the candidate question set by using a similarity calculation algorithm. That is, the slot position in the target question sentence input by the user and the slot position of the question in the candidate question set are calculated, and the similarity between the slot position words of the corresponding entity is calculated, wherein the slot position similarity calculation algorithm can be used for calculating a similarity calculation measurement standard in cosine distance, Manhattan distance, correlation coefficient and Mahalanobis distance. Specifically, the weight score of each slot in each problem may be determined according to the similarity calculation, and the slot scores are weighted and averaged to obtain the final similarity matching score. By calculating the similarity of the entity slots, the semantic matching degree between the target problem input by the user and each problem in the candidate problem set can be determined, so that the best matching similar problem can be further found.

106. And determining the similarity problem which is most matched with the target problem statement according to the slot similarity calculation result and the subdivision rule.

Further, the server determines the similarity problem which is most matched with the target problem statement according to the slot similarity calculation result and the subdivision rule. Specifically, after the similarity is calculated, the corresponding final slot similarity matching score between each problem in the candidate problem set and the target problem statement can be obtained, and according to the slot information which the fine-grained problem type in the subdivision rule should have, the similarity problem which is most matched with the target problem statement can be determined by combining the final slot similarity matching score. And returning the finally matched similarity problem according to the final slot similarity matching and whether the slot information corresponding to the fine-grained problem type can be empty or not.

Specifically, the server screens out the problems meeting the requirements of slot information in the subdivision rule in the candidate problem set, sorts the obtained final similarity matching scores in a descending order on the basis of meeting the requirements of the subdivision rule, and takes the problem corresponding to the highest final similarity matching score as the most matched similarity problem.

Therefore, the slot filling processing is carried out on the target question sentences currently input by the user according to the established dictionary and the subdivision rules, and the questions in the existing question-answering library are classified according to the question types same with the input target question sentences to obtain the candidate question set; the method comprises the steps of calculating the similarity between a target question sentence input by a user and slot position words corresponding to all questions in a candidate question set, determining a similar question which is most matched with the input target question sentence according to a slot position similarity calculation result and a segmentation rule, improving the semantic matching accuracy of a question text, and enabling the user to obtain more accurate answers in an existing question-answering library according to the obtained most matched similar question.

Further, referring to fig. 2, in another embodiment of the text matching method based on slot similarity according to the embodiment of the present invention, the method further includes:

201. and identifying and extracting entity words related to the target field linguistic data by using a named entity identification algorithm, and establishing an entity dictionary corresponding to the target field language according to the entity words.

Specifically, in the embodiment of the present invention, the server further uses a named entity recognition algorithm to recognize and extract the entity words related to the target domain corpus in advance, and establishes the entity dictionary corresponding to the target domain language according to the entity words. The target area may be an insurance area, or may be other specific target areas, and is not limited herein. For example, if the target field is an insurance field, a corpus of the insurance field may be obtained, entity words in the corpus may be extracted according to an entity recognition algorithm, and the extracted entity words are combined to form an entity dictionary, which is used when corresponding slot information is established for a problem, for example, entities such as "disease", "insurance", "age" in the entity dictionary are used as slots that need to be established for the problem in the insurance field.

202. And carrying out statistical word segmentation on the questions in the question-answer library in the target field, and extracting keywords according to word segmentation results.

Specifically, the server also performs statistical word segmentation processing on the questions in the question-answer library in the target field, and extracts keywords according to word segmentation results. Aiming at a certain target field, firstly, carrying out statistical word segmentation processing on the questions in the obtained question-answering library to obtain high-frequency words, and extracting keywords from the high-frequency words.

The keyword can be used as a basis for coarse-grained problem type division through the mode, and the intention of the problem corresponding to the coarse granularity can be identified through the division of the keyword on the problem type, namely, the problem is roughly classified for roughly identifying the intention of the user.

203. And expanding the keywords to form a trigger word set, and establishing a trigger word dictionary according to the trigger word set.

Specifically, the word size of the keyword can be expanded to form a trigger word set, and a trigger word dictionary is established. Keywords in the trigger dictionary are used as the basis for coarse-grained problem type classification.

204. Dividing the questions in the question-answering base in the target field into coarse-grained question types according to the keywords in the trigger word set, and establishing corresponding slot position information for each coarse-grained question type according to the entity dictionary.

Specifically, the server divides the questions in the target field question-answering library into coarse-grained question types according to the keywords in the trigger word set, and establishes corresponding slot position information for each coarse-grained question type according to the entity dictionary. In specific implementation, after coarse-grained problem types are obtained from the established trigger dictionary and divided, corresponding slot position information is established by observing data, such as entity slot positions for setting diseases and insurance for insurance problems.

205. And carrying out statistical word segmentation on the corpus of the coarse-grained problem types, extracting high-frequency words in the slot position information corresponding to the coarse-grained problem types according to word segmentation results, and establishing a high-frequency word dictionary according to the high-frequency words.

Specifically, the server performs statistical word segmentation processing on the corpus of the problem types with the coarse granularity, extracts high-frequency words in the slot position information corresponding to the problem types with the coarse granularity according to word segmentation results, and establishes a high-frequency word dictionary according to the high-frequency words. The linguistic data which are subjected to coarse-grained problem type division are further subjected to slot position analysis, and high-frequency words in the linguistic data are extracted to establish a high-frequency word dictionary.

In specific implementation, word segmentation processing can be performed on the linguistic data of the coarse-grained problem type according to a pre-established statistical language model; and extracting high-frequency words in the word segmentation result according to the word frequency information so as to establish a high-frequency word dictionary.

206. And further dividing the questions in the question-answer library in the target field into fine-grained question types according to the high-frequency words in the high-frequency word dictionary.

Specifically, the questions in the target field question-answer library are further divided into fine-grained question types according to the high-frequency words in the high-frequency word dictionary, and the questions in the target field language are further divided into fine-grained questions, so that the question intentions input by the user are more accurate.

207. And setting a corresponding subdivision rule for each fine-grained problem type, wherein the subdivision rule comprises establishing corresponding slot position information for each fine-grained problem type according to the entity dictionary.

Specifically, the server also sets a corresponding subdivision rule for each fine-grained problem type; the subdivision rules include establishing corresponding slot information for each fine-grained problem type according to an entity dictionary. For example, after the fine-grained problem type "disease insurance problem" is divided, the slot corresponding to the "disease insurance problem" may be set to include a disease entity, an insurance name entity, an age entity, and a place name entity. The slot position information of the fine-grained problem types can be established to make clear attribute definition for entities in the problem types of the target domain languages, and basic preparation is further made for slot position identification, slot position entity extraction and slot position filling in target problem sentences input by users.

208. And identifying and extracting entity words in the target question sentences input by the user.

209. And finding out corresponding keywords in a pre-established trigger word dictionary according to the entity words, and dividing the coarse-grained problem types of the target problem sentences according to the keywords.

210. And finding out and combining the problems with the same problem type as the target problem statement in the pre-established problem classification set to obtain a candidate problem set.

211. And performing word segmentation processing on the target problem statement, and performing slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset subdivision rule.

212. And calculating the slot position similarity between the target question sentence and each question in the candidate question set by using a similarity calculation algorithm.

213. And determining the similar problem which is most matched with the target problem statement according to the slot position similarity calculation result and the subdivision rule.

Specifically, the detailed implementation description of steps 208-213 refers to steps 101-106, which are not described herein again.

Optionally, in another implementation manner of the first aspect of the present invention, before the step 101, the method further includes:

In specific implementation, coarse-grained question type classification is carried out on questions in a target field question-and-answer library, such as questions in an insurance field question-and-answer library, according to keywords in the dictionary through the established trigger word dictionary, and a question classification set classified according to the coarse-grained question types is established. For example, all questions about the application are divided into question types of 'application class' and classified accordingly; and dividing all the questions related to recommendation into question types of recommendation classes, classifying the question types well, and combining all the classified questions to establish a question classification set classified according to the coarse-grained question types by analogy.

Optionally, in another implementation manner of the first aspect of the present invention, step 202 includes:

and performing word segmentation processing on the questions in the question-answer library in the target field by adopting a statistical word segmentation algorithm, and extracting high-frequency words in word segmentation results according to word frequency information.

Extracting the high-frequency words to obtain corresponding keywords, wherein the keywords are nouns and verbs in the high-frequency words.

In specific implementation, the method adopts a statistical word segmentation method to segment words, a statistical language model can be established in advance according to the requirements of the target field, words are segmented, then probability calculation is carried out on the segmentation result, and the segmentation result obtained in the word segmentation mode with the highest probability is used as the finally selected word segmentation result. Statistical algorithms for probability calculation herein include, but are not limited to, hidden markov models, conditional random field models.

Optionally, in another implementation manner of the first aspect of the present invention, step 203 includes:

combining the expanded words to form a trigger word set;

and establishing a trigger word dictionary according to the trigger word set.

In specific implementation, the synonym forest and the Chinese word bank in the known network are called to expand the word quantity of the trigger words, a large amount of problems in the target field can be input in advance, and a trigger word dictionary with different keywords as classification bases is established. When a certain input question sentence contains a certain keyword in the trigger word dictionary, the question can be classified into a coarse-grained question type corresponding to the keyword.

Optionally, in another implementation manner of the first aspect of the present invention, step 104 includes:

and creating a Hash mapping from the pre-established high-frequency word dictionary, wherein the key value of the Hash mapping is the high-frequency word stored in the pre-established high-frequency word dictionary.

And performing word segmentation processing on the target question sentence, and judging whether the word obtained by word segmentation processing exists in the Hash mapping.

And when the word obtained by word segmentation processing is judged to exist in the Hash mapping, slot filling is carried out on the target problem statement according to slot information established in a preset segmentation rule.

Specifically, the server creates a hash mapping from the pre-established high-frequency word dictionary, and the key value of the hash mapping is the high-frequency word stored in the pre-established high-frequency word dictionary. And further performing word segmentation processing on the target question sentence, and judging whether the word obtained by word segmentation processing exists in the Hash mapping. And when the word obtained by word segmentation processing is judged to exist in the Hash mapping, slot filling is carried out on the target problem statement according to slot information established in a preset segmentation rule. In this embodiment, the hash mapping value may be a similar word of the high-frequency word or a supplement. The high-frequency word dictionary is created into a hash mapping data structure, so that the high-frequency word mapping relation in the dictionary is clearer, and when slot position analysis and identification are carried out on target problem sentences, whether word segmentation results exist in hash mapping or not can be searched and judged more efficiently, and the slot position analysis and filling efficiency is improved.

Optionally, in another implementation manner of the first aspect of the present invention, step 105 includes:

and training the target question sentence and each question in the candidate question set by utilizing a Bert model to obtain a corresponding word vector.

And respectively calculating the average value of the word vectors corresponding to the problems in the target problem statement and the candidate problem set, and taking the obtained average value of the word vectors corresponding to the problems as the word vectors of the corresponding slots.

Specifically, in this embodiment, a Bert model is used to train a slot in a question sentence input by a user and each question in a candidate question set to obtain a word vector corresponding to each question, an average value of the word vectors obtained by each question is taken as a word vector of a word in the slot, a cosine distance of the word vectors between the corresponding slots is calculated, the cosine distance is used as a similarity metric, slot information in a subdivision rule is set, and a question with the best cosine distance score is calculated as a similarity question that is the best match of the question sentence input by the user.

The above describes the text matching method based on slot similarity in the embodiment of the present invention, and the following describes the text matching device based on slot similarity in the embodiment of the present invention, with reference to fig. 3, an embodiment of the text matching device based on slot similarity in the embodiment of the present invention includes:

and the entity identifying and extracting module 301 is configured to identify and extract entity words in the target question sentence input by the user.

The coarse-grained problem partitioning module 302 is configured to find a corresponding keyword in a pre-established trigger word dictionary according to the entity word, and partition the coarse-grained problem type of the target problem statement according to the keyword.

And the candidate problem set acquisition module 303 is configured to find out and combine problems having the same problem type as the target problem statement in the pre-established problem classification set to obtain a candidate problem set.

And a word segmentation and slot filling module 304, configured to perform word segmentation on the target problem statement, and perform slot filling on the target problem statement according to a word segmentation processing result and slot information established in a preset segmentation rule.

And a similarity calculation module 305, configured to calculate a slot similarity between the target question statement and each question in the candidate question set using a similarity calculation algorithm.

And a best matching similarity determination module 306, configured to determine a similarity problem that is best matched with the target question statement according to the slot similarity calculation result and the subdivision rule.

Referring to fig. 4, in another embodiment of the text matching apparatus based on slot similarity according to the embodiment of the present invention, the text matching apparatus based on slot similarity further includes:

And an entity dictionary establishing module 307, configured to identify and extract entity words related to the target domain corpus by using a named entity recognition algorithm, and establish an entity dictionary corresponding to the target domain language according to the entity words.

And the word segmentation and keyword extraction module 308 is configured to perform statistical word segmentation processing on the questions in the question-and-answer library in the target field, and extract keywords according to word segmentation results.

And a trigger dictionary establishing module 309, configured to expand the keywords to form a trigger word set, and establish a trigger dictionary according to the trigger word set.

The coarse-grained division and corresponding slot establishment module 310 is configured to divide the questions in the question-and-answer library of the target field into coarse-grained question types according to the keywords in the trigger word set, and establish corresponding slot information for each coarse-grained question type according to the entity dictionary.

And the high-frequency word dictionary establishing module 311 is configured to perform statistical word segmentation on the corpus of the coarse-grained problem types, extract high-frequency words in the slot information corresponding to the coarse-grained problem types according to the word segmentation result, and establish a high-frequency word dictionary according to the high-frequency words.

And a fine-grained division module 312, configured to further divide the questions in the question-and-answer library in the target field into fine-grained question types according to the high-frequency words in the high-frequency word dictionary.

A subdivision rule setting module 313, configured to set a corresponding subdivision rule for each fine-grained problem type; the subdivision rules include establishing corresponding slot information for each fine-grained problem type according to an entity dictionary.

and the problem classification set establishing module is used for classifying the problems in the question-answering base in the target field by the coarse-grained problem types divided according to the keywords in the trigger word dictionary and establishing a problem classification set classified according to the coarse-grained problem types.

performing word segmentation processing on the questions in the question-answer library in the target field by adopting a statistical word segmentation algorithm, and extracting high-frequency words in word segmentation results according to word frequency information;

combining the expanded words to form a trigger word set;

and establishing a trigger word dictionary according to the trigger word set.

creating a Hash mapping from the pre-established high-frequency word dictionary, wherein the key value of the Hash mapping is the high-frequency word stored in the pre-established high-frequency word dictionary;

performing word segmentation processing on the target question sentence, and judging whether the word obtained by the word segmentation processing exists in the Hash mapping;

training the target question sentence and each question in the candidate question set by utilizing a Bert model to obtain a word vector corresponding to each question;

respectively calculating the average value of the word vectors corresponding to the problems in the target problem statement and the candidate problem set, and taking the obtained average value of the word vectors corresponding to the problems as the word vectors of the corresponding groove positions;

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device or system type embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The above fig. 3 and fig. 4 describe the text matching device based on slot similarity in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the following describes the text matching device based on slot similarity in the embodiment of the present invention in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a slot similarity-based text matching device according to an embodiment of the present invention, where the slot similarity-based text matching device 500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing an application 507 or data 506. Memory 509 and storage medium 508 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instructions operating on text matching based on slot similarity. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the slot similarity-based text matching device 500.

The slot similarity-based text matching device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems 505, such as Windows Server, Mac OS X, Unix, L inux, FreeBSD, etc. it will be understood by those skilled in the art that the slot similarity-based text matching device architecture shown in FIG. 5 does not constitute a limitation of slot similarity-based text matching devices, may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text matching method based on slot similarity is characterized by comprising the following steps:

2. The slot similarity-based text matching method of claim 1, wherein prior to the identifying and extracting entity words in the target question sentence input by the user, the method further comprises:

and setting a corresponding subdivision rule for each fine-grained problem type, wherein the subdivision rule comprises the step of establishing corresponding slot position information for each fine-grained problem type according to the entity dictionary.

3. The slot similarity-based text matching method of claim 1, wherein prior to the identifying and extracting entity words in the target question sentence input by the user, the method further comprises:

4. The slot similarity-based text matching method according to claim 2, wherein the performing statistical word segmentation on the questions in the target field question-answer library and extracting keywords according to the word segmentation result comprises:

5. The slot similarity-based text matching method according to claim 2, wherein the expanding the keywords to form a trigger word set and establishing a trigger word dictionary according to the trigger word set comprises:

combining the expanded words to form a trigger word set;

and establishing a trigger word dictionary according to the trigger word set.

6. The slot similarity-based text matching method according to claim 1, wherein the performing word segmentation on the target question sentence and performing slot filling on the target question sentence according to a word segmentation result and slot information established in a preset subdivision rule comprises:

7. The slot similarity-based text matching method according to any one of claims 1 to 6, wherein the calculating of the slot similarity between the target question sentence and each question in the candidate question set by using a similarity calculation algorithm comprises:

8. A text matching device based on slot similarity, comprising:

9. A slot similarity-based text matching device, comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the slot similarity-based text matching device to perform the method of any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.