CN111414746A

CN111414746A - Matching statement determination method, device, equipment and storage medium

Info

Publication number: CN111414746A
Application number: CN202010281056.4A
Authority: CN
Inventors: 李宸; 付博; 顾远; 袁晟君; 王雪; 张晨; 谢隆飞; 李亚雄
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-07-14
Anticipated expiration: 2040-04-10
Also published as: CN111414746B

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for determining a matching statement, wherein the method for determining the matching statement comprises the following steps: determining a candidate sentence corresponding to the target sentence from a preset sentence set by determining a rule according to a preset candidate sentence; determining at least two similarity features between the target sentence and the candidate sentence; and determining and displaying a matching sentence matched with the target sentence based on the at least two similarity features and the candidate sentence. The technical scheme of the embodiment of the invention combines a plurality of similarity characteristics between the target statement and the candidate question statement to determine the matching statement of the target statement, thereby improving the accuracy of the matching statement.

Description

Matching statement determination method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a matching statement determining method, a device, a terminal and a storage medium.

Background

Sentence matching technology, especially question matching technology, is widely applied in the technical fields of customer service, sales consultation and the like.

The existing sentence matching method is generally divided into two types, one is a matching method based on a traditional statistical model, the method can only determine the similarity of words in a sentence through word frequency TF and inverse text word frequency IDF, and the matching accuracy is low; secondly, a semantic matching model based on deep learning only considers sentence semantic similarity characteristics of sentences, and the method cannot solve the problem of ambiguity caused by sentence information loss, so that a matching result is inaccurate.

Disclosure of Invention

The invention provides a matching statement determination method, a matching statement determination device, a terminal and a storage medium, which can more accurately determine a matching statement matched with a target statement.

In a first aspect, an embodiment of the present invention provides a matching statement determining method, where the method includes:

determining candidate sentences corresponding to the target sentences from a preset sentence set according to preset candidate sentence determination rules;

determining at least two similarity features between the target sentence and the candidate sentence;

and determining and displaying a matching sentence matched with the target sentence based on at least two similarity features and the candidate sentence.

In a second aspect, an embodiment of the present invention further provides a matching statement determining apparatus, where the apparatus includes:

the candidate sentence determining module is used for determining candidate sentences corresponding to the target sentences from the preset sentence set according to preset candidate sentence determining rules;

at least two similarity feature determination modules for determining at least two similarity features between the target sentence and the candidate sentence;

and the matching statement determining and displaying module is used for determining and displaying the matching statement matched with the target statement based on at least two similarity features and the candidate statement.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the matching statement determination method according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the matching statement determination method according to any embodiment of the present invention.

Determining a candidate sentence corresponding to a target sentence from a preset sentence set by determining a rule according to a preset candidate sentence; determining at least two similarity features between the target sentence and the candidate sentence; and determining and displaying a matched sentence matched with the target sentence based on the at least two similarity features and the candidate sentence, and determining the matched sentence of the target sentence by combining the plurality of similarity features between the target sentence and the candidate question sentence, so that the accuracy of the matched sentence is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a matching statement determination method according to a first embodiment of the present invention;

FIG. 2a is a flowchart of a matching statement determination method according to a second embodiment of the present invention;

FIG. 2b is a diagram illustrating a second embodiment of the present invention for determining the similarity characteristics;

FIG. 2c is a schematic structural diagram of a BERT model according to a second embodiment of the present invention;

FIG. 2d is a schematic diagram of a sample input of a BERT model according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a matching statement determination apparatus in a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a matching statement determining method according to an embodiment of the present invention, where this embodiment is applicable to a case where a matching statement matching a target statement needs to be determined, and the method may be executed by a matching statement determining apparatus, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be configured in a computer device. As shown in fig. 1, the method of this embodiment specifically includes:

s110, determining candidate sentences corresponding to the target sentences from the preset sentence set according to preset candidate sentence determination rules.

The preset candidate sentence determination rule may be a rule determined based on various semantic similarities between sentences, where the similarity feature may be a semantic similarity determined based on words in the sentences, a semantic similarity determined based on a sentence context, a semantic similarity determined based on a sentence meaning in the sentences, or the like. The preset candidate sentence determination rule may preferably be configured to use a preset sentence having a semantic similarity value within a preset range as the candidate sentence, or use a preset number of preset sentences having a maximum semantic similarity value as the candidate sentences.

The application scenario of the embodiment is mainly human-computer interaction, and for example, the application scenario may be applied to a self-service business handling robot system in a financial institution (e.g., a bank), a bank financial management intelligent customer service system, a robot system for entertainment (e.g., a robot simulating a user speaking, a problem solving robot, etc.), and the like. The self-service business handling robot system, the bank financial intelligent customer service system and the robot system for entertainment applied to the financial institution can be an intelligent question-answering system or a non-intelligent question-answering system.

In this regard, the target sentence may be a question sentence, a statement sentence, an exclamation sentence, or the like (this embodiment is not particularly limited). The target sentence may preferably be a sentence input by the user through the voice acquiring means of each system. The preset sentence set can be composed of a plurality of specific topics in the related field, and a plurality of similar sentences can be included under the same specific topic. If the preset sentence set is applied to the intelligent question-answering system, the preset sentence set can comprise the same answer corresponding to each question sentence besides a plurality of similar question sentences under the same specific topic, and preferably, the plurality of similar question sentences can be mapped to the same answer to be stored under the same topic. The candidate sentences are determined from the preset sentence set according to preset candidate sentence determination rules, and one or more candidate sentences may be determined, and have a certain degree of similarity with the target sentence semantically.

And S120, determining at least two similarity characteristics between the target sentence and the candidate sentence.

The similarity feature in this embodiment refers to a semantic similarity feature between two sentences, that is, semantically, there is a certain similarity between two sentences. This similarity may be described in terms of multiple feature dimensions, i.e., there may be multiple semantic similarity features between two sentences. If only one similarity feature is used to describe the semantic similarity of two sentences, the precision is poorer than the precision in describing the semantic similarity of two sentences by combining a plurality of similarity features, so in the embodiment, at least two similarity features between the target sentence and the candidate sentence are used to describe the similarity between the target sentence and the candidate sentence.

Preferably, the similarity feature in this embodiment may include at least two of a word similarity feature, an above-mentioned similarity feature, and a sentence similarity feature, where the word similarity feature may preferably represent the similarity between the target sentence and the candidate sentence on a single word; the term similarity feature preferably may represent the similarity in terms between the target sentence and the candidate sentence; the above similarity feature may preferably represent similarity between the above information of the target sentence and the candidate sentence, where the similarity may be similarity in words, similarity in terms, or similarity in sentence meanings; the sentence meaning similarity feature may preferably represent a semantic similarity between the target sentence and the candidate sentence.

And S130, determining and displaying a matched sentence matched with the target sentence based on the at least two similarity features and the candidate sentences.

The matching statement may be a statement meeting a preset matching condition, which may be a statement in the candidate statement or a preset alternative statement, and correspondingly, the content displayed to the user may be a statement in the candidate statement or an alternative statement. If the target sentence is a question sentence, the matching sentence may include, in addition to the sentence or the alternative sentence in the candidate sentence, a unique answer sentence corresponding to the sentence or the alternative sentence in the candidate sentence, in which case, the sentence or the alternative sentence in the candidate sentence may be included and the corresponding unique answer sentence may be included.

The preset matching condition may be that if there is a similarity between the candidate sentence and the target sentence in a preset similarity range, the candidate sentence is used as a matching sentence of the target sentence; if the similarity between the candidate sentences and the target sentence is within a preset similarity range, taking the candidate sentence with the highest similarity as a matching sentence of the target sentence; if the similarity between the candidate statement and the target statement is within a preset similarity range, but the candidate statement does not meet a matching statement check condition, taking the candidate statement as a matching statement of the target statement, wherein the matching statement check condition can be that the matching statement must contain a certain preset keyword, or that the matching statement must be equal to a domain keyword of the target statement; and if the similarity between the candidate sentence and the target sentence is not within the preset similarity range, taking the candidate sentence as a matching sentence of the target sentence.

The similarity between the target sentence and the candidate sentence can be determined based on at least two similarity features, and the specific determination method preferably includes inputting the at least two similarity features into a machine learning model trained in advance, and outputting corresponding similarity, where the machine learning model may be any one of an SVM (Support Vector Machines) model, an L R (L logistic Regression) model, and an XGBoost model, where the XGBoost is a machine learning system with an extensible lifting tree, and the XGBoost model is a tree integration model.

In the matching statement determination method provided by this embodiment, a candidate statement corresponding to a target statement is determined from a preset statement set according to a preset candidate statement determination rule; determining at least two similarity features between the target sentence and the candidate sentence; and determining and displaying a matched sentence matched with the target sentence based on the at least two similarity features and the candidate sentence, and determining the matched sentence of the target sentence by combining the plurality of similarity features between the target sentence and the candidate question sentence, so that the accuracy of the matched sentence is improved.

Based on the foregoing embodiments, further, determining a matching sentence matching the target sentence based on at least two similarity features and the candidate sentence includes:

inputting at least two similarity characteristics into a pre-trained XGboost tree model to obtain the similarity between a target statement and a candidate statement;

and determining the highest similarity in the similarities, and taking the candidate sentence corresponding to the highest similarity as the matching sentence.

The XGboost tree model may represent a non-linear relationship between a plurality of features and real tags. In this embodiment, at least two features are input into the XGBoost tree model to perform two-class training, a probability value of a model prediction class of 1 is a similarity between a final target sentence and a candidate sentence, the similarity corresponding to each candidate question in the candidate questions is calculated one by using the above method to obtain the similarity corresponding to each candidate question, the similarities corresponding to each candidate question are sorted, and the candidate sentence corresponding to the highest similarity is used as a matching sentence of the target sentence.

The XGboost tree model may be trained prior to determining a matching statement for the target statement using the XGboost tree model. Before training, a positive and negative example sample set is determined, wherein the positive and negative example sample set comprises a positive example sample pair and a negative example sample pair, the positive example sample pair is two sentences with similar semantemes, and the negative example sample pair is two sentences with completely different semantemes. At least two similarity characteristics of the positive sample pair and the negative sample pair are respectively extracted to obtain multiple groups of similarity characteristics, and each group of similarity characteristics is used as a training sample to train the XGboost tree model. The specific training process comprises the steps of firstly training a first tree by using a training sample, predicting a training set by using the first tree to obtain a predicted value of the training sample, defining a difference value between the predicted value and a true value as a residual error, and taking the residual error as the true value of the training sample when a second tree is trained. And training the second tree according to the training method, using the residual error corresponding to the second tree for training the third tree, and so on, and stopping training when the total number of the preset trees is reached.

In training the XGBoost tree model, the penalty function used may be a squared penalty function,

the square loss function enables each training fit residual to gradually approximate the true value of the sample. The hyper-parameters can be adjusted during model training, thereby improving the training effect. The meta-parameters comprise eta, gamma, maximum tree depth and the minimum sample weight sum in the sub-nodes, the eta parameters are contraction step length in the parameter updating process and are similar to the concept of learning rate, the gamma is regularization parameters, the algorithm is more conservative when the value is larger, the maximum tree depth can control the scale and the complexity of a single tree, the minimum sample weight sum in the sub-nodes refers to the minimum sample number required by building each model, and the algorithm is more conservative when the value is larger.

On the basis of the foregoing embodiments, after determining the highest similarity among the similarities and taking the candidate sentence corresponding to the highest similarity as the matching sentence, the method further includes:

determining keyword differences between the target statement and the matching statement based on a preset keyword difference determination rule;

and if the keyword difference accords with the preset rejection rule, discarding the matching statement, and taking the alternative statement as the matching statement.

The specific preset keyword difference determination rule is as follows: illustratively, the target sentence may be represented as q _ user ═ w₁,w₂,...,w_i,...,w_uCorrespondingly, a matching statement may be denoted as faq — i ═ w₁,w₂,...,w_j,...,w_qThe preset keyword difference determination rule may specifically be that q _ user and faq _ i are aligned to obtain a keyword w that is not common to both_kAnd determining the keyword difference between the target sentence and the matching sentence as

The default rejection rule in this embodiment is w if there is diff_kSatisfies the following conditions: w is a_kIs verb, noun, and sentence key word, and freqency_wk< frequency _ threshold, where frequency_wkDenotes w_kThe frequency of the words is the preset frequency threshold, and the matching is rejected if the frequency _ threshold is the preset frequency thresholdA statement. And if the keyword difference diff is determined to accord with the preset rejection rule, discarding the matching statement, and taking the alternative statement as the matching statement.

Example two

Fig. 2a is a flowchart of a matching statement determining method according to a second embodiment of the present invention. On the basis of the foregoing embodiments, the present embodiment may determine, according to a preset candidate statement determination rule, a candidate statement corresponding to the target statement from a preset statement set, including:

determining a first BM25 value between each preset statement in a preset statement set and the target statement by using a BM25 algorithm;

performing descending order arrangement on each preset statement according to the first BM25 value, and taking the preset statements with the preset number as candidate statements;

correspondingly, if the at least two similarity features include a word similarity feature, determining the at least two similarity features between the target sentence and the candidate sentence includes:

and taking the first BM25 value corresponding to the candidate sentence as the character similarity characteristic between the target sentence and the candidate sentence.

And if the at least two similarity features further include the above similarity feature, determining the at least two similarity features between the target sentence and the candidate sentence, including:

determining an upper dialogue text which meets a preset condition, wherein the upper dialogue text is the upper dialogue text of the target sentence;

extracting the domain keywords of the above dialog text and the target sentence to obtain a first domain keyword set;

extracting a domain keyword of the candidate sentences to obtain a second domain keyword set aiming at each candidate sentence in the candidate sentences;

and determining a second BM25 value between the first domain keyword set and the second domain keyword set by utilizing a keyword-level BM25 algorithm, and taking the second BM25 value as the above similarity characteristic between the target statement and the candidate statement.

And if the at least two similarity features further include a sentence meaning similarity feature, determining the at least two similarity features between the target sentence and the candidate sentence, including:

and determining sentence meaning similarity characteristics between the target sentence and the candidate sentence by utilizing a pre-trained BERT model, wherein the BERT model is obtained by training a L istwise list method and a Pairwise document method in sequence.

As shown in fig. 2a, the method of this embodiment specifically includes:

s210, determining a first BM25 value between each preset statement in a preset statement set and a target statement by utilizing a BM25 algorithm; and performing descending arrangement on each preset statement according to the first BM25 value, and taking the preset statements with the preset number as candidate statements.

Preferably, the BM25 algorithm in this embodiment may be a term level (i.e. a term is used as a minimum selection unit) BM25 algorithm, where a term may be any term, or may be a keyword (i.e. a term other than a filler term, where the filler term is a word in spoken language and a null word indicating a pause thinking, such as a term "this", "then", "hiccup", etc.); the BM25 algorithm may also be a text-level (i.e., the smallest unit of choice is a single text) BM25 algorithm.

Since the application scenario of this embodiment is voice interaction, the target sentences input by the user are mostly spoken sentences, and mostly short texts, and the content words (i.e. words having practical meanings) are fewer and the number of filling words is more, this embodiment preferably can adopt the BM25 algorithm at the text level.

Preferably, the specific expression of the BM25 algorithm is as follows:

wherein score (D, Q) is the first BM25 value, D is the preset statement, Q is the target statement, Q is the first BM25 value_iFor the ith character n in the target sentence, the total number of characters in the target sentence, ω is the character weight, f (q)_iD) is q_iFrequency of occurrence, k, in preset sentences₁B is an adjustable parameter, | D | is the length of the preset statement D in units of words, avgdl is the average length of all candidate statements, N is the total number of preset statements in the preset statement set, and N (q)_i) Is a preset statement set including q_iThe filler words are filling words, and the content words are content words.

S220, if the at least two similarity features comprise character similarity features, the first BM25 value corresponding to the candidate sentence is used as the character similarity features between the target sentence and the candidate sentence.

Preferably, the normalization process may be performed on the first BM25 value corresponding to each candidate statement.

S230, if the at least two similarity characteristics further comprise the above similarity characteristics, determining an above dialogue text meeting preset conditions, wherein the above dialogue text is an above dialogue text of the target sentence; extracting the domain keywords of the above dialog text and the target sentence to obtain a first domain keyword set; and extracting the domain keywords of the candidate sentences aiming at each candidate sentence in the candidate sentences to obtain a second domain keyword set.

The above information features in the present embodiment can assist in resolving ambiguity problems. For example, the current conversation scene between the user and the robot customer service is an annual profitability of products scene, the target sentence of the current user is 'will not fluctuate', the target sentence obviously lacks the subject 'profitability', and in this case, the problems of 'resolution by reference' and 'information loss' and the like can be solved by taking the above conversation text meeting the preset conditions into consideration and introducing the above information features.

The above dialog text meeting the preset condition may be a pair of dialog texts before the target sentence (i.e. including the target sentence and the matching sentence), a matching sentence in the pair of dialog texts before the target sentence, or a plurality of dialog texts before the target sentence.

S240, determining a second BM25 value between the first domain keyword set and the second domain keyword set by utilizing a keyword-level BM25 algorithm, and taking the second BM25 value as the above similarity characteristic between the target statement and the candidate statement.

The principle of determining the similarity features in this step is the same as that of determining the similarity features of the characters, and the difference is only that the minimum selection unit is changed from a single character to a keyword, and the specific process is not repeated here.

Fig. 2b is a schematic diagram of determining the similarity characteristic of an upper context according to the second embodiment of the present invention, as shown in fig. 2b, the upper context dialog text is a matching statement in a pair of dialog texts before a target statement, a matching statement field keyword and a target statement field keyword in the previous round of matching statement are extracted, the matching statement field keyword and the target statement field keyword are used as a first field keyword set, a candidate statement field keyword of each candidate statement in a candidate statement is extracted, the candidate statement field keyword is used as a second field keyword set, a second field keyword set corresponding to each candidate question is obtained, a second BM25 value between the first field keyword set and each second field keyword set is calculated by using a BM25 algorithm at a keyword level, and a second BM25 value corresponding to each candidate question is finally obtained.

And S250, if the at least two similarity characteristics further comprise a sentence meaning similarity characteristic, determining the sentence meaning similarity characteristic between the target sentence and the candidate sentence by utilizing a pre-trained BERT model, wherein the BERT model is obtained by training a method based on L istwise list method and Pairwise document method in sequence.

Preferably, a pre-trained BERT model (Bidirectional encoder tokens from transducers) may be used to extract the sentence meaning similarity characteristic between the target sentence and the candidate sentence, specifically, the target sentence and the candidate sentence may be simultaneously input into the BERT model, and the probability value of the model prediction output category 1 is the sentence meaning similarity characteristic value between the target sentence and the candidate sentence.

FIG. 2C is a schematic structural diagram of a BERT model according to a second embodiment of the present invention, and as shown in FIG. 2C, the target statement sensor A is converted into Tok1 … … TokN, where Tok1 to Tok N sequentially represent each word in the target statement sensor A. the candidate statement sensor B is converted into Tok1 … … TokM, where Tok1 to Tok M sequentially represent each word in the candidate statement sensor B, [ C L S ] C L S]Marking the vector positions for classification, aggregating all classification information, [ SEP ]]After inputting sensor A and sensor B into the BERT model, Tok1 … … TokN, Tok1 … … TokM, [ C L S ]]And [ SEP]Respectively converted into word-embedded representation vectors E (including E)₁……E_N、E₁＇……E_M＇、E_[CLS]And E_[SEP]). Each E is composed of three parts of superposition, including token embedding, segment embedding and position embedding, where the token embedding function is to convert each token into a vector representation of a fixed dimension (for example, 768 dimensions may be adopted in the BERT model) as a semantic representation of the corresponding token. segment embedding is to distinguish sentence dependencies of tokens, with only two vector representations, 0 and 1. A first value of 0 may be assigned to all tokens of presence a and a second value of 1 may be assigned to all tokens of presence B. The position is embedded as a sequential feature of the encoded input. Thereafter, each vector E is input into a Trm (i.e., multi-layer bidirectional Transformer) structure in order. The Trm structure is composed of an attention mechanism (attention mechanism) and a feedforward neural network, and is essentially an Encoder-Decoder structure. Since the Transformer in the BERT model is only used for feature extraction, only the Encoder portion is needed. The partial representation vector E enters a self-attribute module in the Encoder to obtain a weighted feature vector z, and then the z is input into a fully connected feed-forward neural network (FFN). Illustratively, the fully-connected feedforward neural networkThe first layer of the complex may be an activation function Re L U and the second layer may be a linear activation function, which may be expressed as ffn (z) max (0, zW)₁+b₁)W₂+b₂. The FFN layer is then layer normalized, and the output vector for that layer is added to the input and normalized. The output vector after 6 identical encorder feature extractions is the output of Trm, and as can be seen from fig. 2c, this process requires two transform feature extractions (i.e. two Trm structures are required), so the size of the general model is 12 layers. After transform feature extraction, the corresponding feature vectors (including C, T) are output₁……T_N、T_SEPAnd T₁＇……T_M') while predicting the probability distribution with the fully-connected layer Classifier output 0/1, wherein the probability value of class 1 is the sentence meaning similarity feature value.

The BERT model training may include two stages, namely, Pre-training (Pre-training) and Fine-tuning (Fine-tuning), wherein parameters of the Pre-training stage may directly adopt model parameters provided by google, and the Fine-tuning stage may determine a preset corpus sample set for training and train the BERT model based on L corpus list method and Pairwise document pair method in sequence, where the preset corpus sample set may include positive examples and negative examples, fig. 2d is a positive example input schematic diagram of a BERT model provided in the second embodiment of the present invention, as shown in fig. 2d, the target sentence is "how much of the daily limit", the candidate sentence is "high limit", the target sentence "how much of the daily limit" is converted into a Tok1 ═ per, Tok2 ═ day, Tok3 ═ limit, Tok4 ═ top, Tok2 ═ and more Tok, and more than Tok6 ═ per, and the target sentence is converted into a Tok3 ═ high limit [ C, Tok 466, [ 2d ] after the target sentence is converted into a target sentence "too high limit" (t 3, Tok) and the target sentence corresponding to the target sentence [ C3, Tok3 ═ t, the target sentence ═ t 3]And two [ SEP ]]Inputting the BERT model, and firstly obtaining E through token embedding_[CLS]、E_{Each time}、E_Sky、E_{Limit of}、E_{Forehead (forehead)}、E_{Multiple purpose}、E_{Chinese character shao (a Chinese character of 'shao')}、E_[SEP]、E_{Limit of}、E_{Forehead (forehead)}、E_{Height of}、E_{Does one}And E_[SEP]Then embedded by segment to obtainTo E_A、E_A、E_A、E_A、E_A、E_A、E_A、E_A、E_B、E_B、E_B、E_BAnd E_BFinally, the position is embedded to obtain E₀、E₁、E₂、E₃、E₄、E₅、E₆、E₇、E₈、E₉、E₁₀、E₁₁And E₁₂. Adding vectors obtained by token embedding, segment embedding and position embedding to obtain a finally obtained vector E.

In the fine training stage of the BERT model, training samples can be firstly constructed based on the concept of L istwise list method, a group of training samples corresponding to L istwise list method can comprise a positive case < q _ user, faq _ i > and a plurality of negative cases < q _ user, faq _ j _1 > L < q _ user and faq _ j _ k >, the training samples are input into the model and the similarity of each positive case and each negative case is calculated, after the similarity calculation of all positive cases and negative cases in a group of training samples is completed, the negative case with the highest similarity in all negative cases is taken and combined with the positive case, the obtained samples conform to the concept of Pairwise document pair method, before training based on L istwise list method, a loss function is required to be designed, the formula of the loss function is represented by loss max (0,1- (score _ i-score _ j)), wherein score _ i is obtained after inputting the positive cases into the BERT model, the gain function is calculated by using a score adaptive learning algorithm, and the gain is preferably calculated by using a learning parameter of learning speed after the optimization procedure of the initial learning sample is optimized by using a score of Adrt-1.

S260, inputting at least two similarity characteristics including the character similarity characteristic, the above similarity characteristic and the sentence meaning similarity characteristic into a pre-trained XGboost tree model to obtain the similarity between the target sentence and the candidate sentence; and determining the highest similarity in the similarities, and taking the candidate sentence corresponding to the highest similarity as the matching sentence.

The technical scheme is verified and evaluated by using dialogue data based on bank financing marketing.

The preset sentence set can be conversation text data corresponding to 100 topics of a preselected bank financing marketing scene, and the preset sentence set comprises 1065 preset sentences including 100 topics, and each topic comprises about 10 similar sentences. These statements are derived from real dialogue data or are organized and expanded by business personnel, and are often referred questions in the marketing scene of financial products.

876 corpora are corresponding to the target sentence of the user. A pair of dialogs refers to one interaction between a customer manager and a customer. Preferably, the 876 entry slogan sentence corpus may be compiled according to a training set: and (4) verification set: test set 8: 1: 1, where the verification set and the test set are 88 samples, and statistics on the length of the target corpus are performed, as shown in table 1 (it is understood that the data in table 1 is only an example and is not limiting).

TABLE 1 statistical information of target sentence corpus

Sentence length (word number)	Number of	Ratio of occupation of	Mean of sentence length
				[3,8]	229	26.14％	6.4
(8,14]	289	32.99％	11.0
				(14,20]	166	18.95％	17.0
(20,26]	92	10.50％	23.0
				(26,32]	41	4.68％	29.2
(32,40]	29	3.31％	35.7
				(40,50]	15	1.71％	44.4
(50,94]	15	1.71％	61.3
				Total of	876	100％	15.3 (mean)/12 (median))

And taking the target sentences of the user as queries, and utilizing a BM25 method to search a preset sentence set one by one and take the first K sentences with the highest BM25 value as candidate sentences. The construction method of the positive case in the BERT training sample is as follows: and combining the candidate sentences similar to the target sentences with the target sentences to form a positive example. The negative example construction method is as follows: based on the constructed positive examples and the corresponding K candidate sentences (for example, K takes 20), the sentences different from the target sentence in the candidate sentences are processed according to the positive-negative ratio of 1: and r (r is an adjustable hyperparameter) is sampled, and the extracted candidate sentences and the target sentences form negative examples.

In order to verify the training effect of training the BERT model based on L istwire list method and Pairwire document method in turn in the embodiment of the invention, the other two methods which only use Pointwire single document method and Pairwire document pair method can be applied to the BERT model training and the results are compared, on the basis of the positive and negative examples obtained by the construction, the samples corresponding to the three training methods are respectively that the training sample of the Pointwire single document method is the initially constructed positive and negative example data set, the Pairwire document pair method selects a positive example and a negative example of the same target sentence to form a sample, and the training method based on L istwire list method and Pairwire document pair method selects a positive example and a plurality of negative examples of the same target sentence to form a sample, Table 2 shows the number of the Pointwire, Pairwire and the training methods based on L istwire list method and Pairwire document pair method in turn, wherein the positive and negative examples r is 5, and K is 60 (the data which can be understood as a limiting example).

TABLE 2 number of training samples for three methods

The BERT model obtained by the training of the three methods can be evaluated by using evaluation indexes BM25 Recall, Recall @ Top-1 and MRR respectively. The BERT model obtained by the above three methods can be evaluated in the following three aspects:

1. the reliability of the candidate question acquisition method is verified, the candidate question set recall rate can be calculated for evaluation, and the BM25 retrieval method which takes a single character as the minimum acquisition unit and is adopted by the embodiment of the invention can be determined to be capable of obviously improving the recall rate by calculating the recall rate of the candidate question set.

2. The method comprises the steps of verifying the superiority of a sentence similarity feature extraction method, and evaluating the method through transversely comparing different BERT models and longitudinally comparing different training methods, wherein the transverse comparison is to transversely compare a BERT model with two DSSM and Match Pyramid models, so that the effect of the BERT model is better than that of the other two models, the method is mainly characterized in that a pre-training model has stronger representation capability, and the training method based on L istwire list method and Pairwire document pair method in sequence improves the distinguishing capability of the model to a negative example, thereby also indicating that the BERT has stronger sentence similarity feature extraction capability.

3. The effect of the XGboost model fusing multiple features is verified, and the XGboost model fusing multiple features can be evaluated by comparing models with single features. The BM25 model which only uses the single character similarity characteristic has poor performance and cannot be practically applied. In the past, a commonly used deep semantic matching model Match Pyramid can be used for extracting sentence semantic similarity characteristics among sentences, but the characteristics are not considered, the characteristic extraction capability is limited, and the matching effect is general. The method provided by the embodiment of the invention enhances the expression capability of sentence meaning characteristics, combines the above information characteristics and character similarity characteristics, makes up the defects of a question semantic matching method with single characteristics, greatly improves the matching index after multiple characteristics are fused, and has the accuracy rate of 85% or more.

The matching sentence determining method provided by the embodiment comprises the steps of determining a first BM25 value between each preset sentence and a target sentence in a preset sentence set by using a BM25 algorithm, performing descending order arrangement on each preset sentence according to a first BM25 value, taking a preset number of preset sentences as candidate sentences, if at least two similarity features comprise character similarity features, taking a first BM25 value corresponding to the candidate sentences as character similarity features between the target sentence and the candidate sentences, if at least two similarity features also comprise the character similarity features, determining an upper dialogue text meeting preset conditions, wherein the upper dialogue text is the upper dialogue text of the target sentence, extracting domain keywords of the upper dialogue text and the target sentence to obtain a first domain keyword set, extracting domain keywords of the candidate sentences aiming at each candidate sentence in the candidate sentences to obtain a second domain keyword set, determining a second domain keyword set by using a BM25 algorithm, determining a second domain keyword value between the first domain keyword set and a second domain keyword set, taking the second domain keyword value as a domain keyword between the candidate sentences in the candidate sentences as a target sentence similarity model, and obtaining a second domain keyword set, and further determining similarity based on the similarity features of the target sentence, wherein the highest similarity features of the candidate sentences comprises at least two BM 3625 similarity features and the highest similarity features of the target sentence similarity models, and the highest similarity features of the target sentence in the target sentence, and the similarity features of the target sentence, and the highest similarity features of the target sentence similarity models, and the highest similarity features of the highest similarity models of the target sentences, and the highest similarity models of the target sentences, and the highest similarity models of the target sentences, and the highest similarity models of the highest similarity models, and the highest.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a matching statement determination apparatus in a third embodiment of the present invention. As shown in fig. 3, the apparatus of the present embodiment includes:

a candidate sentence determining module 310, configured to determine, according to a preset candidate sentence determining rule, a candidate sentence corresponding to the target sentence from a preset sentence set;

at least two similarity feature determination modules 320 for determining at least two similarity features between the target sentence and the candidate sentence;

and a matching sentence determining and presenting module 330, configured to determine and present a matching sentence matching the target sentence based on the at least two similarity features and the candidate sentence.

In the matching sentence determination apparatus provided in this embodiment, a candidate sentence corresponding to a target sentence is determined from a preset sentence set by using a candidate sentence determination module according to a preset candidate sentence determination rule; determining at least two similarity features between the target sentence and the candidate sentence by utilizing at least two similarity feature determination modules; and determining and displaying the matched sentences matched with the target sentences by using a matched sentence determining and displaying module based on at least two similarity characteristics and the candidate sentences, and determining the matched sentences of the target sentences by combining a plurality of similarity characteristics between the target sentences and the candidate question sentences, so that the accuracy of the matched sentences is improved.

On the basis of the above technical solutions, optionally, the at least two similarity features include at least two of a character similarity feature, an above similarity feature, and a sentence meaning similarity feature.

On the basis of the above technical solutions, optionally, the candidate sentence determining module 310 may include:

a first BM25 value determining unit, configured to determine, using a BM25 algorithm, a first BM25 value between each preset statement in the preset statement set and the target statement;

the candidate statement determining unit is used for performing descending order arrangement on each preset statement according to the first BM25 value, and taking the preset statements with the preset number as candidate statements;

correspondingly, the at least two similarity feature determining modules 320 may preferably include a text similarity feature determining unit, configured to, if the at least two similarity features include a text similarity feature, use the first BM25 value corresponding to the candidate sentence as the text similarity feature between the target sentence and the candidate sentence.

On the basis of the above technical solutions, optionally, a specific expression of the BM25 algorithm is as follows:

wherein score (D, Q) is the first BM25 value, D is the preset statement, Q is the target statement, Q is the first BM25 value_iIs the ith character in the target sentence, n is the total number of characters in the target sentence, omega is the character weight, f (q)_iD) is q_iFrequency of occurrence, k, in preset sentences₁B is an adjustable parameter, | D | is the length of the preset statement D in units of words, avgdl is the average length of all candidate statements, N is the total number of preset statements in the preset statement set, and N (q)_i) Is a preset statement set including q_iThe filler words are filling words, and the content words are content words.

On the basis of the foregoing technical solutions, optionally, if the at least two similarity features further include the above similarity feature, the at least two similarity feature determining module 320 may preferably include:

the device comprises an upper dialogue text determining unit, a processing unit and a processing unit, wherein the upper dialogue text determining unit is used for determining an upper dialogue text which meets a preset condition, and the upper dialogue text is an upper dialogue text of a target sentence;

the first domain keyword set determining unit is used for extracting the domain keywords of the above dialogue text and the target sentence to obtain a first domain keyword set;

a second domain keyword set determining unit, configured to extract, for each candidate sentence in the candidate sentences, a domain keyword of the candidate sentence, to obtain a second domain keyword set;

and the above similarity characteristic determining unit is used for determining a second BM25 value between the first domain keyword set and the second domain keyword set by utilizing a keyword-level BM25 algorithm, and taking the second BM25 value as the above similarity characteristic between the target statement and the candidate statement.

On the basis of the foregoing technical solutions, optionally, if the at least two similarity features further include a sentence meaning similarity feature, the at least two similarity feature determining module 320 may preferably include a sentence meaning similarity feature determining unit, configured to determine a sentence meaning similarity feature between the target sentence and the candidate sentence by using a pre-trained BERT model, where the BERT model is obtained by training a method based on L istwise list method and Pairwise document method in turn.

On the basis of the above technical solutions, optionally, the matching statement determining and presenting module 330 may include:

the similarity determining unit is used for inputting at least two similarity characteristics into a pre-trained XGboost tree model to obtain the similarity between the target statement and the candidate statement;

and the matching statement determining unit is used for determining the highest similarity in the similarities and taking the candidate statement corresponding to the highest similarity as the matching statement.

On the basis of the foregoing technical solutions, optionally, the matching statement determination and presentation module 330 may further include a keyword difference determination unit, configured to determine a keyword difference between the target statement and the matching statement based on a preset keyword difference determination rule after determining a highest similarity among the similarities and taking the candidate statement corresponding to the highest similarity as the matching statement;

The matching statement determination device provided by the embodiment of the invention can execute the matching statement determination method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary computer device 412 suitable for use in implementing embodiments of the present invention. The computer device 412 shown in FIG. 4 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 4, computer device 412 is in the form of a general purpose computing device. Components of computer device 412 may include, but are not limited to: one or more processors 416, a memory 428, and a bus 418 that couples the various system components (including the memory 428 and the processors 416).

Bus 418 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 412 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 428 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)430 and/or cache memory 432. The computer device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage 434 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 418 by one or more data media interfaces. Memory 428 can include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 440 having a set (at least one) of program modules 442 may be stored, for instance, in memory 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 442 generally perform the functions and/or methodologies of the described embodiments of the invention.

The computer device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing device, display 424, etc., where the display 424 may be configurable or not as desired), one or more devices that enable a user to interact with the computer device 412, and/or any device (e.g., network card, modem, etc.) that enables the computer device 412 to communicate with one or more other computing devices.

The processor 416 executes various functional applications and data processing, such as implementing the matching statement determination method provided by the embodiment of the present invention, by executing programs stored in the memory 428.

EXAMPLE five

An embodiment five of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a matching statement determination method provided in an embodiment of the present invention, and the method includes:

and determining and displaying a matching sentence matched with the target sentence based on the at least two similarity features and the candidate sentence.

Of course, the computer-readable storage medium provided in the embodiments of the present invention, on which the computer program is stored, is not limited to executing the method operations described above, and may also execute related operations in the matching statement determination method based on the computer device provided in any embodiment of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A matching statement determination method, comprising:

2. The method of claim 1, wherein the at least two similarity features comprise at least two of a text similarity feature, and a sentence similarity feature.

3. The method of claim 2, wherein determining the candidate sentence corresponding to the target sentence from the preset sentence set according to a preset candidate sentence determination rule comprises:

4. The method of claim 3, wherein the specific expression of the BM25 algorithm is as follows:

wherein the score (D, Q) is the first BM25 value, the D is the preset statement, the Q is the target statement, and the Q is_iIs the ith word in the target sentence, n is the total number of words in the target sentence, ω is the word weight, f (q)_iD) is q_iFrequency of occurrence, k, in the preset sentence₁B is an adjustable parameter, | D | is the length of the preset statement D taking the word as the unit, avgdl is the average length of all candidate statements, N is the total number of the preset statements in the preset statement set, and N (q is the total number of the preset statements in the preset statement set_i) For the preset statement set to contain q_iThe filler words are filling words, and the content words are content words.

5. The method of claim 3, wherein determining at least two similarity features between the target sentence and the candidate sentence if the at least two similarity features further include the above similarity feature comprises:

6. The method of claim 5, wherein determining at least two similarity features between the target sentence and the candidate sentence if the at least two similarity features further comprise a sentence meaning similarity feature comprises:

7. The method of any of claims 1-6, wherein determining a matching sentence that matches the target sentence based on at least two similarity features and the candidate sentence comprises:

inputting the at least two similarity characteristics into a pre-trained XGboost tree model to obtain the similarity between the target statement and the candidate statement;

8. The method according to claim 7, wherein after determining the highest similarity among the similarities and taking the candidate sentence corresponding to the highest similarity as the matching sentence, further comprising:

and if the keyword difference accords with a preset rejection rule, discarding the matching statement, and taking an alternative statement as the matching statement.

9. A matching sentence determination apparatus, comprising:

10. A computer device, comprising:

one or more processing devices;

a memory for storing one or more programs;

when executed by the one or more processing devices, cause the one or more processing devices to implement the matching statement determination method of any of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the matching statement determination method according to any one of claims 1 to 8.