CN114036283A

CN114036283A - Text matching method, device, equipment and readable storage medium

Info

Publication number: CN114036283A
Application number: CN202111367504.3A
Authority: CN
Inventors: 张晗; 杜新凯; 吕超; 谷姗姗; 韩佳; 孙垚锋
Original assignee: Sunshine Insurance Group Co Ltd
Current assignee: Sunshine Insurance Group Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-11

Abstract

The application provides a text matching method, a text matching device, text matching equipment and a readable storage medium, wherein the method comprises the steps of acquiring a text to be matched and a candidate text set; inputting a text to be matched into a matching model to obtain a processing result; and determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the matching model is obtained after the basic model is trained according to an output result obtained after the basic model is input twice by the training sample, and the output result corresponding to the basic model input once by one sample comprises the punching vector corresponding to the one sample and the similarity of the two texts in the one sample. The method can improve the accuracy of text matching.

Description

Text matching method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of text relationship matching, and in particular, to a text matching method, apparatus, device, and readable storage medium.

Background

At present, most of text matching is mainly and commonly distinguished by semantic logic degree, semantic relation reasoning, question-answer peer, such as semantic relevance of search recommendation, question-question in intelligent question-answer, question-answer matching, entity link in a knowledge graph and the like. The algorithm used in text matching mainly solves the matching problem based on the vocabulary level.

The algorithm used for matching based on the vocabulary contact degree has great limitation, and the matching result is not accurate in the text similarity matching process.

Therefore, how to improve the accuracy of text matching becomes a technical problem which needs to be solved urgently.

Disclosure of Invention

The embodiment of the application aims to provide a text matching method, and the accuracy of text matching can be improved through the technical scheme of the embodiment of the application.

In a first aspect, an embodiment of the present application provides a text matching method, including: acquiring a text to be matched and a candidate text set; inputting a text to be matched into a matching model to obtain a processing result; and determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the matching model is obtained after the basic model is trained according to an output result obtained after the basic model is input twice by the training sample, and the output result corresponding to the basic model input once by one sample comprises the punching vector corresponding to the one sample and the similarity of the two texts in the one sample.

In the process, the text to be matched can be input into the model to be matched to match the target text, and the target text can be found from the candidate text set based on the output result of the model.

Optionally, when the processing result is the similarity between the text to be matched and each candidate text in the candidate set, inputting the text to be matched into the matching model, including:

inputting a text to be matched and each candidate text in the candidate set into a matching model;

determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the step of determining the target text matched with the text to be matched comprises the following steps:

sorting the similarity values of the text to be matched and each candidate text in the candidate set;

and determining the candidate text corresponding to the value with the maximum similarity as the target text.

In the process, the text to be matched and the candidate text set can be directly and simultaneously input into the model, the candidate text corresponding to the maximum similarity can be directly found out through the similarity between the text to be matched and each candidate text, and the candidate text at the moment is the target text.

Optionally, the processing result is a vector of the text to be matched, and the determining, according to the processing result, a target text matched with the text to be matched from the candidate text set includes:

calculating cosine similarity of the vector of the text to be matched and the vector of each candidate text in the candidate set to obtain M cosine similarities, wherein M is a positive integer greater than or equal to 2;

and determining the candidate text corresponding to the maximum cosine similarity value in the M cosine similarities as the target text.

In the process, the text to be matched can be independently input into the model, and the cosine similarity between the vector of the text to be matched and the vector of each candidate text is calculated, so that the greater the similarity value is, the greater the similarity between the corresponding candidate text and the text to be matched is, and the easier the target text is found.

Optionally, before obtaining the text to be matched and the candidate text set, the method further includes:

acquiring a text in a system log;

manually labeling similar texts on the texts;

splicing every two similar texts in the artificially marked similar texts by utilizing a template prepared in advance to form a training sample;

and inputting the training samples into the basic model twice to obtain an output result, and training the basic model according to the output result to obtain a matching model, wherein the output result corresponding to the basic model with one-time input of one sample comprises a punching vector corresponding to one sample and the similarity of two texts in one sample.

In the process, the overfitting effect can be slowed down in the text matching process through the training of the model, particularly the training method of inputting the sample into the basic model twice.

Optionally, after inputting the basic model twice for each sample and training the existing basic model by using the optimization algorithm of the training model, the method further includes:

inputting the verification samples in the verification set into the matching model twice to obtain two vectors of the verification samples;

calculating cosine similarity according to the two vectors of the verification sample;

verifying the matching model according to the cosine similarity value;

or

Inputting the verification samples in the verification set into a matching model to obtain the similarity of two texts in the verification samples;

and verifying the matching model according to the similarity of the two texts in the sample.

In the process, the model can be verified to be good or bad through verification of the verification set, if the model is found to be in accordance with the standard after multiple times of verification, the model can be used, and if the verification result of the model is found not to be in accordance with the standard, the model is continuously trained by using the sample until the model training is in accordance with the standard.

Optionally, in the process of verifying the model by the verification set, verifying the matching model according to the cosine similarity value includes:

comparing the cosine similarity value with a threshold value to obtain a comparison result;

and verifying the matching model according to the comparison result.

In the above process, whether the model training meets the standard or not can be compared by comparing the cosine similarity value with the threshold value.

and screening similar texts in the knowledge base and the text to be matched by using a text similarity algorithm in the server to obtain a candidate text set.

In the process, a text similar to the text to be matched is found out from the knowledge base by using a similarity algorithm, and after the candidate text is collected in advance, the candidate text set is conveniently and directly placed into the model in the follow-up process.

In a second aspect, an embodiment of the present application provides an apparatus for text matching, including:

the acquisition module is used for acquiring a text to be matched and a candidate text set;

the input module is used for inputting the text to be matched into the matching model to obtain a processing result;

and the output module is used for determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the matching model is obtained after the basic model is trained according to an output result obtained after a training sample is input into the basic model twice, and the output result corresponding to the basic model input into one sample once comprises a punching vector corresponding to the sample and the similarity of two texts in the sample.

Optionally, the input module is specifically configured to:

inputting the text to be matched and each candidate text in the candidate set into the matching model;

the output module is specifically configured to:

Optionally, the output module is specifically configured to:

the processing result is the vector of the text to be matched, and the vector of the text to be matched and the vector of each candidate text in the candidate set are subjected to cosine similarity calculation to obtain M cosine similarities, wherein M is a positive integer greater than or equal to 2;

Optionally, the apparatus further comprises:

a training module, configured to, before the obtaining of the to-be-matched text and the candidate text set:

acquiring a text in a system log;

manually labeling similar texts on the texts;

and inputting the training samples into the basic model twice to obtain an output result, and training the basic model according to the output result to obtain a matching model, wherein the output result corresponding to the basic model with one-time input of one sample comprises the punching vector corresponding to the one sample and the similarity of two texts in the one sample.

Optionally, the apparatus further comprises:

the verification module is used for inputting the basic model twice for each sample, and after the existing basic model is trained by utilizing the optimization algorithm of the training model:

inputting the verification samples in the verification set into a matching model twice to obtain two vectors of the verification samples;

verifying the matching model according to the cosine similarity value;

or

Inputting verification samples in a verification set into a matching model to obtain the similarity of two texts in the verification samples;

Optionally, the verification module is specifically configured to:

and verifying the matching model according to the comparison result.

Optionally, the apparatus further comprises:

and the screening module is used for screening similar texts in a knowledge base and the text to be matched by using a text similarity algorithm in a server to obtain the candidate text set before the text to be matched and the candidate text set are obtained.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a text matching method according to an embodiment of the present disclosure;

fig. 2 is a schematic block diagram of a text matching apparatus provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text matching apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: logical labels and letters refer to similar items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The method and the device are applied to a text matching scene, and particularly, the input text is input into a text matching model, and the text similar to the text is found through the text.

However, the current text matching technology includes algorithms such as BoW (information search), TF-IDF (information retrieval and mining), BM25 (similarity), Jaccad (similarity and difference between samples), SimHash (massive processing data) and the like, for example, BM25 calculates the matching score between the web page and the query word according to the coverage degree of the web page to the query word, the higher the score is, the better the matching degree of the web page and the query is, and the problem of matching at the vocabulary level is mainly solved. Matching algorithms based on vocabulary overlap ratio have great limitations, including: the word sense is limited: for example, "taxi" and "taxi" are literally dissimilar, but are actually the same vehicle; "apple" in different contexts means something different, either fruit or company; the structure is limited: although the words of "machine learning" and "learning machine" are completely overlapped, the expressions have different meanings; knowledge limitations; for example: although the sentence of 'Qinheyuang playing mobile phone' has no problem in terms of lexical and syntactic aspects, the sentence is not correct in combination with knowledge. This indicates that, for the text matching task, it can not only stay at the literal matching level, but needs the matching at the semantic level.

The method comprises the steps of applying contrast learning to a training process of a depth text matching model in a supervised learning category, specifically, coding each sample (text pair) in Batch processing through a BERT model, carrying out twice Dropout (exit mechanism) calculation on an output coding vector to superpose two vector construction positive examples in contrast learning, randomly sampling one of the two vectors and vector construction negative examples output by other samples in Batch processing, calculating cosine similarity loss, calculating cross entropy loss by combining artificial labeling labels whether the text pairs are similar, constructing a multi-target training task, and training the model through a back propagation and gradient descent method. The method can train the deep text matching model under less sample size by utilizing the self-supervision mechanism of contrast learning, and the two-time exit mechanism constructs a positive case for data enhancement, so that the overfitting phenomenon of the deep model is relieved, the robustness of the model to the exit mechanism is enhanced, the output of models under different exit mechanisms is basically consistent, the problem of inconsistent training and prediction of the exit mechanism is solved, and the model effect is improved.

The text matching method according to the embodiment of the present application is described in detail below with reference to fig. 1.

Referring to fig. 1, fig. 1 is a flowchart of a text matching method according to an embodiment of the present application, where the text matching method shown in fig. 1 includes:

110: and acquiring a text to be matched and a candidate text set.

Through the acquired matching text and the candidate text set, the model can be conveniently input directly in the follow-up process or after combination.

The text to be matched is the input text, and the candidate text is the text which is compared with the input text in similarity.

Optionally, before obtaining the text to be matched and the candidate text set, the method shown in fig. 1 may further include:

acquiring a text in a system log;

manually labeling similar texts on the texts;

Through the training of the model, especially the training method of inputting the sample into the basic model twice, the overfitting effect can be slowed down in the text matching process.

The basic model may be a BERT model or other available models, and the puncturing vector may be obtained by calculating a sample through a two-exit mechanism in a training process, for example: each puncture vector may have a 300-dimensional value, some of the values may be removed after the calculation by the exit mechanism, for example, the values may be 30-dimensional values, and the remaining 270-dimensional values may form a new vector, where the 30-dimensional values are randomly selected, so that the vectors after each calculation by the exit mechanism are different, but the two or more vectors may be the same by adjusting the model parameters, thereby making the model more robust.

Optionally, after inputting the basic model twice for each sample and training the existing basic model by using the optimization algorithm of the training model, the method shown in fig. 1 may further include:

verifying the matching model according to the cosine similarity value;

or

And through the verification of the verification set, the model training quality can be verified, if the model is found to be in accordance with the standard after multiple times of verification, the model can be used, and if the verification result of the model is found not to be in accordance with the standard, the model is continuously trained by using the sample until the model training is in accordance with the standard.

The cosine similarity can be used for calculating the similarity of two vectors, and further judging the similarity of two samples.

and verifying the matching model according to the comparison result.

Whether the model training meets the standard or not can be compared by comparing the cosine similarity value with a threshold value.

Wherein, a certain range value can be set, for example: when there are 100 samples for model verification, it can be set that the results obtained after more than 95 samples pass through the model are all in accordance with the standard, and the model can be considered to be in accordance with the standard, wherein the threshold value can also be set to 0.95, and when the final similar result is more than 0.95, the model can be considered to be in accordance with the standard.

And finding out a text similar to the text to be matched from the knowledge base by using a similarity algorithm, and after the candidate text is collected in advance, conveniently and directly placing the candidate text set into a model in the follow-up process.

And partial search can be performed in the process of searching similar texts, so that the range of text quantity in the text matching process is reduced, and the matching time resource is further reduced.

120: and inputting the text to be matched into the matching model to obtain a processing result.

It should be understood that, in the embodiment of the present application, the condition of the processing result may include various situations, for example:

in case 1, the processing result is the similarity between the text to be matched and each candidate text in the candidate set.

In case 2, the processing result is a vector of the text to be matched.

130: and determining the target text matched with the text to be matched from the candidate text set according to the processing result.

Therefore, the matching model in the embodiment of the application can be obtained by training vectors and similarity obtained by inputting the model twice through the sample, so that the phenomenon of model overfitting can be effectively relieved, and the accuracy of text matching can be improved.

Specifically, for the case where the processing result is different, the method for determining the target text is different, for example, for case 1, optionally, when the processing result is the similarity between the text to be matched and each candidate text in the candidate set, the inputting the text to be matched into the matching model includes:

The method can directly input the text to be matched and the candidate text set into the model at the same time, can directly find out the candidate text corresponding to the maximum similarity through the similarity between the text to be matched and each candidate text, and the candidate text at the moment is the target text.

For the above situation 2, optionally, the processing result is a vector of the text to be matched, and determining the target text matched with the text to be matched from the candidate text set according to the processing result includes:

The text to be matched can be independently input into the model, the cosine similarity between the vector of the text to be matched and the vector of each candidate text is calculated, and the larger the similarity value is, the larger the similarity between the corresponding candidate text and the text to be matched is, so that the target text can be found more easily.

The vectors may be obtained in advance, may be obtained by inputting the model in advance, or may be obtained from a third party, for example, the vectors are obtained by a third party device in advance through a sample input module, and then an execution subject of the method of the present application is obtained from the third party; the vector of the candidate text may be obtained when the text matching method of the present application is executed, that is, before the target text matched with the text to be matched is determined from the candidate text set according to the processing result, the method further includes inputting each candidate text set into a matching model to obtain a vector of each candidate text.

The method of text matching was described above with fig. 1, and the apparatus of text matching is described below with reference to fig. 2-3.

Referring to fig. 2, a schematic block diagram of a text matching apparatus 200 provided in the embodiment of the present application is shown, where the apparatus 200 may be a module, a program segment, or code on an electronic device. The apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 1, and can perform various steps related to the embodiment of the method of fig. 1, and specific functions of the apparatus 200 can be referred to the following description, and detailed descriptions are appropriately omitted herein to avoid redundancy.

Optionally, the apparatus 200 includes:

an obtaining module 210, configured to obtain a text to be matched and a candidate text set;

the input module 220 is configured to input the text to be matched into a matching model to obtain a processing result;

an output module 230, configured to determine, according to the processing result, a target text matched with the text to be matched from the candidate text set, where the matching model is obtained by training a basic model according to an output result obtained after a training sample is input to the basic model twice, where an output result obtained after one sample is input to the basic model once includes a puncturing vector corresponding to the sample and a similarity between two texts in the sample.

Optionally, the input module is specifically configured to:

the output module is specifically configured to:

Optionally, the output module is specifically configured to:

Optionally, the apparatus further comprises:

acquiring a text in a system log;

manually labeling similar texts on the texts;

Optionally, the apparatus further comprises:

verifying the matching model according to the cosine similarity value;

or

Optionally, the verification module is specifically configured to:

and verifying the matching model according to the comparison result.

Optionally, the apparatus further comprises:

Fig. 3 is a schematic structural diagram of a text matching apparatus provided in an embodiment of the present application, where the apparatus may include a processor 310 and a memory 320. Optionally, the apparatus may further include: a communication interface 330, and a communication bus 340. The apparatus corresponds to the above-mentioned embodiment of the method of fig. 1, and can perform various steps related to the embodiment of the method of fig. 1, and specific functions of the apparatus can be referred to the following description.

In particular, memory 320 is used to store computer readable instructions.

Processor 310 is configured to process instructions stored in memory 320 to perform the steps of embodiments 110 through 130 of the method of fig. 1.

A communication interface 330 for communicating signaling or data with other node devices. For example: the embodiments of the present application are not limited to the above-described node devices for communication with a server or a terminal.

And a communication bus 340 for realizing direct connection communication of the above components.

The communication interface 330 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 320 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 320 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 320 stores computer readable instructions, which when executed by the processor 310, cause the electronic device to perform the method processes described above with reference to fig. 1. A processor 310 may be used on the apparatus 200 and to perform the functions herein. The Processor 310 may be, for example, a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component, and the embodiments of the present Application are not limited thereto.

Embodiments of the present application further provide a readable storage medium, and when being executed by a processor, the computer program performs a method process performed by an electronic device in the method embodiment shown in fig. 1.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

In summary, the present application provides a method, an apparatus, a device, and a readable storage medium for text matching, where the method obtains a text to be matched and a candidate text set; inputting a text to be matched into a matching model to obtain a processing result; and determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the matching model is obtained after the basic model is trained according to an output result obtained after the basic model is input twice by the training sample, and the output result corresponding to the basic model input once by one sample comprises the punching vector corresponding to the one sample and the similarity of the two texts in the one sample. The method can improve the accuracy of text matching.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: logical labels and letters refer to similar items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of text matching, comprising:

acquiring a text to be matched and a candidate text set;

inputting the text to be matched into a matching model to obtain a processing result;

and determining a target text matched with the text to be matched from the candidate text set according to the processing result, wherein the matching model is obtained after the basic model is trained according to an output result obtained after a training sample is input into the basic model twice, and the output result corresponding to the basic model input into one sample once comprises a punching vector corresponding to the sample and the similarity of two texts in the sample.

2. The method according to claim 1, wherein the processing result is similarity between the text to be matched and each candidate text in the candidate set, and the inputting the text to be matched into a matching model comprises:

the determining the target text matched with the text to be matched from the candidate text set according to the processing result comprises:

3. The method of claim 1,

the processing result is a vector of the text to be matched, and the determining of the target text matched with the text to be matched from the candidate text set according to the processing result comprises the following steps:

4. The method according to any one of claims 1 to 3, wherein before the obtaining the text to be matched and the candidate text set, the method further comprises:

acquiring a text in a system log;

manually labeling similar texts on the texts;

and inputting the training samples into a basic model twice to obtain an output result, and training the basic model according to the output result to obtain the matching model, wherein the output result corresponding to the basic model after one sample is input once comprises the punching vector corresponding to the sample and the similarity of two texts in the sample.

5. The method of claim 4, wherein after inputting the base model twice for each sample and training the existing base model using the optimization algorithm of the training model, the method further comprises:

verifying the matching model according to the cosine similarity value;

or

6. The method of claim 5, wherein the validating the matching model according to the value of the cosine similarity comprises:

and verifying the matching model according to the comparison result.

7. The method according to any one of claims 1 to 3, wherein before the obtaining the text to be matched and the candidate text set, the method further comprises:

and screening similar texts in a knowledge base and the text to be matched by using a text similarity algorithm in a server to obtain the candidate text set.

8. An apparatus for text matching, comprising:

9. An apparatus for text matching, comprising:

a memory and a processor, the memory storing computer readable instructions which, when executed by the processor, perform the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, comprising:

computer program, which, when run on a computer, causes the computer to carry out the method according to any one of claims 1 to 7.