CN111611371A

CN111611371A - Method, device, equipment and storage medium for matching FAQ based on wide and deep network

Info

Publication number: CN111611371A
Application number: CN202010555479.0A
Authority: CN
Inventors: 胡哲杨; 肖龙源; 李稀敏; 刘晓葳; 廖斌
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-09-01
Anticipated expiration: 2040-06-17
Also published as: CN111611371B

Abstract

The invention discloses a method, a device, equipment and a storage medium for matching FAQ based on a wide and deep network, wherein the method comprises the following steps: respectively obtaining text representations of the target problem and the candidate problem, and respectively mapping the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem; calculating first similarity of the first mapping vector and the second mapping vector, and calculating second similarity of sentence length difference, third similarity of the same word proportion and fourth similarity of the same word inverse proportion of the target question and the candidate question; and obtaining text similarity according to the first similarity, the second similarity, the third similarity and the fourth similarity, and performing FAQ matching according to the text similarity. According to the method, a more comprehensive similarity is obtained for FAQ matching by extracting the characteristics of the context information, sentence length difference, the number of the same words, the sequence of the same words and the like of the target problem and the candidate problem.

Description

Method, device, equipment and storage medium for matching FAQ based on wide and deep network

Technical Field

The invention relates to the field of natural language processing, in particular to a method, a device, equipment and a computer storage medium for matching FAQ based on a wide and deep network.

Background

The FAQ matching task is to give a group of frequently-used question-answer pairs, find whether a question in the question-answer pair matched with the target question exists or not for the given question, and if the candidate question exists, take the corresponding answer as the answer of a new question. At present, the most common method for solving the FAQ matching problem is to construct a twin network, obtain vector representation forms with context semantics corresponding to the candidate problem and the target problem, calculate the similarity of the two vectors, and select the candidate problem with the highest similarity as a matching item. However, the method does not consider the factors of sentence length difference, the same word quantity, the same word sequence and the like, so that the prediction effect is poor.

Disclosure of Invention

In view of the foregoing problems, an object of the present invention is to provide a method, an apparatus, a device, and a storage medium for FAQ matching based on a wide and deep network, in which the method, the apparatus, the device, and the storage medium extract the context information, sentence length difference, the number of the same words, the sequence of the same words, and other characteristics of the target problem and the candidate problem to obtain a more comprehensive similarity for FAQ matching.

The embodiment of the invention provides an FAQ matching method based on a wide and deep network, which comprises the following steps:

respectively obtaining text representations of a target problem and a candidate problem, and respectively mapping the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem;

calculating a first similarity of the first mapping vector and the second mapping vector;

calculating a second similarity of sentence length differences of the target question and the candidate question;

calculating a third similarity of the target problem and the candidate problem in the same word proportion;

calculating a fourth similarity of the same word inverse number proportion of the target problem and the candidate problem;

and obtaining text similarity according to the first similarity, the second similarity, the third similarity and the fourth similarity, and performing FAQ matching according to the text similarity.

Preferably, the method includes obtaining text representations of the target problem and the candidate problem, and mapping the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem, specifically:

respectively acquiring text representations of a target problem and a candidate problem;

segmenting and segmenting the text characterization of the target problem and the text characterization of the candidate problem, and mapping into a first mapping vector of the target problem and a second mapping vector of the candidate problem based on Word2Vec mapping and a TextCNN network.

Preferably, the second similarity of the sentence length differences of the target question and the candidate question is calculated, specifically:

calculating the absolute value of the word number difference between each candidate problem and the corresponding target problem;

dividing the absolute value by the sum of the times of the candidate question and the target question to obtain a sentence length difference rate of the candidate question and the target question;

and calculating a second similarity of sentence length differences of the target question and the candidate question based on the sentence length difference rate.

Preferably, the third similarity of the target problem and the candidate problem with the same word proportion is calculated, specifically:

calculating the number of words of each candidate problem which are the same as the corresponding target problem;

and dividing the number by the total word number of the target problem to be used as a third similarity of the candidate problem based on the same word proportion.

Preferably, the fourth similarity of the same word inverse number proportion of the target problem and the candidate problem is calculated, specifically:

calculating a set of words of which each candidate problem is the same as the corresponding target problem;

calculating the reverse order number of the same words of the candidate problem by taking the sequence of the same words in the target problem as a positive order;

dividing the inverse number by the maximum value of the possible values of the inverse number to serve as an inverse number ratio example of the candidate problem;

and calculating a fourth similarity of the inverse number ratios of the same words of the target problem and the candidate problem based on the inverse number ratios.

Preferably, the text similarity is obtained according to the first similarity, the second similarity, the third similarity and the fourth similarity, and FAQ matching is performed according to the text similarity, specifically:

splicing the first similarity, the second similarity, the third similarity and the fourth similarity into a vector;

splicing the vector into a full-connection layer with the number of nodes being 1 and the activation function being sigmoid so as to obtain the final text similarity;

and carrying out FAQ matching according to the text similarity.

In a second aspect, the present invention provides an FAQ matching apparatus based on wide and deep network, including:

the vector mapping unit is used for respectively acquiring text representations of a target problem and a candidate problem and mapping the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem;

a first similarity obtaining unit, configured to calculate first similarities of the first mapping vector and the second mapping vector;

a second similarity obtaining unit, configured to calculate a second similarity of sentence length differences of the target question and the candidate question;

the third similarity obtaining unit is used for calculating third similarities of the target problem and the candidate problem with the same word proportion;

a fourth similarity obtaining unit, configured to calculate a fourth similarity of the same word inverse number ratios of the target problem and the candidate problem;

and the text similarity obtaining unit is used for obtaining text similarity according to the first similarity, the second similarity, the third similarity and the fourth similarity, and performing FAQ matching according to the text similarity.

Preferably, the vector mapping unit includes:

the text representation acquisition module is used for respectively acquiring text representations of the target problem and the candidate problem;

and the vector mapping module is used for segmenting and segmenting the text characterization of the target problem and the text characterization of the candidate problem, and mapping the segmented text characterization and the segmented text characterization into a first mapping vector of the target problem and a second mapping vector of the candidate problem based on Word2Vec mapping and a TextCNN network.

Preferably, the second similarity obtaining unit includes:

the absolute value calculation module is used for calculating the absolute value of the word number difference between each candidate problem and the corresponding target problem;

a sentence length difference rate obtaining module, configured to divide the absolute value by a sum of the number of times of the candidate question and the target question, so as to obtain a sentence length difference rate of the candidate question and the target question;

and the second similarity obtaining module is used for calculating second similarities of sentence length differences of the target question and the candidate question based on the sentence length difference rate.

Preferably, the third similarity obtaining unit includes:

the same word number calculation module is used for calculating the number of words of each candidate problem, which are the same as the corresponding target problem;

and the third similarity obtaining module is used for dividing the number by the total word number of the target problem to obtain a third similarity of the candidate problem based on the same word proportion.

Preferably, the fourth similarity obtaining unit includes:

the same word set calculation module is used for calculating a word set of each candidate problem which is the same as the corresponding target problem;

the inverse sequence number calculating module is used for calculating the inverse sequence number of the same words of the candidate problem by taking the sequence of the same words in the target problem as a positive sequence;

the reverse order ratio obtaining module is used for dividing the reverse order number by the maximum value of the possible value of the reverse order number to be used as the reverse order ratio of the candidate problem;

and the fourth similarity calculation module is used for calculating a fourth similarity of the same word inverse number proportion of the target problem and the candidate problem based on the inverse number proportion.

Preferably, the text similarity obtaining unit includes:

the vector acquisition module is used for splicing the first similarity, the second similarity, the third similarity and the fourth similarity into a vector;

the text similarity obtaining module is used for splicing the vectors into a full-connection layer with the node number being 1 and the activation function being sigmoid so as to obtain the final text similarity;

and the FAQ matching module is used for carrying out FAQ matching according to the text similarity.

The embodiment of the invention also provides FAQ matching equipment based on the wide and deep network, which comprises a processor, a memory and a computer program stored in the memory, wherein the computer program can be executed by the processor to realize the FAQ matching method based on the wide and deep network.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, which is used for preventing data loss, according to the above embodiment.

In the embodiment, on the basis of extracting the twin network of the context semantic information of the target problem and the candidate problem, the 3 types of characteristics of the sentence length difference, the same word quantity and the same word sequence of the target problem and the candidate problem are continuously extracted, so that a more effective text similarity comparison method can be obtained, and the prediction effect of the model is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating an FAQ matching method based on a wide and deep network according to a first embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an FAQ matching apparatus based on the wide and deep network according to a second embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a first embodiment of the present invention provides a FAQ matching method based on a wide and deep network, which can be performed by a FAQ matching device based on a wide and deep network, and in particular, by one or more processors in the FAQ matching device based on a wide and deep network, and includes at least the following steps:

s101, respectively obtaining text representations of the target problem and the candidate problem, and respectively mapping the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem.

In the embodiment, all question sentences in FAQ are extracted, Word segmentation is carried out, then Word2Vec is used for mapping as a matrix, then network extraction features such as TextCNN can be selected, finally each question is expressed as a vector, cosine similarity of a candidate question vector and a target question vector is calculated, and specifically, a target question and text representation of the candidate question are respectively obtained; and segmenting the text representation of the target problem and the text representation of the candidate problem, and mapping into a first mapping vector of the target problem and a second mapping vector of the candidate problem based on Word2Vec mapping and a TextCNN network.

S102, calculating a first similarity of the first mapping vector and the second mapping vector.

S103, calculating a second similarity of sentence length differences of the target question and the candidate question.

The step S103 includes:

s1031, calculating absolute values of word number differences of each candidate question and the corresponding target question;

s1032, dividing the absolute value by the sum of the times of the candidate question and the target question to obtain the sentence length difference rate of the candidate question and the target question;

s1032, calculating a second similarity of sentence length differences of the target question and the candidate question based on the sentence length difference rate.

Specifically, the second similarity of the sentence-length difference is 1-sentence-length difference rate.

And S104, calculating a third similarity of the target problem and the candidate problem with the same word proportion.

The step of S104 includes:

s1041, calculating the number of words of each candidate question which is the same as the number of words of the corresponding target question;

and S1042, dividing the number by the total word number of the target problem to be used as a third similarity of the candidate problem based on the same word proportion.

And S105, calculating a fourth similarity of the same word inverse number proportion of the target problem and the candidate problem.

The step S105 includes:

s1051, calculating the set of words of each candidate question which are the same as the corresponding target question;

s1052, taking the sequence of the same words in the target problem as a positive sequence, and calculating the reverse sequence number of the same words of the candidate problem;

s1053, dividing the inverse number by the maximum value of the possible values of the inverse number, and taking the maximum value as the inverse number ratio example of the candidate problem;

s1054, based on the inverse number proportion, calculating a fourth similarity of the inverse number proportion of the same word of the target question and the candidate question.

Specifically, the fourth similarity degree of the same term inverse number ratio is 1-inverse number ratio.

And S106, obtaining text similarity according to the first similarity, the second similarity, the third similarity and the fourth similarity, and performing FAQ matching according to the text similarity.

In this embodiment, the step S106 includes:

s1061, obtaining text similarity according to the first similarity, the second similarity, the third similarity, and the fourth similarity, and performing FAQ matching according to the text similarity, specifically:

s1062, splicing the first similarity, the second similarity, the third similarity and the fourth similarity into a vector;

s1063, splicing the vectors to form a full-connection layer with the number of nodes being 1 and the activation function being sigmoid so as to obtain the final text similarity;

and S1064, performing FAQ matching according to the text similarity.

Specifically, the deep and wide network is built, then, label data is used for training, and positive and negative examples of each question-answer pair are balanced during training. When the weight is initialized, the last full-connection layer may give a weight with a larger semantic similarity, and give a smaller value to the other 3 similarities, such as: the similarity weights may be set to 0.7,0.1,0.1, 0.1.

In conclusion, on the basis of extracting the twin network of the context semantic information of the target problem and the candidate problem, the 3 types of characteristics of the sentence length difference, the same word quantity and the same word sequence of the target problem and the candidate problem are continuously extracted, so that a more effective text similarity comparison method can be obtained, and the prediction effect of the model is improved.

The first embodiment of the present invention:

referring to fig. 2, a second embodiment of the present invention provides an FAQ matching apparatus based on a wide and deep network, including:

the vector mapping unit 100 is configured to obtain text representations of a target problem and a candidate problem, and map the text representations of the target problem and the candidate problem into a first mapping vector of the target problem and a second mapping vector of the candidate problem;

a first similarity obtaining unit 200, configured to calculate a first similarity between the first mapping vector and the second mapping vector;

a second similarity obtaining unit 300 for calculating a second similarity of sentence length differences of the target question and the candidate question;

a third similarity obtaining unit 400, configured to calculate a third similarity of the target problem and the candidate problem with the same word proportion;

a fourth similarity obtaining unit 500, configured to calculate a fourth similarity of the same word inverse number ratios of the target problem and the candidate problem;

a text similarity obtaining unit 600, configured to obtain text similarity according to the first similarity, the second similarity, the third similarity, and the fourth similarity, and perform FAQ matching according to the text similarity.

On the basis of the above embodiments, in a preferred embodiment of the present invention, the vector mapping unit 100 includes:

On the basis of the above embodiments, in a preferred embodiment of the present invention, the second similarity obtaining unit 300 includes:

On the basis of the above embodiment, in a preferred embodiment of the present invention, the third similarity obtaining unit 400 includes:

On the basis of the foregoing embodiment, in a preferred embodiment of the present invention, the fourth similarity obtaining unit 500 includes:

On the basis of the foregoing embodiment, in a preferred embodiment of the present invention, the text similarity obtaining unit 600 includes:

Third embodiment of the invention:

the third embodiment of the invention also provides FAQ matching equipment based on the wide and deep network, which comprises a processor, a memory and a computer program stored in the memory, wherein the computer program can be executed by the processor to implement the FAQ matching method based on the wide and deep network described in the above embodiment.

The fourth embodiment of the present invention:

the fourth embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, which is used for preventing data loss, according to the foregoing embodiment.

Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of instruction segments of a computer program capable of performing specific functions, and the instruction segments are used for describing the execution process of the computer program in the FAQ matching device based on the wide and deep network.

The wide and deep network based FAQ matching device can include but is not limited to a processor and a memory. Those skilled in the art will appreciate that the schematic diagram is merely an example of the wide and deep network-based FAQ matching device, and does not constitute a limitation of the wide and deep network-based FAQ matching device, and may include more or less components than those shown, or combine some components, or different components, for example, the wide and deep network-based FAQ matching device may further include an input-output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the control center of the wide and deep network based FAQ matching device connects the various parts of the entire wide and deep network based FAQ matching device using various interfaces and lines.

The memory may be used for storing the computer programs and/or modules, and the processor may implement various functions of the fade and deep network-based FAQ matching device by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the FAQ matching device integrated unit based on the wide and deep network can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for matching FAQ based on wide and deep network is characterized by comprising the following steps:

2. The method for FAQ matching based on the wide and deep network of claim 1, wherein text tables of a target problem and a candidate problem are obtained respectively, and the text representations of the target problem and the candidate problem are mapped into a first mapping vector of the target problem and a second mapping vector of the candidate problem respectively, specifically:

3. The method as claimed in claim 2, wherein the second similarity of sentence length differences of the target question and the candidate question is calculated as follows:

4. The method as claimed in claim 3, wherein the third similarity of the target problem and the candidate problem with the same word proportion is calculated as follows:

5. The method as claimed in claim 4, wherein the step of calculating the fourth similarity of the inverse number ratios of the same words of the target problem and the candidate problem comprises:

6. The FAQ matching method based on the wide and deep network of claim 5, wherein text similarity is obtained according to the first similarity, the second similarity, the third similarity and the fourth similarity, and FAQ matching is performed according to the text similarity, specifically:

and carrying out FAQ matching according to the text similarity.

7. An FAQ matching device based on a wide and deep network, comprising:

8. The wide and deep network-based FAQ matching device as claimed in claim 7, wherein the vector mapping unit comprises:

9. A wide and deep network-based FAQ matching device, comprising a processor, a memory and a computer program stored in the memory, wherein the computer program is executable by the processor to implement the wide and deep network-based FAQ matching method as claimed in any one of claims 1 to 6.

10. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed, the computer-readable storage medium controls a device to execute the method for FAQ matching based on the wide and deep network according to any one of claims 1 to 6.