CN111738019A

CN111738019A - Method and device for recognizing repeated sentences

Info

Publication number: CN111738019A
Application number: CN202010591982.1A
Authority: CN
Inventors: 周楠楠; 汤耀华; 杨海军; 徐倩
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-02

Abstract

The invention provides a method and a device for recognizing repeated sentences, wherein the method comprises the following steps: the method comprises the steps of obtaining two sentences to be recognized, determining the semantic role of each word in the two sentences when the editing distance of the two sentences is determined to be not 0, and determining the two sentences to be the repeated sentences if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms. When the editing distance of the two sentences is determined to be not 0, after the semantic roles of the two sentences are identified, the two sentences can be determined to be the repeated sentences by judging that the words of the same semantic role and the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.

Description

Method and device for recognizing repeated sentences

Technical Field

The invention relates to the field of financial technology (Fintech), in particular to a method and a device for recognizing a repeated statement sentence.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, but due to the requirements of the financial industry on safety and real-time performance, higher requirements are also put forward on the technologies. In the customer service in the financial field, the repeated statement sentence identification is an important problem in an intelligent voice customer service system, and the user experience can be well improved by correctly identifying and understanding the repeated statement sentence.

In an intelligent speech service system, a repeat sentence is generally defined as whether the current input of a user is a semantically correct repeat of the last sentence served by the intelligent speech service system. The prior technical scheme generally obtains initial vector representations of two sentences through a word vector model, then obtains final vector representations of the two sentences through a CNN or RNN model, and finally determines whether the two sentences are similar sentences or not through the modes of solving similarity and the like for the two vectors. However, the accuracy of the recognition result is not high, and the user experience is influenced.

In summary, there is a need for a method for recognizing a repeated sentence to solve the problem of low recognition accuracy of the repeated sentence in the prior art.

Disclosure of Invention

The invention provides a method and a device for recognizing a repeated sentence, which can solve the problem of low recognition precision of the repeated sentence in the prior art.

In a first aspect, the present invention provides a method for restatement sentence recognition, including:

acquiring two sentences to be recognized;

when the editing distance of the two sentences is determined not to be 0, determining the semantic role of each word in the two sentences;

and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.

In the above technical solution, after the semantic roles of the two sentences are identified when it is determined that the edit distance of the two sentences is not 0, the two sentences can be determined to be the repeated sentences by judging whether the words of the same semantic role and the words of the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.

Optionally, the method further includes:

if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.

In the technical scheme, when it is determined that words with different semantic roles or corresponding to the same semantic role are different, the repeated sentence recognition is performed by analyzing the affairs or affairs in the semantic roles, so that the accuracy of the repeated sentence recognition can be further improved.

Optionally, the performing the first processing on the event-taking or event-taking word, and determining whether the two sentences are repeat sentences according to semantic roles of the two sentences after the first processing includes:

inverting and/or inheriting words corresponding to the action or the subject in the two sentences;

and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.

By inverting and/or inheriting the words corresponding to the action or the story, the semantic roles of the two sentences after the action and/or the inheritance are consistent, and whether the two sentences are retended sentences can be determined.

Optionally, the determining whether the two sentences are repeated sentences according to the semantic additional words in the semantic roles of the two sentences includes:

determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;

otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.

In the technical scheme, after the fact that whether the sentence is a repeated sentence or not cannot be determined by affairs or affairs, the semantic additional words can be analyzed to determine whether the two sentences are the repeated sentences or not, and the accuracy of repeated sentence recognition can be further improved.

Optionally, the determining whether the two sentences are repeated sentences according to the similarity of the vectors of the two sentences includes:

and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.

Optionally, the determining the semantic role of each word in the two sentences includes:

inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.

Optionally, the training a sequence labeling model by using a training sample labeled according to semantic role labeling to obtain the semantic role recognition model includes:

acquiring a training sample labeled according to the semantic role label,

preprocessing sentences in the training samples;

inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;

and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.

In a second aspect, an embodiment of the present invention provides an apparatus for sentence recognition, including:

an acquisition unit configured to acquire two sentences to be recognized;

the processing unit is used for determining the semantic role of each word in the two sentences when the editing distance of the two sentences is determined not to be 0; and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.

Optionally, the processing unit is further configured to:

Optionally, the processing unit is specifically configured to:

acquiring a training sample labeled according to the semantic role label,

preprocessing sentences in the training samples;

In a third aspect, the invention provides a computing device comprising:

a memory for storing a computer program;

a processor for calling the computer program stored in the memory and executing the method according to the first aspect according to the obtained program.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer-executable program for causing a computer to perform the method of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for sentence recognition according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for sentence repetition identification according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a system architecture provided in an embodiment of the present invention. As shown in fig. 1, the system architecture may be a server 100 including a processor 110, a communication interface 120, and a memory 130.

The communication interface 120 is used for communicating with the customer service terminal device, and receiving and transmitting information transmitted by the customer service terminal device to implement communication.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and lines, performs various functions of the server 100 and processes data by running or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

It should be noted that the structure shown in fig. 1 is only an example, and the embodiment of the present invention is not limited thereto.

Based on the above description, fig. 2 exemplarily shows a flow of a method for sentence recognition, which can be performed by an apparatus for sentence recognition.

As shown in fig. 2, the specific steps of the process include:

in step 201, two sentences to be recognized are obtained.

In the embodiment of the present invention, the two sentences to be recognized may be two sentences in the dialogue data of the customer service dialogue with the user, for example, one sentence is input by the customer service, one sentence is input by the user, and generally the sentence of the customer service repeated user, or the sentence of the customer service repeated by the user. If the sentence a is "manual repayment is carried out" and the sentence B is "manual repayment is carried out, the bar is carried out".

Step 202, when the editing distance of the two sentences is determined not to be 0, determining the semantic role of each word in the two sentences.

After obtaining the two sentences in step 201, it is necessary to remove the nonsense words in the two sentences, such as the words of tone, "you are", "your meaning", "opposite bar", etc. The edit distance of the two sentences is then determined, which may generally refer to the minimum number of editing operations required to transition from one string to another. The allowed editing operations include replacing one character with another, inserting one character, and deleting one character. Wherein the smaller the number of editing operations, the closer the two.

It should be noted that, when the edit distance of two sentences is 0, it indicates that the two sentences are identical sentences, and the two sentences can be directly determined to be statement sentences. If the edit distance of the two sentences is not 0, the semantic role of each word in the two sentences needs to be determined by inputting the two sentences into the semantic role recognition model.

When training the sequence labeling model by using the training sample labeled according to the semantic role labeling to obtain the semantic role recognition model, the method specifically includes:

firstly, obtaining a training sample labeled according to semantic role labels, then preprocessing sentences in the training sample, and inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence. And finally, inputting the vector representation into a sequence labeling model for training to obtain a semantic role recognition model.

In the embodiment of the present invention, the task of SRL (Semantic Role Labeling) is to study the relationship between each component in a sentence and a predicate with a predicate of the sentence as a center, and describe the relationship between them with a Semantic Role, that is, determine the roles of other arguments and other arguments with respect to a (core) predicate in the sentence. SRL generally divides the components of a sentence into three categories, respectively: the predicate (REL), core arguments (ArgN, N ∈ {0,1,2,3,4,5}) which are generally verbs or adjectives, and semantic additional words (ArgM-x), wherein the core arguments represent arguments directly related to the predicate, such as the predicate's predicate (Arg0) and the predicate's argument (Arg1), and the semantic additional words represent arguments not directly related to the predicate and can independently exist, such as time (ArgM-TMP), place (ArgM-LOC), purpose (ArgM-PRP), degree (ArgM-DGR), scope (ArgM-EXT), and the like. For example, "you can search for the public number at present" the predicate is "search" can be judged by the SRL, the action is "you", the subject is "public number at all", and the time is "now".

In the specific training process, firstly, data needs to be collected and labeled according to the SRL labeling standard, and a training sample D1 is obtained. Then, training a semantic role recognition model according to the training sample D1, where the model may adopt a BERT (bidirectional encoder responses from Transformer, pre-training model) + LSTM (long short-Term Memory network) + CRF (Conditional Random Field) based sequence labeling model, and the training process is as follows:

firstly, preprocessing data in a training sample D1, performing character-level segmentation on two sentences, converting the two sentences into an ID form, setting [ CLS ] labels at the beginning of the sentences and [ SEP ] labels at the end of the sentences, simultaneously changing the corpus into a fixed length, filling up the sentences with insufficient length by '0', and truncating the sentences with the length exceeding the fixed length.

Secondly, inputting the preprocessed sentences into a pre-training model BERT to obtain vector representation of each word in the sentences, and then inputting the obtained vector representation into an upper LSTM + CRF model for training to obtain a semantic role recognition model.

After the semantic character recognition model is trained, the two sentences can be input into the semantic character recognition model to obtain the semantic character of each word in the two sentences.

Step 203, if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are complex sentences.

After the semantic roles of each word in the two sentences are obtained, whether the semantic roles in the two sentences are the same and whether the words corresponding to the same semantic role are the same or are synonyms can be judged. When the semantic roles in the two sentences are determined to be the same and the words corresponding to the same semantic roles are the same or are synonyms, the two sentences can be determined to be the repeated sentences. That is, if the semantic roles of the two sentences completely match and the words corresponding to the semantic roles completely match or are synonymous, the sentence is a repeat sentence, and if the sentence a is "default" and the sentence B is "default, the sentence a has the core predicate REL" and the sentence Arg1 "default", the sentence B has the core predicate REL after deleting the nonsense word and the sentence B has the core predicate REL "and the sentence Arg 1" default ", and the semantic roles of the two sentences and the corresponding words completely match, the sentence is a repeat sentence.

If the semantic roles of the two sentences are not completely consistent with the corresponding words, namely the semantic roles in the two sentences are not completely identical and/or the words corresponding to the same semantic role are different, whether the affairs carried on or the affairs carried on in the semantic roles of the two sentences are identical or not can be determined, if not, the words corresponding to the affairs carried on or the affairs carried on are subjected to first processing, and whether the two sentences are the repeat sentences or not is determined according to the semantic roles of the two sentences after the first processing. Otherwise, whether the two sentences are the repeat sentences can be determined according to the semantic additional words in the semantic roles of the two sentences.

The first processing is performed on the words corresponding to the events or the events, and whether the two sentences are the repeat sentences or not is determined according to the semantic roles of the two sentences after the first processing specifically may be: firstly, words corresponding to the events or the events in the two sentences are reversed and/or inherited, then the two sentences after reversal and/or inheritance are judged to have the same semantic role and the same words corresponding to the same semantic role or are synonyms, if yes, the two sentences are determined to be the repeated sentences, otherwise, whether the two sentences are the repeated sentences is determined according to semantic additional words in the semantic roles of the two sentences.

And (3) reversing or inheriting the events on the spot or the events on the spot, if the semantic roles are completely consistent after reversing and inheriting and the words corresponding to the semantic roles are completely consistent or are synonyms, the sentences are repeated, otherwise, whether the two sentences are repeated or not can be determined according to the semantic additional words in the semantic roles of the two sentences. If the sentence a is "you unbind bank card to go" and the sentence B is "i unbind to go to be bar", the core predicate REL of the sentence a is "unbind", the action Arg0 is "you", the action Arg1 is "bank card", the core predicate REL of the sentence B is "unbind", the action Arg0 is "i", and there is no action, the actions of the sentence B are converted into "you", and after the action is inherited from the sentence a, the roles of the two sentences and the words corresponding to the roles can be seen to be consistent or synonymous, so that the two sentences can be judged to be the repeated sentences.

Determining whether the two sentences are the sentence according to the semantic additional words in the semantic roles of the two sentences may include determining whether the semantic additional words are the shape words and the negative words, if so, determining the two sentences as the sentence when determining that the number of the negative words is even, and determining the two sentences as the sentence when determining that the number of the negative words is odd. Otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is repeatedly asked for the user, if the semantic additional word is not the preset important information, vectorizing the two sentences, and determining whether the two sentences are repeat sentences according to the similarity of vectors of the two sentences. The preset important information may be set according to experience, and may be important information such as time, place, and the like.

That is, if the missing component is a state word in the semantic additional word ArgM and the state word is a negative word, the missing number is determined, if the number is an even number, the sentence is a restitution sentence, otherwise, the sentence is a non-restitution sentence. If the missing component is important information such as time (ArgM-TMP) or location (ArgM-LOC), the question is confirmed, that is, the question is sent to the user, and the user replies to the question to determine whether the sentence is a repeat sentence.

In the above embodiment, the way of determining whether the two sentences are the repeating sentences according to the similarity of the vectors of the two sentences is mainly to vectorize the two sentences through word vectors or pre-training models (Bert, XLNet, etc.), and judge by calculating the similarity of the two vectors, if the similarity is greater than a threshold, the sentence is the repeating sentence, otherwise, the sentence is not the repeating sentence.

The embodiment of the invention shows that two sentences to be identified are obtained, when the editing distance of the two sentences is determined to be not 0, the semantic role of each word in the two sentences is determined, and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, the two sentences are determined to be the repeated sentences. When the editing distance of the two sentences is determined to be not 0, after the semantic roles of the two sentences are identified, the two sentences can be determined to be the repeated sentences by judging that the words of the same semantic role and the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.

Based on the same technical concept, fig. 3 exemplarily shows a schematic structural diagram of an apparatus for sentence repetition identification provided by an embodiment of the present invention, and the apparatus can perform a flow of sentence repetition identification.

As shown in fig. 3, the apparatus specifically includes:

an acquisition unit 301 configured to acquire two sentences to be recognized;

a processing unit 302, configured to determine a semantic role of each word in the two sentences when it is determined that the edit distance of the two sentences is not 0; and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.

Optionally, the processing unit 302 is further configured to:

Optionally, the processing unit 302 is specifically configured to:

acquiring a training sample labeled according to the semantic role label,

preprocessing sentences in the training samples;

Based on the same technical concept, the present invention provides a computing device, comprising:

a memory for storing a computer program;

and the processor is used for calling the computer program stored in the memory and executing the method for recognizing the repeated sentences according to the obtained program.

In a fourth aspect, the present invention provides a computer-readable storage medium storing a computer-executable program for causing a computer to execute the method of restatement sentence recognition described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present application and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of restatement sentence recognition, comprising:

acquiring two sentences to be recognized;

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the performing a first process on the words of the event or the story, and determining whether the two sentences are restitution sentences according to semantic roles of the two sentences after the first process, comprises:

4. The method of claim 3, wherein said determining whether the two sentences are compound sentences based on semantic additional words in the semantic roles of the two sentences comprises:

5. The method of claim 4, wherein said determining whether the two sentences are repeating sentences based on similarity of vectors of the two sentences comprises:

6. The method of any of claims 1 to 5, wherein the determining semantic role for each word in two sentences comprises:

7. The method of claim 6, wherein the training a sequence labeling model using training samples labeled according to semantic character labels to obtain the semantic character recognition model comprises:

acquiring a training sample labeled according to the semantic role label,

preprocessing sentences in the training samples;

8. An apparatus for sentence recognition, comprising:

an acquisition unit configured to acquire two sentences to be recognized;

9. The apparatus as recited in claim 8, said processing unit to further:

10. The apparatus as claimed in claim 9, wherein said processing unit is specifically configured to:

11. The apparatus as claimed in claim 10, wherein said processing unit is specifically configured to:

12. The apparatus as recited in claim 11, said processing unit to:

13. The apparatus according to any one of claims 8 to 12, wherein the processing unit is specifically configured to:

14. The apparatus as claimed in claim 13, wherein said processing unit is specifically configured to:

acquiring a training sample labeled according to the semantic role label,

preprocessing sentences in the training samples;

15. A computing device, comprising:

a memory for storing a computer program;

a processor for calling a computer program stored in said memory, for executing the method of any one of claims 1 to 7 in accordance with the obtained program.

16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer-executable program for causing a computer to execute the method of any one of claims 1 to 7.