CN111738019A - Method and device for recognizing repeated sentences - Google Patents

Method and device for recognizing repeated sentences Download PDF

Info

Publication number
CN111738019A
CN111738019A CN202010591982.1A CN202010591982A CN111738019A CN 111738019 A CN111738019 A CN 111738019A CN 202010591982 A CN202010591982 A CN 202010591982A CN 111738019 A CN111738019 A CN 111738019A
Authority
CN
China
Prior art keywords
sentences
semantic
determining
words
same
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010591982.1A
Other languages
Chinese (zh)
Inventor
周楠楠
汤耀华
杨海军
徐倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010591982.1A priority Critical patent/CN111738019A/en
Publication of CN111738019A publication Critical patent/CN111738019A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Evolutionary Computation (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for recognizing repeated sentences, wherein the method comprises the following steps: the method comprises the steps of obtaining two sentences to be recognized, determining the semantic role of each word in the two sentences when the editing distance of the two sentences is determined to be not 0, and determining the two sentences to be the repeated sentences if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms. When the editing distance of the two sentences is determined to be not 0, after the semantic roles of the two sentences are identified, the two sentences can be determined to be the repeated sentences by judging that the words of the same semantic role and the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.

Description

Method and device for recognizing repeated sentences
Technical Field
The invention relates to the field of financial technology (Fintech), in particular to a method and a device for recognizing a repeated statement sentence.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, but due to the requirements of the financial industry on safety and real-time performance, higher requirements are also put forward on the technologies. In the customer service in the financial field, the repeated statement sentence identification is an important problem in an intelligent voice customer service system, and the user experience can be well improved by correctly identifying and understanding the repeated statement sentence.
In an intelligent speech service system, a repeat sentence is generally defined as whether the current input of a user is a semantically correct repeat of the last sentence served by the intelligent speech service system. The prior technical scheme generally obtains initial vector representations of two sentences through a word vector model, then obtains final vector representations of the two sentences through a CNN or RNN model, and finally determines whether the two sentences are similar sentences or not through the modes of solving similarity and the like for the two vectors. However, the accuracy of the recognition result is not high, and the user experience is influenced.
In summary, there is a need for a method for recognizing a repeated sentence to solve the problem of low recognition accuracy of the repeated sentence in the prior art.
Disclosure of Invention
The invention provides a method and a device for recognizing a repeated sentence, which can solve the problem of low recognition precision of the repeated sentence in the prior art.
In a first aspect, the present invention provides a method for restatement sentence recognition, including:
acquiring two sentences to be recognized;
when the editing distance of the two sentences is determined not to be 0, determining the semantic role of each word in the two sentences;
and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.
In the above technical solution, after the semantic roles of the two sentences are identified when it is determined that the edit distance of the two sentences is not 0, the two sentences can be determined to be the repeated sentences by judging whether the words of the same semantic role and the words of the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.
Optionally, the method further includes:
if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.
In the technical scheme, when it is determined that words with different semantic roles or corresponding to the same semantic role are different, the repeated sentence recognition is performed by analyzing the affairs or affairs in the semantic roles, so that the accuracy of the repeated sentence recognition can be further improved.
Optionally, the performing the first processing on the event-taking or event-taking word, and determining whether the two sentences are repeat sentences according to semantic roles of the two sentences after the first processing includes:
inverting and/or inheriting words corresponding to the action or the subject in the two sentences;
and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.
By inverting and/or inheriting the words corresponding to the action or the story, the semantic roles of the two sentences after the action and/or the inheritance are consistent, and whether the two sentences are retended sentences can be determined.
Optionally, the determining whether the two sentences are repeated sentences according to the semantic additional words in the semantic roles of the two sentences includes:
determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;
otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.
In the technical scheme, after the fact that whether the sentence is a repeated sentence or not cannot be determined by affairs or affairs, the semantic additional words can be analyzed to determine whether the two sentences are the repeated sentences or not, and the accuracy of repeated sentence recognition can be further improved.
Optionally, the determining whether the two sentences are repeated sentences according to the similarity of the vectors of the two sentences includes:
and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.
Optionally, the determining the semantic role of each word in the two sentences includes:
inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.
Optionally, the training a sequence labeling model by using a training sample labeled according to semantic role labeling to obtain the semantic role recognition model includes:
acquiring a training sample labeled according to the semantic role label,
preprocessing sentences in the training samples;
inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;
and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.
In a second aspect, an embodiment of the present invention provides an apparatus for sentence recognition, including:
an acquisition unit configured to acquire two sentences to be recognized;
the processing unit is used for determining the semantic role of each word in the two sentences when the editing distance of the two sentences is determined not to be 0; and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.
Optionally, the processing unit is further configured to:
if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.
Optionally, the processing unit is specifically configured to:
inverting and/or inheriting words corresponding to the action or the subject in the two sentences;
and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.
Optionally, the processing unit is specifically configured to:
determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;
otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.
Optionally, the processing unit is specifically configured to:
and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.
Optionally, the processing unit is specifically configured to:
inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.
Optionally, the processing unit is specifically configured to:
acquiring a training sample labeled according to the semantic role label,
preprocessing sentences in the training samples;
inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;
and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.
In a third aspect, the invention provides a computing device comprising:
a memory for storing a computer program;
a processor for calling the computer program stored in the memory and executing the method according to the first aspect according to the obtained program.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer-executable program for causing a computer to perform the method of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for sentence recognition according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an apparatus for sentence repetition identification according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a system architecture provided in an embodiment of the present invention. As shown in fig. 1, the system architecture may be a server 100 including a processor 110, a communication interface 120, and a memory 130.
The communication interface 120 is used for communicating with the customer service terminal device, and receiving and transmitting information transmitted by the customer service terminal device to implement communication.
The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and lines, performs various functions of the server 100 and processes data by running or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units.
The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
It should be noted that the structure shown in fig. 1 is only an example, and the embodiment of the present invention is not limited thereto.
Based on the above description, fig. 2 exemplarily shows a flow of a method for sentence recognition, which can be performed by an apparatus for sentence recognition.
As shown in fig. 2, the specific steps of the process include:
in step 201, two sentences to be recognized are obtained.
In the embodiment of the present invention, the two sentences to be recognized may be two sentences in the dialogue data of the customer service dialogue with the user, for example, one sentence is input by the customer service, one sentence is input by the user, and generally the sentence of the customer service repeated user, or the sentence of the customer service repeated by the user. If the sentence a is "manual repayment is carried out" and the sentence B is "manual repayment is carried out, the bar is carried out".
Step 202, when the editing distance of the two sentences is determined not to be 0, determining the semantic role of each word in the two sentences.
After obtaining the two sentences in step 201, it is necessary to remove the nonsense words in the two sentences, such as the words of tone, "you are", "your meaning", "opposite bar", etc. The edit distance of the two sentences is then determined, which may generally refer to the minimum number of editing operations required to transition from one string to another. The allowed editing operations include replacing one character with another, inserting one character, and deleting one character. Wherein the smaller the number of editing operations, the closer the two.
It should be noted that, when the edit distance of two sentences is 0, it indicates that the two sentences are identical sentences, and the two sentences can be directly determined to be statement sentences. If the edit distance of the two sentences is not 0, the semantic role of each word in the two sentences needs to be determined by inputting the two sentences into the semantic role recognition model.
When training the sequence labeling model by using the training sample labeled according to the semantic role labeling to obtain the semantic role recognition model, the method specifically includes:
firstly, obtaining a training sample labeled according to semantic role labels, then preprocessing sentences in the training sample, and inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence. And finally, inputting the vector representation into a sequence labeling model for training to obtain a semantic role recognition model.
In the embodiment of the present invention, the task of SRL (Semantic Role Labeling) is to study the relationship between each component in a sentence and a predicate with a predicate of the sentence as a center, and describe the relationship between them with a Semantic Role, that is, determine the roles of other arguments and other arguments with respect to a (core) predicate in the sentence. SRL generally divides the components of a sentence into three categories, respectively: the predicate (REL), core arguments (ArgN, N ∈ {0,1,2,3,4,5}) which are generally verbs or adjectives, and semantic additional words (ArgM-x), wherein the core arguments represent arguments directly related to the predicate, such as the predicate's predicate (Arg0) and the predicate's argument (Arg1), and the semantic additional words represent arguments not directly related to the predicate and can independently exist, such as time (ArgM-TMP), place (ArgM-LOC), purpose (ArgM-PRP), degree (ArgM-DGR), scope (ArgM-EXT), and the like. For example, "you can search for the public number at present" the predicate is "search" can be judged by the SRL, the action is "you", the subject is "public number at all", and the time is "now".
In the specific training process, firstly, data needs to be collected and labeled according to the SRL labeling standard, and a training sample D1 is obtained. Then, training a semantic role recognition model according to the training sample D1, where the model may adopt a BERT (bidirectional encoder responses from Transformer, pre-training model) + LSTM (long short-Term Memory network) + CRF (Conditional Random Field) based sequence labeling model, and the training process is as follows:
firstly, preprocessing data in a training sample D1, performing character-level segmentation on two sentences, converting the two sentences into an ID form, setting [ CLS ] labels at the beginning of the sentences and [ SEP ] labels at the end of the sentences, simultaneously changing the corpus into a fixed length, filling up the sentences with insufficient length by '0', and truncating the sentences with the length exceeding the fixed length.
Secondly, inputting the preprocessed sentences into a pre-training model BERT to obtain vector representation of each word in the sentences, and then inputting the obtained vector representation into an upper LSTM + CRF model for training to obtain a semantic role recognition model.
After the semantic character recognition model is trained, the two sentences can be input into the semantic character recognition model to obtain the semantic character of each word in the two sentences.
Step 203, if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are complex sentences.
After the semantic roles of each word in the two sentences are obtained, whether the semantic roles in the two sentences are the same and whether the words corresponding to the same semantic role are the same or are synonyms can be judged. When the semantic roles in the two sentences are determined to be the same and the words corresponding to the same semantic roles are the same or are synonyms, the two sentences can be determined to be the repeated sentences. That is, if the semantic roles of the two sentences completely match and the words corresponding to the semantic roles completely match or are synonymous, the sentence is a repeat sentence, and if the sentence a is "default" and the sentence B is "default, the sentence a has the core predicate REL" and the sentence Arg1 "default", the sentence B has the core predicate REL after deleting the nonsense word and the sentence B has the core predicate REL "and the sentence Arg 1" default ", and the semantic roles of the two sentences and the corresponding words completely match, the sentence is a repeat sentence.
If the semantic roles of the two sentences are not completely consistent with the corresponding words, namely the semantic roles in the two sentences are not completely identical and/or the words corresponding to the same semantic role are different, whether the affairs carried on or the affairs carried on in the semantic roles of the two sentences are identical or not can be determined, if not, the words corresponding to the affairs carried on or the affairs carried on are subjected to first processing, and whether the two sentences are the repeat sentences or not is determined according to the semantic roles of the two sentences after the first processing. Otherwise, whether the two sentences are the repeat sentences can be determined according to the semantic additional words in the semantic roles of the two sentences.
The first processing is performed on the words corresponding to the events or the events, and whether the two sentences are the repeat sentences or not is determined according to the semantic roles of the two sentences after the first processing specifically may be: firstly, words corresponding to the events or the events in the two sentences are reversed and/or inherited, then the two sentences after reversal and/or inheritance are judged to have the same semantic role and the same words corresponding to the same semantic role or are synonyms, if yes, the two sentences are determined to be the repeated sentences, otherwise, whether the two sentences are the repeated sentences is determined according to semantic additional words in the semantic roles of the two sentences.
And (3) reversing or inheriting the events on the spot or the events on the spot, if the semantic roles are completely consistent after reversing and inheriting and the words corresponding to the semantic roles are completely consistent or are synonyms, the sentences are repeated, otherwise, whether the two sentences are repeated or not can be determined according to the semantic additional words in the semantic roles of the two sentences. If the sentence a is "you unbind bank card to go" and the sentence B is "i unbind to go to be bar", the core predicate REL of the sentence a is "unbind", the action Arg0 is "you", the action Arg1 is "bank card", the core predicate REL of the sentence B is "unbind", the action Arg0 is "i", and there is no action, the actions of the sentence B are converted into "you", and after the action is inherited from the sentence a, the roles of the two sentences and the words corresponding to the roles can be seen to be consistent or synonymous, so that the two sentences can be judged to be the repeated sentences.
Determining whether the two sentences are the sentence according to the semantic additional words in the semantic roles of the two sentences may include determining whether the semantic additional words are the shape words and the negative words, if so, determining the two sentences as the sentence when determining that the number of the negative words is even, and determining the two sentences as the sentence when determining that the number of the negative words is odd. Otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is repeatedly asked for the user, if the semantic additional word is not the preset important information, vectorizing the two sentences, and determining whether the two sentences are repeat sentences according to the similarity of vectors of the two sentences. The preset important information may be set according to experience, and may be important information such as time, place, and the like.
That is, if the missing component is a state word in the semantic additional word ArgM and the state word is a negative word, the missing number is determined, if the number is an even number, the sentence is a restitution sentence, otherwise, the sentence is a non-restitution sentence. If the missing component is important information such as time (ArgM-TMP) or location (ArgM-LOC), the question is confirmed, that is, the question is sent to the user, and the user replies to the question to determine whether the sentence is a repeat sentence.
In the above embodiment, the way of determining whether the two sentences are the repeating sentences according to the similarity of the vectors of the two sentences is mainly to vectorize the two sentences through word vectors or pre-training models (Bert, XLNet, etc.), and judge by calculating the similarity of the two vectors, if the similarity is greater than a threshold, the sentence is the repeating sentence, otherwise, the sentence is not the repeating sentence.
The embodiment of the invention shows that two sentences to be identified are obtained, when the editing distance of the two sentences is determined to be not 0, the semantic role of each word in the two sentences is determined, and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, the two sentences are determined to be the repeated sentences. When the editing distance of the two sentences is determined to be not 0, after the semantic roles of the two sentences are identified, the two sentences can be determined to be the repeated sentences by judging that the words of the same semantic role and the same semantic role in the two sentences are the same or are synonyms. Because the semantic roles of the two sentences which are the same as the compound sentence are required to be the same, and the words corresponding to the semantic roles are also required to be the same or are synonymous words, the compound sentence is identified by the consistency of the semantic roles of the sentences and the corresponding words, and the accuracy rate of the compound sentence identification can be improved.
Based on the same technical concept, fig. 3 exemplarily shows a schematic structural diagram of an apparatus for sentence repetition identification provided by an embodiment of the present invention, and the apparatus can perform a flow of sentence repetition identification.
As shown in fig. 3, the apparatus specifically includes:
an acquisition unit 301 configured to acquire two sentences to be recognized;
a processing unit 302, configured to determine a semantic role of each word in the two sentences when it is determined that the edit distance of the two sentences is not 0; and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.
Optionally, the processing unit 302 is further configured to:
if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.
Optionally, the processing unit 302 is specifically configured to:
inverting and/or inheriting words corresponding to the action or the subject in the two sentences;
and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.
Optionally, the processing unit 302 is specifically configured to:
determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;
otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.
Optionally, the processing unit 302 is specifically configured to:
and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.
Optionally, the processing unit 302 is specifically configured to:
inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.
Optionally, the processing unit 302 is specifically configured to:
acquiring a training sample labeled according to the semantic role label,
preprocessing sentences in the training samples;
inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;
and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.
Based on the same technical concept, the present invention provides a computing device, comprising:
a memory for storing a computer program;
and the processor is used for calling the computer program stored in the memory and executing the method for recognizing the repeated sentences according to the obtained program.
In a fourth aspect, the present invention provides a computer-readable storage medium storing a computer-executable program for causing a computer to execute the method of restatement sentence recognition described above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present application and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (16)

1. A method of restatement sentence recognition, comprising:
acquiring two sentences to be recognized;
when the editing distance of the two sentences is determined not to be 0, determining the semantic role of each word in the two sentences;
and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.
2. The method of claim 1, wherein the method further comprises:
if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.
3. The method of claim 2, wherein the performing a first process on the words of the event or the story, and determining whether the two sentences are restitution sentences according to semantic roles of the two sentences after the first process, comprises:
inverting and/or inheriting words corresponding to the action or the subject in the two sentences;
and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.
4. The method of claim 3, wherein said determining whether the two sentences are compound sentences based on semantic additional words in the semantic roles of the two sentences comprises:
determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;
otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.
5. The method of claim 4, wherein said determining whether the two sentences are repeating sentences based on similarity of vectors of the two sentences comprises:
and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.
6. The method of any of claims 1 to 5, wherein the determining semantic role for each word in two sentences comprises:
inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.
7. The method of claim 6, wherein the training a sequence labeling model using training samples labeled according to semantic character labels to obtain the semantic character recognition model comprises:
acquiring a training sample labeled according to the semantic role label,
preprocessing sentences in the training samples;
inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;
and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.
8. An apparatus for sentence recognition, comprising:
an acquisition unit configured to acquire two sentences to be recognized;
the processing unit is used for determining the semantic role of each word in the two sentences when the editing distance of the two sentences is determined not to be 0; and if the semantic roles in the two sentences are the same and the words corresponding to the same semantic role are the same or are synonyms, determining that the two sentences are the repeated sentences.
9. The apparatus as recited in claim 8, said processing unit to further:
if the semantic roles in the two sentences are not completely the same and/or the words corresponding to the same semantic role are different, determining whether the affairs carried out or affairs carried out in the semantic roles of the two sentences are the same, if not, performing first processing on the words corresponding to the affairs carried out or affairs carried out, determining whether the two sentences are repeat sentences according to the semantic roles of the two sentences after the first processing, otherwise, determining whether the two sentences are repeat sentences according to semantic additional words in the semantic roles of the two sentences.
10. The apparatus as claimed in claim 9, wherein said processing unit is specifically configured to:
inverting and/or inheriting words corresponding to the action or the subject in the two sentences;
and if the semantic roles of the two sentences after inversion and/or inheritance are the same and the words corresponding to the same semantic roles are the same or are synonyms, determining that the two sentences are compound sentences, otherwise determining whether the two sentences are compound sentences according to semantic additional words in the semantic roles of the two sentences.
11. The apparatus as claimed in claim 10, wherein said processing unit is specifically configured to:
determining whether the semantic additional words are shape words and negative words, if so, determining the two sentences as complex sentences when the number of the negative words is determined to be even, and determining the two sentences as complex sentences when the number of the negative words is determined to be odd;
otherwise, determining whether the semantic additional word is preset important information, if the semantic additional word is the preset important information, determining a sentence which is asked for the user in return, if the semantic additional word is not the preset important information, vectorizing the two sentences, and then determining whether the two sentences are repeat sentences according to the similarity of the vectors of the two sentences.
12. The apparatus as recited in claim 11, said processing unit to:
and if the similarity of the vectors of the two sentences is greater than a threshold value, determining that the two sentences are the statement sentences, otherwise determining that the two sentences are not the statement sentences.
13. The apparatus according to any one of claims 8 to 12, wherein the processing unit is specifically configured to:
inputting the two sentences into a semantic role recognition model, and determining the semantic role of each word in the two sentences, wherein the semantic role recognition model is obtained by training a sequence labeling model by using a training sample labeled according to semantic role labeling.
14. The apparatus as claimed in claim 13, wherein said processing unit is specifically configured to:
acquiring a training sample labeled according to the semantic role label,
preprocessing sentences in the training samples;
inputting the preprocessed sentences into a pre-training model to obtain vector representation of each word in each sentence;
and inputting the vector representation into a sequence labeling model for training to obtain the semantic role recognition model.
15. A computing device, comprising:
a memory for storing a computer program;
a processor for calling a computer program stored in said memory, for executing the method of any one of claims 1 to 7 in accordance with the obtained program.
16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer-executable program for causing a computer to execute the method of any one of claims 1 to 7.
CN202010591982.1A 2020-06-24 2020-06-24 Method and device for recognizing repeated sentences Pending CN111738019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010591982.1A CN111738019A (en) 2020-06-24 2020-06-24 Method and device for recognizing repeated sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010591982.1A CN111738019A (en) 2020-06-24 2020-06-24 Method and device for recognizing repeated sentences

Publications (1)

Publication Number Publication Date
CN111738019A true CN111738019A (en) 2020-10-02

Family

ID=72651135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010591982.1A Pending CN111738019A (en) 2020-06-24 2020-06-24 Method and device for recognizing repeated sentences

Country Status (1)

Country Link
CN (1) CN111738019A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319735A1 (en) * 2007-06-22 2008-12-25 International Business Machines Corporation Systems and methods for automatic semantic role labeling of high morphological text for natural language processing applications
CN111046656A (en) * 2019-11-15 2020-04-21 北京三快在线科技有限公司 Text processing method and device, electronic equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319735A1 (en) * 2007-06-22 2008-12-25 International Business Machines Corporation Systems and methods for automatic semantic role labeling of high morphological text for natural language processing applications
CN111046656A (en) * 2019-11-15 2020-04-21 北京三快在线科技有限公司 Text processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴晓锋;宗成庆;: "基于语义角色标注的新闻领域复述句识别方法", 中文信息学报, no. 05 *
赵世奇;刘挺;李生;: "复述技术研究", 软件学报, no. 08 *

Similar Documents

Publication Publication Date Title
CN110377911B (en) Method and device for identifying intention under dialog framework
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN108334891B (en) Task type intention classification method and device
CN108710704B (en) Method and device for determining conversation state, electronic equipment and storage medium
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN111708869B (en) Processing method and device for man-machine conversation
CN107844481B (en) Text recognition error detection method and device
CN110597966A (en) Automatic question answering method and device
CN111858878B (en) Method, system and storage medium for automatically extracting answer from natural language text
CN111738018A (en) Intention understanding method, device, equipment and storage medium
CN113221555A (en) Keyword identification method, device and equipment based on multitask model
CN111079418A (en) Named body recognition method and device, electronic equipment and storage medium
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN111738017A (en) Intention identification method, device, equipment and storage medium
CN116467417A (en) Method, device, equipment and storage medium for generating answers to questions
CN115640200A (en) Method and device for evaluating dialog system, electronic equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113326367B (en) Task type dialogue method and system based on end-to-end text generation
CN112667803A (en) Text emotion classification method and device
CN115345177A (en) Intention recognition model training method and dialogue method and device
CN116680385A (en) Dialogue question-answering method and device based on artificial intelligence, computer equipment and medium
CN110852103A (en) Named entity identification method and device
CN115563278A (en) Question classification processing method and device for sentence text
CN116186219A (en) Man-machine dialogue interaction method, system and storage medium
CN112036188A (en) Method and device for recommending quality test example sentences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination