CN114417891A

CN114417891A - Reply sentence determination method and device based on rough semantics and electronic equipment

Info

Publication number: CN114417891A
Application number: CN202210083351.8A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-22
Filing date: 2022-01-22
Publication date: 2022-04-29
Anticipated expiration: 2042-01-22
Also published as: WO2023137903A1; CN114417891B

Abstract

The application discloses a reply sentence determination method and device based on rough semantics and electronic equipment, wherein the method comprises the following steps: acquiring previous round voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment; performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information; performing word segmentation processing on the voice information to obtain a key phrase; carrying out multiple hidden feature extraction processing on the key phrase to obtain an initial hidden layer state feature vector; performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word; and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice message.

Description

Reply sentence determination method and device based on rough semantics and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a reply sentence determination method and device based on rough semantics and electronic equipment.

Background

At present, in the conventional dialog model, the previous dialog text is usually encoded, the hidden layer state feature of the obtained encoded information is used as one of the input of the decoder, and then the decoder automatically generates the dialog reply according to the time sequence. In the method, the hidden layer state characteristic coded by the previous dialog text is used as one of the generation bases of the reply sentence in the current dialog, so that the reply sentence generation process comprises the information characteristic of the previous dialog.

However, in the conventional scheme, in order to enable the model to construct a reply sentence for the key information in the dialog, the features are often focused on the key information in the previous dialog, and then in the actual extraction process, the key information is extracted as the features, and some coarse information in the dialog is often discarded. Therefore, some rough information can reflect the real attention of the conversation in some texts, and the reply sentence is not accurate enough.

Disclosure of Invention

In order to solve the above problems in the prior art, embodiments of the present application provide a reply sentence determination method and apparatus based on a rough semantic meaning, and an electronic device, which can simultaneously extract key information and rough information in a previous round of conversation, so that a generated reply sentence is more accurate.

In a first aspect, an embodiment of the present application provides a reply statement determination method based on a rough semantic meaning, including:

acquiring previous round voice information adjacent to the voice information according to the occurrence time of the voice information of the current moment of the user, wherein the occurrence time of the previous round voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round voice information and the occurrence time of the voice information is minimum;

performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information;

performing word segmentation processing on the voice information to obtain a key phrase;

carrying out multiple hidden feature extraction processing on the key phrase to obtain an initial hidden layer state feature vector;

performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word;

and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice message.

In a second aspect, an embodiment of the present application provides a reply sentence determination apparatus based on a rough semantic meaning, including:

the acquisition module is used for acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the current moment of the user, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

the processing module is used for performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, performing word segmentation processing on the voice information to obtain a key word group, and performing multiple hidden feature extraction processing on the key word group to obtain an initial hidden layer state feature vector;

and the generating module is used for performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor coupled to the memory, the memory for storing a computer program, the processor for executing the computer program stored in the memory to cause the electronic device to perform the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, the computer program causing a computer to perform the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer operable to cause the computer to perform a method according to the first aspect.

The implementation of the embodiment of the application has the following beneficial effects:

in the embodiment of the application, the semantic features which can contain high-level abstract information in the previous round of voice information are obtained by obtaining the previous round of voice information of the user at the current moment and then performing rough semantic extraction on the previous round of voice information, and the semantic features are used as the rough semantic features of the voice information of the user at the current moment, so that the synchronous extraction of key information and rough information in the previous round of voice information is realized. Then, word segmentation processing is carried out on the voice information of the user at the current moment, and multiple hidden feature extraction processing is carried out on the obtained multiple keywords to obtain an initial hidden layer state feature vector of the voice information of the user at the current moment. And finally, performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information. Based on the above, the rough semantic features simultaneously containing the key information and rough information in the previous round of conversation are used as one of the generation bases of the reply sentences in the current round of conversation, so that the reply sentence generation process contains more comprehensive information features of the previous round of conversation. Therefore, the generated reply sentence is higher in precision, can be better matched with the main body of the conversation, and improves user experience.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic hardware configuration diagram of a reply statement determination apparatus based on rough semantics according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a reply statement determination method based on rough semantics according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a method for performing coarse semantic extraction on voice information according to previous round of voice information to obtain coarse semantic features corresponding to the voice information according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a gated cyclic unit encoder according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a multilayer sensor according to an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a method for inputting at least one coarse context information and at least one first hidden layer state feature vector into a coarse decoder for performing a plurality of decoding processes to obtain coarse semantic features of speech information according to an embodiment of the present application;

fig. 7 is a block flow diagram of a reply word generation process according to an embodiment of the present application;

fig. 8 is a block diagram illustrating functional modules of a reply sentence determination apparatus based on rough semantics according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

First, referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of a reply sentence determination device based on a rough semantic meaning according to an embodiment of the present disclosure. The reply sentence determination apparatus 100 based on the rough semantics includes at least one processor 101, a communication line 102, a memory 103, and at least one communication interface 104.

In this embodiment, the processor 101 may be a general processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs according to the present disclosure.

The communication link 102, which may include a path, carries information between the aforementioned components.

The communication interface 104 may be any transceiver or other device (e.g., an antenna, etc.) for communicating with other devices or communication networks, such as an ethernet, RAN, Wireless Local Area Network (WLAN), etc.

The memory 103 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In this embodiment, the memory 103 may be independent and connected to the processor 101 through the communication line 102. The memory 103 may also be integrated with the processor 101. The memory 103 provided in the embodiments of the present application may generally have a nonvolatile property. The memory 103 is used for storing computer-executable instructions for executing the scheme of the application, and is controlled by the processor 101 to execute. The processor 101 is configured to execute computer-executable instructions stored in the memory 103, thereby implementing the methods provided in the embodiments of the present application described below.

In alternative embodiments, computer-executable instructions may also be referred to as application code, which is not specifically limited in this application.

In alternative embodiments, processor 101 may include one or more CPUs, such as CPU0 and CPU1 of FIG. 1.

In an alternative embodiment, the reply sentence determination apparatus 100 based on the coarse semantics may include a plurality of processors, such as the processor 101 and the processor 107 in fig. 1. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In an alternative embodiment, if the reply sentence determination apparatus 100 based on the rough semantics is a server, for example, the apparatus may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The reply sentence determination apparatus 100 based on the rough semantics may further include an output device 105 and an input device 106. The output device 105 is in communication with the processor 101 and may display information in a variety of ways. For example, the output device 105 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 106 is in communication with the processor 101 and may receive user input in a variety of ways. For example, the input device 106 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

The above-described reply sentence determination apparatus 100 based on the rough semantics may be a general-purpose device or a special-purpose device. The present embodiment does not limit the type of the reply sentence determination apparatus 100 based on the rough semantics.

Next, it should be noted that the embodiments disclosed in the present application may acquire and process related data based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Finally, the reply sentence determination method based on the rough semantics can be applied to telephone consultation, e-commerce sales, off-line entity sales, service promotion, seat telephone outbound, social platform promotion and other scenes. In the present application, the phone consultation scenario is mainly taken as an example to illustrate the reply sentence determination method based on the rough semantics, and the reply sentence determination method based on the rough semantics in other scenarios is similar to the implementation manner in the phone consultation scenario and will not be described here.

In the following, the reply sentence determination method based on rough semantics disclosed in the present application will be explained:

referring to fig. 2, fig. 2 is a schematic flowchart of a reply statement determination method based on rough semantics according to an embodiment of the present disclosure. The reply sentence determination method based on the rough semantics comprises the following steps:

201: and acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment.

In the present embodiment, the occurrence time of the voice information of the previous round is shorter than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the voice information of the previous round and the occurrence time of the voice information is smallest. In brief, the previous round of voice information is the last word spoken by the user before the voice information at the current moment is spoken.

For example, the previous round of voice information may be determined by querying historical dialogue data recording dialogue data generated before the current time by a dialogue event to which the voice information at the current time belongs, by the occurrence time of the voice information at the current time of the user. Specifically, two interrelated sentence queues may be maintained in the historical dialogue data, where one queue is used to store user sentences issued by the user, and the other queue is used to store reply sentences made by the AI on the user sentences. Meanwhile, each user statement in the user statement queue and each reply statement in the reply statement queue comprise a conversation mark and conversation occurrence time, and the user statements and the reply statements with the same marks are combined into a question-answer pair through the conversation marks, namely the reply statements with the same conversation marks are replies to the user statements. Therefore, the question-answer logicality in the historical dialogue data can be guaranteed, and meanwhile, the statements of the user and the AI are separately stored, so that the search is facilitated.

Therefore, in the present embodiment, by referring to the user sentence queue, it is possible to determine the speech information whose dialog occurrence time is earlier than the occurrence time of the speech information of the user current time and whose absolute value of the difference between the occurrence time and the occurrence time of the first sentence is the smallest as the previous round of speech information.

202: and performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information.

In this embodiment, the rough semantic features may be understood as semantic features including high-level abstract information in the previous round of voice information. Illustratively, a high-order coarse sequence representation (high-level coarse sequence representation) can be actively constructed, and then the high-order coarse sequence representation is analyzed to obtain a plurality of high-order parallel sequences. And then generating a low-order coarse (coarse) sequence by a hierarchical structure, and enabling information in a plurality of high-order parallel sequences to flow to the low-order coarse sequence, thereby realizing synchronous extraction of key information and coarse information in voice information and enabling the information of a plurality of layers to be synchronously embodied. Meanwhile, after the low-order rough sequence is converted, the model for generating the reply sentence can better memorize and understand the long-term content, and then meaningful replies closely related to the theme are generated, so that the user experience is improved.

For example, the present embodiment provides a method for performing rough semantic extraction on voice information according to a previous round of voice information to obtain rough semantic features corresponding to the voice information, as shown in fig. 3, where the method includes:

301: and detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information.

In this embodiment, the detection process may be to perform text conversion on the previous round of voice information and then perform word segmentation, and then obtain all words obtained through word segmentation processing as the at least one first word. Meanwhile, each of the at least one first word may include a word tag, and the word tag may be part-of-speech information of the corresponding first word, for example: nouns, verbs, named entities, etc.

Thus, in the present embodiment, named entity information in the text obtained by text conversion may be extracted by a Conditional Random Field (CRF), and the type of the named entity, such as a person name or an organization name, may be marked by the CRF. And then, using a Part-Of-Speech tagging tool (POS) to perform word segmentation and Part-Of-Speech tagging on the text, and extracting nouns and verbs in the text. The combined results of CRF and POS are combined in this process because POS identification is only word-wise, while CRF can be a complete phrase, such as: i work at Shanghai's redden university, CRF can fully identify the organization name entity "Shanghai redden university", while POS can only identify the noun: "shanghai", "fudan" and "university". Therefore, in processing the entity words, if the result of POS is contained in CRF, the result of CRF will be used preferentially, and the verb aspect will only use the result of POS. Thus, the first word containing the part of speech information label can be obtained.

In an optional embodiment, if the language used by the user is english, a set of verbs and named entities in a corresponding domain may be constructed in advance, and then verbs and named entities in the original sentence are extracted in a matching manner, and then the extraction of english nouns may be performed by using POS for noun recognition and extraction, and then the first term including the part-of-speech information tag is obtained.

302: and determining the temporal information of the previous round of voice information according to the at least one first word.

In this embodiment, at least one first word obtained by word segmentation may be input to a Gate Recovery Unit (GRU) encoder for encoding, so as to obtain a second hidden layer state feature vector. And then inputting the second hidden layer state feature vector into a multi-layer Perceptron (MLP) to obtain a linear output result. And finally, inputting the linear output result into a temporal classifier to obtain temporal information of the previous round of voice information.

Specifically, the GRU has a structure as shown in FIG. 4, which includes a reset gate r_tUpdate gate z_tCandidate memory cell

And a current time memory unit h_t。

Specifically, the reset gate r_tThe operation logic of (c) can be expressed by the formula (i):

r_t＝σ(W_rX_t+U_rh_t-1+b_r).........①

where σ is the activation function, W_rAnd U_rIs a reset gate r_tAnd the initialized values of the corresponding parameter matrixes are random, and new values can be obtained through training the model. b_rIs a reset gate r_tThe corresponding biasing, is also trainable.

Further, the door z is updated_tThe operation logic of (c) can be expressed by the formula (ii):

z_t＝σ(W_zX_t+U_zh_t-1+b_z).........②

wherein, W_zAnd U_zIs to update the door z_tAnd the initialized values of the corresponding parameter matrixes are random, and new values can be obtained through training the model. b_zIs to update the door z_tThe corresponding biasing, is also trainable.

Further, candidate memory cells

The operation logic of (c) can be expressed by formula (c):

where tanh is the activation function, W and U are candidate tokensMemory cell

And the initialized values of the corresponding parameter matrixes are random, and new values can be obtained through training the model. b is a candidate memory cell

The corresponding biasing, is also trainable.

Further, a current time memory unit h_tThe operating logic of (c) can be represented by the formula (iv):

wherein z is_tFor weighting, it is trainable.

In this embodiment, the MLP structure is composed of two Linear layers Linear and a ReLu activation function, as shown in fig. 5, and after the Linear output result is output through the last Linear layer, the Linear output result is input into the softmax function again for multi-tag classification, and finally, the temporal classifier determines the temporal state of the current sentence. Therefore, false recognition and missing recognition caused by the fact that independent words such as ' over ', ' and ' over ' are used only in the traditional temporal recognition are avoided. Such as: the voice message "i'm running" is currently performed, but because independent words such as "over", "is", etc. are not included, the voice message is missed to be recognized in the traditional recognition mode.

303: adding the temporal information into the word label of each first word to obtain at least one second word corresponding to at least one first word one by one.

In brief, in this embodiment, the second word is the first word of the temporal information to which the corresponding voice information is added to the word tag. Therefore, the second word carries corresponding part of speech information and temporal information of the voice on the basis of carrying the information of the voice, and the subsequently generated reply sentence is more accurate.

304: and inputting at least one second word into a coarse encoder to be encoded, and obtaining at least one coarse context information in one-to-one correspondence with the at least one second word and at least one first hidden layer state feature vector in one-to-one correspondence with the at least one second word.

In this embodiment, the coarse encoder may be a GRU encoder. Specifically, at the time of encoding, at least one second word is sequentially input to the GRU encoder in order, and the encoder outputs corresponding coarse context information and a first hidden layer state feature vector. In the encoding process, besides the currently encoded second word, the first hidden layer state feature vector output by the last encoding process can also be used as the input of the current encoding. That is, when the xth second word is encoded, the xth second word and the xth-1 first hidden layer state feature vector may be input to the GRU encoder, so as to obtain the xth coarse context information and the xth first hidden layer state feature vector. And when x is 1, since there is no 0 th second word, at this time, only the 1 st second word may be input to the GRU encoder for encoding.

305: and inputting at least one rough context information and at least one first hidden layer state feature vector into a rough decoder to perform decoding processing for multiple times to obtain rough semantic features of the voice information.

In the present embodiment, when extracting the rough semantic features, the importance level of each of the second words obtained by splitting is different for the speech information. Thus, before coarse context information is input to the coarse decoder, the information may be processed with attention to obtain the importance of the respective coarse context information.

For example, each second word corresponds to one hidden layer state feature vector, i.e. the first hidden layer state feature vector, in the encoder, in a simple manner, there are how many first hidden layer state feature vectors there are how many coarse context information. Thus, the coarse context information may be input to a decoder, which, when decoding, calculates the similarity between the feature vector of the current decoding process (the output of the current decoding process of the decoder) and the hidden layer state features decoded from the input coarse context information. Therefore, a similarity value is calculated for each rough context information, and then the similarities are normalized to obtain the weight corresponding to each rough context information. And multiplying the weight corresponding to each piece of rough context information by the hidden layer state feature vector obtained by the rough context information input encoder to obtain the attention feature, and adding the attention feature to the output feature vector obtained when the rough context information is input into the decoder to obtain the final feature obtained by inputting the rough context information into the decoder.

Based on this, the present embodiment provides a method for inputting at least one coarse context information and at least one first hidden layer state feature vector into a coarse decoder for performing a plurality of decoding processes to obtain coarse semantic features of speech information, as shown in fig. 6, the method includes:

601: in the i-th decoding process, the feature vector A is input_iInputting the coarse decoder to obtain an output feature vector B_i。

In this embodiment, i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one piece of coarse context information, j is an integer greater than or equal to 1, and when i is equal to 1, the feature vector a is input_iIs the 1 st coarse context information of the at least one coarse context information.

602: computing output feature vector B_iAnd the ith first hidden layer state feature vector C in the at least one first hidden layer state feature vector_iSimilarity between them D_i。

In the present embodiment, the output feature vector B may be calculated_iAnd the ith first hidden layer state feature vector C_iCosine similarity between them to obtain similarity D_i。

603: to similarity D_iCarrying out normalization processing to obtain an input feature vector A_iWeight E of_i。

In this embodiment, theSimilarity D_iInputting a softmax function for normalization processing to obtain an input characteristic vector A_iWeight E of_i。

604: weight E_iAnd the ith first hidden layer state feature vector C_iMultiplying to obtain a weight feature vector F_i。

605: weighting feature vector F_iAnd output feature vector B_iAdding to obtain a target output characteristic vector G_i。

606: outputting the target to a feature vector G_iInput feature vector a as i +1 th decoding process_i+1And (5) carrying out decoding processing for the (i + 1) th time until carrying out decoding processing for multiple times to obtain the rough semantic features of the voice information.

Specifically, in the process of multiple decoding processes, the output at the previous time is used as the input at the next time, and the final output obtained after the multiple decoding processes is the rough semantic feature of the voice information.

203: and performing word segmentation processing on the voice information to obtain a key phrase.

In this embodiment, the voice information may be converted into a text, and then the text may be segmented to obtain at least one first keyword. Then, any two different first adjacent words and second adjacent words in the at least one first keyword are combined to obtain at least one second keyword, and the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold value.

Specifically, the first neighboring word and the second neighboring word are any two different neighboring fields in the second candidate field, where the field interval is smaller than the first threshold, and the field interval can be understood as the number of characters between the corresponding positions of the first neighboring word and the second neighboring word in the corresponding text. For example, for the text "Disney park in Pudong New zone of Shanghai City and 2016, the first keyword can be obtained after word segmentation and screening: "Shanghai City", "2016", "Disney", "paradise", "Pudong", and "New district". At this time, the number of characters between the corresponding positions of the first keyword "2016 year" and "disney" in the text of the word is 3, so that the character distance between the first keyword "2016 year" and "disney" is 3. And the number of characters between corresponding positions of the first keyword "disney" and "park" in the text is 0, so the character distance between the first keyword "disney" and "park" is 0.

In this embodiment, the first threshold may be set to 1, thereby satisfying the required first keyword, taking the text "Disney park located in Pudong New zone in Shanghai City and 2016 open park" as an example: "Disney" and "paradise", and "Pudong" and "New district". Thus, the third candidate fields "disneyland" and "purdong new area" can be obtained.

And then, matching each second keyword in the at least one second keyword with a preset entity library, and screening out the second keywords failing to be matched to obtain at least one third keyword. And deleting the first keywords of each third keyword in the at least one third keyword to obtain at least one fourth keyword.

Specifically, the fourth keyword is the first keyword remaining after the first keyword forming each of the at least one third keyword is removed. Illustratively, following the example of the above text "Disneyland park in Pudong New zone with 2016 open park" in Shanghai city, assuming that the determined third keyword is "Disneyland park", since the third keyword "Disneyland park" is composed of the first keywords "Disney" and "park", the first keywords "Disney" and "park" are derived from the original several first keywords: "Shanghai City", "2016", "Disney", "paradise", "Pudong", and "New zone" are removed, the remaining first keyword: "Shanghai City", "2016", "Pudong" and "New zone" are the fourth keyword.

And finally, combining the at least one third keyword and the at least one fourth keyword to obtain a keyword group.

Specifically, following the example of the above-described text "Disneyland park located in Pudong New zone in Shanghai City and 2016, the third keyword" Disneyland "and the fourth keyword: the key phrases are obtained by combining Shanghai city, 2016, Pudong and new area: "Shanghai City", "2016", "Disney park", "Pudong", and "New district".

204: and carrying out multiple hidden feature extraction processing on the key phrase to obtain an initial hidden layer state feature vector.

In this embodiment, the keyword group may include at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice message. Based on this, the present embodiment provides a method for performing multiple hidden feature extraction processing on a keyword group to obtain an initial hidden layer state feature vector, which includes:

in the nth hidden feature extraction process, the first input hidden feature H is extracted_nInputting the GUR encoder to obtain a first output hidden feature I_nWherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n is 1, the hidden feature H is input_nIs the 1 st keyword in the at least one keyword; hiding the first output with a feature I_nFirst input hidden feature H as n +1 th hidden feature extraction process_n+1And (4) carrying out the (n + 1) th hidden feature extraction processing until the initial hidden layer state feature vector is obtained after the hidden feature extraction processing is carried out for multiple times.

205: and performing repeated reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word.

In this embodiment, the input word vector K may be used in the p-th reply word generation process_pA second input hidden feature L_pAnd coarse semantic features are input into a gated cyclic unit decoder to obtain a reply word O_pAnd a second output hidden feature R_pWherein p is an integer greater than or equal to 1 and less than or equal to q, q is determined by the speech information to be an integer greater than or equal to 1, and when p is 1, the word vector K is input_pIs the initial hidden layer state feature vector. Then, for the reply word O_pPerforming word embedding processing to obtain a reply word vector S_p. Finally, the reply word vector S_pInput word vector K as a p +1 th reply word generation process_p+1Hiding the second output with the feature R_pSecond input hidden feature L as a p +1 th reply word generation process_p+1And performing the (p + 1) th replying word generation processing until at least one replying word is obtained after performing the replying word generation processing for multiple times.

Specifically, as shown in FIG. 7, the generation process generates one reply word at a time, and generates a reply word O at the p-th time_pThen, a reply word O is generated at the p +1 th time_p+1. However, at the time of the p +1 th time, the reply word O generated last time (i.e., the p-th time) is generated_pAlso as one of the inputs at the p +1 th time. While the other input is a coarse semantic feature, i.e. a reply word O_p+1Is composed of a reply word O_pWord vector of (1), second output hidden feature R generated p time_pAnd rough semantic features.

206: and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice message.

In summary, in the reply sentence determination method based on rough semantics provided by the present invention, the semantic features that can include high-level abstract information in the previous round of voice information are obtained by obtaining the previous round of voice information of the user at the current moment and then performing rough semantics extraction on the previous round of voice information, and the semantic features are used as the rough semantic features of the voice information of the user at the current moment, thereby realizing synchronous extraction of key information and rough information in the previous round of voice information. Then, word segmentation processing is carried out on the voice information of the user at the current moment, and multiple hidden feature extraction processing is carried out on the obtained multiple keywords to obtain an initial hidden layer state feature vector of the voice information of the user at the current moment. And finally, performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information. Based on the above, the rough semantic features simultaneously containing the key information and rough information in the previous round of conversation are used as one of the generation bases of the reply sentences in the current round of conversation, so that the reply sentence generation process contains more comprehensive information features of the previous round of conversation. Therefore, the generated reply sentence is higher in precision, can be better matched with the main body of the conversation, and improves user experience.

Referring to fig. 8, fig. 8 is a block diagram illustrating functional modules of a reply sentence determination apparatus based on rough semantics according to an embodiment of the present disclosure. As shown in fig. 8, the reply sentence determination apparatus 800 based on rough semantics includes:

an obtaining module 801, configured to obtain previous round of voice information adjacent to the voice information according to occurrence time of the voice information at a current moment of a user, where the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and an absolute value of a difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

the processing module 802 is configured to perform coarse semantic extraction on the voice information according to the previous round of voice information to obtain coarse semantic features corresponding to the voice information, perform word segmentation processing on the voice information to obtain a keyword group, and perform multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer state feature vector;

the generating module 803 is configured to perform multiple reply word generation processing according to the coarse semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and splice the at least one reply word according to a generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In an embodiment of the present invention, in terms of performing rough semantic extraction on voice information according to previous round of voice information to obtain rough semantic features corresponding to the voice information, the processing module 802 is specifically configured to:

detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label;

determining the temporal information of the previous round of voice information according to at least one first word;

adding the temporal information into the word label of each first word to obtain at least one second word, wherein the at least one second word is in one-to-one correspondence with the at least one first word;

inputting at least one second word into a coarse encoder to be encoded to obtain at least one coarse context information and at least one first hidden layer state feature vector, wherein the at least one coarse context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one;

and inputting at least one rough context information and at least one first hidden layer state feature vector into a rough decoder to perform decoding processing for multiple times to obtain rough semantic features of the voice information.

In an embodiment of the present invention, in terms of determining temporal information of a previous round of speech information according to at least one first word, the processing module 802 is specifically configured to:

inputting at least one first word into a gated cyclic unit encoder for encoding to obtain a second hidden layer state feature vector;

inputting the second hidden layer state feature vector into a multilayer perceptron to obtain a linear output result;

and inputting the linear output result into a temporal classifier to obtain temporal information of the previous round of voice information.

In an embodiment of the present invention, in terms of inputting at least one coarse context information and at least one first hidden layer state feature vector into a coarse decoder for performing multiple decoding processes to obtain coarse semantic features of speech information, the processing module 802 is specifically configured to:

in the i-th decoding process, the feature vector A is input_iInputting the coarse decoder to obtain an output feature vector B_iWherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one piece of coarse context information, j is an integer greater than or equal to 1, and when i is equal to 1, the feature vector A is input_iThe 1 st rough context information in the at least one rough context information;

computing output feature vector B_iAnd the ith first hidden layer state feature vector C in the at least one first hidden layer state feature vector_iSimilarity between them D_i；

To similarity D_iCarrying out normalization processing to obtain an input feature vector A_iWeight E of_i；

Weight E_iAnd the ith first hidden layer state feature vector C_iMultiplying to obtain a weight feature vector F_i；

Weighting feature vector F_iAnd output feature vector B_iAdding to obtain a target output characteristic vector G_i；

Outputting the target to a feature vector G_iInput feature vector a as i +1 th decoding process_i+1And (5) carrying out decoding processing for the (i + 1) th time until carrying out decoding processing for multiple times to obtain the rough semantic features of the voice information.

In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice message. Based on this, in the aspect of extracting and processing the hidden features of the keyword group for multiple times to obtain the initial hidden layer state feature vector, the processing module 802 is specifically configured to:

in the nth hidden feature extraction process, the first input hidden feature H is extracted_nInput gated cyclic unit encoder to obtain a first output hidden feature I_nWherein n is an integer greater than or equal to 1 and less than or equal to m, and m is at least oneThe number of the key words, m is an integer greater than or equal to 1, and when n is equal to 1, the hidden feature H is input_nIs the 1 st keyword in the at least one keyword;

hiding the first output with a feature I_nFirst input hidden feature H as n +1 th hidden feature extraction process_n+1And (4) carrying out the (n + 1) th hidden feature extraction processing until the initial hidden layer state feature vector is obtained after the hidden feature extraction processing is carried out for multiple times.

In an embodiment of the present invention, in terms of performing multiple reply word generation processing according to the coarse semantic features and the initial hidden layer state feature vector to obtain at least one reply word, the generating module 803 is specifically configured to:

when generating the recovery word for the p-th time, inputting the word vector K_pA second input hidden feature L_pAnd coarse semantic features are input into a gated cyclic unit decoder to obtain a reply word O_pAnd a second output hidden feature R_pWherein p is an integer greater than or equal to 1 and less than or equal to q, q is determined by the speech information to be an integer greater than or equal to 1, and when p is 1, the word vector K is input_pIs an initial hidden layer state feature vector;

for reply word O_pPerforming word embedding processing to obtain a reply word vector S_p；

Will reply to the word vector S_pInput word vector K as a p +1 th reply word generation process_p+1Hiding the second output with the feature R_pSecond input hidden feature L as a p +1 th reply word generation process_p+1And performing the (p + 1) th replying word generation processing until at least one replying word is obtained after performing the replying word generation processing for multiple times.

In the embodiment of the present invention, in terms of performing word segmentation processing on the voice information to obtain a keyword group, the processing module 802 is specifically configured to:

converting the voice information into a text, and segmenting the text to obtain at least one first keyword;

combining the first adjacent words and the second adjacent words to obtain at least one second keyword, wherein the first adjacent words and the second adjacent words are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold value;

matching each second keyword in the at least one second keyword with a preset entity library, and screening out the second keywords which fail to be matched to obtain at least one third keyword;

deleting the first keywords of each third keyword in the at least one third keyword to obtain at least one fourth keyword;

and combining the at least one third keyword and the at least one fourth keyword to obtain a keyword group.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device 900 includes a transceiver 901, a processor 902, and a memory 903. Connected to each other by a bus 904. The memory 903 is used to store computer programs and data, and may transfer the data stored in the memory 903 to the processor 902.

The processor 902 is configured to read the computer program in the memory 903 to perform the following operations:

and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice message.

In an embodiment of the present invention, in terms of performing a rough semantic extraction on voice information according to a previous round of voice information to obtain a rough semantic feature corresponding to the voice information, the processor 902 is specifically configured to perform the following operations:

In an embodiment of the present invention, in determining temporal information of a previous round of speech information according to at least one first word, the processor 902 is specifically configured to:

In an embodiment of the present invention, in inputting at least one coarse context information and at least one first hidden layer state feature vector into a coarse decoder for performing a plurality of decoding processes to obtain coarse semantic features of speech information, the processor 902 is specifically configured to perform the following operations:

In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice message. Based on this, in terms of performing multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer state feature vector, the processor 902 is specifically configured to perform the following operations:

in the nth hidden feature extraction process, the first input hidden feature H is extracted_nInput gated cyclic unit encoder to obtain a first output hidden feature I_nWherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n is 1, the hidden feature H is input_nIs the 1 st keyword in the at least one keyword;

In an embodiment of the present invention, in terms of performing multiple reply word generation processing according to the coarse semantic features and the initial hidden layer state feature vector to obtain at least one reply word, the processor 902 is specifically configured to perform the following operations:

Will reply to the word vector S_pInput word vector K as a p +1 th reply word generation process_p+1Hiding the second output with the feature R_pSecond input hidden feature L as a p +1 th reply word generation process_p+1Performing the (p + 1) th reply word generation processing until at least one reply is obtained after performing the reply word generation processing for multiple timesA word.

In an embodiment of the present invention, in terms of performing a word segmentation process on the voice information to obtain a keyword group, the processor 902 is specifically configured to perform the following operations:

It should be understood that the reply sentence determination device based on the rough semantics in the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (Mobile Internet Devices, abbreviated as MID), a robot, or a wearable device, etc. The above reply sentence determination device based on the rough semantics is merely an example, and is not exhaustive, and includes but is not limited to the above reply sentence determination device based on the rough semantics. In practical applications, the above reply sentence determination apparatus based on rough semantics may further include: intelligent vehicle-mounted terminal, computer equipment and the like.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.

Accordingly, the present application also provides a computer readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to implement part or all of the steps of any one of the reply sentence determination methods based on rough semantics as described in the above method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the coarse semantics based reply sentence determination methods as set forth in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required by the application.

In the above embodiments, the description of each embodiment has its own emphasis, and for parts not described in detail in a certain embodiment, reference may be made to the description of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, and the memory may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the methods and their core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for determining a reply sentence based on rough semantics, the method comprising:

performing word segmentation processing on the voice information to obtain a key word group;

2. The method of claim 1, wherein the performing rough semantic extraction on the speech information according to the previous round of speech information to obtain rough semantic features corresponding to the speech information comprises:

determining the temporal information of the previous round of voice information according to the at least one first word;

adding the temporal information into a word label of each first word to obtain at least one second word, wherein the at least one second word is in one-to-one correspondence with the at least one first word;

inputting the at least one second word into a coarse encoder for encoding to obtain at least one coarse context information and at least one first hidden layer state feature vector, wherein the at least one coarse context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one;

and inputting the at least one rough context information and the at least one first hidden layer state feature vector into a rough decoder to perform decoding processing for multiple times to obtain rough semantic features of the voice information.

3. The method of claim 2, wherein the determining temporal information of the previous round of speech information according to the at least one first word comprises:

inputting the at least one first word into a gated cyclic unit encoder for encoding to obtain a second hidden layer state feature vector;

4. The method according to claim 2, wherein said inputting the at least one coarse context information and the at least one first hidden layer state feature vector into a coarse decoder for a plurality of decoding processes to obtain coarse semantic features of the speech information comprises:

in the i-th decoding process, the feature vector A is input_iInputting the coarse decoder to obtain an output feature vector B_iWherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of the at least one coarse context information, j is an integer greater than or equal to 1, and when i is equal to 1, the input feature vector a is_iThe 1 st rough context information in the at least one rough context information;

calculating the output feature vector B_iAnd an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vector_iSimilarity between them D_i；

For the similarity D_iCarrying out normalization processing to obtain the input characteristic vector A_iWeight E of_i；

The weight E is_iAnd the ith first hidden layer state feature vector C_iMultiplying to obtain a weight feature vector F_i；

The weight feature vector F_iAnd the output feature vector B_iAdding to obtain a target output characteristic vector G_i；

Outputting the target feature vector G_iInput feature vector a as i +1 th decoding process_i+1Performing the i +1 th decoding process until the decoding process is performedAnd carrying out decoding processing for multiple times to obtain the rough semantic features of the voice information.

5. The method of claim 1,

the keyword group comprises at least one keyword, and the at least one keyword is arranged according to the sequence of the position of each keyword in the at least one keyword in the voice information;

the multiple hidden feature extraction processing is performed on the keyword group to obtain an initial hidden layer state feature vector, and the method comprises the following steps:

in the nth hidden feature extraction process, the first input hidden feature H is extracted_nInput gated cyclic unit encoder to obtain a first output hidden feature I_nWherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n is 1, the input hidden feature H_nThe number of the keywords is 1 in the at least one keyword;

hiding the first output with a feature I_nFirst input hidden feature H as n +1 th hidden feature extraction process_n+1And (4) carrying out the (n + 1) th hidden feature extraction processing until the initial hidden layer state feature vector is obtained after the multiple hidden feature extraction processing.

6. The method according to claim 1, wherein performing a plurality of reply word generation processes according to the coarse semantic features and the initial hidden layer state feature vector to obtain at least one reply word comprises:

when generating the recovery word for the p-th time, inputting the word vector K_pA second input hidden feature L_pInputting the coarse semantic features into a gated cyclic unit decoder to obtain a reply word O_pAnd a second output hidden feature R_pWherein p is an integer greater than or equal to 1 and less than or equal to q, q is determined by the speech information to be an integer greater than or equal to 1, and when p is 1,the input word vector K_pThe initial hidden layer state feature vector is obtained;

for the reply word O_pPerforming word embedding processing to obtain a reply word vector S_p；

The reply word vector S_pInput word vector K as a p +1 th reply word generation process_p+1Hiding the second output with the feature R_pSecond input hidden feature L as the p +1 th reply word generation process_p+1And performing the (p + 1) th replying word generation processing until the at least one replying word is obtained after the multiple replying word generation processing.

7. The method of claim 1, wherein the performing word segmentation processing on the voice message to obtain a keyword group comprises:

combining a first adjacent word and a second adjacent word to obtain at least one second keyword, wherein the first adjacent word and the second adjacent word are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent word and the second adjacent word is smaller than a first threshold value;

deleting the first keywords forming each third keyword in the at least one third keyword from the at least one first keyword to obtain at least one fourth keyword;

and combining the at least one third keyword and the at least one fourth keyword to obtain the keyword group.

8. An apparatus for determining a reply sentence based on a rough semantic, the apparatus comprising:

and the generating module is used for performing multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the one or more programs including instructions for performing the steps in the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.