WO2023137903A1

WO2023137903A1 - Reply statement determination method and apparatus based on rough semantics, and electronic device

Info

Publication number: WO2023137903A1
Application number: PCT/CN2022/090129
Authority: WO
Inventors: 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-01-22
Filing date: 2022-04-29
Publication date: 2023-07-27
Also published as: CN114417891B; CN114417891A

Abstract

Disclosed in the present application are a reply statement determination method and apparatus based on rough semantics, and an electronic device. The method comprises: according to an occurrence time of speech information of a user at the current moment, acquiring a previous round of speech information adjacent to the speech information; according to the previous round of speech information, performing rough semantic extraction on the speech information, so as to obtain a rough semantic feature corresponding to the speech information; performing word segmentation processing on the speech information, so as to obtain a keyword group; performing a plurality of instances of hidden feature extraction processing on the keyword group, so as to obtain an initial hidden layer state feature vector; according to the rough semantic feature and the initial hidden layer state feature vector, performing a plurality of instances of reply word generation processing, so as to obtain at least one reply word; and splicing the at least one reply word according to a generation sequence of each reply word from among the at least one reply word, so as to obtain a reply statement of the speech information.

Description

Method, device and electronic equipment for determining reply sentence based on rough semantics

priority statement

This application claims the priority of the Chinese patent application with the application number 202210083351.8 and the invention title "Method, device and electronic equipment for determining reply sentences based on rough semantics" submitted to the China Patent Office on January 22, 2022, the entire contents of which are incorporated in this application by reference.

technical field

The present application relates to the technical field of artificial intelligence, in particular to a method, device and electronic equipment for determining reply sentences based on rough semantics.

Background technique

At present, in the traditional dialogue model, the previous round of dialogue text is usually encoded, and the hidden layer state feature of the obtained encoded information is used as one of the inputs of the decoder, and then the dialogue reply is automatically generated according to the time sequence through the decoder. In this method, the hidden layer state features encoded by the previous round of dialogue text are used as one of the basis for the generation of reply sentences in the current round of dialogue, so that the reply sentence generation process includes the information characteristics of the previous round of dialogue.

However, the inventor realized that in the traditional solution, in order to enable the model to construct reply sentences for the key information in the dialogue, these features are often focused on the key information in the previous round of dialogue, and then in the actual extraction process, these key information are extracted as features, while some rough information in the dialogue is often discarded. Thus ignoring that in some texts, some rough information can better reflect the real focus of the dialogue, resulting in inaccurate reply sentences.

Contents of the invention

In order to solve the above-mentioned problems in the prior art, the embodiments of the present application provide a method, device and electronic device for determining reply sentences based on rough semantics, which can simultaneously extract key information and rough information in the previous round of dialogue, and then make the generated reply sentences more accurate.

In the first aspect, the implementation of the present application provides a method for determining a reply sentence based on rough semantics, including:

According to the time of occurrence of the voice information at the user's current moment, the previous round of voice information adjacent to the voice information is obtained, wherein the time of occurrence of the previous round of voice information is less than the time of occurrence of the voice information, and the absolute value of the difference between the time of occurrence of the previous round of voice information and the time of occurrence of the voice information is the smallest;

Perform rough semantic extraction on the voice information according to the previous round of voice information, and obtain rough semantic features corresponding to the voice information;

Perform word segmentation processing on the voice information to obtain keyword groups;

Perform multiple hidden feature extraction processing on the keyword group to obtain the initial hidden layer state feature vector;

Perform multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word;

Splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In the second aspect, the embodiment of the present application provides a device for determining reply sentences based on rough semantics, including:

The acquisition module is used to obtain the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

The processing module is used to perform rough semantic extraction on the voice information according to the previous round of voice information, obtain rough semantic features corresponding to the voice information, perform word segmentation processing on the voice information, obtain keyword groups, and perform multiple hidden feature extraction processing on the keyword groups to obtain initial hidden layer state feature vectors;

The generation module is used to perform multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and to splice the obtained at least one reply word according to the generation order of each reply word in the at least one reply word to obtain the reply sentence of the voice information.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, and the one or more programs include instructions for performing the following steps:

According to the occurrence time of the voice information at the user's current moment, the previous round of voice information adjacent to the voice information is acquired, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

performing rough semantic extraction on the voice information according to the previous round of voice information, to obtain rough semantic features corresponding to the voice information;

performing word segmentation processing on the voice information to obtain keyword groups;

Performing multiple hidden feature extraction processes on the keyword group to obtain an initial hidden layer state feature vector;

The at least one reply word is spliced according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice information.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following steps:

In a fifth aspect, an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer is operable to cause the computer to execute the method in the first aspect.

Implementing the implementation mode of the present application has the following beneficial effects:

In the embodiment of the present application, by obtaining the previous round of voice information of the user's current voice information, and then performing rough semantic extraction on the previous round of voice information, the semantic features that can contain high-level abstract information in the previous round of voice information are obtained, which are used as the rough semantic features of the user's current voice information, thereby realizing synchronous extraction of key information and rough information in the previous round of voice information. Then, word segmentation is performed on the voice information of the user at the current moment, and multiple hidden feature extraction processes are performed on the obtained multiple keywords to obtain the initial hidden layer state feature vector of the voice information of the user at the current moment. Finally, according to the rough semantic features and the initial hidden layer state feature vector, multiple reply words are generated, and the obtained at least one reply word is spliced according to the generation order of each reply word in the at least one reply word to obtain the reply sentence of the voice information. Based on this, the rough semantic features that contain both key information and rough information in the previous round of dialogue are used as one of the basis for generating reply sentences in this round of dialogue, so that the reply sentence generation process includes more comprehensive information features of the previous round of dialogue. As a result, the generated reply sentences are more accurate, can better fit with the main body of the dialogue, and improve user experience.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work.

FIG. 1 is a schematic diagram of the hardware structure of an apparatus for determining a reply sentence based on rough semantics provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a method for determining a reply sentence based on rough semantics provided in an embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for extracting rough semantics from voice information based on the previous round of voice information to obtain rough semantic features corresponding to the voice information provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a gated cyclic unit encoder provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a multi-layer perceptron provided in an embodiment of the present application;

6 is a schematic flow diagram of a method for inputting at least one rough context information and at least one first hidden layer state feature vector into a rough decoder for multiple decoding processes to obtain rough semantic features of speech information provided by an embodiment of the present application;

FIG. 7 is a block flow diagram of a reply word generation process provided by an embodiment of the present application;

FIG. 8 is a block diagram of functional modules of a device for determining a reply sentence based on rough semantics provided in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are part of the embodiments of the application, not all of them. Based on the implementation manners in this application, all other implementation manners obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.

First, please refer to FIG. 1 , which is a schematic diagram of a hardware structure of an apparatus for determining a reply sentence based on rough semantics provided by an embodiment of the present application. The apparatus 100 for determining reply sentences based on rough semantics includes at least one processor 101 , a communication line 102 , a memory 103 and at least one communication interface 104 .

In this embodiment, the processor 101 may be a general-purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (application-specific integrated circuit, ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.

Communication line 102, which may include a path, transmits information between the aforementioned components.

The communication interface 104 may be any device such as a transceiver (such as an antenna) for communicating with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (wireless local area networks, WLAN) and the like.

The memory 103 may be a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, a random access memory (random access memory, RAM) or other types of dynamic storage devices capable of storing information and instructions, or may be an electrically erasable programmable read-only memory (EEPROM), a compact disc (compact disc) read-only memory, CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and can be accessed by a computer, without limitation.

In this embodiment, the memory 103 may exist independently and be connected to the processor 101 through the communication line 102 . The memory 103 can also be integrated with the processor 101 . The memory 103 provided in this embodiment of the present application may generally be non-volatile. Wherein, the memory 103 is used to store computer-executed instructions for implementing the solutions of the present application, and the execution is controlled by the processor 101 . The processor 101 is configured to execute computer-executed instructions stored in the memory 103, so as to implement the methods provided in the following embodiments of the present application.

In an optional implementation manner, computer-executed instructions may also be referred to as application code, which is not specifically limited in the present application.

In an optional implementation manner, the processor 101 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 1 .

In an optional implementation manner, the apparatus 100 for determining a reply sentence based on rough semantics may include multiple processors, such as the processor 101 and the processor 107 in FIG. 1 . Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).

In an optional embodiment, if the device 100 for determining the reply statement based on rough semantics is a server, for example, it can be an independent server, or it can be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms. Then the apparatus 100 for determining a reply sentence based on rough semantics may further include an output device 105 and an input device 106 . Output device 105 is in communication with processor 101 and may display information in a variety of ways. For example, the output device 105 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector), etc. The input device 106 communicates with the processor 101 and can receive user input in various ways. For example, the input device 106 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

The apparatus 100 for determining a reply sentence based on rough semantics may be a general-purpose device or a special-purpose device. The embodiment of the present application does not limit the type of the device 100 for determining a reply sentence based on rough semantics.

Secondly, it should be noted that the embodiments disclosed in this application can acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

Finally, the method for determining reply sentences based on rough semantics in this application can be applied to scenarios such as telephone consultation, e-commerce sales, offline physical sales, business promotion, outbound calls by agents, and promotion on social platforms. In this application, the telephone consultation scenario is used as an example to illustrate the method for determining the reply sentence based on rough semantics. The method for determining the reply sentence based on rough semantics in other scenarios is similar to that in the telephone consultation scenario, and will not be described here.

In the following, the method for determining reply sentences based on rough semantics disclosed in this application will be described:

Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a method for determining a reply sentence based on rough semantics provided in an embodiment of the present application. The method for determining reply sentences based on rough semantics includes the following steps:

201: Acquire the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user.

In this embodiment, the occurrence time of the previous round of voice information is shorter than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest. In simple terms, the previous round of voice information is the last sentence spoken by the user before the voice information at the current moment.

Exemplarily, the previous round of voice information can be determined by querying historical dialogue data that records dialogue data generated before the current time by the dialogue event to which the user's current voice information belongs, based on the occurrence time of the user's current voice information. Specifically, two interrelated sentence queues can be saved in the historical dialogue data, one of which is used to store the user sentences issued by the user, and the other pair is used to store the reply sentences made by AI to the user's sentences. At the same time, each user statement in the user statement queue and each reply statement in the reply statement queue all include a dialogue identifier and a dialogue occurrence time, and the user statement and the reply statement with the same identifier are formed into a question-answer pair through the dialogue identifier, that is, the reply statement with the same dialogue identifier is a reply to the user statement. In this way, the question-and-answer logic in the historical dialogue data can be guaranteed, and the sentences of the user and AI can be saved separately for easy search.

Therefore, in this embodiment, by querying the user sentence queue, it is possible to determine the speech information whose dialogue occurrence time is earlier than the occurrence time of the voice information at the user's current moment, and whose absolute value of the difference between the occurrence time and the occurrence time of the first sentence is the smallest, as the previous round of voice information.

202: Perform rough semantic extraction on the voice information according to the previous round of voice information, and obtain rough semantic features corresponding to the voice information.

In this embodiment, rough semantic features can be understood as semantic features that include high-level abstract information in the previous round of speech information. Exemplarily, multiple high level parallel sequences can be obtained by actively constructing a high level coarse sequence representation (high level coarse sequence representation), and then analyzing the high level coarse sequence representation. Then, through the layered structure, the low-order coarse sequence is first generated, so that the information in multiple high-order parallel sequences flows to the low-order rough sequence, so as to realize the synchronous extraction of the key information and rough information in the speech information, so that the information of multiple levels can be displayed simultaneously. At the same time, after being transformed into a low-level rough sequence, the model that generates reply sentences can also better remember and understand long-term content, and then generate meaningful replies that are closely related to the topic, improving user experience.

Exemplarily, this embodiment provides a method for performing rough semantic extraction on voice information according to the previous round of voice information, and obtaining rough semantic features corresponding to the voice information, as shown in FIG. 3 , the method includes:

301: Detect the previous round of voice information, and obtain at least one first word contained in the previous round of voice information.

In this embodiment, the detection process may be to perform word segmentation after performing text conversion on the previous round of speech information, and then obtain all the words that can be obtained through word segmentation processing as the at least one first word. Meanwhile, each of the at least one first word may include a word tag, and the word tag may be part-of-speech information of the corresponding first word, for example: noun, verb, named entity, and the like.

Therefore, in this embodiment, the named entity information in the text converted from the text can be extracted through the conditional random field model (Conditional Random Fields, CRF), and the type of the named entity, such as the name of the person or the name of the institution, can be marked through the CRF. Then use the part-of-speech tagging tool (Part-Of-Speech, POS) to perform word segmentation and part-of-speech tagging on the text, and extract the nouns and verbs in the text. In this process, it is necessary to combine the comprehensive results of CRF and POS, because the recognition of POS is only on words, and CRF can be a complete phrase. For example: I work at Fudan University in Shanghai, and CRF can fully recognize the institution name entity of "Shanghai Fudan University", while POS can only recognize nouns: "Shanghai", "Fudan" and "University". Therefore, when dealing with entity words, if the result of POS is included in CRF, the result of CRF will be used first, and the result of POS will only be used for verbs. Thus, the first word tagged with part-of-speech information can be obtained.

In an optional implementation, if the language used by the user is English, a set of verbs and named entities corresponding to the domain can be pre-constructed, and then the verbs and named entities in the original sentence can be extracted by matching. For the extraction of English nouns, POS can also be used for noun recognition and extraction, and then the first word containing part-of-speech information can be obtained.

302: Determine temporal information of the previous round of voice information according to at least one first word.

In this embodiment, at least one first word obtained by word segmentation can be input into a gate recurrent unit (Gate Recurrent Unit, GRU) encoder for encoding to obtain the second hidden layer state feature vector. Then input the state feature vector of the second hidden layer into a multilayer perceptron (MultiLayer Perceptron, MLP), and obtain a linear output result. Finally, input the linear output result into the temporal classifier to obtain the temporal information of the previous round of speech information.

Specifically, the structure of GRU is shown in Figure 4, which includes reset gate r _t , update gate z _t , candidate memory unit

and the current memory unit h _t .

Specifically, the operating logic of the reset gate r _t can be expressed by formula ①:

r _t ＝σ(W _r X _t +U _r h _t-1 +b _r )..........①

Among them, σ is the activation function, W _r and U _r are the parameter matrices corresponding to the reset gate r _t , the initialized values are all random, and new values can be obtained through training the model. b _r is the bias corresponding to the reset gate r _t , which is also trainable.

Furthermore, the operating logic of the update gate z _t can be expressed by formula ②:

z _t ＝σ(W _z X _t +U _z h _t-1 +b _z )..........②

Among them, W _z and U _z are the parameter matrices corresponding to the update gate z _t , and the initialized values are all random, and new values can be obtained through training the model. b _z is the bias corresponding to the update gate z _t , which is also trainable.

Further, the candidate memory unit

The operation logic of can be expressed by the formula ③:

Among them, tanh is the activation function, W and U are candidate memory units

The corresponding parameter matrix is initialized with random values, and new values can be obtained by training the model. b is the candidate memory unit

The corresponding bias is also trainable.

Furthermore, the operating logic of the memory unit h _t at the current moment can be expressed by formula ④:

Among them, z _t is the weight, which is trainable.

In this embodiment, the structure of MLP is shown in Figure 5. It consists of two linear layers Linear and a ReLu activation function. After outputting the linear output result through the last linear layer, the linear output result will be input into the softmax function again for multi-label classification, and finally the tense classifier is used to judge the tense of the current sentence. As a result, misrecognition and missed recognition caused by simply using independent words such as "Guo", "Zhe", and "Le" in traditional tense recognition are avoided. For example: the voice message "I'm running" is in the present continuous tense, but because it does not contain independent words such as "pass", "zhe" and "le", it will be missed in the traditional recognition method.

303: Add temporal information to the word label of each first word to obtain at least one second word corresponding to at least one first word one-to-one.

In short, in this embodiment, the second word is the first word in the word tag with the temporal information of the corresponding voice information added. Thus, the second word carries the corresponding part-of-speech information and tense information of the speech on the basis of carrying the information of the speech itself, so that the reply sentences generated subsequently are more accurate.

304: Input at least one second word into a rough encoder for encoding, and obtain at least one rough context information one-to-one corresponding to at least one second word, and at least one first hidden layer state feature vector one-to-one corresponding to at least one second word.

In this embodiment, the coarse encoder may be a GRU encoder. Specifically, during encoding, at least one second word is sequentially input into a GRU encoder, and the encoder outputs corresponding rough context information and a first hidden layer state feature vector. In the encoding process, in addition to the second word currently encoded, the input to the GRU encoder can also use the first hidden layer state feature vector output from the previous encoding process as the input of the current encoding. That is, when encoding the xth second word, the xth second word and the x-1th first hidden layer state feature vector can be input into the GRU encoder to obtain the xth rough context information and the xth first hidden layer state feature vector. And when x=1, since there is no 0th second word, at this time, only the 1st second word is input into the GRU encoder for encoding.

305: Input at least one piece of rough context information and at least one first hidden layer state feature vector into a rough decoder for multiple decoding processes to obtain rough semantic features of the speech information.

In this embodiment, when extracting rough semantic features, for speech information, the importance of each second word obtained by splitting is different. Therefore, before the coarse context information is fed into the coarse decoder, attention processing can be performed on these information to obtain the importance of each coarse context information.

Exemplarily, each second word in the encoder corresponds to a hidden layer state feature vector, that is, the first hidden layer state feature vector. In simple terms, there are as many first hidden layer state feature vectors as there are coarse context information. Therefore, the rough context information can be input into the decoder, and the decoder will calculate the similarity between the feature vector of the current decoding process (the output of the current decoding process of the decoder) and the hidden layer state features decoded from the input rough context information. Thus, a similarity value is calculated for each piece of rough context information, and then these similarities are normalized to obtain a weight corresponding to each piece of rough context information. Then multiply the weight corresponding to each rough context information with the hidden layer state feature vector obtained by inputting the rough context information into the encoder to obtain the feature of attention, and then add it to the output feature vector obtained when the rough context information is input into the decoder to obtain the final feature obtained by inputting the rough context information into the decoder.

Based on this, this embodiment provides a method of inputting at least one rough context information and at least one first hidden layer state feature vector into a rough decoder for multiple decoding processes to obtain rough semantic features of speech information. As shown in FIG. 6 , the method includes:

601: In the i-th decoding process, input the input feature vector A _i into the rough decoder to obtain the output feature vector B _i .

In this embodiment, i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one rough context information, j is an integer greater than or equal to 1, when i=1, the input feature vector A _i is the first rough context information in at least one rough context information.

602: Calculate the similarity D _i between the output feature vector B _i and the i-th first hidden layer state feature vector C _i among at least one first hidden layer state feature vector.

In this embodiment, the similarity D _i can be obtained by calculating the cosine similarity between the output feature vector B _i and the ith first hidden layer state feature vector C _i .

603: Perform normalization processing on the similarity D _i to obtain the weight E _i of the input feature vector A _i .

In this embodiment, the similarity degree D _i can be input into the softmax function for normalization processing to obtain the weight E _i of the input feature vector A _i .

604: Multiply the weight E _i by the i-th first hidden layer state feature vector C _i to obtain a weight feature vector F _i .

605: Add the weight feature vector F _i to the output feature vector B _i to obtain the target output feature vector G _i .

606: Use the target output feature vector G _i as the input feature vector A _i+1 of the i+1 decoding process to perform the i+1 decoding process until multiple decoding processes are performed to obtain rough semantic features of the speech information.

Specifically, in the process of multiple decoding processes, the output at the previous moment will be used as the input at the next moment, until the final output obtained after multiple decoding processes is the rough semantic feature of the speech information.

203: Perform word segmentation processing on the speech information to obtain keyword groups.

In this embodiment, the voice information may be converted into text, and then the text may be segmented to obtain at least one first keyword. Then, at least one second keyword is obtained by combining any two different first adjacent words and second adjacent words in the at least one first keyword, and the field interval between the first adjacent word and the second adjacent word is smaller than the first threshold.

Specifically, the first adjacent word and the second adjacent word are any two different adjacent fields in the second candidate field whose field interval is smaller than the first threshold, and the field interval can be understood as the number of characters between the corresponding positions of the first adjacent word and the second adjacent word in their corresponding text. Exemplarily, for the text "Shanghai and Disneyland, which opened in 2016, is located in Pudong New District", after word segmentation and screening, the first keywords can be obtained: "Shanghai City", "2016", "Disney", "Parkland", "Pudong" and "New District". At this time, the number of characters between the corresponding positions of the first keyword "2016" and "Disney" in the text is 3, so the character distance between the first keyword "2016" and "Disney" is 3. However, the number of characters between the corresponding positions of the first keyword "Disney" and "Paradise" in the text is 0, so the character distance between the first keyword "Disney" and "Paradise" is 0.

In this embodiment, the first threshold can be set to 1. Thus, taking the above-mentioned text "Shanghai and Disneyland, which opened in 2016, are located in Pudong New District" as an example, the first keywords that meet the requirements are "Disney" and "Park", and "Pudong" and "New District". Thus, the third candidate fields "Disneyland" and "Pudong New Area" can be obtained.

Then, each second keyword in the at least one second keyword is matched with a preset entity library, and second keywords that fail to be matched are screened out to obtain at least one third keyword. In the at least one first keyword, the first keyword constituting each third keyword in the at least one third keyword is deleted to obtain at least one fourth keyword.

Specifically, the fourth keyword is the remaining first keyword after removing the first keyword constituting each third keyword in the at least one third keyword. Exemplarily, following the example of the above-mentioned text "Shanghai and the Disneyland opened in 2016 are located in Pudong New Area", assuming that the determined third keyword is "Disneyland", since the third keyword "Disneyland" is composed of the first keywords "Disney" and "Paradise", the first keywords "Disney" and "Paradise" are removed from the original several first keywords: "Shanghai", "2016", "Disney", "Paradise", "Pudong" and "New District", Then the remaining first keywords: "Shanghai", "2016", "Pudong" and "New District" are the fourth keywords.

Finally, at least one third keyword and at least one fourth keyword are combined to obtain a keyword group.

Specifically, following the example of the above-mentioned text "Shanghai and Disneyland, which opened in 2016, are located in Pudong New District", combine the third keyword "Disneyland" with the fourth keyword: "Shanghai City", "2016", "Pudong" and "New District" to obtain the keyword group: "Shanghai City", "2016", "Disneyland", "Pudong" and "New District".

204: Perform multiple hidden feature extraction processes on the keyword group to obtain an initial hidden layer state feature vector.

In this embodiment, the keyword group may include at least one keyword, and the at least one keyword is arranged according to the order of each keyword in the at least one keyword in the voice information. Based on this, this embodiment provides a method for performing multiple hidden feature extraction processing on keyword groups to obtain the initial hidden layer state feature vector, specifically as follows:

In the n hidden feature extraction process, the first input hidden feature H _n is input to the GUR encoder to obtain the first output hidden feature I _n , wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, and m is an integer greater than or equal to 1. When n=1, the input hidden feature H _n is the first keyword in the at least one keyword; the first output hidden feature I _n is used as the first input hidden feature H _n +1 of the n+1 hidden feature extraction process to perform the n+1 hidden feature Extraction processing, until the hidden feature extraction process is performed multiple times, and the initial hidden layer state feature vector is obtained.

205: Perform multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word.

In this embodiment, during the p-time reply word generation process, the input word vector K _p , the second input hidden feature _L _p and the rough semantic feature can be input into the gated recurrent unit decoder to obtain the reply word Op and the second output hidden feature R _p , wherein p is an integer greater than or equal to 1 and less than or equal to q, and q is determined as an integer greater than or equal to 1 by the voice information. When p=1, the input word vector K _p is the initial hidden layer state feature vector. Then, word embedding is performed on the reply word O _p to obtain the reply word vector S _p . Finally, the reply word vector S _p is used as the input word vector K _p +1 of the p+1-th reply word generation process, and the second output hidden feature R _p is used as the second input hidden feature L _p +1 of the p+1-th reply word generation process for the p+1-th reply word generation process until at least one reply word is obtained after multiple reply word generation processes.

Specifically, as shown in FIG. 7 , the generation process is to generate a reply word each time, generate the reply word O _p at the pth time, and then generate the reply word O _p+1 at the p+1 time. However, at the p+1th time, the word vector of the reply word O _p generated last time (that is, the pth time) is also used as one of the inputs of the p+1th time. The other input is the rough semantic feature, that is, the reply word O _p+1 is determined by the word vector of the reply word Op _p , the second output hidden feature R _p generated for the pth time, and the rough semantic feature.

206: Concatenate at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

To sum up, in the method for determining reply sentences based on rough semantics provided by this application, by obtaining the previous round of voice information of the user's current voice information, and then performing rough semantic extraction on the previous round of voice information, the semantic features that can contain the high-level abstract information in the previous round of voice information are obtained, which are used as the rough semantic features of the user's current voice information, thereby realizing synchronous extraction of key information and rough information in the previous round of voice information. Then, word segmentation is performed on the voice information of the user at the current moment, and multiple hidden feature extraction processes are performed on the obtained multiple keywords to obtain the initial hidden layer state feature vector of the voice information of the user at the current moment. Finally, according to the rough semantic features and the initial hidden layer state feature vector, multiple reply words are generated, and the obtained at least one reply word is spliced according to the generation order of each reply word in the at least one reply word to obtain the reply sentence of the voice information. Based on this, the rough semantic features containing both key information and rough information in the previous round of dialogue are used as one of the basis for generating reply sentences in the current round of dialogue, so that the reply sentence generation process includes more comprehensive information features of the previous round of dialogue. As a result, the generated reply sentences are more accurate, can better fit with the main body of the dialogue, and improve user experience.

Referring to FIG. 8 , FIG. 8 is a block diagram of functional modules of an apparatus for determining reply sentences based on rough semantics provided in an embodiment of the present application. As shown in FIG. 8, the device 800 for determining a reply sentence based on rough semantics includes:

The acquisition module 801 is used to acquire the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the user's current moment, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

The processing module 802 is used to perform rough semantic extraction on the voice information according to the previous round of voice information, obtain rough semantic features corresponding to the voice information, perform word segmentation processing on the voice information, obtain keyword groups, and perform multiple hidden feature extraction processing on the keyword groups to obtain initial hidden layer state feature vectors;

The generation module 803 is used to perform multiple reply word generation processing according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word, and splice at least one reply word according to the generation order of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In the embodiment of the present application, in performing rough semantic extraction on the voice information according to the previous round of voice information, and obtaining rough semantic features corresponding to the voice information, the processing module 802 is specifically used for:

Detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word includes a word label;

determining the temporal information of the previous round of speech information according to at least one first word;

adding temporal information into the word tags of each first word to obtain at least one second word, wherein at least one second word is in one-to-one correspondence with at least one first word;

Inputting at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein at least one rough context information is in one-to-one correspondence with at least one second word, and at least one first hidden layer state feature vector is in one-to-one correspondence with at least one second word;

Inputting at least one piece of rough context information and at least one first hidden layer state feature vector into a rough decoder for multiple decoding processes to obtain rough semantic features of speech information.

In the embodiment of the present application, in terms of determining the temporal information of the previous round of speech information according to at least one first word, the processing module 802 is specifically used to:

Inputting at least one first word into a gated recurrent unit encoder for encoding to obtain a second hidden layer state feature vector;

Input the state feature vector of the second hidden layer into the multi-layer perceptron to obtain a linear output result;

Input the linear output result into the temporal classifier to obtain the temporal information of the previous round of speech information.

In the embodiment of the present application, the processing module 802 is specifically used to:

In the i-th decoding process, the input feature vector A _i is input to the rough decoder to obtain the output feature vector B _i , wherein i is an integer greater than or equal to 1 and less than or equal to j, j is the quantity of at least one rough context information, j is an integer greater than or equal to 1, when i=1, the input feature vector A _i is the first rough context information in at least one rough context information;

Calculating the similarity D _i between the output feature vector B _i and the i-th first hidden layer state feature vector C _i in at least one first hidden layer state feature vector;

Normalize the similarity D _i to obtain the weight E _i of the input feature vector A _i ;

Multiply the weight E _i with the i-th first hidden layer state feature vector C _i to obtain the weight feature vector F _i ;

Add the weight feature vector F _i to the output feature vector B _i to get the target output feature vector G _i ;

The target output feature vector G _i is used as the input feature vector A _i+1 of the i+1 decoding process to perform the i+1 decoding process until multiple decoding processes are performed to obtain the rough semantic features of the speech information.

In the implementation manner of the present application, the keyword group includes at least one keyword, and the at least one keyword is arranged according to the sequence of each keyword in the at least one keyword in the voice information. Based on this, the processing module 802 is specifically used for:

In the n hidden feature extraction process, the first input hidden feature H _n is input into the gated recurrent unit encoder to obtain the first output hidden feature I _n , wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H _n is the first keyword in at least one keyword;

The first output hidden feature I _n is used as the first input hidden feature H _n+1 of the n+1 hidden feature extraction process to perform the n+1 hidden feature extraction process until the hidden feature extraction process is performed multiple times, and the initial hidden layer state feature vector is obtained.

In the embodiment of the present application, in terms of generating at least one reply word by performing multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector, the generating module 803 is specifically used for:

In the p-time reply word generation process, the input word vector _Kp , the second input hidden feature _Lp and the rough semantic feature are input into the gated recurrent unit decoder to obtain the reply word _Op and the second output hidden feature _Rp , wherein p is an integer greater than or equal to 1 and less than or equal to q, and q is determined by the voice information to be an integer greater than or equal to 1. When p=1, the input word vector _Kp is the initial hidden layer state feature vector;

Perform word embedding processing on the reply word O _p to obtain the reply word vector S _p ;

The reply word vector S _p is used as the input word vector K _p +1 of the p+1 reply word generation process, and the second output hidden feature R _p is used as the second input hidden feature L _p+1 of the p+1 reply word generation process for the p+1 reply word generation process until at least one reply word is obtained after multiple reply word generation processes.

In the embodiment of the present application, the processing module 802 is specifically used for:

Converting the speech information into text, and performing segmentation processing on the text to obtain at least one first keyword;

Combining the first adjacent word and the second adjacent word to obtain at least one second keyword, wherein the first adjacent word and the second adjacent word are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent word and the second adjacent word is less than the first threshold;

Match each second keyword in the at least one second keyword with a preset entity library, and filter out the second keywords that fail to match, to obtain at least one third keyword;

In the at least one first keyword, the first keyword forming each third keyword in the at least one third keyword is deleted to obtain at least one fourth keyword;

Combining at least one third keyword and at least one fourth keyword to obtain a keyword group.

Referring to FIG. 9 , FIG. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in FIG. 9 , an electronic device 900 includes a transceiver 901 , a processor 902 and a memory 903 . They are connected through a bus 904 . The memory 903 is used to store computer programs and data, and can transmit the data stored in the memory 903 to the processor 902 .

The processor 902 is used to read the computer program in the memory 903 to perform the following operations:

According to the time of occurrence of the voice information of the user at the current moment, the previous round of voice information adjacent to the voice information is obtained, wherein the time of occurrence of the previous round of voice information is less than the time of occurrence of the voice information, and the absolute value of the difference between the time of occurrence of the previous round of voice information and the time of occurrence of the voice information is the smallest;

The at least one reply word is spliced according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In the embodiment of the present application, the processor 902 is specifically configured to perform the following operations in terms of performing rough semantic extraction on the voice information based on the previous round of voice information to obtain rough semantic features corresponding to the voice information:

At least one second word input rough encoder is encoded, obtains at least one coarse context information and at least one first hidden layer state feature vector, wherein, at least one rough context information and at least one second word one-to-one correspondence, at least one first hidden layer state feature vector one-to-one correspondence with at least one second word;

In the embodiment of the present application, in terms of determining the temporal information of the previous round of speech information according to at least one first word, the processor 902 is specifically configured to perform the following operations:

In the embodiment of the present application, the processor 902 is specifically configured to perform the following operations in terms of inputting at least one rough context information and at least one first hidden layer state feature vector into a rough decoder for multiple decoding processes to obtain rough semantic features of speech information:

In the implementation manner of the present application, the keyword group includes at least one keyword, and the at least one keyword is arranged according to the sequence of each keyword in the at least one keyword in the voice information. Based on this, the processor 902 is specifically configured to perform the following operations in performing multiple hidden feature extraction processes on the keyword group to obtain the initial hidden layer state feature vector:

In the embodiment of the present application, the processor 902 is specifically configured to perform the following operations in terms of generating at least one reply word based on the rough semantic feature and the initial hidden layer state feature vector for multiple times of reply word generation:

In the embodiment of the present application, the processor 902 is specifically configured to perform the following operations in performing word segmentation processing on the speech information to obtain keyword groups:

It should be understood that the apparatus for determining reply sentences based on rough semantics in the present application may include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, palmtop computers, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID), robots or wearable devices, etc. The above device for determining a reply sentence based on rough semantics is only an example, not exhaustive, including but not limited to the above device for determining a reply sentence based on rough semantics. In practical applications, the apparatus for determining reply sentences based on rough semantics may also include: intelligent vehicle-mounted terminals, computer equipment, and the like.

Therefore, the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement some or all steps of any method for determining a reply sentence based on rough semantics as described in the above-mentioned method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like. The computer-readable storage medium may be non-volatile or volatile.

The embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause the computer to execute some or all of the steps of any method for determining a reply sentence based on rough semantics as described in the above-mentioned method embodiments.

The above is a detailed introduction to the implementation of the present application. In this paper, specific examples are used to illustrate the principle and implementation of the application. The description of the above implementation is only used to help understand the method and core idea of the application; at the same time, for those of ordinary skill in the art, according to the thinking of the application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be understood as limiting the application.

Claims

A method for determining a reply sentence based on rough semantics, wherein the method includes:

According to the occurrence time of the voice information at the user's current moment, the previous round of voice information adjacent to the voice information is acquired, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

performing rough semantic extraction on the voice information according to the previous round of voice information, to obtain rough semantic features corresponding to the voice information;

performing word segmentation processing on the voice information to obtain keyword groups;

Performing multiple hidden feature extraction processes on the keyword group to obtain an initial hidden layer state feature vector;

Perform multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word;

The at least one reply word is spliced according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice information.
The method according to claim 1, wherein said performing rough semantic extraction on said voice information according to said previous round of voice information to obtain rough semantic features corresponding to said voice information, comprising:

Detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word includes a word label;

determining temporal information of the previous round of voice information according to the at least one first word;

adding the temporal information to the word label of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word;

Inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information is in one-to-one correspondence with the at least one second word, and the at least one first hidden layer state feature vector is in one-to-one correspondence with the at least one second word;

Inputting the at least one first hidden layer state feature vector of the at least one rough context information into a rough decoder for multiple decoding processes to obtain rough semantic features of the speech information.
The method according to claim 2, wherein said determining the temporal information of the previous round of speech information according to the at least one first word comprises:

The at least one first word input gated recurrent unit encoder is encoded to obtain the second hidden layer state feature vector;

Inputting the second hidden layer state feature vector into the multi-layer perceptron to obtain a linear output result;

Inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of speech information.
The method according to claim 2, wherein said inputting said at least one first hidden layer state feature vector of said at least one rough context information into a rough decoder for multiple decoding processes to obtain rough semantic features of said speech information, comprising:

In the i-th decoding process, the input feature vector A i is input to the rough decoder to obtain the output feature vector B i , where i is an integer greater than or equal to 1 and less than or equal to j, j is the number of the at least one rough context information, j is an integer greater than or equal to 1, and when i=1, the input feature vector A i is the first rough context information in the at least one rough context information;

Calculating the similarity D i between the output feature vector B i and the ith first hidden layer state feature vector C i of the at least one first hidden layer state feature vector;

Perform normalization processing on the similarity D i to obtain the weight E i of the input feature vector A i ;

Multiplying the weight E i by the ith first hidden layer state feature vector C i to obtain a weight feature vector F i ;

Adding the weight feature vector F i to the output feature vector B i to obtain the target output feature vector G i ;

The target output feature vector G i is used as the input feature vector A i+ 1 of the i+1 decoding process to perform the i+1 decoding process until the multiple decoding processes are performed to obtain the rough semantic features of the speech information.
The method according to claim 1, wherein,

The keyword group includes at least one keyword, and the at least one keyword is arranged according to the order of each keyword in the at least one keyword in the voice information;

The described keyword group is subjected to multiple hidden feature extraction processes to obtain an initial hidden layer state feature vector, including:

In the nth hidden feature extraction process, the first input hidden feature H n is input into the gated recurrent unit encoder to obtain the first output hidden feature I n , wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H n is the first keyword in the at least one keyword;

The first output hidden feature I n is used as the first input hidden feature H n+1 of the n+1 hidden feature extraction process to perform the n+1 hidden feature extraction process until the hidden feature extraction process is performed multiple times to obtain the initial hidden layer state feature vector.
The method according to claim 1, wherein, performing multiple reply word generation processing according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word, comprising:

During the p-time reply word generation process, the input word vector Kp , the second input hidden feature Lp and the rough semantic feature are input into the gated recurrent unit decoder to obtain the reply word Op and the second output hidden feature Rp , wherein p is an integer greater than or equal to 1 and less than or equal to q, and q is determined by the voice information to be an integer greater than or equal to 1. When p=1, the input word vector Kp is the initial hidden layer state feature vector;

Carry out word embedding processing to described reply word Op , obtain reply word vector S p ;

The reply word vector S p is used as the input word vector K p+1 of the p+1 reply word generation process, and the second output hidden feature R p is used as the second input hidden feature L p+1 of the p+1 reply word generation process for the p+1 reply word generation process until the at least one reply word is obtained after performing the multiple reply word generation processes.
The method according to claim 1, wherein said performing word segmentation processing on said speech information to obtain a keyword group includes:

Converting the speech information into text, and performing segmentation processing on the text to obtain at least one first keyword;

Combining the first adjacent word and the second adjacent word to obtain at least one second keyword, wherein the first adjacent word and the second adjacent word are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent word and the second adjacent word is less than a first threshold;

matching each second keyword of the at least one second keyword with a preset entity library, and filtering out second keywords that fail to match, to obtain at least one third keyword;

In the at least one first keyword, deleting the first keyword constituting each third keyword in the at least one third keyword to obtain at least one fourth keyword;

Combining the at least one third keyword and the at least one fourth keyword to obtain the keyword group.
A device for determining a reply sentence based on rough semantics, wherein the device includes:

An acquisition module, configured to acquire the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

A processing module, configured to perform rough semantic extraction on the voice information according to the previous round of voice information, obtain rough semantic features corresponding to the voice information, perform word segmentation processing on the voice information, obtain keyword groups, and perform multiple hidden feature extraction processing on the keyword groups to obtain initial hidden layer state feature vectors;

The generating module is used to perform multiple reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and splice the at least one reply word according to the generation order of each reply word in the at least one reply word to obtain the reply sentence of the voice information.
An electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, and the one or more programs include instructions for performing the following steps:

According to the occurrence time of the voice information at the user's current moment, the previous round of voice information adjacent to the voice information is acquired, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

performing rough semantic extraction on the voice information according to the previous round of voice information, to obtain rough semantic features corresponding to the voice information;

performing word segmentation processing on the voice information to obtain keyword groups;

Performing multiple hidden feature extraction processes on the keyword group to obtain an initial hidden layer state feature vector;

Perform multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word;

The at least one reply word is spliced according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice information.
The electronic device according to claim 9, wherein, performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, comprising:

Detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word includes a word label;

determining temporal information of the previous round of voice information according to the at least one first word;

adding the temporal information to the word label of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word;

Inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information is in one-to-one correspondence with the at least one second word, and the at least one first hidden layer state feature vector is in one-to-one correspondence with the at least one second word;

The at least one first hidden layer state feature vector of the at least one rough context information is input into a rough decoder for multiple decoding processes to obtain the rough semantic features of the speech information.
The electronic device according to claim 10, wherein said determining the temporal information of the previous round of voice information according to the at least one first word comprises:

Inputting the at least one first word into a gated recurrent unit encoder for encoding to obtain a second hidden layer state feature vector;

Inputting the second hidden layer state feature vector into the multi-layer perceptron to obtain a linear output result;

Inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of speech information.
The electronic device according to claim 10, wherein said inputting said at least one first hidden layer state feature vector of said at least one rough context information into a rough decoder for multiple decoding processes to obtain rough semantic features of said speech information, comprising:

In the i-th decoding process, the input feature vector A i is input to the rough decoder to obtain the output feature vector B i , where i is an integer greater than or equal to 1 and less than or equal to j, j is the number of the at least one rough context information, j is an integer greater than or equal to 1, and when i=1, the input feature vector A i is the first rough context information in the at least one rough context information;

Calculating the similarity D i between the output feature vector B i and the ith first hidden layer state feature vector C i of the at least one first hidden layer state feature vector;

Perform normalization processing on the similarity D i to obtain the weight E i of the input feature vector A i ;

Multiplying the weight E i by the ith first hidden layer state feature vector C i to obtain a weight feature vector F i ;

Adding the weight feature vector F i to the output feature vector B i to obtain the target output feature vector G i ;

The target output feature vector G i is used as the input feature vector A i+ 1 of the i+1 decoding process to perform the i+1 decoding process until the multiple decoding processes are performed to obtain the rough semantic features of the speech information.
The electronic device according to claim 9, wherein,

The keyword group includes at least one keyword, and the at least one keyword is arranged according to the order of each keyword in the at least one keyword in the voice information;

The described keyword group is subjected to multiple hidden feature extraction processes to obtain an initial hidden layer state feature vector, including:

In the nth hidden feature extraction process, the first input hidden feature H n is input into the gated recurrent unit encoder to obtain the first output hidden feature I n , wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H n is the first keyword in the at least one keyword;

The first output hidden feature I n is used as the first input hidden feature H n+1 of the n+1 hidden feature extraction process to perform the n+1 hidden feature extraction process until the hidden feature extraction process is performed multiple times to obtain the initial hidden layer state feature vector.
The electronic device according to claim 9, wherein, performing multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word, comprising:

During the p-time reply word generation process, the input word vector Kp , the second input hidden feature Lp and the rough semantic feature are input into the gated recurrent unit decoder to obtain the reply word Op and the second output hidden feature Rp , wherein p is an integer greater than or equal to 1 and less than or equal to q, and q is determined by the voice information to be an integer greater than or equal to 1. When p=1, the input word vector Kp is the initial hidden layer state feature vector;

Carry out word embedding processing to described reply word Op , obtain reply word vector S p ;

The reply word vector S p is used as the input word vector K p+1 of the p+1 reply word generation process, and the second output hidden feature R p is used as the second input hidden feature L p+1 of the p+1 reply word generation process for the p+1 reply word generation process until the at least one reply word is obtained after performing the multiple reply word generation processes.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following steps:

According to the occurrence time of the voice information at the user's current moment, the previous round of voice information adjacent to the voice information is acquired, wherein the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest;

performing rough semantic extraction on the voice information according to the previous round of voice information, to obtain rough semantic features corresponding to the voice information;

performing word segmentation processing on the voice information to obtain keyword groups;

Performing multiple hidden feature extraction processes on the keyword group to obtain an initial hidden layer state feature vector;

Perform multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word;

The at least one reply word is spliced according to the generation sequence of each reply word in the at least one reply word to obtain the reply sentence of the voice information.
The computer-readable storage medium according to claim 15, wherein said performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, comprising:

Detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word includes a word label;

determining temporal information of the previous round of voice information according to the at least one first word;

adding the temporal information to the word label of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word;

Inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information is in one-to-one correspondence with the at least one second word, and the at least one first hidden layer state feature vector is in one-to-one correspondence with the at least one second word;

Inputting the at least one first hidden layer state feature vector of the at least one rough context information into a rough decoder for multiple decoding processes to obtain rough semantic features of the speech information.
The computer-readable storage medium according to claim 16, wherein said determining the temporal information of the previous round of speech information according to the at least one first word comprises:

Inputting the at least one first word into a gated recurrent unit encoder for encoding to obtain a second hidden layer state feature vector;

Inputting the second hidden layer state feature vector into the multi-layer perceptron to obtain a linear output result;

Inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of speech information.
The computer-readable storage medium according to claim 16, wherein said inputting said at least one first hidden layer state feature vector of said at least one rough context information into a rough decoder for multiple decoding processes to obtain rough semantic features of said speech information comprises:

In the i-th decoding process, the input feature vector A i is input to the rough decoder to obtain the output feature vector B i , where i is an integer greater than or equal to 1 and less than or equal to j, j is the number of the at least one rough context information, j is an integer greater than or equal to 1, and when i=1, the input feature vector A i is the first rough context information in the at least one rough context information;

Calculating the similarity D i between the output feature vector B i and the ith first hidden layer state feature vector C i of the at least one first hidden layer state feature vector;

Perform normalization processing on the similarity D i to obtain the weight E i of the input feature vector A i ;

Multiplying the weight E i by the ith first hidden layer state feature vector C i to obtain a weight feature vector F i ;

Adding the weight feature vector F i to the output feature vector B i to obtain the target output feature vector G i ;

The target output feature vector G i is used as the input feature vector A i+ 1 of the i+1 decoding process to perform the i+1 decoding process until the multiple decoding processes are performed to obtain the rough semantic features of the speech information.
The computer readable storage medium of claim 15, wherein:

The keyword group includes at least one keyword, and the at least one keyword is arranged according to the order of each keyword in the at least one keyword in the voice information;

The described keyword group is subjected to multiple hidden feature extraction processes to obtain an initial hidden layer state feature vector, including:

In the nth hidden feature extraction process, the first input hidden feature H n is input into the gated recurrent unit encoder to obtain the first output hidden feature I n , wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H n is the first keyword in the at least one keyword;

The first output hidden feature I n is used as the first input hidden feature H n+1 of the n+1 hidden feature extraction process to perform the n+1 hidden feature extraction process until the hidden feature extraction process is performed multiple times to obtain the initial hidden layer state feature vector.
The computer-readable storage medium according to claim 15, wherein, performing multiple reply word generation processes according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word, comprising:

During the p-time reply word generation process, the input word vector Kp , the second input hidden feature Lp and the rough semantic feature are input into the gated recurrent unit decoder to obtain the reply word Op and the second output hidden feature Rp , wherein p is an integer greater than or equal to 1 and less than or equal to q, and q is determined by the voice information to be an integer greater than or equal to 1. When p=1, the input word vector Kp is the initial hidden layer state feature vector;

Carry out word embedding processing to described reply word Op , obtain reply word vector S p ;

The reply word vector S p is used as the input word vector K p+1 of the p+1 reply word generation process, and the second output hidden feature R p is used as the second input hidden feature L p+1 of the p+1 reply word generation process for the p+1 reply word generation process until the at least one reply word is obtained after performing the multiple reply word generation processes.