CN116467407A

CN116467407A - Voice processing method, device and equipment

Info

Publication number: CN116467407A
Application number: CN202310429483.6A
Authority: CN
Inventors: 张若璇; 陈永录; 高宏超; 李宁
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-21

Abstract

The embodiment of the application provides a voice processing method, device and equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a first voice; acquiring a voice response relation network, wherein the voice response relation network comprises a plurality of problem groups and responses corresponding to each problem group, and the similarity of the problems in the problem groups is greater than or equal to a first threshold; processing the first voice through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice; determining a plurality of to-be-selected questions in the voice response relation network according to the at least one semantic feature, wherein the similarity between the to-be-selected questions and the at least one semantic feature is greater than or equal to a second threshold; and determining a target reply corresponding to the first voice in the voice reply relation network according to the to-be-selected problem, and outputting the target reply. The method improves the accuracy of voice processing.

Description

Voice processing method, device and equipment

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a voice processing method, a voice processing device and voice processing equipment.

Background

In the business transaction process, enterprises can automatically answer questions of users through artificial intelligent equipment, so that human resources are saved, and the working efficiency is improved.

In the related art, the voice of the user question may be processed as follows: after the artificial intelligent device obtains the questioning voice of the user, text extraction processing can be carried out on the questioning voice to obtain at least one keyword corresponding to the questioning voice. And determining the problem with the highest similarity with the questioning voice in the database according to at least one keyword corresponding to the questioning voice. And determining a reply corresponding to the problem to be fed back to the client.

In the process, the question text with the highest similarity with the questioning voice is determined only through at least one keyword corresponding to the questioning voice. The text of the question that may have the highest similarity to the question speech may differ significantly from the semantics of the question speech. Therefore, the corresponding reply text is determined not to be the reply corresponding to the question voice according to the question text, so that the voice processing accuracy is low.

Disclosure of Invention

The embodiment of the application provides a voice processing method, device and equipment, which are used for solving the problem of low accuracy of voice processing.

In a first aspect, an embodiment of the present application provides a method for processing speech, including:

acquiring a first voice;

acquiring a voice response relation network, wherein the voice response relation network comprises a plurality of problem groups and responses corresponding to each problem group, and the similarity of the problems in the problem groups is greater than or equal to a first threshold;

processing the first voice through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice;

determining a plurality of to-be-selected questions in the voice response relation network according to the at least one semantic feature, wherein the similarity between the to-be-selected questions and the at least one semantic feature is greater than or equal to a second threshold;

and determining a target reply corresponding to the first voice in the voice reply relation network according to the to-be-selected problem, and outputting the target reply.

In a second aspect, embodiments of the present application provide a speech processing apparatus, the apparatus including:

the first acquisition module is used for acquiring first voice;

the second acquisition module is used for acquiring a voice response relation network, wherein the voice response relation network comprises a plurality of problem groups and responses corresponding to each problem group, and the similarity of the problems in the problem groups is greater than or equal to a first threshold;

The processing module is used for processing the first voice through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice;

a first determining module, configured to determine a plurality of questions to be selected in the voice response relationship network according to the at least one semantic feature, where a similarity between the questions to be selected and the at least one semantic feature is greater than or equal to a second threshold;

and the second determining module is used for determining a target reply corresponding to the first voice in the voice reply relation network according to the to-be-selected problem and outputting the target reply.

In a third aspect, an embodiment of the present application provides a speech processing apparatus, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the first aspects.

The voice processing method, the voice processing device and the voice processing equipment provided by the embodiment of the application acquire the first voice and the voice reply relation network. And processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice. And determining the similarity of at least one semantic feature and each problem in the voice response relation network through a first similarity algorithm. And sequencing the plurality of similarity from big to small, sequencing the problems in the voice response relation network, and obtaining a plurality of sequenced problems. The first K questions of the ordered plurality of questions are determined as a first set of questions and the other questions of the plurality of questions, except for the first K questions, are determined as a second set of questions. A plurality of candidate questions are determined based on the at least one semantic feature, the first set of questions, and the second set of questions. And according to the problem to be selected, determining a target reply corresponding to the first voice in the voice reply relation network, and outputting the target reply. In the above process, the first voice may be processed through the first model, so as to obtain at least one semantic feature corresponding to at least one keyword in the first voice. The same semantic features are used for indicating all words of the same or similar semantics, and a plurality of questions to be selected are determined through the semantic features corresponding to the keywords. According to the multiple problems to be selected, determining target replies corresponding to the first voice in the voice reply relation network, avoiding the condition that keywords of the problems to be selected in the first voice and the database are the same or similar but have larger semantic differences, and improving the accuracy of voice processing.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a voice processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a process of acquiring a first voice according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another speech processing method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a voice relationship reply network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a speech processing procedure according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice processing device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another speech processing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech processing device according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that the method and apparatus for speech processing of the present application may be used in the field of artificial intelligence, and may also be used in any field other than artificial intelligence, and the application field of the method and apparatus for speech processing of the present application is not limited.

In order to facilitate understanding, an application scenario to which the embodiments of the present application are applicable is described below with reference to fig. 1.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application. Referring to fig. 1, the terminal device 101 and the voice processing device 102 are included. The terminal device 101 may be a mobile phone, a computer, etc., and the speech processing device 102 may be a server. The user can ask questions through an application program provided by the terminal device 101, and the terminal device 101 acquires the asking voice of the user and sends the asking voice to the voice processing device 102. The voice processing device 102 determines a reply corresponding to the question voice in the database based on the question voice transmitted from the terminal device 101, and transmits the reply to the terminal device 101. The terminal device 101 may display the reply or play the reply so that the user obtains the reply corresponding to the question.

In the related art, the voice of the user question may be processed as follows: after the artificial intelligent device obtains the questioning voice of the user, text extraction processing can be carried out on the questioning voice to obtain at least one keyword corresponding to the questioning voice. And determining the problem with the highest similarity with the questioning voice in the database according to at least one keyword corresponding to the questioning voice. And determining a reply corresponding to the problem to be fed back to the client. In the above process, the question with the highest similarity with the questioning voice is determined only by at least one keyword corresponding to the questioning voice. There may be a question with the highest similarity to the question speech and the semantic difference from the question speech may be large. Therefore, determining that the corresponding reply is not the reply corresponding to the questioning voice according to the question results in lower accuracy of voice processing.

In the embodiment of the application, a first voice corresponding to a user question and a voice reply relation network are acquired. And processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice. And determining a plurality of questions to be selected with highest similarity with the first voice in the voice response relation network according to at least one semantic feature. And according to the problem to be selected, determining a target reply corresponding to the first voice in the voice reply relation network, and outputting the target reply. In the above process, the first voice may be processed through the first model, so as to obtain at least one semantic feature corresponding to at least one keyword in the first voice. The same semantic features are used for indicating all words of the same or similar semantics, and a plurality of questions to be selected are determined through the semantic features corresponding to the keywords. According to the multiple problems to be selected, determining target replies corresponding to the first voice in the voice reply relation network, avoiding the condition that keywords of the problems to be selected in the first voice and the database are the same or similar but have larger semantic differences, and improving the accuracy of voice processing.

The method shown in the present application will be described below by way of specific examples. It should be noted that the following embodiments may exist alone or in combination with each other, and for the same or similar content, the description will not be repeated in different embodiments.

Fig. 2 is a flow chart of a voice processing method according to an embodiment of the present application. Referring to fig. 2, the method may include:

s201, acquiring a first voice.

The execution body of the embodiment of the application may be a voice processing device, or may be a voice processing apparatus provided in the voice processing device. The speech processing means may be implemented by software or by a combination of software and hardware. The speech processing device can be a server.

The user can ask questions through prompt information displayed by an application program in the terminal equipment. The audio acquisition device of the terminal device acquires the questions of the user and sends the questions of the user to the voice processing device. The terminal equipment can be a mobile phone, a tablet personal computer and the like.

Next, a process of acquiring the first voice will be described with reference to fig. 3. Fig. 3 is a schematic diagram of a process of acquiring a first voice according to an embodiment of the present application. Referring to FIG. 3, interfaces 301-302 are included. Interfaces 301-302 are query pages provided by applications in the terminal device. Referring to interface 301, a user clicks and opens a query page in an application program of the terminal device, and a dialog box is displayed in the query page to prompt the user to perform a corresponding operation. When the user makes an information inquiry, the talk button can be clicked and pressed in the interface 301, and the terminal device starts recording the question of the user through the recording device in response to the clicking operation of the user. Referring to interface 302, after recording a question of a user, the terminal device sends a first voice corresponding to the question to the voice processing device, and displays a text corresponding to the first voice on the query page to prompt the user to receive the question.

S202, acquiring a voice reply relation network.

The voice response relation network comprises a plurality of problem groups and responses corresponding to each problem group, and the similarity of the problems in the problem groups is greater than or equal to a first threshold.

According to the voice and the corresponding reply of the voice acquired in the historical period, a voice reply relation network is established through a knowledge graph, and the voice reply relation network is stored in a preset storage space of the voice processing equipment.

The multiple questions in the question set may be questions of the same or similar semantics. For example, the plurality of questions in the question group may be specifically as shown in table 1:

TABLE 1

Problem(s)	Problem content
		Problem 1	How much money is there?
Problem 2	Querying the current account balance
		Problem 3	How much money remains in the account?

S203, processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice.

The first model may be a word to vector (word 2 vec) model.

The at least one semantic feature corresponding to the at least one keyword in the first voice may be obtained by: performing voice recognition processing and word segmentation processing on the first voice to obtain a first text corresponding to the first voice, wherein the first text comprises at least one sentence text corresponding to the first voice and a keyword label corresponding to the sentence text, and the keyword label is used for indicating part-of-speech classification of the keyword; and processing the first text through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice.

The part of speech classification includes at least: nouns, verbs, adjectives, prepositions, pronouns, numbers, conjunctions, facilitators, time words, status words, orientation words, punctuation marks, and the like.

Semantic features may be represented by word vectors. The word vector may be an n-dimensional vector, and each element in the word vector is used to indicate a semantic feature corresponding to a keyword in the first text.

Semantic features are used to indicate multiple keywords of the same or similar semantics. For example, balance, remaining money, etc. may be represented by the same semantic features.

For example, after the voice processing device acquires the first voice, the voice processing device performs voice recognition processing on the first voice, where the obtained text is: there is also some money in me account. Word segmentation processing is carried out on the text to obtain a first text: i (pronoun), there are also (adverbs) how much (adjectives) money (nouns) in the account. And inputting each keyword in the first text into a first model, and processing the first text through the first model to obtain at least one semantic feature word vector A corresponding to at least one keyword in the first voice. The word vector may be specifically a= (a, b, c, d) ^T . Wherein, each element is used for indicating the semantic features corresponding to each keyword in the first text.

S204, determining a plurality of candidate questions in the voice response relation network according to at least one semantic feature.

The similarity of the candidate question and the at least one semantic feature is greater than or equal to a second threshold.

The plurality of questions to be selected may be determined in the voice response relationship network by: determining the similarity of at least one semantic feature and each problem in the voice response relation network through a first similarity algorithm; sequencing the plurality of similarity from big to small, sequencing the problems in the voice reply relation network, and obtaining a plurality of sequenced problems; determining the first K questions of the ordered plurality of questions as a first question set, and determining the other questions of the plurality of questions except the first K questions as a second question set, wherein K is an integer greater than or equal to 1; a plurality of candidate questions are determined based on the at least one semantic feature, the first set of questions, and the second set of questions.

The first similarity algorithm may be a word centroid distance (Word Centoid Distance, WCD) algorithm.

In the first question set and the second question set, K questions having a similarity greater than or equal to a second threshold and greater than other questions than the first K questions are determined as a plurality of candidate questions.

For example, there are 100 problems in the voice reply relationship network, assuming K is 10. And determining the similarity of at least one semantic feature and 100 problems in the voice response relation network through a WMD algorithm. And sequencing the 100 similarity from big to small, sequencing the problems in the voice response relation network, and obtaining a plurality of sequenced problems. The first 10 questions (questions 1-10) of the ordered 100 questions are determined as the first question set. The remaining 90 questions (questions 11 to 100) out of the 100 ordered questions are determined as the second question set. A plurality of candidate questions are determined from the at least one semantic feature, the first set of questions, and the second set of questions with a similarity of 10 questions greater than or equal to a second threshold of 90%.

If the similarity between the questions 1 and 10 is equal to or greater than the similarity between the questions 11 and 100, and the similarity between the questions 1 and 10 is equal to or greater than a second threshold, determining the questions 1 to 10 in the first question set as a plurality of candidate questions. If the problem 11-100 has a similarity greater than the problem 1-10, updating the ranking of all the problems until the similarity of 10 problems is greater than or equal to the second threshold and the similarity is greater than the rest 90 problems. The 10 questions are determined as a plurality of candidate questions.

S205, determining a target reply corresponding to the first voice in the voice reply relation network according to the problem to be selected, and outputting the target reply.

For example, the voice processing device may specifically determine, according to at least one semantic feature, a plurality of candidate questions in the voice response relationship network as shown in table 2:

TABLE 2

Problem(s)	Problem content
		Problem 1	When the next repayment of me is
Problem 2	What is the repayment cycle
		Problem 3	The repayment before the number is not out of date

According to the problem to be selected, the voice processing equipment determines that the target reply corresponding to the first voice can be repayment with the repayment time of 5 # per month in the voice reply relation network. The voice processing device may send the target reply to the terminal device, which may directly play or display the target reply.

According to the voice processing method provided by the embodiment of the application, the first voice and the voice reply relation network are obtained. And processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice. A plurality of candidate questions are determined in the voice response relationship network based on the at least one semantic feature. And according to the problem to be selected, determining a target reply corresponding to the first voice in the voice reply relation network, and outputting the target reply. In the above process, the first voice may be processed through the first model, so as to obtain at least one semantic feature corresponding to at least one keyword in the first voice. The same semantic features are used for indicating all words of the same or similar semantics, and a plurality of questions to be selected are determined through the semantic features corresponding to the keywords. According to the multiple problems to be selected, determining target replies corresponding to the first voice in the voice reply relation network, avoiding the condition that keywords of the problems to be selected in the first voice and the database are the same or similar but have larger semantic differences, and improving the accuracy of voice processing.

On the basis of any of the above embodiments, a detailed procedure of the voice processing will be described below with reference to fig. 4.

Fig. 4 is a flow chart of another voice processing method according to an embodiment of the present application. Referring to fig. 4, the method includes:

s401, acquiring a first voice.

It should be noted that, the step of executing S401 may refer to S201, which is not described herein.

S402, acquiring a voice reply relation network.

Before the voice response relationship network is acquired, the voice response relationship network can be established according to a plurality of questions acquired in a history period and responses corresponding to each question. And storing the voice reply relation network into a preset storage space of the voice processing equipment.

The voice reply relationship network may be determined by: acquiring a plurality of voice questions and replies corresponding to each voice question; classifying the plurality of voice questions to obtain a plurality of question groups, wherein the similarity of the questions in the question groups is greater than or equal to a first threshold; determining replies corresponding to each question set; and generating a voice response relation network according to the plurality of voice questions, the plurality of question sets and the responses corresponding to each question set.

Next, the structure of the voice response relationship network will be described with reference to fig. 5. Fig. 5 is a schematic structural diagram of a voice relationship reply network according to an embodiment of the present application. Referring to fig. 5, a voice reply relation network 501 is included, and the voice reply relation network 501 is stored in a preset storage space of the voice processing device. The voice reply relationship network 501 includes 5 question groups, question group 1, question group 2, question group 3, question group 4, and question group 5, respectively. And replies corresponding to each question group are reply 1, reply 2, reply 3, reply 4 and reply 5 respectively. Each question group comprises a plurality of questions, and the similarity of each question is greater than or equal to a first threshold value of 95%.

An association relationship can be established between a plurality of problem groups of the same type, and the association relationship is used for indicating that problems in the problem groups belong to the same service type. For example, the problem group 1 and the problem group 2 shown in fig. 5 have an association relationship (indicated by a dotted line box). The service types corresponding to the questions in the question group 1 and the question group 2 are query account information. The problem group 3 and the problem group 4 have an association relationship, and the service types corresponding to the problems in the problem group 3 and the problem group 4 are inquiry repayment service deadlines.

When determining the association relationship between the problem groups, service types with multiple dimensions and multiple ranges can be set in advance, and the association relationship between the problem groups is established according to the dimensions. For example, query business information related questions, the dimensions corresponding to the business types may be set to include query account information-query account type-query account deposit and withdrawal limits. Thus, multi-dimensional and multi-level association relations among the problem groups can be established according to a plurality of problems.

S403, processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice.

Before the first voice is processed through the first model, the first model can be trained according to the voice acquired in the historical period and the reply corresponding to the voice, so that the accuracy of the output result of the first model is improved.

The first model may be trained by: acquiring a training set, wherein the training set comprises a plurality of second voices and at least one semantic feature corresponding to each second voice; performing voice recognition processing and word segmentation processing on the second voice to obtain a second text corresponding to the second voice and a keyword label corresponding to the second text, wherein the keyword label is used for indicating part-of-speech classification of the keyword; and (3) performing ith iterative training on the ith intermediate model through the second voice to obtain an (i+1) th intermediate model, wherein i sequentially takes 1, 2, 3 and … … until the ith intermediate model converges, and when i is greater than or equal to N, the ith intermediate model is determined to be a first model, N is a preset iteration number, N is an integer greater than 1, and the 1 st intermediate model is an initial model.

The i+1th intermediate model can be obtained by: performing feature extraction processing on each second text through the ith intermediate model to obtain at least one predicted semantic feature corresponding to each second text; determining a loss value according to at least one predicted semantic feature corresponding to each first text and at least one semantic feature corresponding to the training set; and updating model parameters of the ith intermediate model according to the loss value to obtain an (i+1) th intermediate model, wherein the model parameters comprise the dimension of the word vector and the window size.

The model convergence condition is that the loss value is smaller than or equal to a preset threshold value. That is, the similarity of at least one semantic feature corresponding to at least one predicted semantic feature training set is less than or equal to a preset threshold.

The method for determining the loss value is the same as the method for determining the similarity of the plurality of questions to be selected, which is not described herein.

S404, determining the similarity of at least one semantic feature and each problem in the voice response relation network through a first similarity algorithm.

When the similarity is determined through the first similarity algorithm, the similarity algorithm with smaller time complexity can be used because the similarity between all problems in the voice reply relation network and at least one semantic feature needs to be determined, so that the calculation efficiency is improved.

S405, sorting the plurality of similarity from big to small, and sorting the problems in the voice reply relation network to obtain a plurality of sorted problems.

For example, the speech processing device determines, via a first similarity algorithm, the similarity of at least one semantic feature to 10 questions in the speech reply relationship network. Sequencing 10 questions in the voice reply relation network from big to small, wherein the 10 sequenced questions comprise: problem 2, problem 1, problem 4, problem 6, problem 3, problem 9, problem 8, problem 10, problem 5, problem 7.

S406, determining the first K questions in the ordered plurality of questions as a first question set, and determining other questions except the first K questions in the ordered plurality of questions as a second question set.

K is an integer greater than or equal to 1. The value corresponding to K may be determined based on the number of questions having a similarity to at least one semantic feature greater than or equal to a second threshold.

For example, if the number of questions having a similarity with at least one semantic feature of 95% or more of the second threshold is 3, it may be determined that the value corresponding to K is 3.

For example, suppose K is 3. Then the first set of questions includes the first 3 questions, question 2, question 1, question 4, respectively, according to the ordered 10 questions shown in the example above. The second set of questions includes the last 7 questions, question 6, question 3, question 9, question 8, question 10, question 5, question 7, respectively.

S407, determining the first similarity of at least one semantic feature and each question in the first question set through a second similarity algorithm.

The second similarity algorithm may be a Word move's Distance algorithm.

For example, according to the first problem set shown in the above example, the determining, by the second similarity algorithm, the first similarity of the at least one semantic feature to each problem in the first problem set may specifically be as shown in table 3:

TABLE 3 Table 3

S408, determining the second similarity of at least one semantic feature and each question in the second question set through a third similarity algorithm.

The third similarity algorithm may be a relaxed word shift distance (Relaxed word moving distance, RWMD) algorithm.

And determining the word shift distance between at least one semantic feature and each problem through the similar algorithm. The similarity of the at least one semantic feature to each question may be determined based on the word shift distance of the at least one semantic feature to each question.

S409, judging whether target similarity exists in the plurality of second similarities.

The target similarity is greater than each of the first similarities.

If yes, S411 is executed.

If not, S410 is performed.

S410, determining target questions corresponding to the target similarity, updating a first question set according to the target questions, and determining a plurality of questions to be selected according to the first question set.

The first question set may be updated according to the target questions by: determining the similarity of at least one semantic feature and the target problem according to a second similarity algorithm; sequencing all the problems and target problems in the first problem set according to the sequence of the similarity from large to small; the first question set is updated to the top K questions after ordering.

For example, according to the second problem set illustrated by the above example, the determining, by the third similarity algorithm, the second similarity of the at least one semantic feature to each problem in the second problem set may specifically be as shown in table 4:

TABLE 4 Table 4

Second problem set	Second similarity degree
		Problem 6	95.2％
Problem 3	91.5％
		Problem 9	90.0％
Problem 8	88.0％
		Problem 10	85.0％
Problem 5	83.0％
		Problem 7	75.0％

From the first similarity shown in table 3 and the second similarity shown in table 4, it can be determined that the second similarity of the question 6 is greater than the similarity of the question 4. It may be determined whether the target similarity 95.2% exists among the plurality of second similarities. At this time, according to the second similarity algorithm, it is determined that the similarity of at least one semantic feature to the question 6 is 98.0%. And sequencing all the questions and the target questions in the first question set according to the sequence of the similarity from large to small to obtain the sequences of the questions 6, 2, 1 and 4. The first question set is updated to the first 3 questions ordered. I.e. the first set of questions comprises question 6, question 2, question 1.

After S410, S412 is performed.

S411, determining a plurality of questions to be selected according to the first question set.

The plurality of candidate questions may be determined from the first set of questions by: aiming at any one problem in the first problem set, determining a problem group in which the problem is positioned in a voice reply relation network; if the problem groups of the problems in the first problem set are the same, determining a plurality of problems in the first problem set as a plurality of to-be-selected problems.

For example, according to the second problem set illustrated by the above example, the determining, by the third similarity algorithm, the second similarity of the at least one semantic feature to each problem in the second problem set may specifically be as shown in table 5:

TABLE 5

Second problem set	Second similarity degree
		Problem 6	93.2％
Problem 3	91.5％
		Problem 9	90.0％
Problem 8	88.0％
		Problem 10	85.0％
Problem 5	83.0％
		Problem 7	75.0％

From the first similarity shown in table 3 and the second similarity shown in table 5, it can be determined that the target similarity does not exist among the plurality of second similarities. At this time, the problem group in which the problem 2, the problem 1, and the problem 4 are located in the first problem set shown in table 3 is determined in the voice response relation network. If all of the problem groups of the problems 2, 1 and 4 in the first problem set are the problem group 2, determining the problems 2, 1 and 4 in the first problem set as a plurality of candidate problems.

By combining three similarity algorithms to determine a plurality of problems to be selected, when the calculated amount is large, the algorithm with smaller time complexity can be used for screening out the problems with small similarity, so that the calculated time is reduced, and the efficiency is improved. When the calculation amount is small, a more complex and accurate algorithm is used, so that the accuracy of determining the similarity is improved.

S412, determining a target reply corresponding to the first voice in the voice reply relation network according to the problem to be selected, and outputting the target reply.

For example, the speech processing apparatus selects question 2, question 1, question 4 according to the plurality of candidate questions shown in the above example. In the voice reply relation network, determining the problem group where the problem 2, the problem 1 and the problem 4 are located as the problem group 2. And determining that the target reply corresponding to the problem group 2 is that the repayment date is 4 months and 12 days per month. The voice processing device sends the target reply to the terminal device, and the terminal device displays or plays the target reply.

According to the voice processing method provided by the embodiment of the application, the first voice and the voice reply relation network are obtained. And processing the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice. And determining the similarity of at least one semantic feature and each problem in the voice response relation network through a first similarity algorithm. And sequencing the plurality of similarity from big to small, sequencing the problems in the voice response relation network, and obtaining a plurality of sequenced problems. The first K questions of the ordered plurality of questions are determined as a first set of questions and the other questions of the plurality of questions, except for the first K questions, are determined as a second set of questions. A plurality of candidate questions are determined based on the at least one semantic feature, the first set of questions, and the second set of questions. And according to the problem to be selected, determining a target reply corresponding to the first voice in the voice reply relation network, and outputting the target reply. In the above process, the first voice may be processed through the first model, so as to obtain at least one semantic feature corresponding to at least one keyword in the first voice. The same semantic features are used for indicating all words of the same or similar semantics, and a plurality of questions to be selected are determined through the semantic features corresponding to the keywords. According to the multiple problems to be selected, determining target replies corresponding to the first voice in the voice reply relation network, avoiding the condition that keywords of the problems to be selected in the first voice and the database are the same or similar but have larger semantic differences, and improving the accuracy of voice processing.

On the basis of any of the above embodiments, a detailed procedure of the voice processing will be exemplified below with reference to fig. 6.

Fig. 6 is a schematic diagram of a voice processing procedure according to an embodiment of the present application. Referring to fig. 6, a terminal device 601 and a voice processing device 602 are included. The terminal device 601 may be a mobile phone, a computer, etc., and the speech processing device 602 may be a server. The voice processing device 602 is provided with a first algorithm, and a preset storage space of the voice processing device 602 stores a voice relation reply network.

The user clicks and opens the query page in the application program of the terminal device 601, and performs corresponding input selection operation according to the prompt information displayed on the query page. The terminal device 601 starts recording a question of the user through the recording device in response to a click operation of the user. After recording the question of the user, the terminal device 601 sends a first voice corresponding to the question to the voice processing device 602, and simultaneously displays a text or a prompt message corresponding to the first voice on the query page so as to prompt the user to receive the question. The first voice may be a current payment time of the account.

The voice processing device 602 performs voice recognition processing and word segmentation processing on the first voice to obtain a first text corresponding to the first voice, where the first text includes an account (noun), a current (noun), a (assisted word), and a payment time (time word). The voice processing device 602 processes the first voice through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice as a= (a, b, c) ^T . Wherein, each element is used for indicating the semantic features corresponding to each keyword in the first text.

The voice processing device 602 obtains a voice response relationship network in a preset storage space, and determines the similarity between at least one semantic feature and each problem in the voice response relationship network through a first similarity algorithm. The speech processing device 602 ranks the plurality of similarities from large to small, ranks the questions in the speech reply relationship network, and obtains a plurality of ranked questions including question 11, question 2, question 7, question 3, question 6, question 1, question 5, question 8, question 10, question 4, question 12, question 9, and question 13. Assuming K is 5, the first question set includes the first 5 questions ordered, question 11, question 2, question 7, question 3, question 6, respectively. The second set of questions includes the last 8 questions of the order, question 1, question 5, question 8, question 10, question 4, question 12, question 9, question 13, respectively. The determining, by the speech processing device 602, the first similarity of the at least one semantic feature to each question in the first set of questions using the second similarity algorithm may specifically be as shown in table 6:

TABLE 5

First question set	First similarity degree
		Problem 11	98.2％
Problem 2	97.5％
		Problem 7	97.0％
Problem 3	96.0％
		Problem 6	95.0％

The speech processing device 602, through a third similarity algorithm, may specifically determine a second similarity of the at least one semantic feature to each question in the second set of questions as shown in table 6:

TABLE 6

Second problem set	Second similarity degree
		Problem 1	94.2％
Problem 5	94.0％
		Problem 8	93.5％
Problem 10	92.0％
		Problem 4	90.0％
Problem 12	88.0％
		Problem 9	86.3％
Problem 13	80.0％

The speech processing device 602 may determine that the target similarity does not exist among the plurality of second similarities based on the first similarity shown in table 5 and the second similarity shown in table 6. The speech processing device 602 determines that the problem groups in which the problems 11, 2, 7, 3, and 6 in the first problem set shown in table 5 are all the problem group 1 in the speech reply relation network. At this time, the speech processing device 602 determines the questions 11, 2, 7, 3, and 6 in the first question set as a plurality of candidate questions. The voice processing device 602 determines, according to the question to be selected, that the target reply corresponding to the first voice is the current day of repayment of 4 months and 12 days in the voice reply relation network, and outputs the target reply. The voice processing device 602 sends the target reply to the terminal device 601, and the terminal device 601 displays or plays the target reply through the application program.

Fig. 7 is a schematic structural diagram of a speech processing device according to an embodiment of the present application. Referring to fig. 7, the voice processing apparatus 10 may include:

a first acquiring module 11, configured to acquire a first voice;

a second obtaining module 12, configured to obtain a voice response relationship network, where the voice response relationship network includes a plurality of problem groups, and responses corresponding to each problem group, where a similarity of problems in the problem groups is greater than or equal to a first threshold;

the processing module 13 is configured to process the first voice through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice;

a first determining module 14, configured to determine a plurality of questions to be selected in the voice response relationship network according to the at least one semantic feature, where a similarity between the questions to be selected and the at least one semantic feature is greater than or equal to a second threshold;

and the second determining module 15 is configured to determine, according to the candidate problem, a target reply corresponding to the first voice in the voice reply relation network, and output the target reply.

In a possible embodiment, the second determining module 15 is specifically configured to:

Determining the similarity of the at least one semantic feature and each problem in the voice response relation network through a first similarity algorithm;

sequencing the plurality of similarity from big to small, sequencing the problems in the voice response relation network, and obtaining a plurality of sequenced problems;

determining the first K questions of the ordered plurality of questions as a first question set and the other questions of the plurality of questions except the first K questions as a second question set, wherein K is an integer greater than or equal to 1;

determining the plurality of candidate questions according to the at least one semantic feature, the first question set and the second question set.

determining a first similarity of the at least one semantic feature to each question in the first set of questions by a second similarity algorithm;

determining a second similarity of the at least one semantic feature to each question in the second set of questions by a third similarity algorithm;

and determining the plurality of questions to be selected from the first question set and the second question set according to the plurality of first similarities and the plurality of second similarities.

judging whether target similarity exists in the plurality of second similarities, wherein the target similarity is larger than each first similarity;

if yes, determining target problems corresponding to the target similarity, updating the first problem set according to the target problems, and determining the plurality of problems to be selected according to the first problem set;

if not, determining the plurality of questions to be selected according to the first question set.

determining the similarity of the at least one semantic feature and the target problem according to the second similarity algorithm;

sorting all the questions in the first question set and the target questions according to the sequence of the similarity from big to small;

the first question set is updated to the top K questions after ordering.

aiming at any one problem in the first problem set, determining a problem group in which the problem is positioned in the voice reply relation network;

And if the problem groups of all the problems in the first problem set are the same, determining a plurality of problems in the first problem set as the plurality of to-be-selected problems.

In one possible embodiment, the processing module 13 is specifically configured to:

performing voice recognition processing and word segmentation processing on the first voice to obtain a first text corresponding to the first voice, wherein the first text comprises at least one sentence text corresponding to the first voice and a keyword label corresponding to the sentence text, and the keyword label is used for indicating part-of-speech classification to which the keyword belongs;

and processing the first text through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice.

The voice processing device provided in the embodiment of the present application may execute the technical solution shown in the foregoing method embodiment, and its implementation principle and beneficial effects are similar, and will not be described herein again.

Fig. 8 is a schematic structural diagram of another speech processing device according to an embodiment of the present application. Referring to fig. 8, the speech processing apparatus 10 further includes a generating module 16 based on the embodiment shown in fig. 7.

Wherein the generating module 16 is configured to:

Acquiring a plurality of voice questions and replies corresponding to each voice question;

classifying the plurality of voice questions to obtain a plurality of question groups, wherein the similarity of the questions in the question groups is greater than or equal to a first threshold;

determining replies corresponding to each question set;

and generating the voice response relation network according to the voice questions, the question sets and the responses corresponding to the question sets.

Fig. 9 is a schematic structural diagram of a speech processing device according to an embodiment of the present application. Referring to fig. 9, the voice processing apparatus 20 may include: a memory 21, and a processor 22. The memory 21, the processor 22, are illustratively interconnected by a bus 23.

The memory 21 is used for storing program instructions;

the processor 22 is configured to execute the program instructions stored in the memory, so as to cause the speech processing device 20 to perform the method shown in the above-described method embodiment.

The voice processing device provided in the embodiment of the present application may execute the technical solution shown in the foregoing method embodiment, and its implementation principle and beneficial effects are similar, and will not be described in detail herein.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the above-described method when the computer-executable instructions are executed by a processor.

Embodiments of the present application may also provide a computer program product comprising a computer program which, when executed by a processor, performs the above-described method.

All or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a readable memory. The program, when executed, performs steps including the method embodiments described above; and the aforementioned memory (storage medium) includes: read-only memory (ROM), random-access memory (Random Access Memory, RAM), flash memory, hard disk, solid state disk, magnetic tape, floppy disk (floppy disk), optical disk (optical disk), and any combination thereof.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to encompass such modifications and variations.

Claims

1. A method of speech processing, comprising:

acquiring a first voice;

2. The method of claim 1, wherein determining a plurality of candidate questions in the voice response relationship network based on the at least one semantic feature comprises:

3. The method of claim 2, wherein determining the plurality of candidate questions from the at least one semantic feature, the first set of questions, and the second set of questions comprises:

4. The method of claim 3, wherein determining the plurality of candidate questions in the first and second sets of questions based on the plurality of first similarities and the plurality of second similarities comprises:

5. The method of claim 4, wherein updating the first set of questions from the target questions comprises:

the first question set is updated to the top K questions after ordering.

6. The method of claim 4 or 5, wherein determining the plurality of candidate questions from the first set of questions comprises:

7. The method of any one of claims 1-6, wherein the processing of the first speech through a first model results in at least one semantic feature corresponding to at least one keyword in the first speech:

8. The method of any of claims 1-7, further comprising, prior to obtaining the voice reply relationship network:

determining replies corresponding to each question set;

9. A speech processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring first voice;

10. A speech processing apparatus, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

11. A non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are for causing a computer to perform the method of any one of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 8.